Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mini sbibm #1335

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Mini sbibm #1335

wants to merge 14 commits into from

Conversation

manuelgloeckler
Copy link
Contributor

What does this implement/fix? Explain your changes

This is a draft for some "benchmarking" capabilities integrated into sbi.

With pytest, we can roughly check that everything works by passing all tests. Some tests will ensure that the overall methodology works "sufficiently" well on simplified Gaussian analytic examples. Certain changes might still pass all tests but, in the end, negatively impact the performance/accuracy.

Specifically, when implementing new methods or, e.g., changing default parameters, it is important to check that what was implemented not just only passes the tests but that it works sufficiently well.

Does this close any currently open issues?

Prototype for #1325

Any relevant code examples, logs, error output, etc?

So it now should work that one simply has to use:

pytest --bm 

Which is a custom tag that will disable testing and instead switch to a "benchmark" mode, which will only run tests that are marked as such and will always pass. Instead, these tests cache a metric on how well an implemented method solved a specific task (currently some examples in "bm_test.py").

Once it finishes, instead of passed/failed, it will return a table with the metric (we still can kinda color some methods that are worse than expected).

Any other comments?

  • What tasks to incldue - clearly they must be somewhat "fast" to solve.
  • What methods to incldue (i.e. just standard methods with default parameters or more)

@manuelgloeckler manuelgloeckler marked this pull request as draft December 18, 2024 17:30
@manuelgloeckler
Copy link
Contributor Author

Alright, on the current examples, the output looks like this:

image

Runtime linearly increases with a number of train simulations (currently 2k ~ 10 min on my laptop, with 1k, it was like 5 min). It would maybe also be nice to print runtimes on the right.

Overall runtime, of course, also depends on how many different methods should be included. I think some limited control over what is run would be nice i.e

pytest --bm # All base inference classes on defaults (similar to current behavior)
pytest --bm=NPE   # NPE with e.g. different density estimators
pytest --bm=SNPE # SNPE_ABC  2 round test
...

Either way, there needs to be a limit on what is run, and every configuration should finish in a reasonable amount of time.

@manuelgloeckler manuelgloeckler self-assigned this Jan 13, 2025
@manuelgloeckler
Copy link
Contributor Author

Alright, it is kinda now ready for review. The overall "framework" is done, one from now go into a few directions, i.e., depending on what scope we want this to have. Current interface is

pytest --bm # Runs all major methods on default
pytest --bm --bm-mode=npe # Runs all major npe methods with e.g. different density estimators ...
pytest --bm --bm-mode=nre # Runs all major nre methods with different classfiers ...
pytest --bm --bm-mode=snpe # Runs all sequential NPE methods 2 rounds
pytest --bm --bm-mode=snle/snre # As above
pytest --bm --bm-mode=fmpe # Runs fmpe with different nets
pytest --bm --bm-mode=npse # Runs NPSE with different nets and others.

Not sure how much "configurability" we want this to have.

Tests cannot fail based on performance currently, but you will get a report which shows relative performance of each method on each task i.e. like this:
image

@manuelgloeckler manuelgloeckler marked this pull request as ready for review January 13, 2025 13:23
@manuelgloeckler manuelgloeckler requested a review from janfb January 13, 2025 13:24
Copy link
Contributor

@janfb janfb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is really great to have - thanks a lot for pushing this!
Love the relative coloring of the results 🎉

Added a couple of comments and questions for clarification.

tests/bm_test.py Outdated Show resolved Hide resolved
tests/bm_test.py Outdated Show resolved Hide resolved
tests/bm_test.py Outdated Show resolved Hide resolved
tests/bm_test.py Outdated Show resolved Hide resolved
tests/bm_test.py Outdated Show resolved Hide resolved
tests/bm_test.py Outdated
num_simulations1 = NUM_SIMULATIONS // 2
thetas, xs = task.get_data(num_simulations1)
prior = task.get_prior()
idx_eval = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so sequential methods are evaluated only on observation idx=1? and amortized methods on all observations and then c2st scores are averaged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently the amortized methods are also only evaluated on the first 3 observations.

More evaluation adds more runtime. Not sure if we should do a full evaluation.

tests/conftest.py Outdated Show resolved Hide resolved
tests/conftest.py Show resolved Hide resolved
tests/mini_sbibm/__init__.py Outdated Show resolved Hide resolved
tests/mini_sbibm/base_task.py Show resolved Hide resolved
Copy link

codecov bot commented Jan 20, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.44%. Comparing base (d3f22b5) to head (8ff30c1).
Report is 12 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1335       +/-   ##
===========================================
- Coverage   89.40%   78.44%   -10.97%     
===========================================
  Files         118      118               
  Lines        8715     8779       +64     
===========================================
- Hits         7792     6887      -905     
- Misses        923     1892      +969     
Flag Coverage Δ
unittests 78.44% <ø> (-10.97%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 42 files with indirect coverage changes

Copy link
Contributor

@janfb janfb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the edits!

All looks very good. I just have a couple of suggestions for renaming and removing comments.

The samples files are small, so no need to save them via git-lfs?

from .mini_sbibm import get_task
from .mini_sbibm.base_task import Task

# NOTE: This might can be improved...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove comments like this. let's rather create an issue with the concrete ideas how to improve it.

SEED = 0
TASKS = ["two_moons", "linear_mvg_2d", "gaussian_linear", "slcp"]
NUM_SIMULATIONS = 2000
EVALUATION_POINTS = 4 # Currently only 3 observation tested for speed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment should match the number in the code.
also, I would suggest to not call it "points" but "observations", e.g.,
NUM_EVALUATIONS_OBS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it now that the range(1, EVALUATION_POINTS) results in 3 observations, but I think we should rather set EVALUATION_POINTS=3 and then use ``range(1, EVALUATION_POINTS+1)` to get 3 observatios.

NUM_SIMULATIONS = 2000
EVALUATION_POINTS = 4 # Currently only 3 observation tested for speed
NUM_ROUNDS_SEQUENTIAL = 2
EVALUATION_POINT_SEQUENTIAL = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NUM_EVALUATION_OBS_SEQ

return float(c2st_val)


def amortized_inference_eval(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest more explicit function names, e.g., train_and_eval_amortized_inference

results_bag.method = inference_method.__name__ + str(extra_kwargs)


def sequential_inference_eval(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accordingly to above, e.g., train_and_eval_sequential_inference


@pytest.mark.benchmark
@pytest.mark.parametrize("task_name", TASKS, ids=str)
def test_benchmark(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_run_benchmark

benchmark_mode: str,
) -> None:
"""
Benchmark test for standard and sequential inference methods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Benchmark test for standard and sequential inference methods.
Benchmark test for amortized and sequential inference methods.

?

task_name: str,
results_bag,
extra_kwargs: dict,
benchmark_mode: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between benchmark_mode and inference_method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it's the fixture right? It's a bit confusing to read if benchmark_mode in ["snpe", "snle", "snre"]: if we also have inference_method as arg.

@@ -44,3 +212,48 @@ def mcmc_params_accurate() -> dict:
def mcmc_params_fast() -> dict:
"""Fixture for MCMC parameters for fast tests."""
return dict(num_chains=1, thin=1, warmup_steps=1)


# Pytest harvest xdist support - not sure if we need it (for me xdist is always slower).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when on a machine with 100 or so cores this might be really useful. I suggest to keep it and to remove this comment.

@@ -0,0 +1,37 @@
# This file is part of sbi, a toolkit for simulation-based inference. sbi is licensed
# under the Apache License Version 2.0, see <https://www.apache.org/licenses/>
# NOTE: This is inspired by the sbibm-package <https://github.com/sbi-benchmark/sbibm>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants