Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added parameter optimization and evaluation functionality for iotools benchmarks #11

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

DanteNiewenhuis
Copy link

@DanteNiewenhuis DanteNiewenhuis commented Apr 17, 2023

Added python files that can be used to optimize the parameters of reading and writing and RNTuple file.

The main functionality is evaluating single parameter configurations (see tutorials/evaluate_parameters.py),
and optimizing parameters for a specific benchmark (see tutorials/run_annealer.py).

Three different methods of optimizing parameters are provided in Algorithms.py:

  • Hill Climber: A simple algorithm that mutates the current parameter configuration, and accepts the new configuration when it results in higher performance.
  • Simulated Annealer: similar to a Hill Climber, but adds a small chance to accept negative mutations.
  • Random Walker: Similar to a Hill Climber, but every step is accepted. This is not useful for optimization, but can be used to better understand the solution space

To use the files, first the paths in variables.py have to be updated.

This update also adds two flags to the generator files for defining page and cluster size:

-p define page size in bytes (default 65536)
-x define cluster size in bytes (default 52428800)

@DanteNiewenhuis DanteNiewenhuis changed the title Flags for page and cluster size Added parameter optimization and evaluation functionality for iotools benchmarks Jun 19, 2023
Copy link
Contributor

@jalopezg-git jalopezg-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks, @DanteNiewenhuis! 👍

I have attached some preliminary comments, but I guess it's up to @jblomer to finally approve and merge.

gen_atlas.cxx Outdated
Comment on lines 28 to 29
size_t pagesize = 65536;
size_t clustersize = 52428800;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_t pagesize = 65536;
size_t clustersize = 52428800;
size_t pageSize = (64 * 1024);
size_t clusterSize = (50 * 1000 * 1000);

Or even better, IMO, set it to 0 or -1U, and only call SetApproxUnzippedPageSize(), etc., if a value was actually set by the user.
If unset, importer should be using the default value from RNTupleWriteOptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments in gen_atlas.cxx apply here.

gen_h1.cxx Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments in gen_atlas.cxx apply here.

gen_lhcb.cxx Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment in gen_atlas.cxx apply here.



import sys
sys.path.insert(0, "/home/dante-niewenhuis/Documents/CERN-parameters-new")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think changing this path is not mentioned in the README.md. Also, what is it used for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, Python cannot import packages that are in the parent folder of a file.
To fix this, I add the optimization_tools folder to the paths so Python can find it.
I have changed the string so it is clearer that it is something the user should to change.

Comment on lines 58 to 59
page_sizes = [(16, "KB"), (32, "KB"), (64, "KB"), (128, "KB"), (256, "KB"), (512, "KB"),
(1, "MB"), (2, "MB"), (4, "MB"), (8, "MB"), (16, "MB")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the convertToByte() function, MB here refers to a power-of-two unit, i.e. a mebibyte.

However, we typically use the MB (1000 * 1000) bytes as the unit for on-disk data (see, e.g. the default value for the fApproxZippedClusterSize member in RNTupleWriteOptions).

return False


def convertToByte(value: int, form: str = "b") -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to duplicate this function in Parameters.py. I think we should get rid of one of them.

"""
for line in benchmark_output.split("\n"):
if target in line:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

for _ in tqdm(range(evaluations)):

# Reset cache
os.system('sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iotools repository already has clear_page_cache.c. The recipe in the Makefile also sets the suid bit, so that no sudo is required after

$ make clear_page_cache

Copy link
Owner

@jblomer jblomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks a lot!

Some minor comments.

Feel free to add your name and/or contact details to the README.

.gitignore Outdated
gen_h1
gen_lhcb

*.pyc
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline at the end of the file

gen_atlas.cxx Outdated
Comment on lines 28 to 29
size_t pagesize = (64 * 1024);
size_t clustersize = (50 * 1000 * 1000);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in the other generators: use zero to indicate the default page size should be used if the option is not set. I think also the file name modification should only be done if there is a deviation from the default.

gen_cms.cxx Outdated
{
std::cout << "Usage: " << progname << " -i <ttjet_13tev_june2019.root> -o <ntuple-path> -c <compression> [-m(t)]"
std::cout << "Usage: " << progname << " -o <ntuple-path> -c <compression> -p <page-size> -x <cluster-size> [-m(t)] <H1 root file>"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::cout << "Usage: " << progname << " -o <ntuple-path> -c <compression> -p <page-size> -x <cluster-size> [-m(t)] <H1 root file>"
std::cout << "Usage: " << progname << " -o <ntuple-path> -c <compression> -p <page-size> -x <cluster-size> [-m(t)] <CMS root file>"

@@ -0,0 +1,12 @@
This folder contains python files used to benchmark and optimize parameters based on the iotools benchmarks.

To run the files, first the Paths in Benchmarking/variables.py need to be updated.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To run the files, first the Paths in Benchmarking/variables.py need to be updated.
To run the files, first the paths in Benchmarking/variables.py need to be updated.



import sys
sys.path.insert(0, "/home/dante-niewenhuis/Documents/CERN-parameters-new")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sys.path.insert(0, "/home/dante-niewenhuis/Documents/CERN-parameters-new")
sys.path.insert(0, "CERN-parameters-new")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work in this case because "CERN-parameters-new" is a parent of the current working directory

Comment on lines 3 to 4
# root [12] auto ntuple = ROOT::Experimental::RNTupleReader::Open("DecayTree" , "generated/B2HHH~none.ntuple")
# root [13] ntuple->PrintInfo(ROOT::Experimental::ENTupleInfo::kStorageDetails)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to the comment what it means / why it is there


for _ in tqdm(range(evaluations)):
# Reset cache
os.system('sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this we have the clean_page_cache utility

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants