Fine-grained partition of CRAM containers and slices #322
+318
−29
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR introduces changes to fine-tune how containers and slices are partitioned when writing CRAM files.
The CRAM specification has flexibility in how containers and slices are partitioned, and it’s up to each CRAM encoder to decide how to do it. In CRAM index files, index entries are created for each slice, so how containers and slices are partitioned affects the performance of index access in CRAM files.
In particular, having multiple-reference (or multi-ref) slices can make index access less efficient. So, CRAM encoders should aim to avoid creating multi-ref slices whenever possible, and even when they are necessary, the size of such slices should be kept to a minimum.
The change introduced in this PR is a bit complex, and this complexity mainly comes from these considerations.
Specification
The size of containers and slices can be controlled by the following three options provided to the CRAM writer:
records-per-slice
slices-per-container
min-single-ref-slice-size
Basically, the CRAM writer tries to pack the number of records specified by
records-per-slice
into one slice and the number of slices specified byslices-per-container
into one container.Additionally, the CRAM writer adjusts the sizes of containers and slices to minimize the occurrence of multi-ref containers and slices as much as possible:
min-single-ref-slice-size
, the records mapped to different references will be put together into one multi-ref slice