Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-grained partition of CRAM containers and slices #322

Merged
merged 2 commits into from
Aug 30, 2024

Conversation

athos
Copy link
Member

@athos athos commented Aug 29, 2024

Overview

This PR introduces changes to fine-tune how containers and slices are partitioned when writing CRAM files.

The CRAM specification has flexibility in how containers and slices are partitioned, and it’s up to each CRAM encoder to decide how to do it. In CRAM index files, index entries are created for each slice, so how containers and slices are partitioned affects the performance of index access in CRAM files.

In particular, having multiple-reference (or multi-ref) slices can make index access less efficient. So, CRAM encoders should aim to avoid creating multi-ref slices whenever possible, and even when they are necessary, the size of such slices should be kept to a minimum.

The change introduced in this PR is a bit complex, and this complexity mainly comes from these considerations.

Specification

The size of containers and slices can be controlled by the following three options provided to the CRAM writer:

  • records-per-slice
  • slices-per-container
  • min-single-ref-slice-size

Basically, the CRAM writer tries to pack the number of records specified by records-per-slice into one slice and the number of slices specified by slices-per-container into one container.

Additionally, the CRAM writer adjusts the sizes of containers and slices to minimize the occurrence of multi-ref containers and slices as much as possible:

  • Records mapped to different references are placed in different slices, ensuring that each slice remains single-ref
  • However, if the size of a slice falls below the value specified by min-single-ref-slice-size, the records mapped to different references will be put together into one multi-ref slice
  • A multi-ref slice will be placed in its own container, and no other slices will be included in the same container
  • If two single-ref slices consist of records mapped to different references, they will not be placed in the same container
  • There are also some other minor adjustments included

@athos athos self-assigned this Aug 29, 2024
@athos athos requested review from alumi and a team as code owners August 29, 2024 06:46
@athos athos requested review from ToshimitsuArai and removed request for a team August 29, 2024 06:46
Copy link

codecov bot commented Aug 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.78%. Comparing base (79205f6) to head (7c93d23).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #322      +/-   ##
==========================================
+ Coverage   89.72%   89.78%   +0.06%     
==========================================
  Files         101      102       +1     
  Lines        9129     9186      +57     
  Branches      480      480              
==========================================
+ Hits         8191     8248      +57     
  Misses        458      458              
  Partials      480      480              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -190,17 +188,11 @@
slices)
(when-let [index-writer (.-index-writer wtr)]
(write-index-entries index-writer container-offset container-header
slices container-records))
counter'))
Copy link
Member Author

@athos athos Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that write-container now doesn't return the next counter value.

Previously, the CRAM writer needed to count the number of container records as writing out the container since the CRAM writer couldn't tell in advance how many records the container had in it.

Now the CRAM writer can tell in advance how many records are contained in each container by counting the number of records and packing them into slices/containers before writing them out.

Copy link
Contributor

@ToshimitsuArai ToshimitsuArai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. LGTM 👍

@ToshimitsuArai ToshimitsuArai merged commit b70109a into master Aug 30, 2024
18 checks passed
@ToshimitsuArai ToshimitsuArai deleted the feature/cram-slice-partition branch August 30, 2024 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants