Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for encoding non-detached mate records #326

Merged
merged 1 commit into from
Nov 20, 2024

Conversation

athos
Copy link
Member

@athos athos commented Nov 6, 2024

This PR adds support for encoding non-detached mate records. Once this has been merged, non-detached mate record encoding will always be enabled.

CRAM Specification

According to the CRAM specification, when mate records are in the same slice, the position of the downstream mate record relative to the upstream record can be indicated using the NF data series. In such cases, information related to mate records (such as RNEXT, PNEXT, and TLEN) can be obtained from the RNAME and POS of the mate record indicated by NF, reducing the need to retain these fields in separate data series. This approach is efficient in terms of CRAM file size.

When the mate record is outside the same slice or when fields like RNEXT and PNEXT are inconsistent between mates, the relevant information is stored in individual data series (NS, NP, etc.). Such mate records are considered "detached" and are flagged with 0x02 (detached) in the CF data series.

Design policy

When encoding mate records as non-detached, it’s essential to verify that the relevant fields are consistent between mates to avoid discrepancies in their stored values. So, in cljam’s CRAM writer, when encoding mate records as non-detached, it checks that the fields are consistent. If the fields are inconsistent, even if both records are in the same slice, they are encoded as detached.

Additionally, to prevent uncertain situations, the CRAM writer adopts a conservative approach and treats mates as detached by default, especially only encoding primary and representative records as non-detached (secondary and supplementary records are never eligible for non-detached encoding). Note that whether a record (or its mate) is unmapped does not affect its eligibility for non-detached encoding.

Implementation

In this PR, the mate resolution process is introduced as part of preprocessing, where it checks whether each record’s mate exists within the same slice and, if so, associates the mate record.

In the mate resolution process, the QNAME is used to locate mate records within the same slice. If a mate is found, it checks whether the fields related to mates, such as RNEXT and PNEXT, are consistent. If no mate is found or if the fields are inconsistent, the record is considered detached. If a mate is found within the same slice and the fields are consistent, the upstream record’s CF data series is flagged with 0x02 (detached) and 0x04 (mate downstream), and the downstream record’s CF is flagged with 0x02. The NF data series of the upstream record is also set to indicate the distance to the downstream mate.

The newly introduced mate resolver is responsible for finding mate records within the same slice. It takes a record and its index within the slice, checks if the QNAME has been observed before, and if so, considers the record associated to the observed QNAME as its mate, returning the mate’s index in the slice. If not observed, it records the QNAME and slice index for future mate resolution.

@athos athos self-assigned this Nov 6, 2024
@athos athos requested review from alumi and a team as code owners November 6, 2024 04:04
@athos athos requested review from matsutomo81 and removed request for a team November 6, 2024 04:04
Copy link

codecov bot commented Nov 6, 2024

Codecov Report

Attention: Patch coverage is 95.58824% with 3 lines in your changes missing coverage. Please review.

Project coverage is 89.99%. Comparing base (b1748fd) to head (1168f6b).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/cljam/io/cram/encode/record.clj 94.64% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #326      +/-   ##
==========================================
+ Coverage   89.96%   89.99%   +0.02%     
==========================================
  Files         103      104       +1     
  Lines        9271     9323      +52     
  Branches      485      488       +3     
==========================================
+ Hits         8341     8390      +49     
  Misses        445      445              
- Partials      485      488       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@athos athos requested review from a team and ToshimitsuArai and removed request for matsutomo81 and a team November 6, 2024 05:38
@athos athos force-pushed the feature/non-detached-mate-records branch from 1790cdb to 1168f6b Compare November 18, 2024 11:40
@athos
Copy link
Member Author

athos commented Nov 18, 2024

Rebased this onto the latest master branch.

Copy link
Contributor

@ToshimitsuArai ToshimitsuArai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review 🙏
Thank you and checked for the PR, LGTM 👍

@ToshimitsuArai ToshimitsuArai merged commit a8966d9 into master Nov 20, 2024
17 checks passed
@ToshimitsuArai ToshimitsuArai deleted the feature/non-detached-mate-records branch November 20, 2024 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants