Support for encoding non-detached mate records #326
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds support for encoding non-detached mate records. Once this has been merged, non-detached mate record encoding will always be enabled.
CRAM Specification
According to the CRAM specification, when mate records are in the same slice, the position of the downstream mate record relative to the upstream record can be indicated using the
NF
data series. In such cases, information related to mate records (such asRNEXT
,PNEXT
, andTLEN
) can be obtained from theRNAME
andPOS
of the mate record indicated byNF
, reducing the need to retain these fields in separate data series. This approach is efficient in terms of CRAM file size.When the mate record is outside the same slice or when fields like
RNEXT
andPNEXT
are inconsistent between mates, the relevant information is stored in individual data series (NS
,NP
, etc.). Such mate records are considered "detached" and are flagged with0x02
(detached) in theCF
data series.Design policy
When encoding mate records as non-detached, it’s essential to verify that the relevant fields are consistent between mates to avoid discrepancies in their stored values. So, in cljam’s CRAM writer, when encoding mate records as non-detached, it checks that the fields are consistent. If the fields are inconsistent, even if both records are in the same slice, they are encoded as detached.
Additionally, to prevent uncertain situations, the CRAM writer adopts a conservative approach and treats mates as detached by default, especially only encoding primary and representative records as non-detached (secondary and supplementary records are never eligible for non-detached encoding). Note that whether a record (or its mate) is unmapped does not affect its eligibility for non-detached encoding.
Implementation
In this PR, the mate resolution process is introduced as part of preprocessing, where it checks whether each record’s mate exists within the same slice and, if so, associates the mate record.
In the mate resolution process, the
QNAME
is used to locate mate records within the same slice. If a mate is found, it checks whether the fields related to mates, such asRNEXT
andPNEXT
, are consistent. If no mate is found or if the fields are inconsistent, the record is considered detached. If a mate is found within the same slice and the fields are consistent, the upstream record’sCF
data series is flagged with0x02
(detached) and0x04
(mate downstream), and the downstream record’sCF
is flagged with0x02
. TheNF
data series of the upstream record is also set to indicate the distance to the downstream mate.The newly introduced mate resolver is responsible for finding mate records within the same slice. It takes a record and its index within the slice, checks if the
QNAME
has been observed before, and if so, considers the record associated to the observedQNAME
as its mate, returning the mate’s index in the slice. If not observed, it records theQNAME
and slice index for future mate resolution.