Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CRAM reader (alpha) #302

Merged
merged 8 commits into from
Mar 28, 2024
Merged

Add CRAM reader (alpha) #302

merged 8 commits into from
Mar 28, 2024

Conversation

athos
Copy link
Member

@athos athos commented Mar 6, 2024

This PR implements an experimental CRAM reader.

At this moment, the API only provides the following basic features:

  • create a new CRAM reader for the specified file
  • create a cloned CRAM reader from another CRAM reader
  • reads all the alignment as a lazy sequence

The namespace structure looks like:

Namespace Description
cljam.io.cram Provides the pubilc API
cljam.io.cram.decode.data-series Implements decoders for data series and tags
cljam.io.cram.decode.record Implements decoder for CRAM records
cljam.io.cram.decode.structure Implements decoders for decoding CRAM constructs such as containers, slices, blocks, etc.
cljam.io.cram.core Provides the CRAM reader constructors
cljam.io.cram.reader Implements the CRAM reader
cljam.io.cram.seq-resolver Provides an implementation of the "seq resolver", which abstracts out the direct dependency on the sequence reader from within the implementation of CRAM reader (to resolve some cyclic dependency)
cljam.io.cram.seq-resolver.protocol provides the protocol definition of the "seq resolver"

Here are the steps I suggest for reading the code:

  • cljam.io.cram.decode.structure(-test): grasp how the CRAM constructs are decoded and what kind of information they contain
  • cljam.io.cram.decode.data-series(-test): grasp how the data series decoders work
  • cljam.io.cram.decode.record(-test): get the big picture of how CRAM record decoder coordinates with the data series decoders
  • others

Limitations

This implementation is quite naive, aiming to make it easier to reason about its compliance with the CRAM specification, and it's not primarily focused on efficiency. Additionally, it currently supports only a very limited subset of the specification's features. Here are the current limitations (which might not be exhaustive):

  • No support for region access or index access
  • Decoding always requires a reference file
    • According to the CRAM specification, it should be possible to read a CRAM file without a reference file if the reference sequences are embedded in the CRAM file or if the alignments are unmapped
  • The decoder never generates QNAMEs even if they are not embedded in the CRAM file
    • According to the CRAM specification, it's the decoder's responsibility to generate QNAMEs if they are not embedded in the file
  • Does not support decoding from core data blocks
  • Currently does not support the following compression methods and encodings:
    • Compression methods:
      • rans4x16
      • adaptive arithmetic coder
      • fqzcomp
      • name tokenizer
    • Encodings:
      • Huffman (only supported when it's a 1-alphabet/0-bit encoding)
      • Beta
      • Subexponential
      • Gamma
      • Golomb (deprecated)
      • Golomb-Rice (deprecated)
  • The RG tag will always be added to the decoding result if the encoded record has the read group info
  • Does not automatically calculate the MD/NM tags from read features
  • Does not perform any CRC/MD5 checks
  • Does not support coordination among more than two mate reads
  • Does not support reading containers with a header larger than 256 bytes

Despite these limitations, it's still possible to read actual CRAM files that are used in practice.

@athos athos self-assigned this Mar 6, 2024
Copy link

codecov bot commented Mar 6, 2024

Codecov Report

Attention: Patch coverage is 93.02650% with 50 lines in your changes are missing coverage. Please review.

Project coverage is 88.67%. Comparing base (ec9c199) to head (5d3d289).

Files Patch % Lines
src/cljam/io/cram/decode/record.clj 93.00% 14 Missing and 6 partials ⚠️
src/cljam/io/cram/decode/structure.clj 90.95% 16 Missing and 3 partials ⚠️
src/cljam/io/cram/decode/data_series.clj 93.18% 1 Missing and 5 partials ⚠️
src/cljam/io/cram/reader.clj 96.10% 2 Missing and 1 partial ⚠️
src/cljam/io/cram/core.clj 92.85% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #302      +/-   ##
==========================================
+ Coverage   88.25%   88.67%   +0.42%     
==========================================
  Files          85       93       +8     
  Lines        7381     8098     +717     
  Branches      488      505      +17     
==========================================
+ Hits         6514     7181     +667     
- Misses        379      412      +33     
- Partials      488      505      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@athos athos marked this pull request as ready for review March 8, 2024 10:00
@athos athos requested review from alumi and a team as code owners March 8, 2024 10:00
@athos athos requested review from niyarin and removed request for a team March 8, 2024 10:00
Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing the CRAM decoder!
I understand that interpreting the CRAM spec might have been challenging due to their complexity. There were no issues found with the decoder's behavior.
I've added a few comments, so please review them when you have a moment.
Thank you for your hard work! 👍 👍

(-> (struct/decode-block bb)
(update :crc bytes->vec)
(dissoc :data)))
(range 29))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second container contains, according to the header, 29 blocks.


The first is a compression header block, and the second is a slice header block.
I expected there to be 27 remaining blocks, but the slice header block states that there are 29 remaining blocks.

When I checked the actual data, it looks like there are actually 31 blocks in total, including the compression header block and the slice header block.
Is this difference explained in the specification document?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for checking the test data in detail. At this moment, I can't come up with the cause of the difference, so I'll take a look into it later.

Copy link
Member Author

@athos athos Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply! After investigating this issue, it seems likely that this is a bug related to htsjdk's CRAM encoder.

Not clearly mentioned in the CRAM specification, but the number of blocks stored in both the container header and the slice header appears to be interpreted as the number of blocks contained within that particular container and slice, respectively. Notably, slices themselves do NOT include compression header blocks or slice header blocks.

Actually, htslib counts the compression header block and slice header block in the total block count for the container header (1, 2) while it only includes the number of data blocks in the slice header's block count (3). This interpretation aligns with how the test data provided by hts-specs is encoded. So, a rational interpretation here would be that the block count is 31 for the container header and 29 for the slice header.

In htsjdk, on the other hand, both the container header's and the slice header's block counts are calculated as the number of data blocks including core data blocks (4) resulting in a block count of 29 for both.

27d1c69 (see below) has corrected the block counts in the container header of the CRAM files generated using htsjdk according to the above interpretation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to update test.cram, not only medium*.cram! 5d3d289 has superseded 27d1c69 now.

Copy link
Contributor

@niyarin niyarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the CRAM decode feature.
LGTM.

@alumi alumi mentioned this pull request Mar 22, 2024
@athos athos force-pushed the feature/cram-support branch from 27d1c69 to 5d3d289 Compare March 27, 2024 06:36
Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the investigation and the update!
I confirmed that the number of blocks in the CRAM files are reasonable now.
LGTM 👍 👍

@alumi alumi merged commit 43df52d into master Mar 28, 2024
17 checks passed
@alumi alumi deleted the feature/cram-support branch March 28, 2024 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants