-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CRAM reader (alpha) #302
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #302 +/- ##
==========================================
+ Coverage 88.25% 88.67% +0.42%
==========================================
Files 85 93 +8
Lines 7381 8098 +717
Branches 488 505 +17
==========================================
+ Hits 6514 7181 +667
- Misses 379 412 +33
- Partials 488 505 +17 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing the CRAM decoder!
I understand that interpreting the CRAM spec might have been challenging due to their complexity. There were no issues found with the decoder's behavior.
I've added a few comments, so please review them when you have a moment.
Thank you for your hard work! 👍 👍
(-> (struct/decode-block bb) | ||
(update :crc bytes->vec) | ||
(dissoc :data))) | ||
(range 29)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second container contains, according to the header, 29 blocks.
:blocks 29 |
The first is a compression header block, and the second is a slice header block.
I expected there to be 27 remaining blocks, but the slice header block states that there are 29 remaining blocks.
:blocks 29 |
When I checked the actual data, it looks like there are actually 31 blocks in total, including the compression header block and the slice header block.
Is this difference explained in the specification document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for checking the test data in detail. At this moment, I can't come up with the cause of the difference, so I'll take a look into it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply! After investigating this issue, it seems likely that this is a bug related to htsjdk's CRAM encoder.
Not clearly mentioned in the CRAM specification, but the number of blocks stored in both the container header and the slice header appears to be interpreted as the number of blocks contained within that particular container and slice, respectively. Notably, slices themselves do NOT include compression header blocks or slice header blocks.
Actually, htslib counts the compression header block and slice header block in the total block count for the container header (1, 2) while it only includes the number of data blocks in the slice header's block count (3). This interpretation aligns with how the test data provided by hts-specs is encoded. So, a rational interpretation here would be that the block count is 31 for the container header and 29 for the slice header.
In htsjdk, on the other hand, both the container header's and the slice header's block counts are calculated as the number of data blocks including core data blocks (4) resulting in a block count of 29 for both.
27d1c69 (see below) has corrected the block counts in the container header of the CRAM files generated using htsjdk according to the above interpretation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding the CRAM decode feature.
LGTM.
27d1c69
to
5d3d289
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the investigation and the update!
I confirmed that the number of blocks in the CRAM files are reasonable now.
LGTM 👍 👍
This PR implements an experimental CRAM reader.
At this moment, the API only provides the following basic features:
The namespace structure looks like:
cljam.io.cram
cljam.io.cram.decode.data-series
cljam.io.cram.decode.record
cljam.io.cram.decode.structure
cljam.io.cram.core
cljam.io.cram.reader
cljam.io.cram.seq-resolver
cljam.io.cram.seq-resolver.protocol
Here are the steps I suggest for reading the code:
cljam.io.cram.decode.structure(-test)
: grasp how the CRAM constructs are decoded and what kind of information they containcljam.io.cram.decode.data-series(-test)
: grasp how the data series decoders workcljam.io.cram.decode.record(-test)
: get the big picture of how CRAM record decoder coordinates with the data series decodersLimitations
This implementation is quite naive, aiming to make it easier to reason about its compliance with the CRAM specification, and it's not primarily focused on efficiency. Additionally, it currently supports only a very limited subset of the specification's features. Here are the current limitations (which might not be exhaustive):
RG
tag will always be added to the decoding result if the encoded record has the read group infoMD
/NM
tags from read featuresDespite these limitations, it's still possible to read actual CRAM files that are used in practice.