Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for delta encoding of AP data series for sorted CRAM files #323

Merged
merged 2 commits into from
Oct 3, 2024

Conversation

athos
Copy link
Member

@athos athos commented Sep 3, 2024

This PR adds support for delta encoding of the AP data series for position-sorted CRAM files.

The CRAM specification says that if the AP flag is set in the container's compression header, delta encoding for the AP data series will be enabled. When enabled, the AP data series encodes the difference in the :pos field between each two consecutive records. For the first record in a slice, the difference is calculated relative to the alignment start in the slice header. See the following sections in the CRAM specification for more details:

In this implementation, AP delta encoding is enabled if (and only if) the CRAM file is declared as SO:coordinate in the CRAM header. With this change, alignment stats, which were previously calculated during record encoding, must be calculated during preprocessing. The calculated alignment stats are now passed to record encoding via container context/slice contexts.

To cover more comprehensive testing scenarios for AP delta encoding, the existing test resource medium.cram has been re-encoded with AP delta encoding enabled. The re-encoding was done using the following script, which is a modified version of the script previously used:

  (require '[clojure.java.io :as io])
- (import '[htsjdk.samtools CRAMFileWriter SamReaderFactory SAMRecord]
+ (import '[htsjdk.samtools CRAMFileWriter SamReaderFactory SAMRecord SAMFileHeader$SortOrder]
          '[htsjdk.samtools.cram.ref ReferenceSource]
          '[htsjdk.samtools.cram.structure CRAMEncodingStrategy]))

  (let [rdr-factory (SamReaderFactory/makeDefault)
        strategy (doto (CRAMEncodingStrategy.)
                   (.setMinimumSingleReferenceSliceSize 300)
                   (.setReadsPerSlice 500)
                   (.setSlicesPerContainer 10))]
    (with-open [r (.open rdr-factory (io/file "test-resources/bam/medium.bam"))
                os (io/output-stream "test-resources/cram/medium.cram")
                w (CRAMFileWriter. strategy
                                   os
                                   nil
-                                  false
+                                  true
                                   (ReferenceSource. (io/file "hg19.fa"))
-                                  (.getFileHeader r)
+                                  (doto (.clone (.getFileHeader r))
+                                    (.setSortOrder SAMFileHeader$SortOrder/coordinate))
                                   "test-resources/cram/medium.cram")]
      (doseq [^SAMRecord aln (iterator-seq (.iterator r))]
        (.addAlignment w aln)))))

@athos athos requested review from alumi and a team September 3, 2024 08:07
@athos athos self-assigned this Sep 3, 2024
@athos athos requested review from niyarin and removed request for a team September 3, 2024 08:07
Copy link

codecov bot commented Sep 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.80%. Comparing base (b70109a) to head (9209c52).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #323      +/-   ##
==========================================
+ Coverage   89.78%   89.80%   +0.01%     
==========================================
  Files         102      102              
  Lines        9186     9197      +11     
  Branches      480      480              
==========================================
+ Hits         8248     8259      +11     
  Misses        458      458              
  Partials      480      480              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@niyarin niyarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reviewing.
LGTM.

@niyarin niyarin merged commit 66b4d3e into master Oct 3, 2024
18 checks passed
@niyarin niyarin deleted the feature/delta-encoded-ap branch October 3, 2024 06:52
@athos
Copy link
Member Author

athos commented Oct 3, 2024

Thank you both for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants