Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use XZ for LZMA compression in CRAM #321

Merged
merged 1 commit into from
Aug 27, 2024
Merged

Use XZ for LZMA compression in CRAM #321

merged 1 commit into from
Aug 27, 2024

Conversation

athos
Copy link
Member

@athos athos commented Aug 27, 2024

Currently, CRAM files generated by the CRAM writer containing LZMA-compressed blocks cannot be read correctly by samtools, resulting in an error.

Repro

(require '[cljam.io.cram :as cram])

(with-open [r (cram/reader "test-resources/cram/medium.cram"
                           {:reference "hg19.fa"})
            w (cram/writer "medium_lzma.cram"
                           {:reference "hg19.fa"
                            :tag-compressor-overrides (constantly :lzma)})]
  (cram/write-header w (cram/read-header r))
  (cram/write-alignments w (cram/read-alignments r) (cram/read-header r)))
$ samtools view -T hg19.fa medium_lzma.cram
[E::lzma_mem_inflate] LZMA decode failure (error 7)
[E::cram_next_slice] Failure to decode slice
samtools view: error reading file "medium_lzma.cram"
$

Cause

According to the CRAM specification, the LZMA compression method used in CRAM is actually based on the XZ format:

CRAM uses the xz Stream format to encapsulate this algorithm, as defined in https://tukaani.org/xz/xz-file-format.txt.
https://github.com/samtools/hts-specs/blob/ebebbc8c2910ef2d4c5e7119c6f9ffac3bb6a0cb/CRAMv3.tex#L2503

In fact, htsjdk uses XZCompressorOutputStream/XZCompressorInputStream from commons-compress when the LZMA method is specified.

On the other hand, the current cljam CRAM writer uses LZMACompressorOutputStream for LZMA compression. This mismatch causes the generated CRAM files to be unreadable by samtools or htsjdk.

Change

The PR replaces the use of LZMACompressorOutputStream/LZMACompressorInputStream with XZCompressorOutputStream/XZCompressorInputStream for LZMA compression and decompression.

I have confirmed that with this change, CRAM files generated by the CRAM writer containing LZMA-compressed blocks can be successfully read by samtools without errors.

@athos athos self-assigned this Aug 27, 2024
@athos athos requested review from alumi and a team as code owners August 27, 2024 06:17
@athos athos requested review from matsutomo81 and removed request for a team August 27, 2024 06:17
Copy link

codecov bot commented Aug 27, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 89.72%. Comparing base (b0b8cb4) to head (421df41).

Files Patch % Lines
src/cljam/io/cram/decode/structure.clj 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master     #321   +/-   ##
=======================================
  Coverage   89.72%   89.72%           
=======================================
  Files         101      101           
  Lines        9129     9129           
  Branches      480      480           
=======================================
  Hits         8191     8191           
  Misses        458      458           
  Partials      480      480           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@matsutomo81 matsutomo81 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work on this. 🙏
LGTM 👍

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick fix! 👍

Since I was not very familiar with the XZ and LZMA formats, I did some research.
XZ is a format where the data block is enclosed by a header that starts with fd 37 7a 58 5a 00 and a footer that ends with 59 5a.
LZMA is a format with a header that begins with 5d followed by the compression level (00 00 80 00).
I confirmed that the CRAM file outputted by the modified code indeed has the XZ structure and can be read by samtools.

@alumi alumi merged commit 79205f6 into master Aug 27, 2024
18 checks passed
@alumi alumi deleted the fix/use-xz-for-lzma branch August 27, 2024 13:19
@athos
Copy link
Member Author

athos commented Aug 27, 2024

Thank you for reviewing the change and clarifying the difference between those formats! That's helpful as I haven't dug that deep into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants