Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster GAF sorting #4505

Merged
merged 14 commits into from
Jan 26, 2025
Merged

Faster GAF sorting #4505

merged 14 commits into from
Jan 26, 2025

Conversation

jltsiren
Copy link
Contributor

Changelog Entry

To be copied to the draft changelog by merger:

  • GAF sorting with vg gamsort is much faster than before.

Description

New GAF sorting algorithm that does not parse the alignments beyond determining the key. It handles the records as pairs consisting of an integer key and a string value.

The initial sort splits the input into chunks of (e.g. 1 million) reads that are sorted into temporary files using worker threads. If the input is bgzip-compressed, 4 or 5 worker threads should be enough.

Intermediate merges take ranges of (e.g. 32) temporary files and merge them into larger temporary files using worker threads. This part parallelizes better than the initial sort.

Once the number of temporary files is low enough, the final merge is performed sequentially. If the output is bgzip-compressed, it may need 5 or 6 compression threads to avoid becoming the bottleneck.

The temporary files are compressed using zstd. This uses zstd_ifstream and zstd_ofstream, which are compressed versions of std::ifstream and std::ofstream with somewhat limited functionality.

With 5 worker threads and 5 bgzip threads, this should sort a 30x GAF in 15-20 minutes on a recent laptop.

There are also options for using a stable sorting algorithm and shuffling the reads instead of sorting them.

This PR also updates GBWT and GBWTGraph for more efficient GBZ loading.

@jltsiren jltsiren merged commit c740ed6 into master Jan 26, 2025
2 checks passed
@jltsiren jltsiren deleted the faster-gaf-sorting branch January 26, 2025 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants