Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changelog Entry
To be copied to the draft changelog by merger:
vg gamsort
is much faster than before.Description
New GAF sorting algorithm that does not parse the alignments beyond determining the key. It handles the records as pairs consisting of an integer key and a string value.
The initial sort splits the input into chunks of (e.g. 1 million) reads that are sorted into temporary files using worker threads. If the input is bgzip-compressed, 4 or 5 worker threads should be enough.
Intermediate merges take ranges of (e.g. 32) temporary files and merge them into larger temporary files using worker threads. This part parallelizes better than the initial sort.
Once the number of temporary files is low enough, the final merge is performed sequentially. If the output is bgzip-compressed, it may need 5 or 6 compression threads to avoid becoming the bottleneck.
The temporary files are compressed using zstd. This uses
zstd_ifstream
andzstd_ofstream
, which are compressed versions ofstd::ifstream
andstd::ofstream
with somewhat limited functionality.With 5 worker threads and 5 bgzip threads, this should sort a 30x GAF in 15-20 minutes on a recent laptop.
There are also options for using a stable sorting algorithm and shuffling the reads instead of sorting them.
This PR also updates GBWT and GBWTGraph for more efficient GBZ loading.