Deconstruct option to cluster similar alleles together #4301
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changelog Entry
To be copied to the draft changelog by merger:
-L
added tovg deconstruct
in order to cluster similar allele traversals together. The value given is a (length-weighted) threshold for the jaccard coefficient between the oriented nodes of two traversals. So if-L 0.75
is given, then alleles that have >= 0.75 similarity based on their graph positions will be merged into one. Two new FORMAT fields are added to keep track of the difference,TS
(jaccard distance) andTL
(length difference). Clustering is done greedily starting with selected reference paths.Description
I don't think the clustering is especially useful on its own (though it can be used to make much simpler VCFs), but it's an important part of improved multi-level vcf support (finally coming in next PR). (it's also why I'm clustering on graph path and not the actual dna string, since only the paths themselves can be used to anchor the child snarls)