feat: convert samples
arg in Genotypes
classes into a set
#230
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
closes #228 and overrides it
background
The
Genotypes.read
andGenotypes.__init__
methods expect samples to be provided as a list but make no guarantee that the samples will be loaded in the ordering requested. (This is because the underlying libraries that we use do not make this guarantee either.)We first attempted to solve this issue in #225 by automatically reordering after reading the genotypes. But then we realized this would double the memory of the
read()
method.PR #228 takes a different approach to the problem by adding a
reorder_samples
argument to theGenotypes
classes that would do the reordering only if it is requested. This approach would also have required us to do more work to implement extra checks (see #228 (comment)). It also got us asking, from a design perspective, what makes thesamples
argument any different than thevariants
argument, which is currently typed as a set?proposal
We ultimately decided that a better design approach would be to just copy the type of the
variants
argument and make thesamples
argument into a set, as well. If a user passes anything that isn't a set, we warn them and request that they use thesubset()
method after loading the genotypes. This approach ended up being a bit cleaner since the GenotypesPLINK classes actually need it to be a set, anyway.