Skip to content

Commit

Permalink
Merge branch 'main' into feat/py311
Browse files Browse the repository at this point in the history
  • Loading branch information
aryarm authored Dec 22, 2023
2 parents 52cc05b + a499b0c commit 8843dab
Show file tree
Hide file tree
Showing 30 changed files with 912 additions and 143 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ __pycache__
.nox
# poetry
dist/
# filprofiler
fil-result/

# OSX
*.DS_Store*
Expand Down
6 changes: 5 additions & 1 deletion .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,16 @@

version: 2

build:
os: "ubuntu-22.04"
tools:
python: "3.7"

sphinx:
configuration: docs/conf.py
fail_on_warning: true

python:
version: 3.7
install:
- method: pip
path: .
Expand Down
32 changes: 24 additions & 8 deletions docs/api/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ All of the other methods in the :class:`Genotypes` class are inherited, but the

GenotypesTR
++++++++++++
The :class:`GenotypesTR` class *extends* :class:`Genotypes` class. The :class:`GenotypesTR` class follows the same structure of :class:`GenotypesVCF`, but can now load repeat count of tandem repeats as the alleles.
The :class:`GenotypesTR` class *extends* the :class:`Genotypes` class. The :class:`GenotypesTR` class follows the same structure of :class:`GenotypesVCF`, but can now load repeat counts of tandem repeats as the alleles.

All of the other methods in the :class:`Genotypes` class are inherited, but the :class:`GenotypesTR` class' ``load()`` function is unique to loading tandem repeat variants.

Expand All @@ -178,6 +178,11 @@ All of the other methods in the :class:`Genotypes` class are inherited, but the
# make the first sample have 4 and 7 repeats for the alleles of the fourth variant
genotypes.data[0, 3] = (4, 7)
The following methods from the :class:`Genotypes` class are disabled, however.

1. ``check_biallelic``
2. ``check_maf``

.. _api-data-genotypestr:

GenotypesPLINK
Expand All @@ -188,13 +193,6 @@ The :class:`GenotypesPLINK` class offers experimental support for reading and wr

The time required to load various genotype file formats.

.. warning::
This class depends on the ``Pgenlib`` python library. This can be installed automatically with ``haptools`` if you specify the "files" extra requirements during installation.

.. code-block:: bash
pip install haptools[files]
The :class:`GenotypesPLINK` class inherits from the :class:`GenotypesVCF` class, so it has all the same methods and properties. Loading genotypes is the exact same, for example.

.. code-block:: python
Expand Down Expand Up @@ -239,6 +237,24 @@ A large ``chunk_size`` is more likely to result in memory over-use while a small
genotypes = data.GenotypesPLINK('tests/data/simple.pgen', chunk_size=500)
genotypes.read()
GenotypesPLINKTR
++++++++++++++++
The :class:`GenotypesPLINKTR`` class extends the :class:`GenotypesPLINK` class to support loading tandem repeat variants.
The :class:`GenotypesPLINKTR` class works similarly to :class:`GenotypesTR` by filling the ``data`` property with repeat counts for each allele.

The following methods from the :class:`GenotypesPLINK` class are disabled, however.

1. ``write``
2. ``check_maf``
3. ``write_variants``
4. ``check_biallelic``

The :class:`GenotypesPLINKTR` uses INFO fields from the PVAR file to determine the repeat unit and the number of repeats for each allele. To ensure your PVAR file contains the necessary information, use the following command when converting from VCF.

.. code-block:: bash
plink2 --vcf-half-call m --make-pgen 'pvar-cols=vcfheader,qual,filter,info' --vcf input.vcf --make-pgen --out output
haplotypes.py
~~~~~~~~~~~~~
Overview
Expand Down
20 changes: 10 additions & 10 deletions docs/formats/genotypes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,29 +9,29 @@ Genotypes
The time required to load various genotype file formats.

VCF/BCF
-------
~~~~~~~

Genotype files must be specified as VCF or BCF files. They can be bgzip-compressed.

.. _formats-genotypesplink:

PLINK2 PGEN
-----------
~~~~~~~~~~~

There is also experimental support for `PLINK2 PGEN <https://github.com/chrchang/plink-ng/blob/master/pgen_spec/pgen_spec.pdf>`_ files in some commands. These files can be loaded and created much more quickly than VCFs, so we highly recommend using them if you're working with large datasets. See the documentation for the :class:`GenotypesPLINK` class in :ref:`the API docs <api-data-genotypesplink>` for more information.

If you run out memory when using PGEN files, consider reading/writing variants from the file in chunks via the ``--chunk-size`` parameter.

.. note::
PLINK2 support depends on the ``Pgenlib`` python library. This can be installed automatically with ``haptools`` if you specify the "files" extra requirements during installation.
Converting from VCF to PGEN
---------------------------
To convert a VCF containing only biallelic SNPs to PGEN, use the following command.

.. code-block:: bash
.. code-block:: bash
pip install haptools[files]
plink2 --snps-only 'just-acgt' --max-alleles 2 --vcf input.vcf --make-pgen --out output
.. warning::
At the moment, only biallelic SNPs can be encoded in PGEN files because of limitations in the ``Pgenlib`` python library. It doesn't properly support multiallelic variants yet (`source <https://github.com/chrchang/plink-ng/blob/c4b8d4361de74c58f0cc11361062eca4f34210d3/2.0/Python/python_api.txt#L88-L89>`_). To ensure your PGEN files only contain SNPs, we recommend use the following command to convert from VCF to PGEN.
To convert a VCF containing tandem repeats to PGEN, use this command, instead.

.. code-block:: bash
.. code-block:: bash
plink2 --snps-only 'just-acgt' --max-alleles 2 --vcf input.vcf --make-pgen --out output
plink2 --vcf-half-call m --make-pgen 'pvar-cols=vcfheader,qual,filter,info' --vcf input.vcf --make-pgen --out output
4 changes: 2 additions & 2 deletions docs/project_info/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Types of Contributions
~~~~~~~~~~~~
Report a bug
~~~~~~~~~~~~
If you have found a bug, please report it on `our issues page <https://github.com/aryarm/haptools/issues>`_ rather than emailing us directly. Others may have the same issue and this helps us get that information to them.
If you have found a bug, please report it on `our issues page <https://github.com/CAST-genomics/haptools/issues>`_ rather than emailing us directly. Others may have the same issue and this helps us get that information to them.

Before you submit a bug, please search through our issues to ensure it hasn't already been reported. If you encounter an issue that has already been reported, please upvote it by reacting with a thumbs-up emoji. This helps us prioritize the issue.

Expand Down Expand Up @@ -80,7 +80,7 @@ Follow these steps to set up a development environment.

.. code-block:: bash
poetry install -E docs -E tests -E files
poetry install -E docs -E tests
Now, try importing ``haptools`` or running it on the command line.

Expand Down
7 changes: 2 additions & 5 deletions docs/project_info/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ Using pip

You can install ``haptools`` from PyPI using ``pip``.

.. warning::
We recommend using ``pip >= 20.3`` because of `an issue in pysam <https://github.com/pysam-developers/pysam/issues/1132>`_.
.. note::
We recommend using ``pip >= 20.3``.

.. code-block:: bash
Expand All @@ -29,9 +29,6 @@ We also support installing ``haptools`` from bioconda using ``conda``.
conda install -c conda-forge -c bioconda haptools
.. note::
Installing ``haptools`` from bioconda with PGEN support is not yet possible. See `issue 228 <https://github.com/chrchang/plink-ng/issues/228>`_ for current progress on this challenge.

Installing the latest, unreleased version
-----------------------------------------
Can't wait for us to tag and release our most recent updates? You can install ``haptools`` directly from the ``main`` branch of our Github repository using ``pip``.
Expand Down
8 changes: 4 additions & 4 deletions haptools/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
# handles py3.7, since importlib.metadata was introduced in py3.8
from importlib_metadata import version, PackageNotFoundError

try:
__version__ = version(__name__)
except PackageNotFoundError:
__version__ = "unknown"
try:
__version__ = version(__name__)
except PackageNotFoundError:
__version__ = "unknown"
11 changes: 6 additions & 5 deletions haptools/clump.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
#!/usr/bin/env python

# To test: ./clumpSTR.py --summstats-snps tests/eur_gwas_pvalue_chr19.LDL.glm.linear --clump-snp-field ID --clump-field p-value --clump-chrom-field CHROM --clump-pos-field position --clump-p1 0.2 --out test.clump
import sys
import math
from logging import Logger, getLogger

import numpy as np

from .data import Genotypes, GenotypesVCF, GenotypesTR
from .data import Genotypes, GenotypesVCF, GenotypesTR, GenotypesPLINKTR


class Variant:
Expand Down Expand Up @@ -557,15 +556,17 @@ def clumpstr(
strgts = None
gts = None
if gts_snps:
log.debug("Loading SNP Genotypes.")
if str(gts_snps).endswith("pgen"):
log.debug("Loading SNP Genotypes.")
snpgts = GenotypesPLINK.load(gts_snps)
else:
log.debug("Loading SNP Genotypes.")
snpgts = GenotypesVCF.load(gts_snps)
if gts_strs:
log.debug("Loading STR Genotypes.")
strgts = GenotypesTR.load(gts_strs)
if str(gts_strs).endswith("pgen"):
strgts = GenotypesPLINKTR.load(gts_strs)
else:
strgts = GenotypesTR.load(gts_strs)

if gts_snps and gts_strs:
log.debug("Calculating set of overlapping samples between STRs and SNPs.")
Expand Down
8 changes: 7 additions & 1 deletion haptools/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,10 @@
from .covariates import Covariates
from .breakpoints import Breakpoints, HapBlock
from .haplotypes import Extra, Repeat, Variant, Haplotype, Haplotypes
from .genotypes import Genotypes, GenotypesVCF, GenotypesTR, GenotypesPLINK
from .genotypes import (
Genotypes,
GenotypesVCF,
GenotypesTR,
GenotypesPLINK,
GenotypesPLINKTR,
)
Loading

0 comments on commit 8843dab

Please sign in to comment.