Skip to content

gi-bielefeld/pangrowth

Folders and files

NameName
Last commit message
Last commit date

Latest commit

f1d2950 · Jan 16, 2025

History

6 Commits
Oct 9, 2024
Jan 16, 2025
Oct 9, 2024
Oct 9, 2024
Nov 22, 2024
Oct 9, 2024
Jan 16, 2025
Oct 9, 2024
Oct 9, 2024

Repository files navigation

Pangrowth

logo_pangrowth

pangrowth is an efficient tool designed for genomic researchers to predict the openness of a pangenome and estimate the core genome size. This tool is capable of analyzing fasta sequences using k-mers, as well as other genomic elements such as genes, CDS, ORFs, provided as either a frequency histogram or a pan-matrix (with columns representing genomes and rows representing items; see panmatrix_ecoli_n50.txt for an example).

Key features

  • k-mer counting: utilizes a modified version of yak to count k-mers
  • growth/core calculation: computes the exact expected genomic growth/core size quadratically in the number of genomes

Publications

Parmigiani, L., Wittler, R., Stoye, J.,: Revisiting pangenome openness with k-mers. PCI Comp & Biol. (2024).

Table of Contents

Install

git clone https://github.com/gi-bielefeld/pangrowth
cd pangrowth
make 

To plot the results we need the following python libraries: numpy, pandas, matplotlib, scipy and searbon. You can install them with:

pip install -r dependencies.txt

Usage

Histogram from fasta files

./pangrowth hist -k 17 -t 12 data/fa/*.fna.gz > hist.txt
  • pangrowth also accepts a file containing a list of fasta files (each one on a single line) passed with the paremeter -i fasta_list.txt

To visualize the histogram:

python scripts/plot_hist.py hist.txt hist.pdf

k-mer frequency histogram of 12 ecoli

If you have multiple histograms that you want to compare with different number of genomes you can use:

python scripts/plot_hist.py --norm_x --norm_y=both hist.txt data/hist_ecoli_n50.txt data/hist_ecoli_n200.txt hist_multiple.pdf
  • The flag --norm_x normalize the x-axis to be between (0,1].
  • The flag --norm_y allows two types of normalization:
    • multiplicity which adjusts each histogram value h[i] multiplying it by its index i (i.e., h[i] * i, this means that values appearing once remain the same, values appearing twice are doubled, and so on)
    • percentage which divides the values of h[i] by the total sum of h (its total sum equals 1) The --norm_y=both applies both in series.

k-mer frequency histogram of multiple ecoli

Pangenome growth from histogram (or pan-matrix)

./pangrowth growth -h data/hist_ecoli_n50.txt > growth.txt
#./pangrowth growth -p data/panmatrix_ecoli_n50.txt > growth.txt

To fit the openness and visualize the growth:

python scripts/plot_growth.py growth.txt growth.pdf

k-mer growth of ecoli

We can again pass multiple growth files to scripts/plot_growth.py to compare with other species.

python scripts/plot_growth.py growth.txt data/growth_ecoli_n200.txt growth_multiple.pdf

k-mer growth of multiple ecoli

Pangenome core from histogram (or pan-matrix)

./pangrowth core -h data/hist_ecoli_n50.txt > core.txt
#./pangrowth core -p data/panmatrix_ecoli_n50.txt > core.txt
./pangrowth core -h data/hist_ecoli_n50.txt -q 0.9 > core_q90.txt
  • The -q takes a quorum to considered the item in the core (default 1.0).

To fit the core genome and report the percentage of core item over the expected genome size:

python scripts/plot_core.py core_q90.txt data/core_q90_ecoli_n200.txt core.pdf

The expected genome size is calculated as the total sum of the histogram divided by the number of genomes.

k-mer core size of multiple ecoli

Contact

For any question, feedback or problem, please feel free to file an issue on Github or contact me here and I will get back to you as soon as possible.

Pangrowth is provided as a service of the German Network for Bioinformatics Infrastructure (de.NBI). We would appriciate if you would participate in the evaluation of Pangrwoth by completing this very short survey.