-
Notifications
You must be signed in to change notification settings - Fork 6
PanPhlAn profiling 3_0
panphlan_profiling.py
is used to merge and process the panphlan_map.py
results for getting the final gene presence/absence profiles of detected strains in samples, or for extracting the transcriptional activity of individual strains based on DNA & RNAseq pairs of the same sample.
Example:
./panphlan_profiling.py -i map_results/ --o_matrix result_gene_presence_absence.tsv -p pangenome_file.tsv --add_ref
-
--i_dna
input directory countaining thepanphlan_map.py
results OR a text file with all paths of the files to input -
--pangenome
Path to the pangenome file
-
--o_matrix
final result of gene-family presence/absence profiles of all detected strains. It consist of a.tsv
file binary profile matrix. Gene-families are marked 1 when present and 0 when absent.
Example of a gene-familiy profile table in result file result_gene_presence_absence.tsv
sample01 sample04 sample05 sample08
g00001 1 0 1 0
g00002 0 1 1 1
g00003 0 0 0 1
g00003 1 1 1 1
The presence/absence matrix can be used in mathematical/statistical software (R, Python, Matlab) to visualize similarities between strains by heatmaps or PCoA plots; for investigating which gene-families are present in same strains, but not in others; and for finding potential relations of diseases associated to the presence of specific genes.
usage: panphlan_profiling.py [-h] -p PANGENOME [-i I_DNA] [--i_covmat I_COVMAT] [--o_matrix O_MATRIX] [--o_covmat O_COVMAT] [--o_covplot_normed O_COVPLOT_NORMED] [--o_idx DNA_INDEX_FILE]
[--min_coverage MIN_COVERAGE] [--left_max LEFT_MAX] [--right_min RIGHT_MIN] [--th_non_present TH_NON_PRESENT] [--th_present TH_PRESENT] [--th_multicopy TH_MULTICOPY]
[--strain_similarity_perc SIMILARITY_PERCENTAGE] [--i_rna INPUT_RNA_FOLDER] [--sample_pairs DNA_RNA_MAPPING] [--rna_max_zeros RNA_MAX_ZEROES]
[--rna_norm_percentile RNA_NORM_PERCENTILE] [--o_rna RNA_EXPRS_FILE] [--add_ref] [-v] [--func_annot FUNC_ANNOT] [-f FIELD]
optional arguments:
-h, --help show this help message and exit
-i I_DNA, --i_dna I_DNA
Input directory of panphlan_map.py results or text file with
--i_covmat I_COVMAT Path to precomputed coverage matrix
--o_matrix O_MATRIX Path for presence/absence matrix output
--o_covmat O_COVMAT Write raw gene-family coverage matrix in provided file
--o_covplot_normed O_COVPLOT_NORMED
Filename for normalized gene-family coverage plot.
--o_idx DNA_INDEX_FILE
Write gene-family plateau definitions (1, -1, -2, -3)
--min_coverage MIN_COVERAGE
Minimum coverage threshold, default: 2X
--left_max LEFT_MAX Strain presence/absence plateau curve threshold: left max [1.25]
--right_min RIGHT_MIN
Strain presence/absence plateau curve threshold: right min [0.75]
--th_non_present TH_NON_PRESENT
Gene families threshold: not present if lower
--th_present TH_PRESENT
Gene families threshold: present if higher
--th_multicopy TH_MULTICOPY
Gene families threshold: multicopy if higher
--strain_similarity_perc SIMILARITY_PERCENTAGE
Minimum threshold (percentage) for genome length to add a reference genome to presence/absence matrix (default: 50).
--i_rna INPUT_RNA_FOLDER
RNA-seq: input directory of RNA mapping results SAMPLE_RNA.csv.bz2
--sample_pairs DNA_RNA_MAPPING
RNA-seq: list of DNA-RNA sequencing pairs from the same biological sample.
--rna_max_zeros RNA_MAX_ZEROES
Max accepted percent of zero coveraged gene-families (default: <10 %).
--rna_norm_percentile RNA_NORM_PERCENTILE
Percentile for normalizing RNA/DNA ratios
--o_rna RNA_EXPRS_FILE
Write normalized gene-family transcription values (RNA-seq).
--add_ref Add reference genomes to gene-family presence/absence matrix.
-v, --verbose Show progress information
--func_annot FUNC_ANNOT
Path to file mapping UniRef IDs to GO/KEGG/... annotation for functional characterization
-f FIELD, --field FIELD
Field in the annotation file that must be added to the presence/absence matrix
required arguments:
-p PANGENOME, --pangenome PANGENOME
Path to pangenome tsv file
Depending on the sample coverage depth and expected species abundance, different sensitivity thresholds for strain presence/absence detection can be chosen. This includes a strain minimum coverage and settings for accepted shapes of strain abundance curves.
--min_coverage 5 --left_max 1.18 --right_min 0.82
(very stringent)
--min_coverage 2 --left_max 1.25 --right_min 0.75
(default)
--min_coverage 1 --left_max 1.70 --right_min 0.30
(very sensitive)
For all samples in which a strain could be detected, we define presence and absence of gene-families based on a threshold on normalized coverage curves:
--th_present 0.5
All gene-families larger than 0.5 are considered as present (1), and lower as absent (0).
PanPhlAn is a project of the Computational Metagenomics Lab at CIBIO, University of Trento, Italy.
- PanPhlAn 3.0
- PanPhlAn 1.3