Skip to content

PanPhlAn profiling 3_0

leonarDubois edited this page Jan 20, 2021 · 2 revisions

panphlan_profiling.py is used to merge and process the panphlan_map.py results for getting the final gene presence/absence profiles of detected strains in samples, or for extracting the transcriptional activity of individual strains based on DNA & RNAseq pairs of the same sample.

Example:
./panphlan_profiling.py -i map_results/ --o_matrix result_gene_presence_absence.tsv -p pangenome_file.tsv --add_ref

Input

  • --i_dna input directory countaining the panphlan_map.py results OR a text file with all paths of the files to input
  • --pangenome Path to the pangenome file

Output

  • --o_matrix final result of gene-family presence/absence profiles of all detected strains. It consist of a .tsv file binary profile matrix. Gene-families are marked 1 when present and 0 when absent.

Example of a gene-familiy profile table in result file result_gene_presence_absence.tsv

        sample01 sample04 sample05 sample08
g00001      1       0        1        0
g00002      0       1        1        1
g00003      0       0        0        1
g00003      1       1        1        1

The presence/absence matrix can be used in mathematical/statistical software (R, Python, Matlab) to visualize similarities between strains by heatmaps or PCoA plots; for investigating which gene-families are present in same strains, but not in others; and for finding potential relations of diseases associated to the presence of specific genes.

Help

usage: panphlan_profiling.py [-h] -p PANGENOME [-i I_DNA] [--i_covmat I_COVMAT] [--o_matrix O_MATRIX] [--o_covmat O_COVMAT] [--o_covplot_normed O_COVPLOT_NORMED] [--o_idx DNA_INDEX_FILE]
                             [--min_coverage MIN_COVERAGE] [--left_max LEFT_MAX] [--right_min RIGHT_MIN] [--th_non_present TH_NON_PRESENT] [--th_present TH_PRESENT] [--th_multicopy TH_MULTICOPY]
                             [--strain_similarity_perc SIMILARITY_PERCENTAGE] [--i_rna INPUT_RNA_FOLDER] [--sample_pairs DNA_RNA_MAPPING] [--rna_max_zeros RNA_MAX_ZEROES]
                             [--rna_norm_percentile RNA_NORM_PERCENTILE] [--o_rna RNA_EXPRS_FILE] [--add_ref] [-v] [--func_annot FUNC_ANNOT] [-f FIELD]

optional arguments:
  -h, --help            show this help message and exit
  -i I_DNA, --i_dna I_DNA
                        Input directory of panphlan_map.py results or text file with
  --i_covmat I_COVMAT   Path to precomputed coverage matrix
  --o_matrix O_MATRIX   Path for presence/absence matrix output
  --o_covmat O_COVMAT   Write raw gene-family coverage matrix in provided file
  --o_covplot_normed O_COVPLOT_NORMED
                        Filename for normalized gene-family coverage plot.
  --o_idx DNA_INDEX_FILE
                        Write gene-family plateau definitions (1, -1, -2, -3)
  --min_coverage MIN_COVERAGE
                        Minimum coverage threshold, default: 2X
  --left_max LEFT_MAX   Strain presence/absence plateau curve threshold: left max [1.25]
  --right_min RIGHT_MIN
                        Strain presence/absence plateau curve threshold: right min [0.75]
  --th_non_present TH_NON_PRESENT
                        Gene families threshold: not present if lower
  --th_present TH_PRESENT
                        Gene families threshold: present if higher
  --th_multicopy TH_MULTICOPY
                        Gene families threshold: multicopy if higher
  --strain_similarity_perc SIMILARITY_PERCENTAGE
                        Minimum threshold (percentage) for genome length to add a reference genome to presence/absence matrix (default: 50).
  --i_rna INPUT_RNA_FOLDER
                        RNA-seq: input directory of RNA mapping results SAMPLE_RNA.csv.bz2
  --sample_pairs DNA_RNA_MAPPING
                        RNA-seq: list of DNA-RNA sequencing pairs from the same biological sample.
  --rna_max_zeros RNA_MAX_ZEROES
                        Max accepted percent of zero coveraged gene-families (default: <10 %).
  --rna_norm_percentile RNA_NORM_PERCENTILE
                        Percentile for normalizing RNA/DNA ratios
  --o_rna RNA_EXPRS_FILE
                        Write normalized gene-family transcription values (RNA-seq).
  --add_ref             Add reference genomes to gene-family presence/absence matrix.
  -v, --verbose         Show progress information
  --func_annot FUNC_ANNOT
                        Path to file mapping UniRef IDs to GO/KEGG/... annotation for functional characterization
  -f FIELD, --field FIELD
                        Field in the annotation file that must be added to the presence/absence matrix

required arguments:
  -p PANGENOME, --pangenome PANGENOME
                        Path to pangenome tsv file


PanPhlAn profile thresholds

Strain detection thresholds

Depending on the sample coverage depth and expected species abundance, different sensitivity thresholds for strain presence/absence detection can be chosen. This includes a strain minimum coverage and settings for accepted shapes of strain abundance curves.
--min_coverage 5 --left_max 1.18 --right_min 0.82 (very stringent)
--min_coverage 2 --left_max 1.25 --right_min 0.75 (default)
--min_coverage 1 --left_max 1.70 --right_min 0.30 (very sensitive)

Gene-family detection thresholds

For all samples in which a strain could be detected, we define presence and absence of gene-families based on a threshold on normalized coverage curves:
--th_present 0.5
All gene-families larger than 0.5 are considered as present (1), and lower as absent (0).