-
From
~/hb_database/hb_database_env.txt
create the working environment.conda env create -file hb_database_env.txt
-
Init conda environment:
conda activate hb_database
Download SwissProt uniprot_sprot.fa and create a local database.
makeblastdb \
-in ~/db_uniprot/uniprot_sprot.fasta \
-out ~/db_uniprot/ \
-parse_seqids \
-dbtype prot
Download Cdd database.
makeprofiledb \
-title Cdd \
-in Cdd.pn \
-out db_cdd \
-threshold 9.82 \
-scale 100.0 \
-dbtype rps \
-index true
Download Pfam-A.hmm & Pfam-A.hmm.dat and create a local Pfam database.
hmmpress -f data/seqs/db_pfam/Pfam-A.hmm
Check here for set up steps.
Get from NCBI the .fasta
file for the curated alpha subunits.
Only one alpha subunit per enzyme of interest.
python 01_fetching_aa_seqs.py \
--input data/seqs/01_list_alpha_subunits.csv \
--email <insert your email>
Input table example:
enzyme | alpha_subunits | accession |
---|---|---|
Xylene_monooxygenase | xylM | NP_542887 |
Toluene_dioxygenase | todC1 | BAC05504 |
Toluene_2-monooxygenase | tomA1 | AAK07411.1 |
Do not use symbols such as (",", "/", etc.) in your input table. It may cause various parsing failures.
02_alpha_subunits_aa_sequences
splits your multiple sequence 02_alpha_subunits_aa_sequences.fa
file, and BLASTP each singular
fasta
file.
python 02_do_blast.py \
--input analyses/02_alpha_subunits_aa_sequences.fa \
--database data/seqs/db_uniprot/swiss_prot.fa
Check in the output xml files last/trailing entries; sometimes you can get eukaryotic genes
Parse .xml
files and get UniProt reviewed and unreviewed (outgroup) sequences.
python 03_get_blast_hits.py \
--xml_in analyses/blast_outputs/separated_blast/
Nota Bene:
separated_hits/interest
(SwissProt reviewed sequences)separated_hits/outgroup
(outgroup sequences)separated_hits/merged
(reviewed + outgroup)
Get .fasta
file for each hit.
python 04_get_uniprot_fa.py \
--input analyses/blast_outputs/separated_hits/
05_do_rps_blast.py
uses .fa
files in blast_outputs/separated_hits
and outputs .cdd
, .ko
and .pfam
tables.
python 05_do_rps_blast.py \
--input_fa analyses/blast_outputs/separated_hits/ \
--db_cdd data/seqs/db_cdd/db_cdd \
--evalue 0.01 \
--db_pfam data/seqs/db_pfam/ \
--db_ko data/seqs/db_ko/profiles/ \
--ko_list data/seqs/db_ko/ko_list \
--ko_config data/seqs/db_ko/config.yml
To enhance the performance of Ko annotation check
.yml
file.
Merge .cdd
, .pfam
, .ko
tables in blast_outputs/separated_hits
.
python 06_create_annotated_table.py /
--fasta_input analyses/blast_outputs/separated_hits/
Output example (ABE37059.1_Paraburkholderia_xenovorans_LB400.tsv):
UniProt | Cdd | Ko | Pfam |
---|---|---|---|
Q52438 | CDD:239550 | K08689 | PF00355.27 |
Q53122 | CDD:239550 | K08689 | PF00355.27 |
Perform UniProt direct query.
Keyword implies exact gene name. Filtering implies deleting uncultured and fragments sequences.
python 07_do_sw_query.py \
--input_file data/seqs/01_list_alpha_subunits.csv
Get annotation for initial curated sequences (NBCI) analyses/blast_outputs/separated_fasta/
and SwissProt query sequences analyses/sp_query/
python 05_do_rps_blast.py \
-in analyses/blast_outputs/separated_fasta/ \
-cdd data/seqs/db_cdd/db_cdd \
-e 0.01 \
-pfam data/seqs/db_pfam/ \
-ko data/seqs/db_ko/profiles/ \
-kl data/seqs/db_ko/ko_list \
-kc data/seqs/db_ko/config.yml
python 06_create_annotated_table.py /
--fasta_input analyses/blast_outputs/separated_fasta/
---------------------------------------------
python 05_do_rps_blast.py \
-in analyses/sp_query/ \
-cdd data/seqs/db_cdd/db_cdd \
-e 0.01 \
-pfam data/seqs/db_pfam/ \
-ko data/seqs/db_ko/profiles/ \
-kl data/seqs/db_ko/ko_list \
-kc data/seqs/db_ko/config.yml
python 06_create_annotated_table.py /
-in analyses/sp_query/
python 08_do_format_seqs.py \
--input_fa analyses/sp_query/
python 08_do_format_seqs.py \
--input_fa analyses/blast_outputs/
I would recommend to put
.fa
ofanalyses/sp_query/
in a separate directory
Keys: accession (NCBI accession #); sp_query (List of fasta files gathered after direct UniProt keyword query); filtered# (Number of SwissProt query sequences after filtration of fragments, etc.); unfiltered# (Number of SwissProt raw keyword query sequences); ncbi_ref (Manually curated NCBI files); sp_ref (Files resulted from RPS Blast); sp_ref# (Number of sequences in each RPS Blast file).
python 09_prep_phylo.py \
--initial_table data/seqs/01_list_alpha_subunits.csv \
--ncbi_fa analyses/blast_outputs/separated_fasta/ \
--sp_blast analyses/blast_outputs/separated_hits/merged/ \
--sp_query analyses/sp_query/
Output:
analyses/annotated_table.tsv
- Curated NCBI sequences can be found at
analyses/blast_outputs/separated_fasta/
- SwissProt review sequences can be found at
analyses/blast_outputs/separated_hits/relevant/
- SwissProt outgroup sequences can be found at
analyses/blast_outputs/separated_hits/outgroup/
- SwissProt outgroup + reviewed sequences can be found at
analyses/blast_outputs/separated_hits/merged/
- SwissProt keyword query sequences (filtered) can be found at
analyses/sp_query/*.filtered
- Table with fasta files to be used further in the phylogenetic analyses
analyses/annotated_table.tsv
Create analyses/phylo
directory and put there merged .fasta
files from analyses/blast_outputs/separated_fasta/
,
analyses/blast_outputs/separated_hits
and analyses/sp_query/
.
python 10_phylo_env.py \
--input_table analyses/annotated_table/annotated_table.tsv
Perform multiple alignment with MUSCLE. Use Trimal for alignment trimming (gap threshold of 0.05). Use IQ-TREE for phylogenetic tree rendering. Use UFboot option for bootstrapping.
In order to get accurate phylogenetic tree. Instead of GAMMA method use LG+R7 (default method in
11_generate_best_tree
)
python 11_generate_best_tree.py \
--input analyses/phylo
Left join references titles for each alpha subunit in analyses/annotated_table.tsv
python 12_get_references.py \
--input_table analyses/annotated_table.tsv \
--email <insert your email>
In order to annotate phylogenetic trees in analyses/phylo
use R & ggtree (check the R script in /phylo/*.ipynb
).
Use for jupyter notebook for
.ipynb
Check /flow_charts
for flow chart creation.
Check /analyses/phylo/*.Rmd
for rendering final reports.
Final files can be accessed in analyses/phylo/*
.
Final .html
reports can be accessed in reports/*
.