-
Notifications
You must be signed in to change notification settings - Fork 15
Building a concatinated gene tree
ITEP contains scripts to help you make a concatenated gene tree (from which you can get an organism phylogeny or an overall phylogeny for a specific operon, etc.). However, in order to make such a tree, it is required that all of the alignments that you wish to concatinate:
- Represent the same group of organisms, and
- Have exactly one gene per organism
The alignments should be in FASTA formats and should have ITEP gene IDs so that ITEP can search for the organism to which they belong.
A set of alignments that meet all of these requirements can be made using the following procedure: First, identify a group of organisms that you want to use. Then make a file with the organism names (one on each line) and call
# This part gets cluster and run IDs for clusters that have exactly one copy in each input organism.
# However, they do not necessarily have
$ cat [organism_list_file] | db_findClustersByOrganismList.py -a -u all_I_2.0_c_0.4_m_maxbit > [conserved_uniq_clusters_filename]
# This part gets the gene info and filters the results to only contain the organisms in your file.
$ cat [conserved_uniq_clusters_filename] | db_getClusterGeneInformation.py | grep -F -f wanted > [geneinfo_filename]
# Finally, this part makes un-aligned FASTA files for each cluster in the above geneinfo file
$ cat [geneinfo_filename] | getClusterFastas.py [foldername]
These FASTA files can be aligned with your chosen tools, eg. with MAFFT
$ cd [foldername]
$ mkdir [newdir]
$ for file in *; do mafft --auto $file > [newdir]/$file; done
where newdir is some folder you create to store all the alignments.
[TODO - I need to write some nicer functions for filtering that list of clusters by annotation...]
Once you have a directory containing ONLY the alignments you want to concatenate, they can be concatenated by running
$ catAlignments.py [alignment_directory] > [concatinated_alignment]
The script will automatically identify which proteins are in the same organisms (they must have ITEP IDs, which they will if you make the FASTA files with ITEP tools) and sequentially add them to the alignment so that the same protein is in the same position for each organism.