Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Building a concatinated gene tree

mattb112885 edited this page Apr 19, 2013 · 8 revisions

Requirements for making a concatenated gene tree

ITEP contains scripts to help you make a concatenated gene tree (from which you can get an organism phylogeny or an overall phylogeny for a specific operon, etc.). However, in order to make such a tree, it is required that all of the alignments that you wish to concatinate:

  1. Represent the same group of organisms, and
  2. Have exactly one gene per organism

The alignments should be in FASTA formats and should have ITEP gene IDs so that ITEP can search for the organism to which they belong.

A set of alignments that meet all of these requirements can be made using the following procedure: First, identify a group of organisms that you want to use. Then make a file with the organism names (one on each line) and call

# This part gets cluster and run IDs for clusters that have exactly one copy in each input organism.
# However, they do not necessarily have 
$ cat [organism_list_file] | db_findClustersByOrganismList.py -a -u all_I_2.0_c_0.4_m_maxbit > [conserved_uniq_clusters_filename]
# This part gets the gene info and filters the results to only contain the organisms in your file.
$ cat [conserved_uniq_clusters_filename] | db_getClusterGeneInformation.py | grep -F -f wanted > [geneinfo_filename]
# Finally, this part makes un-aligned FASTA files for each cluster in the above geneinfo file
$ cat [geneinfo_filename] | getClusterFastas.py [foldername]

These FASTA files can be aligned with your chosen tools, eg. with MAFFT

$ cd [foldername]
$ mkdir [newdir]
$ for file in *; do mafft --auto $file > [newdir]/$file; done

where newdir is some folder you create to store all the alignments.

Figuring out which alignments you want to concatenate

[TODO - I need to write some nicer functions for filtering that list of clusters by annotation...]

Concatenating alignments

Once you have a directory containing ONLY the alignments you want to concatenate, they can be concatenated by running

$ catAlignments.py [alignment_directory] > [concatinated_alignment]

The script will automatically identify which proteins are in the same organisms (they must have ITEP IDs, which they will if you make the FASTA files with ITEP tools) and sequentially add them to the alignment so that the same protein is in the same position for each organism.

Clone this wiki locally