-
Notifications
You must be signed in to change notification settings - Fork 15
Searching for gene families by presence and absence patterns
Before running the scripts in this tutorial you should have first completed the directions step 1 and 2 for building the ITEP database.
The script db_findClustersByOrganismList.py supports finding clusters with different combinations of four bulk properties, ALL, ANY, ONLY, and NONE, with respect to a given list of organisms (the "ingroup") and the rest of the organisms in a specific cluster run (the "outgroup"). The below table specifies the possible combinations and what they represent.
Property | Ingroup | Outgroup | Meaning
+-----------+---------+-----------+----------
ALL | == N | >= 0 | Conserved genes in the specified list
+-----------+---------+-----------+----------
ANY | >= 1 | >= 0 | Genes present in the specified list
+-----------+---------+-----------+----------
ONLY | >= 1 | == 0 | Genes unique to the specified list
+-----------+---------+-----------+----------
NONE | == 0 | >= 1* | Genes absent from the specified list
+-----------+---------+-----------+-----------
ALL + ONLY | == N | == 0 | Genes that are conserved and found only in the specified list
+-----------+---------+-----------+-----------
ANY + ONLY | >= 1 | == 0 | Genes that are found only in the specified list
+-----------+---------+-----------+-----------
ALL + NONE |
ANY + NONE | Contradictions (raise errors).
ONLY + NONE|
+-----------+---------+-----------
You can also specify a UNIQUE property which enforces that the members in the ingroup must be unique.
Create a file with the two lines in it:
Clostridium beijerinckii NCIMB 8052
Clostridium novyi NT
Save it as "Clostridia_names.txt". Then run the following:
$ cat "Clostridia_names.txt" | db_findClustersByOrganismList.py -a -s all_I_2.0_c_0.4_m_maxbit
You get a list of all of the genes clusters that are found in both of our Clostridia species in our test database (-a) but not in Acetobacterium woodii (-s). By doing a line count (wc -l) we find that 566 families have this property including the following pair:
all_I_2.0_c_0.4_m_maxbit 996
We can find out what genes are in this cluster by using the db_getGenesInCluster.py function, or to find all of their functions and sequences directly use the db_getClusterGeneInformation.py function:
$ makeTabDelimitedRow.py "all_I_2.0_c_0.4_m_maxbit" "996" | db_getClusterGeneInformation.py
fig|290402.1.peg.1383 Clostridium beijerinckii NCIMB 8052 290402.1 DEFAULT_1 290402.1.NC_009617.1 1640496 1642154 + 1 radical SAM domain-containing protein_YP_001308529.1_Cbei_1394 ATGAAGGTATTA... MKVLITA... all_I_2.0_c_0.4_m_maxbit 996
fig|386415.1.peg.747 Clostridium novyi NT 386415.1 DEFAULT_2 386415.1.NC_008593.1 845451 847223 + 1 magnesium-protoporphyrin IX monomethyl ester oxidative cyclase_YP_877721.1_NT01CX_1640 ATGAAAAA... MKKLKTLL... all_I_2.0_c_0.4_m_maxbit 996
Where the DNA and amino acid sequences have been truncated for clarity. Note that as expected both C. beijerinckii and C. novyi are predicted to have this gene, but Acetobacterium is not. We could perform clustering at lower cutoffs (0.4 is rather stringent) and perform similar queries to see if this prediction holds up.
This section requires you to have a tree with sanitized organism names on the leaves. A sanitized ID has everything that is not a letter or number in the organism's name (including spaces) replaced with an underscore. See the "Building alignments and trees" tutorial for more details on how to generate such a file.
We have provided a function "makeCoreClusterAnalysisTree.py" that performs presence-absence analysis for every clade in a tree using the same methods as shown above. It produces the number of gene families with the specified properties and displays them on a tree. Optionally, it also produces an Excel file containing the runID\cluster ID pairs and a representative annotation (the one with the highest number of genes) for each cluster identified at each node in the tree (the names of the sheets correspond to the node labels on the tree), allowing quick identification of interesting gain and loss patterns at different stages of evolution.