Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Searching for gene families by presence and absence patterns

mattb112885 edited this page May 9, 2013 · 9 revisions

Fetching pre-computed presence and absence patterns

The database stores a pre-computed table of which genes are present and absent in each organism for each of the clusters in the database. To access this table (or subsets of it) use the db_getPresenceAbsenceTable.py function. Just calling the function without any arguments prints out the entire table (with every cluster run and every cluster). Only the first two rows are shown in the example below:

$ db_getPresenceAbsenceTable.py
runid   clusterid       annote  Clostridium_beijerinckii_NCIMB_8052     Clostridium_novyi_NT    Acetobacterium_woodii_DSM_1030
all_I_2.0_c_0.4_m_maxbit        1424    glutamate racemase_YP_879120.1_NT01CX_0654_murI fig|290402.1.peg.32     fig|386415.1.peg.2142   NONE

As you can see, by default, the function prints out the ITEP IDs in each organism for each cluster - to instead get Boolean values (0 for absent and 1 for present) just call the function with the -b flag:

$ db_getPresenceAbsenceTable.py -b
runid   clusterid       annote  Clostridium_beijerinckii_NCIMB_8052     Clostridium_novyi_NT    Acetobacterium_woodii_DSM_1030
all_I_2.0_c_0.4_m_maxbit        1424    glutamate racemase_YP_879120.1_NT01CX_0654_murI 1     1   0

The function also lets you print out the number of gene representatives in each cluster\run pair for each organism, or pull out subsets by run ID or by cluster ID (see help text for details). Finally, if you have an organism tree with sanitized organism names as leaf names (see Building a concatinated gene tree for how to build such a tree), you can tell the program to sort the columns in a way that makes sense in light of the tree's branching order (so that for example organisms of the same species will group together) using the -t flag.

Types of presence and absence patterns that you can search with ITEP

The script db_findClustersByOrganismList.py supports finding clusters with different combinations of four bulk properties, ALL, ANY, ONLY, and NONE, with respect to a given list of organisms (the "ingroup") and the rest of the organisms in a specific cluster run (the "outgroup"). The below table specifies the possible combinations and what they represent.

  Property  | Ingroup |  Outgroup | Meaning
+-----------+---------+-----------+----------
  ALL       |  == N   |    >= 0   | Conserved genes in the specified list
+-----------+---------+-----------+----------
  ANY       |  >= 1   |    >= 0   | Genes present in the specified list
+-----------+---------+-----------+----------
  ONLY      |  >= 1   |    == 0   | Genes unique to the specified list
+-----------+---------+-----------+----------
  NONE      |  == 0   |    >= 1*  | Genes absent from the specified list
+-----------+---------+-----------+-----------
 ALL + ONLY |  == N   |    == 0   | Genes that are conserved and found only in the specified list
+-----------+---------+-----------+-----------
 ANY + ONLY |  >= 1   |    == 0   | Genes that are found only in the specified list
+-----------+---------+-----------+-----------
 ALL + NONE |
 ANY + NONE | Contradictions (raise errors).
 ONLY + NONE|
+-----------+---------+-----------

You can also specify a UNIQUE property which enforces that the members in the ingroup must be unique.

Examples with a specific list of organisms

Create a file with the two lines in it:

Clostridium beijerinckii NCIMB 8052
Clostridium novyi NT

Save it as "Clostridia_names.txt". Then run the following:

$ cat "Clostridia_names.txt" | db_findClustersByOrganismList.py -a -s all_I_2.0_c_0.4_m_maxbit

You get a list of all of the genes clusters that are found in both of our Clostridia species in our test database (-a) but not in Acetobacterium woodii (-s). By doing a line count (wc -l) we find that 566 families have this property including the following pair:

all_I_2.0_c_0.4_m_maxbit        996

We can find out what genes are in this cluster by using the db_getGenesInCluster.py function, or to find all of their functions and sequences directly use the db_getClusterGeneInformation.py function:

$ makeTabDelimitedRow.py "all_I_2.0_c_0.4_m_maxbit" "996" | db_getClusterGeneInformation.py
fig|290402.1.peg.1383   Clostridium beijerinckii NCIMB 8052     290402.1        DEFAULT_1 290402.1.NC_009617.1    1640496 1642154 +       1       radical SAM domain-containing protein_YP_001308529.1_Cbei_1394 ATGAAGGTATTA...     MKVLITA...      all_I_2.0_c_0.4_m_maxbit        996
fig|386415.1.peg.747    Clostridium novyi NT    386415.1        DEFAULT_2       386415.1.NC_008593.1    845451  847223  +       1       magnesium-protoporphyrin IX monomethyl ester oxidative cyclase_YP_877721.1_NT01CX_1640 ATGAAAAA...   MKKLKTLL...      all_I_2.0_c_0.4_m_maxbit        996

Where the DNA and amino acid sequences have been truncated for clarity. Note that as expected both C. beijerinckii and C. novyi are predicted to have this gene, but Acetobacterium is not. We could perform clustering at lower cutoffs (0.4 is rather stringent) and perform similar queries to see if this prediction holds up.

Finding presence and absence across an organism tree

This section requires you to have a tree with sanitized organism names on the leaves. A sanitized ID has everything that is not a letter or number in the organism's name (including spaces) replaced with an underscore. See the Building a concatinated gene tree tutorial for more details on how to generate such a file.

We have provided a function "makeCoreClusterAnalysisTree.py" that performs presence-absence analysis for every clade in a tree using the same methods as shown above. It produces the number of gene families with the specified properties and displays them on a tree. Optionally, it also produces an Excel file containing the runID\cluster ID pairs and a representative annotation (the one with the highest number of genes) for each cluster identified at each node in the tree (the names of the sheets correspond to the node labels on the tree), allowing quick identification of interesting gain and loss patterns at different stages of evolution.

Clone this wiki locally