-
Notifications
You must be signed in to change notification settings - Fork 15
Comparing the results of different clustering approaches
In ITEP, you are allowed to run multiple clustering approaches and save them in the same database. Each different clustering approach that you run is given its own Run ID, which is then used in combination with cluster IDs to identify a specific protein family identified with a specific clustering technique
This tutorial will show how to visualize the effects of parameter choices on clustering results. For simplicity we focus on MCL results but the same procedures can be applied to compare other clustering procedures provided the data has been imported into ITEP.
If you just want to use MCL to cluster your organisms, perform the following steps.
- Set up a groups file containing the group of organisms you want to cluster (see directions here).
- Use setup_step2.sh as many times as you want, varying the cutoff and inflation parameters. Each time the clustering results will be saved to the database and presence/absence data will be re-computed.
If you want to use other methods (not MCL) to cluster your organisms, follow the steps outlined here to extract data to cluster from the database and then re-import the clustering results when it is done running.
You can check the list of all cluster runs you have available in the database at any time by running:
$ db_getAllClusterRuns.py
In this example, we will use the test data from two Clostridia and one Acetobacterium that is provided in the virtual machine (and has been used for previous tutorials) to explore the effects of clustering cutoffs on the composition of clusters. In the VM we have defined three groups for the purposes of the tutorials: 'all' which contains all of the organisms in the database, 'Clostridia' which contains only the Clostridia species and 'woodii_novyi' which only contains Acetobacterium woodii and Clostridium novyi (see this tutorial for a primer on how we generated the groups file).
Unlike other tutorials, we need to perform some more clustering before we can do testing. Since we are using MCL, we will do this by using setup_step2.sh (note - the VM already has groups set up for you). Set up several screens and run one of these commands in each of them:
./setup_step2.sh 2.0 maxbit 0.3
./setup_step2.sh 2.0 maxbit 0.2
./setup_step2.sh 2.0 maxbit 0.1
This will take about 5-10 minutes to run.
When all of them are done, use db_getAllClusterRuns.py to verify that all of them are loaded. If they are not, run one of the above commands again (it doesn't matter which) to load them. The results should be the following (this result includes the three cluster runs we've been using in the rest of the tutorials):
$ db_getAllClusterRuns.py
Clostridia_I_2.0_c_0.1_m_maxbit
Clostridia_I_2.0_c_0.2_m_maxbit
Clostridia_I_2.0_c_0.3_m_maxbit
Clostridia_I_2.0_c_0.4_m_maxbit
all_I_2.0_c_0.1_m_maxbit
all_I_2.0_c_0.2_m_maxbit
all_I_2.0_c_0.3_m_maxbit
all_I_2.0_c_0.4_m_maxbit
woodii_novyi_I_2.0_c_0.1_m_maxbit
woodii_novyi_I_2.0_c_0.2_m_maxbit
woodii_novyi_I_2.0_c_0.3_m_maxbit
woodii_novyi_I_2.0_c_0.4_m_maxbit
Now that we have multiple cutoffs for our pre-defined groups of organisms, lets look at how we can compute and visualize the differences
If you can define a reference gene, ITEP can identify all of the clusters to which it belongs and the other members of those clusters. You can get a convenient table of which of these genes are in common in different clustering runs using the following command:
$ db_makeClusterComparisonTable.py 'reference_gene' -a;
For example, for the C beijerinckii phosphofructokinase:
$ db_makeClusterComparisonTable.py 'fig|290402.1.peg.992' -a
creates this table. The genes on the left are all genes that are identified in the same cluster as 'fig|290402.1.peg.992' in any cluster run, and the labels on the top are the run IDs and the cluster IDs (e.g. 341) in which it is found.
#names Clostridia_I_2.0_c_0.1_m_maxbit_341 Clostridia_I_2.0_c_0.2_m_maxbit_459 Clostridia_I_2.0_c_0.3_m_maxbit_321 Clostridia_I_2.0_c_0.4_m_maxbit_4759 all_I_2.0_c_0.1_m_maxbit_374 all_I_2.0_c_0.2_m_maxbit_430 all_I_2.0_c_0.3_m_maxbit_278 all_I_2.0_c_0.4_m_maxbit_6096
fig|290402.1.peg.992 1 1 1 1 1 1 1 1
fig|290402.1.peg.4768 1 1 1 0 1 1 1 0
fig|290402.1.peg.581 1 0 0 0 1 0 0 0
fig|386415.1.peg.406 1 1 1 0 1 1 1 0
fig|931626.1.peg.1249 0 0 0 0 1 1 1 0
The above could be difficult to interpret because some cluster runs have different sets of organisms (e.g. Acetobacterium is not present in the "Clostridia" group). You can filter (with regular expressions) for specific cluster runs using the -f flag, which can be used to narrow down the list to only cluster runs with the same set of organisms:
$ db_makeClusterComparisonTable.py 'fig|290402.1.peg.992' -a -f 'Clostridia'
#names Clostridia_I_2.0_c_0.1_m_maxbit_341 Clostridia_I_2.0_c_0.2_m_maxbit_459 Clostridia_I_2.0_c_0.3_m_maxbit_321 Clostridia_I_2.0_c_0.4_m_maxbit_4759
fig|290402.1.peg.4768 1 1 1 0
fig|290402.1.peg.992 1 1 1 1
fig|290402.1.peg.581 1 0 0 0
fig|386415.1.peg.406 1 1 1 0
Note that at at the weakest cutoff (0.1) there were four proteins identified in the cluster, but at the strictest cutoff (0.4) the target gene 'fig|290402.1.peg.992' was the only one in its cluster. Also notice that the 'fig|290402.1.peg.581' gene fell away from the cluster before the others, so it apparently is the most divergent by this metric.
Recall that you can replace the gene names with human-readable IDs using several methods; for example, you can replace it with the annotation and organism using db_replaceGeneNameWithAnnotation.py:
$ db_makeClusterComparisonTable.py 'fig|290402.1.peg.992' -a -f 'Clostridia' | db_replaceGeneNameWithAnnotation.py -a -o
#names Clostridia_I_2.0_c_0.1_m_maxbit_341 Clostridia_I_2.0_c_0.2_m_maxbit_459 Clostridia_I_2.0_c_0.3_m_maxbit_321 Clostridia_I_2.0_c_0.4_m_maxbit_4759
Clostridium_beijerinckii_NCIMB_8052_6_phosphofructokinase_YP_001311914_1_Cbei_4852 1 1 1 0
Clostridium_beijerinckii_NCIMB_8052_6_phosphofructokinase_YP_001308138_1_Cbei_0998 1 1 1 1
Clostridium_beijerinckii_NCIMB_8052_6_phosphofructokinase_YP_001307727_1_Cbei_0584 1 0 0 0
Clostridium_novyi_NT_6_phosphofructokinase_YP_877380_1_NT01CX_1297 1 1 1 0
However, you should only use this for visualization or if this is the final result you want - further ITEP analyses require you to use the ITEP IDs as inputs.
Finally, note that you can add the --yesno flag to the above command to change 1 and 0 to 'yes' and 'no', if you prefer.
You can filter the list of genes returned by db_makeClusterComparisonTable.py in two ways: by providing a list of genes that you are interested in, or by providing a Newick tree with ITEP IDs as leaf names (built as shown here).
In either case, only genes in the provided list are printed in the final output.
The output from db_makeClusterComparisonTable.py is provided in the format necessary for input into db_displayTree.py. You will first need a tree with the same gene IDs as provided in the output table. We create one using the following set of commands:
$ db_makeClusterComparisonTable.py 'fig|290402.1.peg.992' -a -f 'Clostridia' > comp
$ cat comp | grep 'fig\|' | db_getAlignmentBetweenGenes.py | FastTreeMP -wag -gamma > tree
Display the tree with the results of the comparison attached to it using:
$ db_displaytree -d -m comp tree