Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Building your database 2 mcl clustering

mattb112885 edited this page May 9, 2013 · 11 revisions

Performing a cluster run

Before running MCL clustering make sure you have set up any groups of organisms that you want to cluster (see Specifying lists of organisms to cluster"Specifying lists of organisms to cluster"). By default clustering is run on all organisms you have downloaded and imported into ITEP. You also have to have already run main.sh (see "Building your database 1") to get the BLASTP results for all vs. all.

To run MCL clustering run the following command FROM $root (it will NOT work if you run it from another directory):

./main2.sh [inflation_value] [scoring_criteria] [score_cutoff]

In this tutorial we will do a lot with the following settings for the sake of illustration:

./main2.sh 2.0 maxbit 0.4

Inflation value is a value larger than 1 (default in MCL is 2.0 so you can specify that if you don't want to mess with it) that controls the granularity of clusters. Scoring criteria are based on the BLAST hits (obtained from main.sh) and can be "minbit" or "maxbit" - minbit is the bit score divided by the minimum of self-bit scores for query and target genes, while "maxbit" is the bits core divided by the maximum of the self-bit scores. The "minbit" criteria emphasizes strong hits over the entirety of the smaller protein (so it can pick up pseudogenes but is less sensitive to events such as gene fusions); the "maxbit" criteria emphasizes strong hits over the entire query and target proteins. A typical value for the cutoff for a single genus is 0.4 but you should play with it and look at the score distributions to see what is appropriate for the protein families you are interested in studying.

The same parameters are used to run MCL with every group in the groups file if there is more than one group present there.

The main2.sh script performs these tasks:

  1. Running MCL clustering on each cluster group (group of organisms) with the specified parameters
  2. Reformatting the clustering output files to assign each cluster to its run ID and a numeric cluster ID
  3. Building a presence-absence table for all organisms in the cluster group based on the clustering results.
  4. Importing the calculation results into the ITEP sqlite database.

Differentiating multiple cluster runs

One nice thing about ITEP is that you can run main2.sh with multiple combinations of inflation values, scoring criteria and cutoffs, and it will store all of them under separate "run IDs" so you can compare them. The run ID is given the following form (this is just an example):

groupID_I_inflation_c_cutoff_m_metric

groupID is the ID for the group of organisms from the "groups" file, I is the inflation parameter, m is the homology metric and c is the cutoff.

If a particular cluster run already exists when main2.sh is called, it is skipped and the method moves on to the next group.

Run IDs and cluster IDs

Many of the ITEP scripts require as input a (runID, clusterID) pair (as two columns in an input file). The run ID is described above; a cluster ID is an integer assigned to each cluster in the order they are outputted by MCL (the largest cluster has ID 1). Any time a script returns cluster IDs, it provides the corresponding run IDs as well. In turn you must provide both of these values (in a tab-delimited row) to get information about a particular cluster.

If you know a run ID and a cluster ID that you are interested in and want to make a tab-delimited row, we have provided a convenient way to do that:

$ makeTabDelimitedRow.py all_I_2.0_c_0.4_m_maxbit 1
all_I_2.0_c_0.4_m_maxbit      1

The results of this can then be piped into the commands that require both a runID and a clusterID to analyze a cluster.

How to get a list of cluster runs

A list of cluster runs currently imported into ITEP is always available via the db_getAllClusterRuns.py function. For example if we use the three groups we generated in the prior tutorial to run clustering with an inflation value of 2.0, a cutoff of 0.4 and a maxbit score we get the following list of run IDs:

$ db_getAllClusterRuns.py
Clostridia_I_2.0_c_0.4_m_maxbit
all_I_2.0_c_0.4_m_maxbit
woodii_novyi_I_2.0_c_0.4_m_maxbit

This is a good function to keep in mind for many scripts that require a run ID as input.

Clone this wiki locally