Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Analyzing gene neighborhoods

mattb112885 edited this page Feb 28, 2014 · 7 revisions

Motivation and introduction

We have done lots of analysis of the C. beijerinckii 6-phosphofructokinase, but how do we know whether it has the same function as the ones in the other organisms? One of the most powerful forms of evidence in microbes is if the gene context of this gene and its homologs is the same.

ITEP includes two main ways to look at gene neighborhoods: one is to generate a tab-delimited list of a gene's neighbors and the other is to add gene neighborhood information to a protein tree.

Finding the gene neighborhood for a list of genes

The gene neighborhoods for a list of genes can be obtained in a convenient tabular format using the db_getGeneNeighborhoods.py function, which calls upon a pre-cached table of calculated neighborhoods for each gene in your genomes (up to a maximum of 10 genes in each direction).

We determined earlier that fig|290402.1.peg.4768 was the ITEP ID for one of the three annotated 6-phosphofructokinases (and as we'll see later, the most-conserved of the three). The neighborhoods of this gene are determined as follows:

$ echo "fig|290402.1.peg.4768" | db_getGeneNeighborhoods.py
fig|290402.1.peg.4768   fig|290402.1.peg.4765   -3      290402.1.NC_009617.1    5673448 5674506 +       galactoside ABC transporter periplasmic D-galactose/D-glucose-binding protein_YP_001311911.1_Cbei_4849
fig|290402.1.peg.4768   fig|290402.1.peg.4766   -2      290402.1.NC_009617.1    5676687 5675308 -       RNA methyltransferase_YP_001311912.1_Cbei_4850
fig|290402.1.peg.4768   fig|290402.1.peg.4767   -1      290402.1.NC_009617.1    5679010 5677589 -       pyruvate kinase_YP_001311913.1_Cbei_4851
fig|290402.1.peg.4768   fig|290402.1.peg.4768   0       290402.1.NC_009617.1    5680111 5679155 -       6-phosphofructokinase_YP_001311914.1_Cbei_4852
fig|290402.1.peg.4768   fig|290402.1.peg.4769   1       290402.1.NC_009617.1    5684143 5680568 -       DNA polymerase III DnaE_YP_001311915.1_Cbei_4853_dnaE
fig|290402.1.peg.4768   fig|290402.1.peg.4770   2       290402.1.NC_009617.1    5684531 5684247 -       stress responsive alpha-beta barrel domain-containing protein_YP_001311916.1_Cbei_4854
fig|290402.1.peg.4768   fig|290402.1.peg.4771   3       290402.1.NC_009617.1    5685502 5684555 -       hypothetical protein_YP_001311917.1_Cbei_4855

In this table the first column is the center gene's ID, the second is the neighboring gene's ID, the third is the number of genes away from the center, the fourth is the contig ID, the fifth and sixth are the start and stop locations of the neighboring gene, the seventh is the strand of the neighbor gene and the last is the annotation.

If the genome is incomplete genes will often fall on the ends of contigs. In such a case, the db_getGeneNeighborhoods.py function will only print out neighbors until it hits the end of a contig.

Adding neighborhood visualization to a protein tree

In another tutorial, we created the following Newick tree for one of the 6-phosphofructokinase clusters:

(fig_290402_1_peg_4768:0.15942,fig_931626_1_peg_1249:0.69610,fig_386415_1_peg_406:0.19652);

Create a file called "pfk_tree" containing this string. To visualize the neighborhoods of these genes, you need to pick a cluster run to use as a basis for coloring (we used all_I_2.0_c_0.4_m_maxbit):

$ cat pfk_tree | db_makeNeighborhoodTree.py -r all_I_2.0_c_0.4_m_maxbit -p pfk_tree -d

(You can omit the -d flag if you just want to save the results to a file and not look at them). The result should look like the below.

PFK tree

The function automatically replaces the gene IDs in the table above with human-readable labels (including organism name and annotation). The arrows for genes on the tree have red borders to distinguish them from their neighbors. The legend above the tree shows the cluster ID corresponding to each color on the tree.

A protein tree with tBLASTn IDs (last column of a results table from the db_TBlastN_wrapper.py script) instead of gene IDs can also be used as input to this function. The db_makeNeighborhoodTree.py function will automatically search for genes neighboring the location of the tBLASTn hit and append those to the tree.

Clone this wiki locally