Skip to content

basic_mining_agglom

Wei-ju Wu edited this page Oct 3, 2017 · 8 revisions
  |  

Basic Mining

EGRIN2.0 MongoDB query using agglom

In a nutshell

Set-up

Make sure ./egrin-tools/ folder is in your python path

You can do this on Mac/Linux by adding the path to you Bash Shell Startup Files, e.g. ~/.bashrc or ~/.bash_profile

for example in ~/.bash_profile add the following line:

export PYTHONPATH=$PYTHONPATH:path/to/egrin2-tools/

Load required modules

import query.egrin2_query as e2q
import pymongo
client = pymongo.MongoClient(host='baligadev')
db = client['eco_db']

There are several dependencies that need to be satisfied, including:

  • pymongo
  • numpy
  • pandas
  • joblib
  • scipy
  • statsmodels
  • itertools

agglom function

The agglom function is the centerpiece for querying information in the ensemble.

Using this function you can co-associate any two bicluster features across the entire ensemble: so - biclusters, genes, conditions, and GREs with biclusters, genes, conditions, or GREs.

If you stack multiple agglom calls you can quickly build very complex queries.

The agglom function has a simple, universal format. Basically it needs an input (x), the type of that input (x_type), the desired output type (y_type), and some infomration about the location of the EGRIN 2.0 MongoDB.

The function will output a table of pvalues that reports the signifigance for each element in your output query type.

Let's take a look at the function itself, then we will explore serveral examples:

?e2q.agglom

Example 1: find co-regulated genes

In this example, we will mine the ensemble for genes discovered in biclusters with the gene E. coli carbamoyl phosphate synthetase, carA (b0032)

carA_genes = e2q.agglom(db, x="b0032", x_type="genes", y_type="genes")


carA_genes
counts all_counts pval qval_BH qval_bonferroni
b1062 152 198 0.000000e+00 0.000000e+00 0.000000e+00
b0336 153 198 0.000000e+00 0.000000e+00 0.000000e+00
b0337 154 198 0.000000e+00 0.000000e+00 0.000000e+00
b4244 167 198 0.000000e+00 0.000000e+00 0.000000e+00
b4245 169 198 0.000000e+00 0.000000e+00 0.000000e+00
b0033 173 198 0.000000e+00 0.000000e+00 0.000000e+00
b0032 198 198 0.000000e+00 0.000000e+00 0.000000e+00
b2499 142 198 3.783051e-300 4.073085e-298 3.665776e-297
b2476 142 198 3.783051e-300 4.073085e-298 3.665776e-297
b2313 141 198 5.664712e-297 4.574255e-295 5.489106e-294
b2312 141 198 5.664712e-297 4.574255e-295 5.489106e-294
b2557 141 198 5.664712e-297 4.574255e-295 5.489106e-294
b0522 138 198 1.477589e-287 1.101372e-285 1.431783e-284
b1849 136 198 2.280977e-281 1.578762e-279 2.210267e-278
b4006 132 198 3.382871e-269 2.185335e-267 3.278002e-266
b0523 129 198 3.043903e-260 1.843464e-258 2.949542e-257
b3654 128 198 2.728256e-257 1.555106e-255 2.643680e-254
b4246 126 198 1.964952e-251 1.057799e-249 1.904038e-248
b2500 122 198 6.646122e-240 3.389522e-238 6.440093e-237
b2497 112 198 4.061863e-212 1.967973e-210 3.935945e-209
b4064 102 198 9.823972e-186 4.533061e-184 9.519429e-183
b0945 96 198 1.511792e-170 6.658758e-169 1.464927e-167
b4005 89 198 2.020583e-153 8.512805e-152 1.957945e-150
b3941 58 198 3.940658e-85 1.591041e-83 3.818498e-82
b3359 57 198 3.886443e-83 1.448447e-81 3.765963e-80
b3829 57 198 3.886443e-83 1.448447e-81 3.765963e-80
b4024 56 198 3.714681e-81 1.333158e-79 3.599526e-78
b0908 55 198 3.440262e-79 1.149522e-77 3.333614e-76
b3774 55 198 3.440262e-79 1.149522e-77 3.333614e-76
b0198 54 198 3.086539e-77 9.647922e-76 2.990856e-74
b3008 54 198 3.086539e-77 9.647922e-76 2.990856e-74
b0197 48 198 8.154522e-66 2.469291e-64 7.901732e-63
b4013 46 198 4.011807e-62 1.143365e-60 3.887441e-59
b0907 46 198 4.011807e-62 1.143365e-60 3.887441e-59
b2600 45 198 2.673409e-60 7.401525e-59 2.590534e-57
b2601 44 198 1.720887e-58 4.632054e-57 1.667539e-55
b3212 43 198 1.069629e-56 2.801271e-55 1.036470e-53
b3172 37 198 2.884778e-46 6.988374e-45 2.795350e-43
b3958 37 198 2.884778e-46 6.988374e-45 2.795350e-43
b3213 37 198 2.884778e-46 6.988374e-45 2.795350e-43
b4254 36 198 1.385367e-44 3.196239e-43 1.342420e-41
b2818 36 198 1.385367e-44 3.196239e-43 1.342420e-41
b0004 34 198 2.840051e-41 6.254567e-40 2.752010e-38
b4488 34 198 2.840051e-41 6.254567e-40 2.752010e-38
b0860 32 198 4.953595e-38 1.043486e-36 4.800034e-35
b0199 32 198 4.953595e-38 1.043486e-36 4.800034e-35
b3714 31 198 1.943526e-36 3.843422e-35 1.883277e-33
b2156 31 198 1.943526e-36 3.843422e-35 1.883277e-33
b2152 31 198 1.943526e-36 3.843422e-35 1.883277e-33
b2913 30 198 7.306285e-35 1.335810e-33 7.079791e-32
b1101 30 198 7.306285e-35 1.335810e-33 7.079791e-32
b0031 30 198 7.306285e-35 1.335810e-33 7.079791e-32
b0273 30 198 7.306285e-35 1.335810e-33 7.079791e-32
b3771 29 198 2.629355e-33 4.549723e-32 2.547845e-30
b3959 29 198 2.629355e-33 4.549723e-32 2.547845e-30
b3770 29 198 2.629355e-33 4.549723e-32 2.547845e-30
b0002 28 198 9.049612e-32 1.538434e-30 8.769074e-29
b3772 27 198 2.975709e-30 4.805771e-29 2.883462e-27
b0003 27 198 2.975709e-30 4.805771e-29 2.883462e-27
b3960 27 198 2.975709e-30 4.805771e-29 2.883462e-27
... ... ... ... ...

259 rows × 5 columns

There are several things to note in this query.

First the arguments:

  • x specfies the query. This can be gene(s), condition(s), GRE(s), or bicluster(s). x can be a single entity or a list of entitites of the same type.
  • x_type indicates the type of x. This can include gene, condition, gres, and biclusters. Basically: "what is x?" The parameter typing is pretty flexible, so - for example - rows can be used instead of genes.
  • y_type is the type of. Again, genes, conditions, gres, or biclusters.
  • host specifies where the MongoDB database is running. In this case it is running on a machine called baligadev. If you are hosting the database locally this would be localhost
  • db is the name of the database you want to perform the query in. Typically databases are specified by the three letter organism code (e.g., eco) followed by _db. A list of maintained databases is available here.

Also note the output:

This is a table of all genes that are frequently co-discovered in biclusters with b0032 (carA). They are sorted by their relative "enrichment" in the sample x (where lower pval = greater enrichment). This is quantified by the hypergeometric distribution. To account for multiple testing, two corrections are included: Benjamini-Hochberg correction (qval_BH; less stringent) and Bonferrnoni correction (qval_bonferroni; more stringent).

Example 2: what if gene names have different formats?

We can easily change the format of x, which makes it easy to specify genes in any format, e.g. NCBI annotation or common name. Let's try to include a list of genes with two different naming formats. This time we will only return the first few elements using .head()

e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type= "genes").head()
counts all_counts pval qval_BH qval_bonferroni
b4244 179 198 0 0 0
b0033 180 198 0 0 0
b4245 185 198 0 0 0
b0031 198 198 0 0 0
b4246 198 198 0 0 0

5 rows × 5 columns

Here, two of the genes are Blattner Ids (b****), while the other is a common name. This is no problem for the query.

The lookup table is pretty extensive for genes (it comes from MicrobesOnline) so anything from common name to GI number can be matched. Cool, huh?

Example 3: alternative logic

You can specify the type of logic to use in grouping your query.

For example, do you want do find biclusters where all genes x in your query are present (and logic) or biclusters where at least one of the genes is present (or logic). You can change the logical grouping of your query by specifying the logic parameter.

This parameter can take one of three values: and, or, or nor. By default the command uses or logic.

Here we change the logic employed in the previous query to "and", meaning that a bicluster must contain all of the query genes. Notice how the values change.

e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type="genes", logic="and").head()
counts all_counts pval qval_BH qval_bonferroni
b4246 25 198 0.000000e+00 0.000000e+00 0.000000e+00
b0032 25 198 0.000000e+00 0.000000e+00 0.000000e+00
b0031 25 198 0.000000e+00 0.000000e+00 0.000000e+00
b4006 24 198 5.510260e-57 3.981163e-55 1.592465e-54
b1062 23 198 2.604624e-53 1.075338e-51 7.527363e-51

5 rows × 5 columns

Logical operations can provide a substantial amount of power to your queries. For example, imagine that you not only want to know when b0032, pyrL, and b0031 are co-regulated but also when they are not. This can be done using the logical operator nor.

Example 4: other associations

So far we have mined gene-gene co-associations, but there are other associations in the ensemble that are also available, including conditions, and GREs, and biclusters.

Each of these can be accessed by changing the x_type and y_type parameter.

Conditions

For example to find conditions where these genes are co-regulated, we change y_type to conditions:

e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type="conditions", logic="or").head()
counts all_counts pval qval_BH qval_bonferroni
rb_del_rpoS_biofilm 240 12921 2.849586e-13 7.228851e-11 1.327907e-10
biofilm_24hr_del_yceP 239 12879 3.889790e-13 7.228851e-11 1.812642e-10
ik_H2_T4 232 12380 4.653767e-13 7.228851e-11 2.168655e-10
MG1063_uninduced_t180 232 12543 2.190004e-12 2.551355e-10 1.020542e-09
rb_del_rpoS_exponential 226 12198 5.366065e-12 5.001173e-10 2.500586e-09

5 rows × 5 columns

GREs

To discover GREs located upstream of these genes we change y_type to gre:

e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type="gre", logic="or").head()
counts all_counts pval qval_BH qval_bonferroni
11 187 231 9.726498e-267 5.252309e-265 5.252309e-265
81 42 48 2.936386e-64 7.928242e-63 1.585648e-62
12 41 211 6.248311e-26 1.124696e-24 3.374088e-24
309 14 19 2.195152e-21 2.606271e-20 1.185382e-19
197 16 27 2.413214e-21 2.606271e-20 1.303136e-19

5 rows × 5 columns

Note that GRE names in ensemble are integer values, thus GRE #1 is simply called 1

Biclusters

Finally: to return biclusters we change y_type to bicluster.

e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type="bicluster", logic="or").head()

WARNING:root:Will return bicluster _id. The results might surprise you...
_id
0 54ee42396a208812a205295a
1 54ee42396a208812a2052982
2 54ee42826a208812a2052b61
3 54ee42836a208812a2052ce0
4 54ee430d6a208812a2053120

5 rows × 1 columns

Please note that this returns MongoDB _ids for the biclusters, which are not by themselves useful or informative. Rather, these _ids can be used in other queries (e.g. fimoFinder) or to query MongoDB directly.