-
Notifications
You must be signed in to change notification settings - Fork 2
basic_mining_agglom
Make sure ./egrin-tools/ folder is in your python path
You can do this on Mac/Linux by adding the path to you Bash Shell Startup Files, e.g. ~/.bashrc or ~/.bash_profile
for example in ~/.bash_profile add the following line:
export PYTHONPATH=$PYTHONPATH:path/to/egrin2-tools/
import query.egrin2_query as e2q
import pymongo
client = pymongo.MongoClient(host='baligadev')
db = client['eco_db']
There are several dependencies that need to be satisfied, including:
- pymongo
- numpy
- pandas
- joblib
- scipy
- statsmodels
- itertools
The agglom function is the centerpiece for querying information in the ensemble.
Using this function you can co-associate any two bicluster features across the entire ensemble: so - biclusters, genes, conditions, and GREs with biclusters, genes, conditions, or GREs.
If you stack multiple agglom
calls you can quickly build very complex queries.
The agglom
function has a simple, universal format. Basically it needs an input (x
), the type of that input (x_type
), the desired output type (y_type
), and some infomration about the location of the EGRIN 2.0 MongoDB.
The function will output a table of pvalues that reports the signifigance for each element in your output query type.
Let's take a look at the function itself, then we will explore serveral examples:
?e2q.agglom
In this example, we will mine the ensemble for genes discovered in biclusters with the gene E. coli carbamoyl phosphate synthetase, carA (b0032)
carA_genes = e2q.agglom(db, x="b0032", x_type="genes", y_type="genes")
carA_genes
counts | all_counts | pval | qval_BH | qval_bonferroni | |
---|---|---|---|---|---|
b1062 | 152 | 198 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
b0336 | 153 | 198 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
b0337 | 154 | 198 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
b4244 | 167 | 198 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
b4245 | 169 | 198 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
b0033 | 173 | 198 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
b0032 | 198 | 198 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
b2499 | 142 | 198 | 3.783051e-300 | 4.073085e-298 | 3.665776e-297 |
b2476 | 142 | 198 | 3.783051e-300 | 4.073085e-298 | 3.665776e-297 |
b2313 | 141 | 198 | 5.664712e-297 | 4.574255e-295 | 5.489106e-294 |
b2312 | 141 | 198 | 5.664712e-297 | 4.574255e-295 | 5.489106e-294 |
b2557 | 141 | 198 | 5.664712e-297 | 4.574255e-295 | 5.489106e-294 |
b0522 | 138 | 198 | 1.477589e-287 | 1.101372e-285 | 1.431783e-284 |
b1849 | 136 | 198 | 2.280977e-281 | 1.578762e-279 | 2.210267e-278 |
b4006 | 132 | 198 | 3.382871e-269 | 2.185335e-267 | 3.278002e-266 |
b0523 | 129 | 198 | 3.043903e-260 | 1.843464e-258 | 2.949542e-257 |
b3654 | 128 | 198 | 2.728256e-257 | 1.555106e-255 | 2.643680e-254 |
b4246 | 126 | 198 | 1.964952e-251 | 1.057799e-249 | 1.904038e-248 |
b2500 | 122 | 198 | 6.646122e-240 | 3.389522e-238 | 6.440093e-237 |
b2497 | 112 | 198 | 4.061863e-212 | 1.967973e-210 | 3.935945e-209 |
b4064 | 102 | 198 | 9.823972e-186 | 4.533061e-184 | 9.519429e-183 |
b0945 | 96 | 198 | 1.511792e-170 | 6.658758e-169 | 1.464927e-167 |
b4005 | 89 | 198 | 2.020583e-153 | 8.512805e-152 | 1.957945e-150 |
b3941 | 58 | 198 | 3.940658e-85 | 1.591041e-83 | 3.818498e-82 |
b3359 | 57 | 198 | 3.886443e-83 | 1.448447e-81 | 3.765963e-80 |
b3829 | 57 | 198 | 3.886443e-83 | 1.448447e-81 | 3.765963e-80 |
b4024 | 56 | 198 | 3.714681e-81 | 1.333158e-79 | 3.599526e-78 |
b0908 | 55 | 198 | 3.440262e-79 | 1.149522e-77 | 3.333614e-76 |
b3774 | 55 | 198 | 3.440262e-79 | 1.149522e-77 | 3.333614e-76 |
b0198 | 54 | 198 | 3.086539e-77 | 9.647922e-76 | 2.990856e-74 |
b3008 | 54 | 198 | 3.086539e-77 | 9.647922e-76 | 2.990856e-74 |
b0197 | 48 | 198 | 8.154522e-66 | 2.469291e-64 | 7.901732e-63 |
b4013 | 46 | 198 | 4.011807e-62 | 1.143365e-60 | 3.887441e-59 |
b0907 | 46 | 198 | 4.011807e-62 | 1.143365e-60 | 3.887441e-59 |
b2600 | 45 | 198 | 2.673409e-60 | 7.401525e-59 | 2.590534e-57 |
b2601 | 44 | 198 | 1.720887e-58 | 4.632054e-57 | 1.667539e-55 |
b3212 | 43 | 198 | 1.069629e-56 | 2.801271e-55 | 1.036470e-53 |
b3172 | 37 | 198 | 2.884778e-46 | 6.988374e-45 | 2.795350e-43 |
b3958 | 37 | 198 | 2.884778e-46 | 6.988374e-45 | 2.795350e-43 |
b3213 | 37 | 198 | 2.884778e-46 | 6.988374e-45 | 2.795350e-43 |
b4254 | 36 | 198 | 1.385367e-44 | 3.196239e-43 | 1.342420e-41 |
b2818 | 36 | 198 | 1.385367e-44 | 3.196239e-43 | 1.342420e-41 |
b0004 | 34 | 198 | 2.840051e-41 | 6.254567e-40 | 2.752010e-38 |
b4488 | 34 | 198 | 2.840051e-41 | 6.254567e-40 | 2.752010e-38 |
b0860 | 32 | 198 | 4.953595e-38 | 1.043486e-36 | 4.800034e-35 |
b0199 | 32 | 198 | 4.953595e-38 | 1.043486e-36 | 4.800034e-35 |
b3714 | 31 | 198 | 1.943526e-36 | 3.843422e-35 | 1.883277e-33 |
b2156 | 31 | 198 | 1.943526e-36 | 3.843422e-35 | 1.883277e-33 |
b2152 | 31 | 198 | 1.943526e-36 | 3.843422e-35 | 1.883277e-33 |
b2913 | 30 | 198 | 7.306285e-35 | 1.335810e-33 | 7.079791e-32 |
b1101 | 30 | 198 | 7.306285e-35 | 1.335810e-33 | 7.079791e-32 |
b0031 | 30 | 198 | 7.306285e-35 | 1.335810e-33 | 7.079791e-32 |
b0273 | 30 | 198 | 7.306285e-35 | 1.335810e-33 | 7.079791e-32 |
b3771 | 29 | 198 | 2.629355e-33 | 4.549723e-32 | 2.547845e-30 |
b3959 | 29 | 198 | 2.629355e-33 | 4.549723e-32 | 2.547845e-30 |
b3770 | 29 | 198 | 2.629355e-33 | 4.549723e-32 | 2.547845e-30 |
b0002 | 28 | 198 | 9.049612e-32 | 1.538434e-30 | 8.769074e-29 |
b3772 | 27 | 198 | 2.975709e-30 | 4.805771e-29 | 2.883462e-27 |
b0003 | 27 | 198 | 2.975709e-30 | 4.805771e-29 | 2.883462e-27 |
b3960 | 27 | 198 | 2.975709e-30 | 4.805771e-29 | 2.883462e-27 |
... | ... | ... | ... | ... |
259 rows × 5 columns
There are several things to note in this query.
First the arguments:
-
x
specfies the query. This can begene(s)
,condition(s)
,GRE(s)
, orbicluster(s)
.x
can be a single entity or a list of entitites of the same type. -
x_type
indicates the type ofx
. This can includegene
,condition
,gres
, andbiclusters
. Basically: "what is x?" The parameter typing is pretty flexible, so - for example -rows
can be used instead ofgenes
. -
y_type
is the type of. Again,genes
,conditions
,gres
, orbiclusters
. -
host
specifies where the MongoDB database is running. In this case it is running on a machine calledbaligadev
. If you are hosting the database locally this would belocalhost
-
db
is the name of the database you want to perform the query in. Typically databases are specified by the three letter organism code (e.g., eco) followed by _db. A list of maintained databases is available here.
Also note the output:
This is a table of all genes that are frequently co-discovered in biclusters with b0032 (carA). They are sorted by their relative "enrichment" in the sample x
(where lower pval = greater enrichment). This is quantified by the hypergeometric distribution. To account for multiple testing, two corrections are included: Benjamini-Hochberg correction (qval_BH
; less stringent) and Bonferrnoni correction (qval_bonferroni
; more stringent).
We can easily change the format of x
, which makes it easy to specify genes in any format, e.g. NCBI annotation or common name. Let's try to include a list of genes with two different naming formats. This time we will only return the first few elements using .head()
e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type= "genes").head()
counts | all_counts | pval | qval_BH | qval_bonferroni | |
---|---|---|---|---|---|
b4244 | 179 | 198 | 0 | 0 | 0 |
b0033 | 180 | 198 | 0 | 0 | 0 |
b4245 | 185 | 198 | 0 | 0 | 0 |
b0031 | 198 | 198 | 0 | 0 | 0 |
b4246 | 198 | 198 | 0 | 0 | 0 |
5 rows × 5 columns
Here, two of the genes are Blattner Ids (b****), while the other is a common name. This is no problem for the query.
The lookup table is pretty extensive for genes (it comes from MicrobesOnline) so anything from common name to GI number can be matched. Cool, huh?
You can specify the type of logic to use in grouping your query.
For example, do you want do find biclusters where all genes x
in your query are present (and
logic) or biclusters where at least one of the genes is present (or
logic). You can change the logical grouping of your query by specifying the logic
parameter.
This parameter can take one of three values: and
, or
, or nor
. By default the command uses or
logic.
Here we change the logic employed in the previous query to "and", meaning that a bicluster must contain all of the query genes. Notice how the values change.
e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type="genes", logic="and").head()
counts | all_counts | pval | qval_BH | qval_bonferroni | |
---|---|---|---|---|---|
b4246 | 25 | 198 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
b0032 | 25 | 198 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
b0031 | 25 | 198 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
b4006 | 24 | 198 | 5.510260e-57 | 3.981163e-55 | 1.592465e-54 |
b1062 | 23 | 198 | 2.604624e-53 | 1.075338e-51 | 7.527363e-51 |
5 rows × 5 columns
Logical operations can provide a substantial amount of power to your queries. For example, imagine that you not only want to know when b0032, pyrL, and b0031 are co-regulated but also when they are not. This can be done using the logical operator nor
.
So far we have mined gene-gene co-associations, but there are other associations in the ensemble that are also available, including conditions, and GREs, and biclusters.
Each of these can be accessed by changing the x_type
and y_type
parameter.
For example to find conditions where these genes are co-regulated, we change y_type
to conditions
:
e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type="conditions", logic="or").head()
counts | all_counts | pval | qval_BH | qval_bonferroni | |
---|---|---|---|---|---|
rb_del_rpoS_biofilm | 240 | 12921 | 2.849586e-13 | 7.228851e-11 | 1.327907e-10 |
biofilm_24hr_del_yceP | 239 | 12879 | 3.889790e-13 | 7.228851e-11 | 1.812642e-10 |
ik_H2_T4 | 232 | 12380 | 4.653767e-13 | 7.228851e-11 | 2.168655e-10 |
MG1063_uninduced_t180 | 232 | 12543 | 2.190004e-12 | 2.551355e-10 | 1.020542e-09 |
rb_del_rpoS_exponential | 226 | 12198 | 5.366065e-12 | 5.001173e-10 | 2.500586e-09 |
5 rows × 5 columns
To discover GREs located upstream of these genes we change y_type
to gre
:
e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type="gre", logic="or").head()
counts | all_counts | pval | qval_BH | qval_bonferroni | |
---|---|---|---|---|---|
11 | 187 | 231 | 9.726498e-267 | 5.252309e-265 | 5.252309e-265 |
81 | 42 | 48 | 2.936386e-64 | 7.928242e-63 | 1.585648e-62 |
12 | 41 | 211 | 6.248311e-26 | 1.124696e-24 | 3.374088e-24 |
309 | 14 | 19 | 2.195152e-21 | 2.606271e-20 | 1.185382e-19 |
197 | 16 | 27 | 2.413214e-21 | 2.606271e-20 | 1.303136e-19 |
5 rows × 5 columns
Note that GRE names in ensemble are integer values, thus GRE #1
is simply called 1
Finally: to return biclusters we change y_type
to bicluster
.
e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type="bicluster", logic="or").head()
WARNING:root:Will return bicluster _id. The results might surprise you...
_id | |
---|---|
0 | 54ee42396a208812a205295a |
1 | 54ee42396a208812a2052982 |
2 | 54ee42826a208812a2052b61 |
3 | 54ee42836a208812a2052ce0 |
4 | 54ee430d6a208812a2053120 |
5 rows × 1 columns
Please note that this returns MongoDB _id
s for the biclusters, which are not by themselves useful or informative. Rather, these _id
s can be used in other queries (e.g. fimoFinder
) or to query MongoDB directly.