-
Notifications
You must be signed in to change notification settings - Fork 2
basic_mining_coremfinder
Corems or condition-specific co-regulated modules are sets of genes that are tightly co-expressed in a condition-specific manner and whose expression is often controlled by common transcriptional regulators.
Expert note: A different perspective on corems
is that they are highly-refined, reproducibly-detected, biclusters that violate some constraints imposed on cMonkey-detected biclusters.
coremFinder
mines information about corems that are detected by EGRIN 2.0.
Typically, about half of the genes in the genome may be discovered in corems. For this subset of genes, the coremFinder
function can return information about the corem, including:
- gene composition
- condition-specific activity
- edges contained in corem
This function can be combined with the agglom
function to drive analysis of EGRIN 2.0 ensembles beyond genes contained in corems.
Make sure ./egrin-tools/ folder is in your python path
You can do this on Mac/Linux by adding the path to you Bash Shell Startup Files, e.g. ~/.bashrc or ~/.bash_profile
for example in ~/.bash_profile add the following line:
export PYTHONPATH=$PYTHONPATH:path/to/egrin2-tools/
import query.egrin2_query as e2q
import pymongo
client = pymongo.MongoClient(host='baligadev')
db = client['eco_db']
There are several dependencies that need to be satisfied, including:
- pymongo
- numpy
- pandas
- joblib
- scipy
- statsmodels
- itertools
The find_corem_info()
function is very similar to the agglom()
function. Basically, it co-associates information about corems, where the infomration supplied and retrieved is modulated by defining the arguments: x
, x_type
, and y_type
.
The function returns the requested information about a corem.
You can find out more about this function and its parameters by issuing the following commmand:
?e2q.find_corem_info
The most straightforward way to use coremFinder
is to find genes contained in a corem.
For example, to find all genes in E. coli corem #1
we would type:
corem_1 = e2q.find_corem_info(db, x=1, x_type="corem", y_type="genes")
corem_1
genes | |
---|---|
0 | b3317 |
1 | b3320 |
2 | b3319 |
3 | b3313 |
4 | b3315 |
5 | b3318 |
6 | b3314 |
7 | b3321 |
8 | b3316 |
9 rows × 1 columns
There are several things to note in this query.
First the arguments:
-
x
specfies the query. This can begene(s)
,condition(s)
,GRE(s)
, oredge(s)
.x
can be a single entity or a list of entitites of the same type. -
x_type
indicates the type ofx
. This can includegene
,condition
,gres
, andedges
. Basically: "what is x?" The parameter typing is pretty flexible, so - for example -rows
can be used instead ofgenes
. -
y_type
is the type of. Again,genes
,conditions
,gres
, oredges
. -
host
specifies where the MongoDB database is running. In this case it is running on a machine calledbaligadev
. If you are hosting the database locally this would belocalhost
-
db
is the name of the database you want to perform the query in. Typically databases are specified by the three letter organism code (e.g., eco) followed by _db. A list of maintained databases is available here.
Also notice that corems (like GREs) are named as integer values.
It should also be noted that corems are ordered by their weighted-density. Thus, corem #1
is the most densly connected corem in the network. Basically, this means that each gene in the corem is co-discovered frequently in biclusters with every other gene in that corem (strongly connected subnetwork).
Here we see that if we translate the names of these genes, we find that they are part of a ribosomal operon, which makes sense in light of the fact that ribosomal genes are tightly co-expressed.
e2q.row2id_batch(db, corem_1.genes.tolist(), return_field="name")
[u'rplB',
u'rplC',
u'rplD',
u'rplP',
u'rplV',
u'rplW',
u'rpsC',
u'rpsJ',
u'rpsS']
More commonly, you want to know the corems to which a particualr gene belongs.
This can be accomplished by changing x
, x_type
, and y_type
, as follows:
carA_corems = e2q.find_corem_info(db, x="carA", x_type="gene", y_type="corems")
carA_corems
corems | |
---|---|
0 | 107 |
1 | 471 |
2 | 835 |
3 | 847 |
4 rows × 1 columns
We can see from this query that carA belongs to four corems. We could retrieve the genes in these corems like in Example 1
:
e2q.find_corem_info(db, x=carA_corems.corems.tolist(), x_type="corems", y_type="genes")
genes | |
---|---|
0 | b0002 |
1 | b0003 |
2 | b0004 |
3 | b0032 |
4 | b0033 |
5 | b0197 |
6 | b0198 |
7 | b0273 |
8 | b0287 |
9 | b0336 |
10 | b0337 |
11 | b0522 |
12 | b0523 |
13 | b0572 |
14 | b0573 |
15 | b0750 |
16 | b0754 |
17 | b0775 |
18 | b0776 |
19 | b0777 |
20 | b0778 |
21 | b0860 |
22 | b0907 |
23 | b0908 |
24 | b0931 |
25 | b0945 |
26 | b1062 |
27 | b1761 |
28 | b1849 |
29 | b2103 |
30 | b2104 |
31 | b2312 |
32 | b2313 |
33 | b2476 |
34 | b2497 |
35 | b2499 |
36 | b2500 |
37 | b2557 |
38 | b2600 |
39 | b2601 |
40 | b2818 |
41 | b2838 |
42 | b2913 |
43 | b2942 |
44 | b3008 |
45 | b3089 |
46 | b3172 |
47 | b3212 |
48 | b3213 |
49 | b3359 |
50 | b3654 |
51 | b3769 |
52 | b3770 |
53 | b3771 |
54 | b3772 |
55 | b3774 |
56 | b3824 |
57 | b3829 |
58 | b3941 |
59 | b3956 |
... |
75 rows × 1 columns
####Example 3: Logical operations
Similar to the agglom
function we can implement logical operations. For example, if we wanted to know the genes that belonged to all of the corems in which carA is a member we would simply set logic = "and"
e2q.find_corem_info(db, x=carA_corems.corems.tolist(), x_type="corems", y_type="genes", logic="and")
genes | |
---|---|
0 | b0032 |
1 | b0033 |
2 | b0197 |
3 | b0198 |
4 | b0287 |
5 | b2500 |
6 | b2600 |
7 | b2601 |
8 | b3769 |
9 | b3770 |
10 | b3771 |
11 | b3772 |
12 | b3956 |
13 | b4005 |
14 | b4006 |
15 | b4064 |
16 | b4246 |
17 | b4488 |
18 rows × 1 columns
Notice that only 17 out of the 75 genes in these four corems are present in every one of the four corems
Similar to the gene example above, corems can be discovered based on the experimental conditions in which the genes in a corem are co-regulated as well. For example, the experimental conditions associated with corem #1
can be discovered by changing the y_type
supplied to one of the previous commands. Since there are many conditions, we will only display the first 10.
corem_1_conditions = e2q.find_corem_info(db, x=1, x_type="corem", y_type="conditions")
corem_1_conditions[0:10]
conditions | |
---|---|
0 | str_ctrl_0m |
1 | str_str_K_relA_M9 |
2 | str_ctrl_K_relA_M9 |
3 | str_ctrl_M9 |
4 | W3110_wt_luxS_glucose |
5 | W3110_K_luxS_glucose |
6 | suspension_24hr |
7 | suspension_15hr |
8 | biofilm_15hr |
9 | str_str_LV_20m |
10 rows × 1 columns
Likewise, we could retrieve all of the other corems that are also "active" in the first 10 conditions annotated to corem #1
by:
e2q.find_corem_info(db, x=corem_1_conditions.conditions[0:10].tolist(), x_type="conditions", y_type="corems", logic="and")
corems | |
---|---|
0 | 1 |
1 | 3 |
2 | 45 |
3 | 46 |
4 | 55 |
5 | 74 |
6 | 76 |
7 | 104 |
8 | 111 |
9 | 114 |
10 | 117 |
11 | 119 |
12 | 120 |
13 | 135 |
14 | 139 |
15 | 145 |
16 | 151 |
17 | 311 |
18 | 344 |
19 | 391 |
20 | 392 |
21 | 402 |
22 | 410 |
23 | 416 |
24 | 432 |
25 | 477 |
26 | 493 |
27 | 559 |
28 | 560 |
29 | 570 |
30 | 572 |
31 | 576 |
32 | 578 |
33 | 579 |
34 | 585 |
35 | 595 |
36 | 629 |
37 | 647 |
38 | 650 |
39 | 655 |
40 | 665 |
41 | 693 |
42 | 749 |
43 | 768 |
44 | 770 |
45 | 811 |
46 | 832 |
47 | 835 |
48 | 836 |
49 | 837 |
50 | 840 |
51 | 884 |
52 | 886 |
53 | 889 |
54 | 891 |
55 | 917 |
56 rows × 1 columns
Thus there are 55 corems that are co-regulated in all of the (first) ten conditions in which the genes in corem #1
are also co-regulated.
Technically, corems are "link-communities", meaning that they are sets of edges, where the edge is a co-regulatory assocaition between two genes (nodes). This is why a single gene (node) can belong to multiple corems (link-communities).
To retrieve that actual edges
that define a corem, set y_type
to "edges"
:
e2q.find_corem_info(db, x=1, x_type="corem", y_type="edges")
edges | |
---|---|
0 | b3313-b3314 |
1 | b3313-b3315 |
2 | b3313-b3316 |
3 | b3313-b3317 |
4 | b3313-b3318 |
5 | b3313-b3319 |
6 | b3313-b3320 |
7 | b3313-b3321 |
8 | b3314-b3315 |
9 | b3314-b3316 |
10 | b3314-b3317 |
11 | b3314-b3318 |
12 | b3314-b3319 |
13 | b3314-b3320 |
14 | b3314-b3321 |
15 | b3316-b3315 |
16 | b3317-b3315 |
17 | b3317-b3316 |
18 | b3318-b3315 |
19 | b3318-b3316 |
20 | b3318-b3317 |
21 | b3318-b3320 |
22 | b3319-b3315 |
23 | b3319-b3316 |
24 | b3319-b3317 |
25 | b3319-b3318 |
26 | b3319-b3320 |
27 | b3320-b3315 |
28 | b3320-b3316 |
29 | b3320-b3317 |
30 | b3321-b3315 |
31 | b3321-b3316 |
32 | b3321-b3317 |
33 | b3321-b3318 |
34 | b3321-b3319 |
35 | b3321-b3320 |
36 rows × 1 columns