-
Notifications
You must be signed in to change notification settings - Fork 0
Cistrome DB Data
This wiki page contains information about the technical aspects of the Cistrome Data Browser (Cistrome DB), such as API endpoints and file types. Fixes #101.
For now the information has mostly been gathered by reverse-engineering the cistrome db website, so some of the information may not be accurate.
The following API call returns metadata about many samples. This metadata is used to fill the main table on the cistrome DB home page http://cistrome.org/db/#/
http://dc2.cistrome.org/api/main_filter_ng?allqc=false&cellinfos=all&completed=false&curated=false&factors=all&keyword=&page=1&peakqc=false&run=false&species=all
This URL returns data in the following JSON format
{
"datasets": [
{
"id": 1,
"factor__name": "BTAF1",
"species__name": "Homo sapiens",
"tissue_type__name": "Cervix",
"cell_line__name": "HeLa",
"strain__name": null,
"cell_pop__name": null,
"cell_type__name": "Epithelium",
"qc_judge": {},
"biological_source": "HeLa;Epithelium;Cervix",
"paper__reference": "Johannes F. et al, ...",
"paper__pub_summary": "Johannes F. et al, Bioinformatics 2010",
"status": "completed"
},
...
],
"cellinfos": [
[ "ct", "1-cell-pronuclei" ],
[ "cl", "1015c" ],
...
],
"factors": [
"AATF",
"ABCC9",
...
],
"species": [
"Homo sapiens",
"Mus musculus"
],
"num_pages": 2823,
"request_page": 1
}
For subsequent pages, change the page=1
parameter. For example, for page 2:
http://dc2.cistrome.org/api/main_filter_ng?allqc=false&cellinfos=all&completed=false&curated=false&factors=all&keyword=&page=2&peakqc=false&run=false&species=all
Example:
http://dc2.cistrome.org/api/main_filter_ng?allqc=true&cellinfos=all&completed=false&curated=false&factors=H3K27ac&keyword=h3k27ac&page=1&peakqc=false&run=false&species=Homo+sapiens
where allqc
is searching for samples passing all quality control.
Each sample from the Cistrome DB is associated with a unique ID. For example, these IDs can be obtained by clicking a sample in the main table on the home page http://cistrome.org/db/#/ and then viewing the URL for the "Wash U Browser" button. The end of the Wash U URL contains the following Cistrome URL:
http://dc2.cistrome.org/api/datahub/{cid}
where {cid}
is replaced by the sample ID (i.e., "id").
For example, http://dc2.cistrome.org/api/datahub/387
returns
[
{
"name": "387_treat.bw",
"url": "http://dc2.cistrome.org/genome_browser/bw/387_treat.bw",
"type": "bigwig",
"mode": "show",
"showOnHubLoad": true,
"options": {
"height": 100
}
}
]
The "url"
value is for the associated bigWig file.
By checking the network requests when selecting a sample on the Cistrome DB main table, we see a request for the following "inspector" URL, which returns JSON with metadata about the selected sample.
http://dc2.cistrome.org/api/inspector?id={cid}
For example, http://dc2.cistrome.org/api/inspector?id=387
returns
{
"status": "completed",
"motif": false,
"treats": [
{
"other_ids": "{\"pmid\": \"21737748\", \"sra\": \"81095\", \"gse\": \"200029600\"}",
"cell_line__name": "CUTLL1",
"factor__name": "H3K27me3",
"is_correcting": false,
"strain__name": null,
"cell_pop__name": null,
"paper__reference": "Wang H, et al. Genome-wide analysis reveals conserved and divergent features of Notch1/RBPJ binding in human and murine T-lymphoblastic leukemia cells. Proc. Natl. Acad. Sci. U.S.A. 2011",
"cell_type__name": "T Lymphocyte",
"paper__pmid": 21737748,
"paper__journal__name": "Proc. Natl. Acad. Sci. U.S.A.",
"name": "CUTLL-H3K27me3",
"disease_state__name": "T-cell Lymphoma",
"link": "http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM732912",
"unique_id": "GSM732912",
"species__name": "Homo sapiens",
"tissue_type__name": "Blood",
"paper__lab": "Aster JC"
}
],
"sign": "eyJpZCI6IjM4NyJ9:1j6dGT:XhNVieMeI6MDjsobT1S_mnIfAOg",
"qc": {
"judge": {
"map": true,
"peaks": false,
"fastqc": true,
"frip": true,
"pbc": true,
"motif_judge": false,
"dhs": true
},
"table": {
"meta_orig": {
"intron": 0.3573820395738204,
"inter": 0.46164383561643835,
"exon": 0.04961948249619483,
"promoter": 0.13135464231354643
},
"map": [ "85.9%" ],
"treat_number": 1,
"peaks": [
6631,
78,
15
],
"control_number": 0,
"fastqc": [ 37 ],
"frip": [ "1.5%" ],
"sample": [ "387_treat_rep1" ],
"meta": [ "13.1% / 5.0% / 35.7% / 46.2%" ],
"map_number": [ 29450835 ],
"pbc": [ "99.3%" ],
"motif": false,
"dhs": "70.8%",
"raw_number": [ 34289570 ]
}
},
"motif_url": "",
"id": "387"
}
bigWig (typically files with the .bw
extension) is a binary file format for WIG files.
-
.bw
files can be opened using bwtool.
HiGlass can ingest individual bigWig files, as described in the docs here: https://docs.higlass.io/data_preparation.html#bigwig-files
However, for our needs, we want each bigWig file to be a row of a multivec file, and we also want the sample metadata to be stored in the multivec file.
For debugging purposes, we can visualize and see the BigWig files directly in the WashU browser (http://epigenomegateway.wustl.edu/browser/) using Local Tracks.
We need to determine how computationally intensive it is to convert bigWig files to multivec files, both in terms of execution time and disk space.
Questions:
- is it possible to aggregate and convert all (2823) cistrome DB bigWig files to a single .multivec file?
- how about grouping them by some metadata variable, e.g. cell type or species or factor, first?
- (it looks like there are currently 1523 human and 1301 mouse samples)
- we know that aggregating at least 228 samples is feasible, as that is the number of samples present in the cistrome-higlass-demo multivec file.
- how about grouping them by some metadata variable, e.g. cell type or species or factor, first?
- is it possible to convert some limited number of bigWig files to multivec "on-the-fly"? For example, how long does it take to aggregate and convert the files for 10 samples?
- note that the hierarchical clustering step would also need to be run on the selected samples as well, to generate the hierarchical clustering metadata.
- if a separate higlass server instance for this cistrome use case is required, who is responsible for managing the instance?