Skip to content

Cistrome DB Data

Sehi L'Yi edited this page Mar 3, 2020 · 13 revisions

This wiki page contains information about the technical aspects of the Cistrome Data Browser (Cistrome DB), such as API endpoints and file types. Fixes #101.

For now the information has mostly been gathered by reverse-engineering the cistrome db website, so some of the information may not be accurate.

List of Samples

The following API call returns metadata about many samples. This metadata is used to fill the main table on the cistrome DB home page http://cistrome.org/db/#/

http://dc2.cistrome.org/api/main_filter_ng?allqc=false&cellinfos=all&completed=false&curated=false&factors=all&keyword=&page=1&peakqc=false&run=false&species=all

This URL returns data in the following JSON format

{
  "datasets": [
    {
      "id": 1,
      "factor__name": "BTAF1",
      "species__name": "Homo sapiens",
      "tissue_type__name": "Cervix",
      "cell_line__name": "HeLa",
      "strain__name": null,
      "cell_pop__name": null,
      "cell_type__name": "Epithelium",
      "qc_judge": {},
      "biological_source": "HeLa;Epithelium;Cervix",
      "paper__reference": "Johannes F. et al, ...",
      "paper__pub_summary": "Johannes F. et al, Bioinformatics 2010",
      "status": "completed"
    },
    ...
  ],
  "cellinfos": [
    [ "ct", "1-cell-pronuclei" ],
    [ "cl", "1015c" ],
    ...
  ],
  "factors": [
    "AATF",
    "ABCC9",
    ...
  ],
  "species": [
    "Homo sapiens",
    "Mus musculus"
  ],
  "num_pages": 2823,
  "request_page": 1
}

For subsequent pages, change the page=1 parameter. For example, for page 2:

http://dc2.cistrome.org/api/main_filter_ng?allqc=false&cellinfos=all&completed=false&curated=false&factors=all&keyword=&page=2&peakqc=false&run=false&species=all

List of Samples with Filter

Example:

http://dc2.cistrome.org/api/main_filter_ng?allqc=true&cellinfos=all&completed=false&curated=false&factors=H3K27ac&keyword=h3k27ac&page=1&peakqc=false&run=false&species=Homo+sapiens

where allqc is searching for samples passing all quality control.

Individual Sample Data

Each sample from the Cistrome DB is associated with a unique ID. For example, these IDs can be obtained by clicking a sample in the main table on the home page http://cistrome.org/db/#/ and then viewing the URL for the "Wash U Browser" button. The end of the Wash U URL contains the following Cistrome URL:

http://dc2.cistrome.org/api/datahub/{cid}

where {cid} is replaced by the sample ID (i.e., "id").

For example, http://dc2.cistrome.org/api/datahub/387 returns

[
  {
    "name": "387_treat.bw",
    "url": "http://dc2.cistrome.org/genome_browser/bw/387_treat.bw",
    "type": "bigwig",
    "mode": "show",
    "showOnHubLoad": true,
    "options": {
      "height": 100
    }
  }
]

The "url" value is for the associated bigWig file.

Individual Sample Metadata

By checking the network requests when selecting a sample on the Cistrome DB main table, we see a request for the following "inspector" URL, which returns JSON with metadata about the selected sample.

http://dc2.cistrome.org/api/inspector?id={cid}

For example, http://dc2.cistrome.org/api/inspector?id=387 returns

{
  "status": "completed",
  "motif": false,
  "treats": [
    {
      "other_ids": "{\"pmid\": \"21737748\", \"sra\": \"81095\", \"gse\": \"200029600\"}",
      "cell_line__name": "CUTLL1",
      "factor__name": "H3K27me3",
      "is_correcting": false,
      "strain__name": null,
      "cell_pop__name": null,
      "paper__reference": "Wang H, et al. Genome-wide analysis reveals conserved and divergent features of Notch1/RBPJ binding in human and murine T-lymphoblastic leukemia cells. Proc. Natl. Acad. Sci. U.S.A. 2011",
      "cell_type__name": "T Lymphocyte",
      "paper__pmid": 21737748,
      "paper__journal__name": "Proc. Natl. Acad. Sci. U.S.A.",
      "name": "CUTLL-H3K27me3",
      "disease_state__name": "T-cell Lymphoma",
      "link": "http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM732912",
      "unique_id": "GSM732912",
      "species__name": "Homo sapiens",
      "tissue_type__name": "Blood",
      "paper__lab": "Aster JC"
    }
  ],
  "sign": "eyJpZCI6IjM4NyJ9:1j6dGT:XhNVieMeI6MDjsobT1S_mnIfAOg",
  "qc": {
    "judge": {
      "map": true,
      "peaks": false,
      "fastqc": true,
      "frip": true,
      "pbc": true,
      "motif_judge": false,
      "dhs": true
    },
    "table": {
      "meta_orig": {
        "intron": 0.3573820395738204,
        "inter": 0.46164383561643835,
        "exon": 0.04961948249619483,
        "promoter": 0.13135464231354643
      },
      "map": [ "85.9%" ],
      "treat_number": 1,
      "peaks": [
        6631,
        78,
        15
      ],
      "control_number": 0,
      "fastqc": [ 37 ],
      "frip": [ "1.5%" ],
      "sample": [ "387_treat_rep1" ],
      "meta": [ "13.1% / 5.0% / 35.7% / 46.2%" ],
      "map_number": [ 29450835 ],
      "pbc": [ "99.3%" ],
      "motif": false,
      "dhs": "70.8%",
      "raw_number": [ 34289570 ]
    }
  },
  "motif_url": "",
  "id": "387"
}

bigWig File Format

bigWig (typically files with the .bw extension) is a binary file format for WIG files.

  • .bw files can be opened using bwtool.

HiGlass can ingest individual bigWig files, as described in the docs here: https://docs.higlass.io/data_preparation.html#bigwig-files

However, for our needs, we want each bigWig file to be a row of a multivec file, and we also want the sample metadata to be stored in the multivec file.

For debugging purposes, we can visualize and see the BigWig files directly in the WashU browser (http://epigenomegateway.wustl.edu/browser/) using Local Tracks.

TODO

We need to determine how computationally intensive it is to convert bigWig files to multivec files, both in terms of execution time and disk space.

Questions:

  • is it possible to aggregate and convert all (2823) cistrome DB bigWig files to a single .multivec file?
    • how about grouping them by some metadata variable, e.g. cell type or species or factor, first?
      • (it looks like there are currently 1523 human and 1301 mouse samples)
    • we know that aggregating at least 228 samples is feasible, as that is the number of samples present in the cistrome-higlass-demo multivec file.
  • is it possible to convert some limited number of bigWig files to multivec "on-the-fly"? For example, how long does it take to aggregate and convert the files for 10 samples?
    • note that the hierarchical clustering step would also need to be run on the selected samples as well, to generate the hierarchical clustering metadata.
    • if a separate higlass server instance for this cistrome use case is required, who is responsible for managing the instance?