-
Notifications
You must be signed in to change notification settings - Fork 5
cPath2PreMerge
Download both provider meta and pathway data in a manner that requires the least amount of user intervention as possible. This data will then be made available to the pathway commons system for import.
Note: original data can be remote or local and packaged in various ways; so some preprocessing (getting licences, downloading, extracting what we need, re-packaging) will be required to be done case by case by a content manager either manually or using custom scripting anyway.
The data fetching module can be broken into two logic components. The first, fetches provider metadata which will be stored in the warehouse and is used later within the pre-merge pipeline. The second fetches pathway or protein data which is then used within the import pipeline.
Provider metadata data is first obtained by the content manager. The manager visits each provider web site and collects the following information:
- IDENTIFIER - unique (per cPath2 instance) text key, e.g., "REACTOME_HUMAN"
- NAME - contain one or more (optional) standard data source names, separated by semicolons, as follows: [displayName;]standardName[;name1;name2;..]. (inside .md - optional)
- DESCRIPTION
- DATA URL
- HOMEPAGE URL
- LOGO URL
The manager then adds this information to the instance's metadata.conf file (ideally, a special administration web app is to be developed for this). The following additional data is then added by the manager:
- DATA_TYPE
- CLEANER CLASS NAME
- CONVERTER CLASS NAME
Data type indicates whether this data is BioPAX, PSI-MI, or protein / small molecule warehouse data. Cleaner and converter classes, if any required, are precompiled java classes that are used to convert (in case of protein or small molecule) or clean (in the case of pathway data).
- PMID
- AVAILABILITY
Based on the metadata, warehouse data get cleaned, converted to BioPAX L3 (if required), saved for later merging with normalized pathway data. Pathway data goes through a different pre-merge pipeline: becomes cleaned/converted, validated/normalized, and made available for the final merging.
cPath^2 will use Miriam standard IDs (URN) Identifiers.org resolvable URIs (URLs) for all the CVs and Entity References in the Warehouse and main database, if possible (that means data normalization). Where required, ID-lookup is done via the MIRIAM Java library. Therefore, for a ProteinReference, CV, or SmallMoleculeReference, it will be possible to map it to a primary identifier and find the corresponding one in the cpath2 Warehouse (in the future, advanced ID-mapping can be implemented using BridgeDb tools), which allows merging and re-using data from different pathway data sources.
Update: we're still OK with using a simple id-mapping based on the information contained in UniProt (some of IDs can be always mapped to the primary accessions); i.e., id-mapping tables are being built when the warehouse is built and used both in merge stage and web service query internal pre-processing.
CVs will be fetched using BioPAX Validator modules that expose several convenient classes: BiopaxOntologyManager, CvTermRestriction, AbstractCvRule, etc. that allow to work with a particular subset of the OBO ontologies and terms documented in the BioPAX standard.
The cleaner is responsible for pre-processing of pathway data. All existing cooker script logic will be placed in this module.
A java framework will be created which will allow the insertion of individual cleaner's (one per data provider) to handle cooking of pathway data. Cleaners will be dynamically inserted into the JVM via the classpath parameter. For each pathway provider specified on the metadata wiki page, a cleaner class name will be specified. Each cleaner will implement cpath.cleaner.Cleaner interface.
The converter is responsible for converting protein and small molecule background data into BioPAX entity references for storage in the warehouse
A java framework will be created which will allow the insertion of individual converters (one per data provider) to handle conversion of protein and SM data. Converters will be dynamically inserted into the jvm via the classpath parameter. For each protein/SM provider specified on the metadata wiki page, a converter class name will be specified. Each converter will implement cpath.converter.Converter interface.
BioPAX element identifiers (- URI defined using either rdf:ID or rdf:about) of EntityReference, ControlledVocabulary, BioSource, PublicationXref
and Provenance(- including all sub-classes, where applicable) will be "normalized" toMiriam URNsIdentifiers.org URLs. Xref subclasses, except for PublicationXref, will be however normalized to using "local" URIs (see below).
We are aware of Biohackathon 2010 advises on URI use: http://hackathon3.dbcls.jp/wiki/URI. Unlike pathwaycommons.org/pc/ (cPath1 based, which replaced all original URIs with CPATH IDs), we now want to keep original provider's URIs untouched as much as possible (despite these might be not the best nor stable choice of URIs), especially for all Entity (incl. sub-classes) individuals. Though our normalization approach is mostly to achieve better results in data merging and querying in the Pathway Commons 2, it does in fact facilitate data loading into a semantic web triple/quad store and sharing the knowledge this way (make our merged BioPAX model available as LinkedData).
The pre-merge normalizer will live in the cpath-import module.Normalizer code was moved to the BioPAX Validator project to make it easier to support, discuss, and try by more BioPAX community people.
Example:
urn:miriam:uniprot:Q5BJF6 http://identifiers.org/uniprot/Q5BJF6 will be the URI (rdf:about value) of a ProteinReference; and urn:biopax:UnificationXref:UNIPROT_Q5BJF6_3 - the URI of its unification xref.
Note: In BioPAX, Xref is a "local", kind of "artificial", utility class (from data integration point of view). Unlike an EntityReference (e.g., ProteinReference) instance that is going to be referred to "globally", an Xref's URI is not that much important, but still worth generating them consistently, because doing so facilitates the merging and prevents duplicates, after all.
- Will use the original ID when it failed to normalize (e.g., "unknown" db name or null db/id properties)
- In "pre-merge", i.e., in the provider's pathway data conversion/validation/normalization pipeline, they are get consistently generated by the normalizer
(at the moment): "urn:biopax:{xrefClassName}#{db}{id}{idVersion}"using current namespace (- xml:base value must be set for the cpath2 instance in advance). - During UniProt and ChEBI warehouse data build, Xref URIs
to follow the patern "urn:biopax:{xrefClassName}#{db}{id}{idVersion}" (db_id_ver is URL-encoded).are generated using the same technique as in the normalizer.
Will generate the
Miriam URN (URI)Identifiers.org URI (URL) for an EntityReference using db/id properties of one of its UnificationXref taken in a particular order (to be consistent). For example, if a ProteinReference has a UnificationXref, db='UniProt', id='P62158', then the new absolute URI of the protein reference will be"urn:miriam:uniprot:P62158""http://identifiers.org/uniprot/P62158".
For each ER, which ID is not normalized yet:
- get unification xrefs, order (e.g., by 'db', then 'id' alphabetically)
- find the first one, ux, that refers to uniprot db, if exists, or - simply the first one, otherwise
- query for URI by ux.db and ux.id
- update the model to reflect that the ER has got its new URI
Normalizer will not reconstruct missing unification xrefs nor validate term names. It takes the "cooked" BioPAX model, where most datasource-specific critical issues have been fixed by the corresponding Cleaner implementation.
Miriam has now all the recommended by BioPAX OBO data: GO, MI, CL, BTO, SO, PATO. Miriam provided a convenient Java library (MiriamLink), which we modified (it's now less coupled with the web service) and included into the biopax-validator codebase and distribution.
- using the unification xref (properties: 'db' and 'id'), fetch the new ID from Miriam. For example, db="Gene Ontology", id="GO:0005654" will result in
"urn:miriam:obo.go:GO%3A0005654" (the local part of the id is url-encoded, i.e., GO:0005654 => GO%3A0005654)http://identifiers.org/go/GO:0005654 - replace CV's RDF ID with the new one
- update properties that refer to this CV
Using similar to ER's and CV's approach (using one of unification xrefs and Miriam lookup)
Validation (also auto-fix and normalization) can be performed by the BioPAX Validator component. It will be integrated (perhaps, extended) in to the cPath2 project and tuned/configured (to be more forgiving).
The Warehouse stores data used by the pre-merge and merge components of the cPath2 system and might also provide for the future advanced queries (secondary lookup support, advanced validation, etc.).
The following data will be stored within the warehouse
- id mapping
- protein references (load uniprot; "enrich" with other xrefs: RefSeq, EntrezGene, secondary IDs) as BioPAX Entity Refs
- small molecule references (chebi, adding more xrefs as well) as BioPAX Entity Refs
- cv (several OBO, hierarchy (relationships))
- provider metadata, including name, icon, version, date, id, stats
- validation/normalization results
- provider pathway data (original before cleaning)
Then, during "merge" and later on, ID mapping can be done by simply searching the Warehouse. The warehouse will provide interfaces for fetching, inserting, and accessing data.