This documentation is out of date and no longer maintained.
The reference and up to date documentation is available at: https://ec-jrc.github.io/datacite-to-dcat-ap/
This document is a draft meant to report work in progress concerning an exercise, carried out at the Joint Research Centre of the European Commission (Units B.6 & G.I.4), for the alignment of DataCite metadata with DCAT-AP.
As such, it can be updated any time and it must be considered as unstable.
This document describes the background and methodology for the design of the DataCite profile of DCAT-AP (DataCite+DCAT-AP).
The mappings defined in DataCite+DCAT-AP are illustrated in a separate document:
Mappings defined in DataCite+DCAT-AP
- Background
- Methodology
- Comparison with DC2AP & DataCite2RDF
- DataCite and DCAT-AP at a glance
- Summary of alignment issues
DCAT-AP [DCAT-AP] is a metadata profile developed in the framework of the EU Programme Interoperability Solutions for European Public Administrations (ISA), and based on and compliant with the W3C Data Catalog vocabulary (DCAT) - currently, one of the most widely used Semantic Web vocabularies for describing datasets and data catalogues.
The purpose of DCAT-AP is to define a common interchange metadata format for data portals of the EU and of EU Member States. In order to achieve this, DCAT-AP defines a set of classes and properties, grouped into mandatory, recommended and optional. Such classes and properties correspond to information on datasets and data catalogues that are shared by many European data portals, aiding interoperability. Although DCAT-AP is designed to be independent from its actual implementation, RDF [RDF] and Linked Data [LDBOOK] are the reference technologies.
DataCite [DataCite] is an international initiative meant to enable citation for scientific datasets. To achieve this, DataCite operates a metadata infrastructure, following the same approach used by CrossRef for scientific publications. As such, the DataCite infrastructure is responsible for issuing persistent identifiers (in particular, DOIs) for datasets, and for registering dataset metadata. Such metadata are to be provided according to the DataCite metadata schema - which is basically an extension to the DOI one.
Currently, DataCite is the de facto standard for data citation. Therefore, the ability to transform metadata records from and to the DataCite metadata schema would enable, respectively, the harvesting of DataCite records, and the publication of metadata records in the DataCite infrastructure (thus enabling their citation).
The motivation for investigating the possiblity of aligning DataCite metadata with DCAT-AP is twofold:
- To identify how to create a DCAT-AP-compliant representation of DataCite metadata, in order to enable their sharing across DCAT-AP-enabled data catalogues. This analysis is not meant to provide a complete representation of all DataCite metadata elements, but only of those included in DCAT-AP.
- To identify how to create a DataCite-compliant representation of DCAT-AP metadata, in order to enable their publishing on the DataCite infrastructure. This analysis is meant to develop an extension of DCAT-AP, covering all DataCite metadata elements.
About point (2), the DataCite-based extension of DCAT-AP is also meant to integrate into DCAT-AP all the information required for data citation.
Based on these considerations, two versions of a DataCite profile of DCAT-AP have been defined, namely, DataCite+DCAT-AP Core (addressing the requirements of point (1)) and DataCite+DCAT-AP Extended (addressing the requirements of point (2)). More precisely, the core version includes alignments only for the subset of DataCite metadata elements included in the DCAT-AP specification, whereas the extended version tries to defines alignments for all the DataCite metadata elements using DCAT-AP and other Semantic Web vocabularies (whenever DCAT-AP does not offer suitable candidates). As such, DataCite+DCAT-AP Extended is a superset of DataCite+DCAT-AP Core, and both are conformant with DCAT-AP.
The reference DCAT-AP and DataCite specifications on which DataCite+DCAT-AP is based are the following ones:
- DCAT-AP v1.1 (October 2015)
- GeoDCAT-AP v1.0.1 (August 2016)
- DataCite v4.1 (October 2017)
For the mappings, existing work has been taken into account concerning the mapping of DataCite to other metadata standards. In particular:
- DataCite Dublin Core Application Profile (DC2AP). Version 1.8 (February 2016)
- DataCite2RDF: Mapping DataCite Metadata Schema 3.1 Terms to RDF. Version 3.3 (February 2016)
DataCite+DCAT-AP builds upon these specifications to provide an as much as possible complete mapping of all the metadata elements in version 4.1 of the DataCite metadata schema. Moreover, the defined mappings are backward compatible with earlier versions of the DataCite metadata schema.
The resulting mappings have been grouped into two classes, corresponding to two different DataCite+DCAT-AP profiles:
- DataCite+DCAT-AP Core: This profile defines alignments for the subset of DataCite metadata elements supported by DCAT-AP.
- DataCite+DCAT-AP Extended: This profile defines alignments for all the DataCite metadata elements using DCAT-AP and other Semantic Web vocabularies (whenever DCAT-AP does not provide suitable candidates).
As far as the extended profile is concerned, the reference vocabularies have been chosen based on the following criteria:
- They have clear persistence and versioning policies.
- Preferably, they should be used across domains and data communities.
These criteria are determining the main differences with the mappings defined in DC2AP and DC2RDF, that are illustrated in the following section.
DC2AP and DataCite2RDF provide a full mapping of version 3.1 of the DataCite metadata schema. To achieve this, they re-use a number of vocabularies, that can be grouped into two main classes:
- General purpose and widely used vocabularies, such as Dublin Core, FOAF, GeoSPARQL, and SKOS.
- A set of vocabularies developed specifically to model the publishing and academic domain. They include PRISM (Publishing Requirements for Industry Standard Metadata) and FRBR (Functional Requirements for Bibliographic Records), plus a set of ontologies developed in the framework of the SPAR (Semantic Publishing and Referencing Ontologies) project.
The current version of DataCite+DCAT-AP follows the mappings based on the former group of vocabularies, but not the ones based on the latter group. The reason is twofold. First, the persistence and versioning policies of these vocabularies are unclear, and so it has been considered safer to re-consider their use when the DataCite+DCAT-AP is more consolidated. Second, the mappings defined in DC2AP and DataCite2RDF are not compliant with the general requirements of DCAT-AP. For instance, despite the resource types defined in DataCite include datasets and metadata records, DC2AP and DataCite2RDF do not make use of DCAT.
It is worth noting that the second group of ontologies, and in particular the ones developed in the SPAR project, provide interesting solutions to modelling some aspects not explicitly addressed in DCAT-AP - e.g., the possibility of associating a temporal dimension to agent roles - but also alternative solutions for specifying the same information - one of the examples being resource identifiers. Complementing and aligning the different approaches would be mutually beneficial.
The following sections provide a high-level comparison of the metadata elements defined in DataCite and DCAT-AP.
The following table provides the complete list of DataCite metadata elements, and shows whether they are supported in DCAT-AP.
For each of the DataCite metadata elements, the table specifies whether they are mandatory (M), recommended (R), or optional (O).
DataCite 4.1 | DCAT-AP 1.1 | Comments | |
---|---|---|---|
Elements | Obligation | ||
Identifier | M | Partially | DataCite requires this to be a DOI, whereas DCAT-AP does not have such requirement |
Creator | M | No | This agent role is supported in GeoDCAT-AP |
Title | M | Yes | |
Publisher | M | Yes | |
Publication year | M | Yes | |
Subject | R | Yes | |
Contributor | R | Partially |
DCAT-AP supports only 1 out of the 21 DataCite contributor types (namely, contact point / person). GeoDCAT-AP supports 1 additional DataCite contributor type, namely, rights holder. |
Date | R | Partially |
DCAT-AP supports only 2 out of the 9 DataCite date types (namely, issue date and last modified date) GeoDCAT-AP supports also an additional date type, namely, creation date. |
Resource type | R | Partially |
DCAT-AP supports just one resource type, namely, |
Related identifier | R | Yes | |
Description | R | Yes | |
Geolocation | R | Yes | |
Language | O | Yes | |
Alternate identifier | O | Yes | |
Size | O | Yes | In DCAT-AP, this is a property of the dataset distribution, and not of the dataset itself |
Format | O | Yes | In DCAT-AP, this is a property of the dataset distribution, and not of the dataset itself |
Version | O | Yes | |
Rights | O | Yes |
DataCite does not use specific elements for use conditions (i.e., licences) and access rights, In DCAT-AP, use conditions are a property of the dataset distribution, whereas access rights are associated with the dataset. |
Funding Reference | O | No | This element specifies: (a) title, identifier and, possibly, URI of the funding project, and (b) name and identifier of the organisation who awarded that project |
The following table provides the list of classes and properties defined in DCAT-AP, and shows whether they are supported in DataCite.
NB: The list of DCAT-AP classes and properties is here limited to those that are either mandatory (M) or recommended (R).
DCAT-AP 1.1 | DataCite 4.1 | Comments | |||
---|---|---|---|---|---|
Classes | Obligation | Properties | Obligation | ||
Agent | M | name | M | Yes | |
type | R | No | |||
Catalogue | M | dataset | M | No | |
description | M | No | |||
publisher | M | No | |||
title | M | No | |||
homepage | R | No | |||
language | R | No | |||
licence | R | No | |||
release date | R | No | |||
themes | R | No | |||
update / modification date | R | No | |||
Dataset | M | description | M | Yes | In DataCite, this property is recommended, not mandatory |
title | M | Yes | |||
contact point | R | Yes | |||
dataset distribution | R | No | |||
keyword / tag | R | Yes | |||
publisher | R | Yes | |||
theme / category | R | Yes | |||
Category | R | preferred label | M | Yes | |
Category scheme | R | title | M | Yes | |
Distribution | R | accessURL | M | No | |
description | R | No | |||
format | R | Yes | In DataCite, this is always a property of the resource itself - even when such resource is a dataset | ||
licence | R | Yes | In DataCite, this is always a property of the resource itself - even when such resource is a dataset | ||
Licence document | R | type | R | No |
As shown in the previous section, DCAT-AP is able to represent all DataCite mandatory elements, with the exception of "creator". This poses an issue for the possible use of DCAT-AP for data citation purposes, since element "creator" is one of the required components. Notably, GeoDCAT-AP supports this agent role, so it can re-used for this purpose.
On the other hand, DataCite includes all the DCAT-AP mandatory classes and related properties, with the only notable exception of dcat:Catalog
. However, this does not pose particular compliance issues, since the catalogue description could be obtained separately from the relevant DataCite records. Actually, since DataCite records are supposed to be all available via the DataCite catalogue, the catalogue description can potentially be the same for all DataCite records. Of course, this does not apply for those records following the DataCite schema but not registered in the DataCite infrastructure.
There are however some key differences on the DCAT-AP and DataCite data models that needs to be addressed. The following sections outline the solutions adopted in DataCite+DCAT-AP, as well as open issues.
DataCite supports 14 different resource types - namely: audiovisual, collection, dataset, event, image, interactive resource, model, physical object, service, software, sound, text, workflow, other. They basically corresponds to the classes included in the DCMI Type vocabulary, with the exception of model and workflow.
The definition of dcat:Dataset
is broad enough to cover most of the DataCite resource type, the exceptions being event, physical object, service, which are not supported in DCAT-AP. Moreover, the notion of "service" is supported in GeoDCAT-AP via dctype:Service
. For the rest, it is possible to re-use the DCMI Type vocabulary, which includes classes for event (dctype:Event
) and physical object (dctype:Event
).
DataCite+DCAT-AP re-uses the approach outlined above. Moreover, in order to preserve the original information, it uses dct:type
with the relevant classes of the DCMI Type vocabulary to denote the DataCite resource type. This is basically the solution adopted in GeoDCAT-AP to model the resource types defined in ISO 19115 - namely, dataset, dataset series, and services.
As said above, the DCMI Type vocabulary does not include classes for model and workflow, and no suitable candidates have been found in the reference vocabularies. As a result, in DataCite+DCAT-AP are both modelled only as dcat:Dataset
's, thus loosing the original information.
The requirements are basically the following ones:
- DataCite requires the dataset identifier to be a DOI.
- DataCite distinguishes between primary and secondary identifiers.
- DataCite models the "type" of identifier (DOIs, ORCIDs, ISNIs, ISSNs, etc.).
DCAT-AP already provides a mechanism to model primary and secondary identifiers, as well as the identifier type. More precisely:
- Property
dct:identifier
is used to model primary identifiers. - Property
adms:identifier
is used to model secondary/alternative identifiers. - Class
adms:Identifier
allows the specification of information about the identifier - identifier scheme included.
Such solutions are basically reflecting the DataCite approach to model identifiers. However, identifiers modelled in this way are of no use for effectively linking the relevant resources. For this purpose, an option would be encoding identifiers as HTTP URIs, whenever possible. This is the case, e.g., for ORCIDs, ISNIs, and DOIs. About the ability to modelling differently primary and secondary/alternative identifiers, the resource URI can denote the primary identifier, whereas URIs corresponding to alternative identifiers can be specified by using owl:sameAs
.
Based on what said above, DataCite+DCAT-AP models identifiers as follows:
- Identifiers are encoded as HTTP URIs, whenever possible, or URNs, using
owl:sameAs
for URIs concerning secondary/alternative identifiers. - In addition:
- Primary identifiers are specified, as literals, with
dct:identifier
. - Secondary/alternative identifiers are specified, as literals, with
adms:identifier
.
- Primary identifiers are specified, as literals, with
DataCite supports the possibility of specifying references to resources related to the described in the metadata record via element RelatedIdentifier, carrying an optional attribute that can be used to express the related resource type.
Roughly half of the supported related resource "types" (25, in total) correspond to the ones defined in Dublin Core. Moreover, with the exception of type "IsIdenticalTo", they denote a relationship (e.g., "IsReferencedBy") and its inverse (e.g., "References").
Based on this DataCite+DCAT-AP models this information as follows:
- The default mapping of element RelatedIdentifier is
dct:relation
- Dublin Core is used for all the relevant related resource types (i.e., 9, excluding
dct:relation
) - 3 are modelled by using FOAF
- 3 are modelled by using Schema.org
- 1 ("IsIdenticalTo") is modelled by using OWL (
owl:sameAs
) - 1 ("IsSourceOf") is modelled by using the W3C PROV Ontology (
prov:hadDerivation
) - 1 ("IsDescribedBy") is modelled by using the W3C POWDER-S Vocabulary (
wdrs:describedby
) - 11 are left unmapped, since the reference vocabularies do not provide suitable candidates
DCAT-AP supports 6 of the mapped related resource types. All the other ones are supported only in the extended profile of DataCite+DCAT-AP.
DataCite supports three main types of agent roles, namely, creator, publisher, and contributor. The last can be further specialised by specifying a contributor "type". DataCite supports 22 contributor types, including, e.g., "contact person", "editor", "funder", "producer", "rights holder", "sponsor", "other".
DCAT-AP supports only two agent roles, namely, publisher and contact point (corresponding to contributor type "contact person" in DataCite). GeoDCAT-AP includes other two DataCite agent roles - namely, creator and rights holder.
As a result, together, DCAT-AP and GeoDCAT-AP cover publisher, creator, and 2 contributor types, namely, contact point and rights holder. For the other ones, DataCite+DCAT-AP includes the following mappings:
dct:contributor
is used when the contributor is untyped, or when the contributor type is "other".schema:editor
,schema:funder
,schema:producer
, andschema:sponsor
are used for the corresponding DataCite contributor types.duv:hasDistributor
is used for the corresponding DataCite contributor type.
It is worth noting that some of the DataCite contributor types cannot be modelled with a direct relationship. This is the case of roles "project leader", "project manager", "project member", "researcher", "supervisor", and "workpackage leader". Such roles are not directly describing the relationship between a resource and an agent, but rather the role of the individual in the "activity" that created the resource. E.g., "project leader" can be described as "the leader of the project that created the resource".
In such cases, the approach used in DataCite+DCAT-AP is as follows:
- The resource is linked to the agent (
foaf:Agent
) with propertydct:contributor
- The agent is linked to the activity (
prov:Activity
) with a property denoting the role played - The activity is linked to the resource with property
prov:wasGeneratedBy
In case of roles "project leader", "project manager", and "project member", the activity is additionally typed as a foaf:Project
.
The following code snippet shows how contributor type "project member" is modelled in DataCite+DCAT-AP:
a:Dataset a dcat:Dataset ;
dct:contributor a:Contributor ;
prov:wasGeneratedBy a:Project .
a:Contributor a foaf:Agent , prov:Agent .
a:Project a prov:Activity , foaf:Project ;
foaf:member a:Contributor .
The issue is that the reference vocabularies does not provide candidates for modelling such contributor types, with the exception of "project member".
For the remaining 14 DataCite contributor types, no candidates have been found in the reference vocabularies, so they are left unmapped in DataCite+DCAT-AP.
The DataCite data model does not distinguish between a dataset and its embodiment(s) ("distribution(s)", in the DCAT terminology).
As a consequence, attributes that in DCAT/DCAT-AP are specific to distributions (as format, licence, size), in DataCite are associated with the dataset. Moreover, in DataCite there is no attribute equivalent to dcat:accessURL
or dcat:downloadURL
. Actually, the only information that can be used to access the dataset, and, possibly, its distribution(s), is the resource DOI.
Based on this, the approach used in DataCite+DCAT-AP to map DataCite records is as follows:
- If the described resource is an event, physical object, or service (i.e., if it cannot be modelled as a dataset), the notion of "distribution" does not apply. Therefore, all DataCite elements are used in DataCite+DCAT-AP to describe the resource. Otherwise:
- Each record is modelled in DataCite+DCAT-AP as a dataset (
dcat:Dataset
), having exactly 1 distribution. - The resulting distribution gets the relevant DataCite elements (as format, licence, size), as per the DCAT/DCAT-AP schema, whereas the remaining ones are used to describe the dataset.
- The dataset DOI is used both as the dataset identifier / URI and as the distribution access URL.
DataCite includes a single element, namely, "rights", to specify use and access conditions. This element is also supported in DCAT-AP (dct:rights
), but, in addition, specific properties are used for use conditions (dct:license
) and access rights (dct:accessRights
). Moreover, in DCAT-AP use conditions are associated with distributions, whereas access rights with datasets.
Based on this, DataCite+DCAT-AP maps by default DataCite "rights" to dct:rights
. In addition, they are mapped to dct:license
and dct:accessRights
when DataCite rights make explicit reference to some known licences and access rights vocabularies. More precisely, the recognised vocabularies are the following ones:
- For licences:
- For access rights:
DataCite supports the specification of both free-text keywords and keywords from controlled vocabularies.
For the latter case, DCAT-AP recommends the use of URIs, but in DataCite only textual labels are used.
To comply with the DCAT-AP recommendation, an option is to implement mappings from textual labels to URIs. However, this poses two main issues:
- DataCite does not require / recommend the use of specific vocabularies, nor a particular format for the textual labels.
- It is often the case that no URIs are available for the used vocabularies.
Such situation makes it difficult the effective implementation of vocabulary mapping.
For this reason, DataCite+DCAT-AP preserve keywords from controlled vocabularies as textual labels.