Table of Contents
The Gen3 Discovery Page allows the visualization of metadata. There are a collection of SDK/CLI functionality that assists with the managing of such metadata in Gen3.
gen3 discovery --help
will provide the most up to date information about CLI
functionality.
Like other CLI functions, the CLI code mostly just wraps an SDK function call.
So you can choose to use the CLI or write your own Python script and use the SDK functions yourself. Generally this provides the most flexibility, at less of a convenience.
Gen3's SDK supports minting DOIs from DataCite, storing DOI metadata in a Gen3 instance, and visualizing the DOI metadata in our Discovery Page to serve as a DOI "Landing Page".
DOI? A digital object identifier (DOI) is a persistent identifier or handle used to identify objects uniquely, standardized by the International Organization for Standardization (ISO). DOIs are in wide use mainly to identify academic, professional, and government information, such as journal articles, research reports, data sets, and official publications. However, they also have been used to identify other types of information resources, such as commercial videos.
The general overview for how Gen3 supports DOIs is as follows:
- Gen3 SDK/CLI used to gather Metadata from External Public Metadata Sources
- Gen3 SDK/CLI used to do any conversions to DOI Metadata
- Gen3 SDK/CLI communicates with DataCite API to mint DOI
- NOTE: the gathering of metadata, conversion to DOI fields, and final minting may or may not be a part of a regular data ingestion. It’s possible that this is used ad-hocly, as needed
- Gen3 SDK/CLI persists metadata in Gen3
- Persisted metadata in Gen3 exposed via Discovery Page
- Discovery Page is used as the required DOI Landing Page
What is DataCite? In order to create a DOI, one must use a DOI registration service. In the US there are two: CrossRef and DataCite. We are focusing on DataCite, because that is what we were provided access to.
Prerequisites:
- Environment variable
DATACITE_USERNAME
set as a valid DataCite username for interacting with their API - Environment variable
DATACITE_PASSWORD
set as a valid DataCite password for interacting with their API
This shows a full example of:
- Setting up the necessary classes for interacting with Gen3 & Datacite
- Getting the DOI metadata (ideally from some external source like a file or another API, but here we've hard-coded it)
- Creating/Minting the DOI in DataCite
- Persisting the DOI metadata into a Gen3 Discovery record in the metadata service
import os
from requests.auth import HTTPBasicAuth
from cdislogging import get_logger
from gen3.doi import (
DataCite,
DigitalObjectIdentifier,
DigitalObjectIdentifierCreator,
DigitalObjectIdentifierTitle,
)
from gen3.auth import Gen3Auth
logging = get_logger("__name__", log_level="info")
# This prefix should be provided by DataCite
PREFIX = "10.12345"
PUBLISHER = "Example"
COMMONS_DISCOVERY_PAGE = "https://example.com/discovery"
DOI_DISCLAIMER = ""
DOI_ACCESS_INFORMATION = "You can find information about how to access this resource in the link below."
DOI_ACCESS_INFORMATION_LINK = "https://example.com/more/info"
DOI_CONTACT = "https://example.com/contact/"
def test_manual_single_doi(publish_dois=False):
# Setup
gen3_auth = Gen3Auth()
datacite = DataCite(
api=DataCite.TEST_URL,
auth_provider=HTTPBasicAuth(
os.environ.get("DATACITE_USERNAME"),
os.environ.get("DATACITE_PASSWORD"),
),
)
gen3_metadata_guid = "Example-Study-01"
# Get DOI metadata (ideally from some external source)
identifier = "10.82483/BDC-268Z-O151"
creators = [
DigitalObjectIdentifierCreator(
name="Bar, Foo",
name_type=DigitalObjectIdentifierCreator.NAME_TYPE_PERSON,
).as_dict()
]
titles = [DigitalObjectIdentifierTitle("Some Example Study in Gen3").as_dict()]
publisher = "Example Gen3 Sponsor"
publication_year = 2023
doi_type = "Dataset"
version = 1
doi_metadata = {
"identifier": identifier,
"creators": creators,
"titles": titles,
"publisher": publisher,
"publication_year": publication_year,
"doi_type": doi_type,
"version": version,
}
# Create/Mint the DOI in DataCite
doi = DigitalObjectIdentifier(root_url=COMMONS_DISCOVERY_PAGE, **doi_metadata)
if publish_dois:
logging.info(f"Publishing DOI `{identifier}`...")
doi.event = "publish"
# works for only new DOIs
# You can use this for updates: `datacite.update_doi(doi)`
response = datacite.create_doi(doi)
doi = DigitalObjectIdentifier.from_datacite_create_doi_response(response)
# Persist necessary DOI Metadata in Gen3 Discovery to support the landing page
metadata = datacite.persist_doi_metadata_in_gen3(
guid=gen3_metadata_guid,
doi=doi,
auth=gen3_auth,
additional_metadata={
"disclaimer": DOI_DISCLAIMER,
"access_information": DOI_ACCESS_INFORMATION,
"access_information_link": DOI_ACCESS_INFORMATION_LINK,
"contact": DOI_CONTACT,
},
prefix="doi_",
)
logging.debug(f"Gen3 Metadata for GUID `{gen3_metadata_guid}`: {metadata}")
def main():
test_manual_single_doi()
if __name__ == "__main__":
main()
This is portion of the Gen3 Data Portal configuration that pertains to the Discovery Page. The code provided shows an example of how to configure the visualization of the DOI metadata.
In order to be compliant with Landing Pages, the URL you provide during minting needs to automatically display all this information. So if you have other tabs of non-DOI information, they cannot be the first focused tab upon resolving the DOI url.
"discoveryConfig": {
// ...
"detailView": {
// ...
"tabs": [
{
"tabName": "DOI",
"groups": [
{
"header": "Dataset Information",
"fields": [
{
"type": "block",
"label": "",
"sourceField": "disclaimer",
"default": ""
},
{
"type": "text",
"label": "Title:",
"sourceField": "doi_titles",
"default": "Not specified"
},
{
"type": "link",
"label": "DOI:",
"sourceField": "doi_resolveable_link",
"default": "None"
},
{
"type": "text",
"label": "Data available:",
"sourceField": "doi_is_available",
"default": "None"
},
{
"type": "text",
"label": "Citation:",
"sourceField": "doi_citation",
"default": "Not specified"
},
{
"type": "link",
"label": "Contact:",
"sourceField": "doi_contact",
"default": "Not specified"
}
]
},
{
"header": "How to Access the Data",
"fields": [
{
"type": "block",
"label": "How to access the data:",
"sourceField": "doi_access_information",
"default": "Not specified"
},
{
"type": "link",
"label": "Data and access information:",
"sourceField": "doi_access_information_link",
"default": "Not specified"
}
]
},
{
"header": "Additional Information",
"fields": [
{
"type": "text",
"label": "Publisher:",
"sourceField": "doi_publisher",
"default": "Not specified"
},
{
"type": "text",
"label": "Funded by:",
"sourceField": "doi_fundingReferences",
"default": "Not specified"
},
{
"type": "text",
"label": "Publication Year:",
"sourceField": "doi_publication_year",
"default": "Not specified"
},
{
"type": "text",
"label": "Resource Type:",
"sourceField": "doi_resource_type",
"default": "Not specified"
},
{
"type": "text",
"label": "Version:",
"sourceField": "doi_version_information",
"default": "Not specified"
}
]
},
{
"header": "Description",
"fields": [
{
"type": "block",
"label": "Description:",
"sourceField": "doi_descriptions",
"default": "Not specified"
}
]
}
]
},
// ...
- TODO: Push DOI from submitted to registered
See below for a full example of DOI metadata gathering, minting, and persisting into Gen3.
import os
from requests.auth import HTTPBasicAuth
from cdislogging import get_logger
from gen3.auth import Gen3Auth
from gen3.discovery_dois import mint_dois_for_dbgap_discovery_datasets
from gen3.utils import get_random_alphanumeric
logging = get_logger("__name__", log_level="info")
PREFIX = "10.12345"
PUBLISHER = "Example"
COMMONS_DISCOVERY_PAGE = "https://example.com/discovery"
DOI_DISCLAIMER = ""
DOI_ACCESS_INFORMATION = "You can find information about how to access this resource in the link below."
DOI_ACCESS_INFORMATION_LINK = "https://example.com/more/info"
DOI_CONTACT = "https://example.com/contact/"
def get_doi_identifier():
return (
PREFIX + "/EXAMPLE-" + get_random_alphanumeric(4) + "-" + get_random_alphanumeric(4)
)
def main():
auth = Gen3Auth()
dbgap_phsid_field = "dbgap_accession"
mint_dois_for_dbgap_discovery_datasets(
gen3_auth=auth,
datacite_auth=HTTPBasicAuth(
os.environ.get("DATACITE_USERNAME"),
os.environ.get("DATACITE_PASSWORD"),
),
dbgap_phsid_field=dbgap_phsid_field,
get_doi_identifier_function=get_doi_identifier,
publisher=PUBLISHER,
commons_discovery_page=COMMONS_DISCOVERY_PAGE,
doi_disclaimer=DOI_DISCLAIMER,
doi_access_information=DOI_ACCESS_INFORMATION,
doi_access_information_link=DOI_ACCESS_INFORMATION_LINK,
doi_contact=DOI_CONTACT,
)
if __name__ == "__main__":
main()
For CLI, see gen3 discovery combine --help
.
This will describe how to use the SDK functions directly. If you use the CLI, it will automatically read current Discovery metadata and then combine with the file you provide (after applying a prefix to all the columns, if you specify that).
Note: This supports CSV and TSV formats for the metadata file
Let's assume:
- You don't have the current Discovery metadata in a file locally
- You want to merge new metadata (parsed from dbGaP's FHIR server) with the existing Discovery metadata
- You want to prefix all the new columns with
DBGAP_FHIR_
Here's how you would do that without using the CLI:
from gen3.auth import Gen3Auth
from gen3.tools.metadata.discovery import (
output_expanded_discovery_metadata,
combine_discovery_metadata,
)
from gen3.external.nih.dbgap_fhir import dbgapFHIR
from gen3.utils import get_or_create_event_loop_for_thread
def main():
"""
Read current Discovery metadata, then combine with dbgapFHIR metadata.
"""
# Get current Discovery metadata
loop = get_or_create_event_loop_for_thread()
auth = Gen3Auth(refresh_file="credentials.json")
current_discovery_metadata_file = loop.run_until_complete(
output_expanded_discovery_metadata(auth, endpoint=auth.endpoint)
)
# Get dbGaP FHIR Metadata
studies = [
"phs000007.v31",
"phs000166.v2",
"phs000179.v6",
]
dbgapfhir = dbgapFHIR()
simplified_data = dbgapfhir.get_metadata_for_ids(phsids=studies)
dbgapFHIR.write_data_to_file(simplified_data, "fhir_metadata_file.tsv")
# Combine new FHIR Metadata with existing Discovery Metadata
metadata_filename = "fhir_metadata_file.tsv"
discovery_column_to_map_on = "guid"
metadata_column_to_map = "Id"
output_filename = "combined_discovery_metadata.tsv"
metadata_prefix = "DBGAP_FHIR_"
output_file = combine_discovery_metadata(
current_discovery_metadata_file,
metadata_filename,
discovery_column_to_map_on,
metadata_column_to_map,
output_filename,
metadata_prefix=metadata_prefix,
)
# You now have a file with the combined information that you can publish
# NOTE: Combining does NOT publish automatically into Gen3. You should
# QA the output (make sure the result is correct), and then publish.
if __name__ == "__main__":
main()