OpenTrials project has 4 main components:
- Collectors: contains logic for gathering data (e.g. scrapers)
and manages the schema for our
warehouse
database that keeps the data collected from different sources - Processors: contains logic for processing and inserting data from
warehouse
into our APIdatabase
- OpenTrials API: manages the schema for our
database
and contains logic for exposing and indexing the data inside it - OpenTrials Explorer: displays data from our API and manages the
explorer
database that keeps users and user-related data
This system is responsible with normalizing and enriching data in our warehouse
and API database
and managing our file storage.
Processors are fully compatible with Python2.7.
We use PostgreSQL for our databases and Docker Cloud to deploy and run the processors in production.
Processors are independent python modules that share the following signature:
def process(conf, conn, *args):
pass
Where arguments are:
conf
- config dictconn
- connections dictargs
- processor arguments
To run a processor from the command line:
$ make start <name> [<args>]
This code will trigger processors.<name>.process(conf, conn, *args)
call.
One of the most common use cases for processors is to extract and standardize data from our
warehouse
database into entities that comply with the structure of our API database
.
Extractors are functions that map entity representations in different registries to OpenTrials API database
schema.
e.g. Given two registries NCT and EUCTR and their corresponding extractors: NCT trial extractor and EUCTR trial extractor
# NCT
nct_record = {
'nct_number': 'nct15',
'main_title': 'name1',
...
}
trial = extract_trial(nct_record)
print(trial)
{
'primary_id': 'nct15',
'public_title': 'name1',
...
}
# EUCTR
euctr_record = {
'trial_id': 'euctr2004',
'euro_title': 'name3',
...
}
trial = extract_trial(euctr_record)
print(trial)
{
'primary_id': 'euctr20014',
'public_title': 'name3',
...
}
Writers are modules that hold logic for creating and updating entities in database
without creating duplicates.
In the folder processors/base/writers
we already have writers for different database entities (e.g. trial
, person
, etc.) that you can use. See documentation in source code for how to use them.
In the folder processors/base/processors
there are a few lower level processors
that contain the logic surrounding certain API database
entities (e.g. trial
, publication
etc.). Their main role is to manage the Extractors and Writers.
These base processors cannot be directly invoked from the command line, they are meant for use in other processors.
e.g. trial
processor extracts and writes a trial
and also creates and links the trial
's related entities. It contains the following function:
def process_trials(conn, table, extractors):
pass
Where arguments are:
- conn - connection dict
- table - name of table from
warehouse
that contains unstructured records - extractors - dict of functions that map the unstructured records into
trials
,documents
and othertrial
-related entities
Processors can perform any operation needed to manage data stores: removing records, linking records etc. Just make sure to keep the logic for gathering data from outside sources in the Collectors.