-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submission portal metadata ingest #118
base: main
Are you sure you want to change the base?
Conversation
#independent functions that I've added to this overarching class | ||
|
||
#API things are still missing as I only just got access to them ~1 hour before pushing this code up | ||
class submission_portal_client: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class names should typically be PascalCase. So you would have SubmissionPortalClient
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ive made this update in the code.
|
||
|
||
def mint_nmdc_ids(self,num_ids): | ||
json_dic = {"schema_class": {"id": "nmdc:Activity"},"how_many": num_ids} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you want to be minting ids for the nmdc:Activity
type? Isn't it nmdc:Biosample
because you're trying to assign ids to each row from the Submission Portal interface, and each of those rows are biosamples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was copying the example for minting IDs and hadnt realized there were multiple types. Ive made this change and it now mints biosample IDs.
sample_annotator/main.py
Outdated
import sys | ||
from clients.submission_portal_client import submission_portal_client | ||
|
||
|
||
|
||
client = submission_portal_client('.env') | ||
r = client.get_submission_json('e4cf7c2c-29b4-4924-8452-2bd3c8e518db') | ||
biosample_db_lis = client.process_json_response(r) | ||
for i in range(0,len(biosample_db_lis)): | ||
client.dump_db(biosample_db_lis[i],"output_" + str(i) + ".json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a test for your submission_portal_client.py
code instead of a Python module this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main.py is removed and a unit test file has been created in the tests directory with a few tests.
param_dic = {} | ||
for key in samp_dic.keys(): | ||
|
||
if(samp_dic[key] == 'nan' or samp_dic[key] == 'null' or samp_dic[key] == None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why you're using parentheses in your if
conditionals? It's not required if you're not trying to enforce an order of operations. If you had a combination of or
, and
and ()
groupings it would be useful to wrap if
conditionals in parentheses, but I don't think it's necessary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theres no reason other than habit. I also work with a few other languages that do require parentheses, so I got into the habit of always using them. I can remove them if needed.
|
||
if (key in string_params): | ||
if (key == 'id'): | ||
param_dic['alternative_identifiers'] = samp_dic[key] #this vlue might need to be different than 'alternative_identifiers' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, as you correctly anticipated, it might need to be source_mat_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ive made this change, checking the updated submission portal API output this field is now labeled as source_mat_id there as well.
|
||
else: | ||
if(";" in samp_dic[key]): | ||
temp = samp_dic[key].split(";") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The split()
operation returns a list. Did you mean to pick the first element and use that as the id after you've done the split?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were a few issues with fields that had semicolon delimiters so if they were present i was converting them to a list. With the updated api output this isnt a problem anymore so ive removed this section of code.
temp = nmdc.ControlledTermValue(term=nmdc.OntologyClass(id=env_code)) | ||
param_dic[key] = temp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be a ControlledIdentifiedTermValue and not a ControlledTermValue. You can what dataclasses to use for slots by looking at this page: https://microbiomedata.github.io/nmdc-schema/Biosample/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ive made this update and the output now looks correct, including the id, name, and raw value.
|
||
#if something shows up in the else statement it is missing from the mappings | ||
else: | ||
print(key + " not found in any current param list") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General tip: Use logging module logger statements instead of print statements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ive added logging to replace prints now.
temp = nmdc.ControlledTermValue(term=nmdc.OntologyClass(id="0000")) | ||
param_dic[item] = temp | ||
|
||
sample = nmdc.Biosample(**param_dic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent 🌟🌟
import logging | ||
from clients.nmdc.runtime_api_client import RuntimeApiSiteClient | ||
from dotenv import dotenv_values | ||
import dateutil.parser as parser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from dateutil import parser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented this change.
@@ -0,0 +1,189 @@ | |||
import nmdc_schema.nmdc as nmdc | |||
import linkml | |||
import linkml_runtime.dumpers.json_dumper as dumper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from linkml_runtime.dumpers import json_dumper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented this change.
#database: an NMDC database object | ||
#outfile_path: a string, filepath for desired output | ||
def dump_db(self,database,outfile_path): | ||
dumper.dump(element = database, to_file = outfile_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use indent=4
as argument to dump()
for better formatted JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldnt figure this one out, i didnt see an indent parameter for the function so this is unchanged at the moment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left very trivial comments on the PR again @JamesTessmer. Good work!
Updated documentation and finalized some functions in sub portal client.
@sujaypatil96 just pushed up some changes with suggested modifications. This includes the water content fix. |
Code for ingesting submission portal metadata -> mongo.
I used Sujay's gold-bioscales branch as a base as we had discussed needing to use the RuntimeApiSiteClient.
So far the code includes:
Additions for this task include an example driver file (main.py), a python file that has the submission_portal_client object and helper functions (submission_portal_client.py).
Initially a CSV file mapping the submission portal metadata slots to the MongoDB slots was needed, but with the updated to the submission portal these now match and it has been removed.