Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission portal metadata ingest #118

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

JamesTessmer
Copy link

@JamesTessmer JamesTessmer commented Mar 13, 2023

Code for ingesting submission portal metadata -> mongo.

I used Sujay's gold-bioscales branch as a base as we had discussed needing to use the RuntimeApiSiteClient.
So far the code includes:

  • API call to collect json output from the submission portal
  • Processing of samples in that response into floats/textValues/quanitityValues/etc as needed, including minting new IDs
  • Putting these into an NMDC database object and then dumping that as a json.

Additions for this task include an example driver file (main.py), a python file that has the submission_portal_client object and helper functions (submission_portal_client.py).

Initially a CSV file mapping the submission portal metadata slots to the MongoDB slots was needed, but with the updated to the submission portal these now match and it has been removed.

sujaypatil96 and others added 30 commits December 15, 2022 15:27
@JamesTessmer JamesTessmer marked this pull request as draft March 13, 2023 21:35
#independent functions that I've added to this overarching class

#API things are still missing as I only just got access to them ~1 hour before pushing this code up
class submission_portal_client:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class names should typically be PascalCase. So you would have SubmissionPortalClient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ive made this update in the code.



def mint_nmdc_ids(self,num_ids):
json_dic = {"schema_class": {"id": "nmdc:Activity"},"how_many": num_ids}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want to be minting ids for the nmdc:Activity type? Isn't it nmdc:Biosample because you're trying to assign ids to each row from the Submission Portal interface, and each of those rows are biosamples?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was copying the example for minting IDs and hadnt realized there were multiple types. Ive made this change and it now mints biosample IDs.

Comment on lines 1 to 10
import sys
from clients.submission_portal_client import submission_portal_client



client = submission_portal_client('.env')
r = client.get_submission_json('e4cf7c2c-29b4-4924-8452-2bd3c8e518db')
biosample_db_lis = client.process_json_response(r)
for i in range(0,len(biosample_db_lis)):
client.dump_db(biosample_db_lis[i],"output_" + str(i) + ".json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a test for your submission_portal_client.py code instead of a Python module this way?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main.py is removed and a unit test file has been created in the tests directory with a few tests.

param_dic = {}
for key in samp_dic.keys():

if(samp_dic[key] == 'nan' or samp_dic[key] == 'null' or samp_dic[key] == None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why you're using parentheses in your if conditionals? It's not required if you're not trying to enforce an order of operations. If you had a combination of or, and and () groupings it would be useful to wrap if conditionals in parentheses, but I don't think it's necessary here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theres no reason other than habit. I also work with a few other languages that do require parentheses, so I got into the habit of always using them. I can remove them if needed.


if (key in string_params):
if (key == 'id'):
param_dic['alternative_identifiers'] = samp_dic[key] #this vlue might need to be different than 'alternative_identifiers'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, as you correctly anticipated, it might need to be source_mat_id

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ive made this change, checking the updated submission portal API output this field is now labeled as source_mat_id there as well.


else:
if(";" in samp_dic[key]):
temp = samp_dic[key].split(";")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split() operation returns a list. Did you mean to pick the first element and use that as the id after you've done the split?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were a few issues with fields that had semicolon delimiters so if they were present i was converting them to a list. With the updated api output this isnt a problem anymore so ive removed this section of code.

Comment on lines 102 to 103
temp = nmdc.ControlledTermValue(term=nmdc.OntologyClass(id=env_code))
param_dic[key] = temp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a ControlledIdentifiedTermValue and not a ControlledTermValue. You can what dataclasses to use for slots by looking at this page: https://microbiomedata.github.io/nmdc-schema/Biosample/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ive made this update and the output now looks correct, including the id, name, and raw value.


#if something shows up in the else statement it is missing from the mappings
else:
print(key + " not found in any current param list")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General tip: Use logging module logger statements instead of print statements

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ive added logging to replace prints now.

temp = nmdc.ControlledTermValue(term=nmdc.OntologyClass(id="0000"))
param_dic[item] = temp

sample = nmdc.Biosample(**param_dic)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent 🌟🌟

import logging
from clients.nmdc.runtime_api_client import RuntimeApiSiteClient
from dotenv import dotenv_values
import dateutil.parser as parser
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from dateutil import parser

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this change.

@@ -0,0 +1,189 @@
import nmdc_schema.nmdc as nmdc
import linkml
import linkml_runtime.dumpers.json_dumper as dumper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from linkml_runtime.dumpers import json_dumper

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this change.

#database: an NMDC database object
#outfile_path: a string, filepath for desired output
def dump_db(self,database,outfile_path):
dumper.dump(element = database, to_file = outfile_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use indent=4 as argument to dump() for better formatted JSON.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldnt figure this one out, i didnt see an indent parameter for the function so this is unchanged at the moment

Copy link
Contributor

@sujaypatil96 sujaypatil96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left very trivial comments on the PR again @JamesTessmer. Good work!

James Tessmer added 2 commits March 30, 2023 14:09
Updated documentation and finalized some functions in sub portal client.
@JamesTessmer
Copy link
Author

@sujaypatil96 just pushed up some changes with suggested modifications. This includes the water content fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants