Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs page containing table that maps typecodes to schema classes #2319

Conversation

eecavanna
Copy link
Collaborator

@eecavanna eecavanna commented Jan 12, 2025

On this branch, I implemented a new page of the schema docs—a page that contains a table that can be used to map a typecode to a schema class. I added a link to that page, to the sidebar navigation menu. The page gets generated dynamically during the documentation build process.

Here's what the page looks like on the website:

image

Highlights:

  • Each schema class name appears at most once in the table. When multiple typecodes map to the same schema class, the typecodes are listed on the same row as one another, as a comma-separated list. (FYI, @turbomam)
  • Each schema class name is a link to the schema documentation page for that class. The linked page opens in a new web browser tab (unless the web browser is configured to not do that). (FYI, @SamuelPurvine)

Fixes #2285

@eecavanna eecavanna self-assigned this Jan 12, 2025
@eecavanna eecavanna linked an issue Jan 12, 2025 that may be closed by this pull request
Copy link

github-actions bot commented Jan 12, 2025

PR Preview Action v1.6.0

🚀 View preview at
https://microbiomedata.github.io/nmdc-schema/pr-preview/pr-2319/

Built to branch gh-pages at 2025-01-16 21:19 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@eecavanna
Copy link
Collaborator Author

Until someone opens up another PR in this repo, the newly-implemented page will be preview-able at https://microbiomedata.github.io/nmdc-schema/pr-preview/pr-2319/typecode-to-class-map/.

@turbomam
Copy link
Member

turbomam commented Jan 13, 2025

This is a good improvement to the documentation and I am inclined to approve it.

Question: what is the native sort order of the table? The sorting controls work for me, so it isn't a critical problem. I just couldn't figure out the order on my own.

Request: please add some brief documentation about the differences between how this code determines the typecodes of classes, vs the code that implements https://api.microbiomedata.org/docs#/metadata/get_nmdc_schema_typecodes_nmdcschema_typecodes_get retrieves

What are the advantages and disadvantages of both? Is there a plan for ensuring that the two tools stay in sync?

@turbomam turbomam closed this Jan 13, 2025
@turbomam turbomam reopened this Jan 13, 2025
@SamuelPurvine
Copy link
Contributor

Well, this certainly looks like I was imagining! I'm fine with sorting it in-page, or having it sort by typecode ascending on open. I definitely approve of this one, though I'll defer to Mark for the closing of the issue.

Many thanks for being willing to make this! It's a nice snapshot of what we "do", where the rubber hits the road in terms of entities we track, at least from my point of view.

@eecavanna
Copy link
Collaborator Author

eecavanna commented Jan 13, 2025

Thanks, @turbomam and @SamuelPurvine.

sort order

I haven't done any explicit sorting in the code. As a result, the current order is based on the order in which a SchemaView instance happens to return class definitions in response to me calling schema_view.all_classes().

I will update the table to be sorted as follows (based upon the latest comment from @SamuelPurvine above):

  • For each row, sort the typecodes (A-Z)
  • For the table, sort the rows by the first typecode in that row (A-Z)

It's a two-stage sorting algorithm because each "Typecode(s)" cell can contain multiple values, and those values have some order within the context of that cell. For example, if the values in a given "Typecode(s)" cell are "ccc, aaa", the first stage will sort them to be "aaa, ccc", and the second stage will move that row into an appropriate place in the table overall (e.g. above the row where the "Typecode(s)" value is, say, "bbb").

Request: please add some brief documentation about the differences [...]

I'd like to chat with you about this as I think there is something else that could be at play: storing the "typecode for newly-minted IDs" in its own slot in the class's definition, rather than storing it in a special place within a regex (I don't know whether the latter is documented anywhere other than in the minter's code and in meeting notes).

Since we do currently have that "dual use" of that slot, I'd like the have a couple helper functions implemented in the nmdc-schema repo that are used by (a) the script that generates this table and (b) the nmdc-runtime code used by that endpoint and by the minter. After my 11-12pm meeting today, I'll explore adding such a helper function to this PR. I think having import-table functions like that would facilitate "keeping the two tools in sync."

Long-term, I want the "typecode for newly-minted IDs" to be stored separately in the schema, than being in a special place in the regex of "ID format that is legal for instances of this class."

@eecavanna
Copy link
Collaborator Author

eecavanna commented Jan 13, 2025

As planned, I made it so the native sort order is the result of:

  1. On each row, the typecodes are sorted alphabetically
  2. Then, the rows are sorted by the first typecode on the row

Here's what the result looks like:

image

@eecavanna
Copy link
Collaborator Author

Also, I extracted the id pattern-parsing logic into helper functions that can be imported by the Runtime (which I think will facilitate using the same algorithm in both places). In one of the helper functions, I included this note:

    Note: We currently use the `pattern` property of the `id` slot for two different things: (a) to specify which
          typecode we want ID generators (e.g., instances of the NMDC ID Minter) to use when generating new identifiers
          (specifically, we want them to use only the _first_ typecode that occurs in the pattern), and (b) to specify
          which typecodes we want ID validators (e.g., instance of the NMDC Runtime) to allow identifiers of instances
          of that class to contain (they can contain _any_ one of the typecodes that occurs in the pattern). This
          function returns the list for "use (b)." Eventually, we may specify the above two things using separate
          schema elements.

@eecavanna
Copy link
Collaborator Author

eecavanna commented Jan 13, 2025

Is there a plan for ensuring that the two tools stay in sync?

My plan is to update the Runtime to use the newly-implemented helper method named get_typecode_for_future_ids from the version of the nmdc-schema PyPI package that this PR gets introduced in, instead of the Runtime having its own code that parses the pattern value from the schema.

@eecavanna
Copy link
Collaborator Author

eecavanna commented Jan 13, 2025

Request: please add some brief documentation about the differences between [...]

What are the advantages and disadvantages of both?

Since I have moved the typecode-extraction code into import-able helper functions and — once this PR gets merged into main and a PyPI package containing those helper functions gets published — I plan to update the Runtime to use those helper functions; I think the aforementioned request is obsolete.

@eecavanna
Copy link
Collaborator Author

FYI:

Until someone opens up another PR in this repo

Someone has opened up another PR in this repo, so the preview deployment of this PR no longer exists.

@eecavanna
Copy link
Collaborator Author

I'm ready for this branch to be merged in.

@eecavanna
Copy link
Collaborator Author

During today's (1/15/2025), @aclum mentioned that she wants this table to include a column that indicates which MongoDB collection instances of a given class can reside in (according to the schema).

I'd prefer to merge this in without that column and then create a follow-on ticket about adding that column to the table. Are you OK with that sequence of events, @aclum?

@aclum
Copy link
Contributor

aclum commented Jan 16, 2025

@eecavanna i agree that it makes sense to merge this in and add mongo collection as a new request. Let's get this merged so it can be part of the January release.

@eecavanna
Copy link
Collaborator Author

eecavanna commented Jan 16, 2025

Thanks, @aclum—I will create an issue for that now and add a link to this PR to it.

Edit: Here's a link to the newly-created ticket: #2323

Hi @turbomam, are you OK merging this in as is? If not, is there anything in particular you want me to do on this branch?

@eecavanna
Copy link
Collaborator Author

I'll resolve those merge conflicts locally now and push up the resulting branch.

Copy link
Member

@turbomam turbomam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand the implementation of identifying future typecodes, but everything else looks great!

…de-to-schema-class-name

# Conflicts:
#	Makefile
#	mkdocs.yml
@eecavanna
Copy link
Collaborator Author

eecavanna commented Jan 16, 2025

Thanks, @turbomam. I'll try to explain that here (I think you are saying you don't understand the get_typecode_for_future_ids function).

Here's that function, including its doctests:

def get_typecode_for_future_ids(slot_pattern: str) -> Optional[str]:
    r"""
    Returns the typecode, if any, that schema authors want future values of the specified `id` field to contain.
    :param slot_pattern: The value of the `pattern` property of the `id` slot definition
    >>> get_typecode_for_future_ids("(nmdc):foo-...") is None
    True
    >>> get_typecode_for_future_ids("^(nmdc):foo-...")
    'foo'
    >>> get_typecode_for_future_ids("^(nmdc):(foo|bar)-...")
    'foo'
    >>> get_typecode_for_future_ids("^(nmdc):(foo|bar|baz)-...")
    'foo'
    """
    compatible_typecodes = get_compatible_typecodes(slot_pattern)
    return compatible_typecodes[0] if len(compatible_typecodes) > 0 else None

Given a regex pattern like the following...

syntax: "{id_nmdc_prefix}:(omprc|dgns)-{id_shoulder}-{id_blade}$"

...here's what this function does:

  1. It uses the get_compatible_typecodes function to extract the typecodes from that regex pattern. I'll call these compatible_typecodes, since they are all of the typecodes that a valid ID for an instance of that class can have. In this case, compatible_typecodes is a list consisting of two items: "omprc" and "dgns".
  2. It returns the first typecode (if any) in that list. That is what team members have expressed that they want the minter to use when generating IDs for instances of that class. I refer to those IDs as "future IDs" because they haven't been generated yet.

@eecavanna eecavanna merged commit 55e96d9 into main Jan 16, 2025
3 checks passed
@eecavanna eecavanna deleted the 2285-make-a-quick-reference-that-maps-typecode-to-schema-class-name branch January 16, 2025 21:26
@eecavanna
Copy link
Collaborator Author

I plan to update the Runtime to use those helper functions

Here's the ticket in the Runtime repo where I am tracking the task of updating the Runtime to use the helper function that is now being exported from the nmdc-schema package: microbiomedata/nmdc-runtime#866

@eecavanna
Copy link
Collaborator Author

Here's a link to the Runtime PR where I replaced its own "extract typecode from pattern" function with the one implemented in the nmdc-schema package: microbiomedata/nmdc-runtime#878

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make a quick reference that maps typecode to schema class name
4 participants