Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Portable and stateless entities #226

Open
goodmami opened this issue Dec 8, 2024 · 0 comments
Open

Portable and stateless entities #226

goodmami opened this issue Dec 8, 2024 · 0 comments
Labels
enhancement New feature or request
Milestone

Comments

@goodmami
Copy link
Owner

goodmami commented Dec 8, 2024

Is your feature request related to a problem? Please describe.

Entities returned from queries, like Words, Senses, Synsets, etc., have some hidden fields that aid in secondary queries, but are tied to the database the entity data came from. Namely:

  • _id -- rowid of the entity in the database
  • _lexid -- rowid of the entity's lexicon in the database
  • _wordnet -- Wordnet object used to query the entity

This means if you query the same thing from different contexts, you'll have slightly different entities. For example, starting from a blank database:

>>> import wn
>>> wn.download("omw-es")  # Add Spanish wordnet
>>> wn.download("omw-en")  # Add OMW English wordnet
>>> palabra_1 = wn.words("palabra", pos="n", lexicon="omw-es")[0]  # get a word
>>> wn.remove("omw-es")  # Remove and re-add Spanish wordnet
>>> wn.download("omw-es")
>>> palabra_2 = wn.words("palabra", pos="n", lexicon="omw-es")[0]  # get same word
>>> palabra_1 == palabra_2  # same word doesn't compare equal
False
>>> # same word from same installed lexicon does compare equal
>>> palabra_2 == wn.words("palabra", pos="n", lexicon="omw-es")[0]
True

The entities are therefore stateful; their value depends not just on what it is but how/when/where you got it.

Describe the solution you'd like

Entities should be stateless; the same entity retrieved from differently-constructed databases should compare as the same, pickled objects should be portable (#84), etc.

In order for this to happen, we need to make some changes:

  • _id -- remove; entities are looked up by their regular id within the lexicon; entities without guaranteed IDs in WN-LMF (like forms and relations) are fully resolved (metadata, tags, pronunciations, etc.)
  • _lexid -- replace with something like lexicon_id that is the lexicon specifier like omw-es:1.4; the .lexicon() method should use this to lookup the lexicon data
  • _wordnet -- replace with some new data structure (BaseWordnet?) that only stores the list of primary and expand lexicons

Further regarding _wordnet, it is stored so secondary queries (Word.synsets(), Synset.relations(), etc.) consider the primary and expand lexicons used in the primary query. The other configurables of Wordnet, like the lemmatizer, normalizer, and search_all_forms flag, are only used in primary queries, and are thus unneeded and are hard to serialize as they are functions.

Describe alternatives you've considered

#84 proposed a custom deepcopy method as an option, but that would only solve one use case affected by the stateful entities.

Additional context

Not using the db-internal rowids will complicate and possibly slow down the queries. Hopefully it is not too bad, but maybe benchmarking is a good idea.

@goodmami goodmami added the enhancement New feature or request label Dec 8, 2024
@goodmami goodmami added this to the v1.0 milestone Dec 8, 2024
goodmami added a commit that referenced this issue Apr 3, 2025
This is mainly around getting lexicons, dependencies, and extensions.
Some temporary non-public functions were needed. If/when lexicon
specifiers are used more generally, these functions will no longer be
needed.

Part of #226
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant