You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Entities returned from queries, like Words, Senses, Synsets, etc., have some hidden fields that aid in secondary queries, but are tied to the database the entity data came from. Namely:
_id -- rowid of the entity in the database
_lexid -- rowid of the entity's lexicon in the database
_wordnet -- Wordnet object used to query the entity
This means if you query the same thing from different contexts, you'll have slightly different entities. For example, starting from a blank database:
>>> import wn
>>> wn.download("omw-es") # Add Spanish wordnet
>>> wn.download("omw-en") # Add OMW English wordnet
>>> palabra_1 = wn.words("palabra", pos="n", lexicon="omw-es")[0] # get a word
>>> wn.remove("omw-es") # Remove and re-add Spanish wordnet
>>> wn.download("omw-es")
>>> palabra_2 = wn.words("palabra", pos="n", lexicon="omw-es")[0] # get same word
>>> palabra_1 == palabra_2 # same word doesn't compare equal
False
>>> # same word from same installed lexicon does compare equal
>>> palabra_2 == wn.words("palabra", pos="n", lexicon="omw-es")[0]
True
The entities are therefore stateful; their value depends not just on what it is but how/when/where you got it.
Describe the solution you'd like
Entities should be stateless; the same entity retrieved from differently-constructed databases should compare as the same, pickled objects should be portable (#84), etc.
In order for this to happen, we need to make some changes:
_id -- remove; entities are looked up by their regular id within the lexicon; entities without guaranteed IDs in WN-LMF (like forms and relations) are fully resolved (metadata, tags, pronunciations, etc.)
_lexid -- replace with something like lexicon_id that is the lexicon specifier like omw-es:1.4; the .lexicon() method should use this to lookup the lexicon data
_wordnet -- replace with some new data structure (BaseWordnet?) that only stores the list of primary and expand lexicons
Further regarding _wordnet, it is stored so secondary queries (Word.synsets(), Synset.relations(), etc.) consider the primary and expand lexicons used in the primary query. The other configurables of Wordnet, like the lemmatizer, normalizer, and search_all_forms flag, are only used in primary queries, and are thus unneeded and are hard to serialize as they are functions.
Describe alternatives you've considered
#84 proposed a custom deepcopy method as an option, but that would only solve one use case affected by the stateful entities.
Additional context
Not using the db-internal rowids will complicate and possibly slow down the queries. Hopefully it is not too bad, but maybe benchmarking is a good idea.
The text was updated successfully, but these errors were encountered:
This is mainly around getting lexicons, dependencies, and extensions.
Some temporary non-public functions were needed. If/when lexicon
specifiers are used more generally, these functions will no longer be
needed.
Part of #226
Is your feature request related to a problem? Please describe.
Entities returned from queries, like Words, Senses, Synsets, etc., have some hidden fields that aid in secondary queries, but are tied to the database the entity data came from. Namely:
_id
-- rowid of the entity in the database_lexid
-- rowid of the entity's lexicon in the database_wordnet
-- Wordnet object used to query the entityThis means if you query the same thing from different contexts, you'll have slightly different entities. For example, starting from a blank database:
The entities are therefore stateful; their value depends not just on what it is but how/when/where you got it.
Describe the solution you'd like
Entities should be stateless; the same entity retrieved from differently-constructed databases should compare as the same, pickled objects should be portable (#84), etc.
In order for this to happen, we need to make some changes:
_id
-- remove; entities are looked up by their regularid
within the lexicon; entities without guaranteed IDs in WN-LMF (like forms and relations) are fully resolved (metadata, tags, pronunciations, etc.)_lexid
-- replace with something likelexicon_id
that is the lexicon specifier likeomw-es:1.4
; the.lexicon()
method should use this to lookup the lexicon data_wordnet
-- replace with some new data structure (BaseWordnet
?) that only stores the list of primary and expand lexiconsFurther regarding
_wordnet
, it is stored so secondary queries (Word.synsets()
,Synset.relations()
, etc.) consider the primary and expand lexicons used in the primary query. The other configurables ofWordnet
, like the lemmatizer, normalizer, andsearch_all_forms
flag, are only used in primary queries, and are thus unneeded and are hard to serialize as they are functions.Describe alternatives you've considered
#84 proposed a custom deepcopy method as an option, but that would only solve one use case affected by the stateful entities.
Additional context
Not using the db-internal rowids will complicate and possibly slow down the queries. Hopefully it is not too bad, but maybe benchmarking is a good idea.
The text was updated successfully, but these errors were encountered: