-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Sequence Length Index to speed up initialisation of the data loader #3
base: main
Are you sure you want to change the base?
Conversation
@davnov134 Please review. |
_SEQUENCE_LENGTHS_TABLE: str = "sequence_lengths" | ||
|
||
|
||
class IndexProtocol(Protocol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add docstrings to the final version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, are we sure we need to use Protocols here? I do understand it is the right thing to do, but historically we learned that OSS ML users are rarely interested in advanced features they need to study. A simple base class with NotImplementedError()
s inside the implementations can be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an implementation detail and not part of the API, and I probably should move these classes to a different file. In general, protocol does not affect the runtime, it is just a static-analysis-type-checking-time thing, so hard to break something even if you don’t understand it.
@@ -33,6 +33,30 @@ def test_iterate_dataset(self): | |||
for i in load_idx: | |||
_ = dataset[i] | |||
|
|||
def test_get_frame_annotations(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test that iterates over a dataset with FullIndex and an equivalent SequenceLenghtIndex and checks that the loaded annotation objects and batches are equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think currently the order will be different due to the way how we sort the index. Do you think it is a problem?
if self.subsets: | ||
if self.subset_lists_file is None: | ||
if self.use_sequence_lengths_index: | ||
if self.remove_empty_masks or self.pick_frames_sql_clause: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a function self.is_filtered() -> bool
- can we use it instead? If not, can we hide this check into a similar a function that returns a bool?
Are we sure we are covering all possible cases for getting out-of-sync with the sequence_lengths table?
"`subsets` is set but `subset_lists_file` is not set. " | ||
+ "Either provide the self.subset_lists_file to load, or " | ||
+ "set self.subsets=None." | ||
"Cannot use these filters with use_sequence_lengths_index." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be more verbose = user-friendly.
"Cannot use remove_empty_masks
and pick_frames_sql_clause
filters with use_sequence_lengths_index
.
@@ -131,8 +368,12 @@ class UCO3DDataset: | |||
remove_empty_masks_poll_whole_table_threshold: int = 300_000 | |||
preload_metadata: bool = False | |||
store_sql_engine: bool = False | |||
# we set it manually in the constructor | |||
# _index: pd.DataFrame = field(init=False) | |||
use_sequence_lengths_index: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a docstring for this switch.
Some edge cases are not supported and 3 tests are failing, but overall this is the idea.