Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Sequence Length Index to speed up initialisation of the data loader #3

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

shapovalov
Copy link
Contributor

Some edge cases are not supported and 3 tests are failing, but overall this is the idea.

@shapovalov shapovalov requested a review from davnov134 December 30, 2024 16:13
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 30, 2024
@shapovalov
Copy link
Contributor Author

@davnov134 Please review.

_SEQUENCE_LENGTHS_TABLE: str = "sequence_lengths"


class IndexProtocol(Protocol):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add docstrings to the final version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are we sure we need to use Protocols here? I do understand it is the right thing to do, but historically we learned that OSS ML users are rarely interested in advanced features they need to study. A simple base class with NotImplementedError()s inside the implementations can be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an implementation detail and not part of the API, and I probably should move these classes to a different file. In general, protocol does not affect the runtime, it is just a static-analysis-type-checking-time thing, so hard to break something even if you don’t understand it.

@@ -33,6 +33,30 @@ def test_iterate_dataset(self):
for i in load_idx:
_ = dataset[i]

def test_get_frame_annotations(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test that iterates over a dataset with FullIndex and an equivalent SequenceLenghtIndex and checks that the loaded annotation objects and batches are equivalent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently the order will be different due to the way how we sort the index. Do you think it is a problem?

if self.subsets:
if self.subset_lists_file is None:
if self.use_sequence_lengths_index:
if self.remove_empty_masks or self.pick_frames_sql_clause:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a function self.is_filtered() -> bool - can we use it instead? If not, can we hide this check into a similar a function that returns a bool?

Are we sure we are covering all possible cases for getting out-of-sync with the sequence_lengths table?

"`subsets` is set but `subset_lists_file` is not set. "
+ "Either provide the self.subset_lists_file to load, or "
+ "set self.subsets=None."
"Cannot use these filters with use_sequence_lengths_index."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be more verbose = user-friendly.

"Cannot use remove_empty_masks and pick_frames_sql_clause filters with use_sequence_lengths_index.

@@ -131,8 +368,12 @@ class UCO3DDataset:
remove_empty_masks_poll_whole_table_threshold: int = 300_000
preload_metadata: bool = False
store_sql_engine: bool = False
# we set it manually in the constructor
# _index: pd.DataFrame = field(init=False)
use_sequence_lengths_index: bool = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a docstring for this switch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants