[WIP] Sequence Length Index to speed up initialisation of the data loader #3

shapovalov · 2024-12-30T16:13:04Z

Some edge cases are not supported and 3 tests are failing, but overall this is the idea.

shapovalov · 2025-01-02T11:39:15Z

@davnov134 Please review.

davnov134 · 2025-01-09T09:45:01Z

uco3d/uco3d_dataset.py

+_SEQUENCE_LENGTHS_TABLE: str = "sequence_lengths"
+
+
+class IndexProtocol(Protocol):


Let's add docstrings to the final version

Also, are we sure we need to use Protocols here? I do understand it is the right thing to do, but historically we learned that OSS ML users are rarely interested in advanced features they need to study. A simple base class with NotImplementedError()s inside the implementations can be enough.

This is an implementation detail and not part of the API, and I probably should move these classes to a different file. In general, protocol does not affect the runtime, it is just a static-analysis-type-checking-time thing, so hard to break something even if you don’t understand it.

davnov134 · 2025-01-09T09:48:37Z

tests/test_dataloader.py

@@ -33,6 +33,30 @@ def test_iterate_dataset(self):
        for i in load_idx:
            _ = dataset[i]

+    def test_get_frame_annotations(self):


Please add a test that iterates over a dataset with FullIndex and an equivalent SequenceLenghtIndex and checks that the loaded annotation objects and batches are equivalent.

I think currently the order will be different due to the way how we sort the index. Do you think it is a problem?

davnov134 · 2025-01-09T09:51:01Z

uco3d/uco3d_dataset.py

-        if self.subsets:
-            if self.subset_lists_file is None:
+        if self.use_sequence_lengths_index:
+            if self.remove_empty_masks or self.pick_frames_sql_clause:


There's a function self.is_filtered() -> bool - can we use it instead? If not, can we hide this check into a similar a function that returns a bool?

Are we sure we are covering all possible cases for getting out-of-sync with the sequence_lengths table?

davnov134 · 2025-01-09T09:55:35Z

uco3d/uco3d_dataset.py

-                    "`subsets` is set but `subset_lists_file` is not set. "
-                    + "Either provide the self.subset_lists_file to load, or "
-                    + "set self.subsets=None."
+                    "Cannot use these filters with use_sequence_lengths_index."


Let's be more verbose = user-friendly.

"Cannot use remove_empty_masks and pick_frames_sql_clause filters with use_sequence_lengths_index.

davnov134 · 2025-01-09T10:01:15Z

uco3d/uco3d_dataset.py

@@ -131,8 +368,12 @@ class UCO3DDataset:
    remove_empty_masks_poll_whole_table_threshold: int = 300_000
    preload_metadata: bool = False
    store_sql_engine: bool = False
-    # we set it manually in the constructor
-    # _index: pd.DataFrame = field(init=False)
+    use_sequence_lengths_index: bool = True


Add a docstring for this switch.

shapovalov added 2 commits December 30, 2024 13:29

Refactor: Splitting out IndexProtocol

2f47829

SequenceLengthIndex implementation

751da08

shapovalov requested a review from davnov134 December 30, 2024 16:13

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 30, 2024

shapovalov added 3 commits December 31, 2024 16:06

Tests fixed.

5146946

Subsets fixed.

f960e56

Bug fixes, supporting get_frame_annotations.

f1012c0

davnov134 reviewed Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Sequence Length Index to speed up initialisation of the data loader #3

[WIP] Sequence Length Index to speed up initialisation of the data loader #3

shapovalov commented Dec 30, 2024

shapovalov commented Jan 2, 2025

davnov134 Jan 9, 2025

davnov134 Jan 9, 2025

shapovalov Jan 14, 2025

davnov134 Jan 9, 2025

shapovalov Jan 14, 2025

davnov134 Jan 9, 2025

davnov134 Jan 9, 2025

davnov134 Jan 9, 2025

		_SEQUENCE_LENGTHS_TABLE: str = "sequence_lengths"


		class IndexProtocol(Protocol):

[WIP] Sequence Length Index to speed up initialisation of the data loader #3

Are you sure you want to change the base?

[WIP] Sequence Length Index to speed up initialisation of the data loader #3

Conversation

shapovalov commented Dec 30, 2024

shapovalov commented Jan 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment