Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support out-of-core and distributed IVF_PQ ingestion #531

Merged
merged 31 commits into from
Oct 17, 2024

Conversation

jparismorgan
Copy link
Contributor

@jparismorgan jparismorgan commented Sep 23, 2024

What

Here we update to support out-of-core and distributed ingestion in IVF_PQ. We do this by updating the IVF_PQ API. The general design is to use the IVF_FLAT code path, but use these IVF_PQ C++ functions:

  • create()
  • train()
  • ingest_parts()
  • consolidate_partitions()

Note that we also use create_temp_data_group() because to support re-ingestion with a new temp data group.

In the future we may also want to move compute_partition_indexes() in C++, but for now we leave it in Python.

We still are able to support C++-only use of the index with this approach, though to make things easier we also support ingest() so that a user only needs to call create(), train(), and then ingest().

Testing

  • Adds new C++ tests.
  • Existing Python tests pass.

@jparismorgan jparismorgan marked this pull request as ready for review October 17, 2024 02:17
+ "".join(random.choices(string.ascii_letters, k=10))
)
if index_type == "IVF_PQ":
PARTIAL_WRITE_ARRAY_DIR = storage_formats[storage_version][
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not follow the same dir structure? The random number part is required in order to support parallel or failed ingestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so I did not realize it was for parallel ingestions, I thought just failed ingestions. And so I added logic to delete the group before use in C++. I did this because I didn't want Python and C++ to have to communicate about what the name of the dir was. But I'll update to follow the pattern in this follow-up PR: #554

@jparismorgan jparismorgan changed the title Support OOC in IVF_PQ ingestion Support out-of-core and distributed IVF_PQ ingestion Oct 17, 2024
@jparismorgan jparismorgan merged commit eaf1a5f into main Oct 17, 2024
6 checks passed
@jparismorgan jparismorgan deleted the jparismorgan/ivf-pq-ooc-rewrite branch October 17, 2024 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants