Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Vector Index Build] Add download + indexOuput#write implementation to RemoteIndexBuildStrategy #2554

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jed326
Copy link
Contributor

@jed326 jed326 commented Feb 24, 2025

Description

As a follow-up to #2550, this PR includes the implementation + testing for downloading a graph file from the remote repository and then writing to an IndexOutput.

The key change here is that in order to manage memory consumption, we use a buffer of size 64kb, the same as used in the IndexOutputWithBuffer class, to buffer the graph file from the remote repository into the IndexOutput, else we will go OOM on the download of larger graph files.

For now, we are only using sequential download for the graph file. Parallel download of the graph file has several difficulties associated with it, both on the download side as well as on the writing to IndexOutput side. For more details on these, see: #2464. We take on parallel blob download as a future improvement.

Related Issues

Relates #2465
Relates #2392
Relates #2391

Check List

  • New functionality includes testing.
    - [ ] New functionality has been documented.
    - [ ] API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
    - [ ] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jed326 jed326 force-pushed the remote-vector-dev-2 branch from 3674aa2 to 6d950eb Compare February 24, 2025 22:27
@@ -93,12 +100,12 @@ public void buildAndWriteIndex(BuildIndexParams indexInfo) throws IOException {
log.debug("Submit vector build took {} ms for vector field [{}]", time_in_millis, indexInfo.getFieldName());

stopWatch = new StopWatch().start();
awaitVectorBuild();
BlobPath downloadPath = awaitVectorBuild();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume that awaitVectorBuild returns a BlobPath constructed from the response the remote index build client gave. This may change in future PRs, see: #2548

@jed326 jed326 force-pushed the remote-vector-dev-2 branch 2 times, most recently from bc0f66a to c945c88 Compare February 25, 2025 00:02
BlobContainer blobContainer = getRepository().blobStore().blobContainer(downloadPath.parent());
// TODO: We are using the sequential download API as multi-part parallel download is difficult for us to implement today and
// requires some changes in core. For more details, see: https://github.com/opensearch-project/k-NN/issues/2464
String fileName = downloadPath.toArray()[downloadPath.toArray().length - 1];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a place that can cause NPE, can we ensure that length checks and other things to ensure that this line is not throwing NPEs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that I'm trying to not take the file path resolution as part of this PR as whether we get it as a BlobPath or String or any other format is still tbd by the client implementation. Let me add some assertions / checks here for now though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So BlobPath is not supposed to be used to represent a specific file, only a given directory. So in latest revision I'm going to take in a String and do the BlobPath creation manually.

…ategy

Signed-off-by: Jay Deng <jayd0104@gmail.com>
@jed326 jed326 force-pushed the remote-vector-dev-2 branch from c945c88 to 6b6c3b1 Compare February 25, 2025 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants