Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce RemoteIndexClient #2548

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

owenhalpert
Copy link
Contributor

@owenhalpert owenhalpert commented Feb 20, 2025

Description

First PR for #2518

This PR introduces the initial implementation of the Remote Index client, which is awaiting implementation on the following:

Related Issues

#2391 Meta Issue
#2393 HLD by @jed326
#2518 My LLD

Check List

  • New functionality includes testing.
    - [ ] New functionality has been documented.
    - [ ] API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
    - [ ] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@owenhalpert owenhalpert force-pushed the owen-remote-vector-dev branch from 57290c1 to 23faf72 Compare February 20, 2025 21:14
Comment on lines 46 to 47
public static final long BASE_DELAY_MS = 100;
public static final long INITIAL_DELAY_MS = 10;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are currently set low for testing, though we haven't thought on strong defaults for these anyway.

@owenhalpert owenhalpert force-pushed the owen-remote-vector-dev branch from 23faf72 to 9bb2d92 Compare February 20, 2025 22:54
long pollInterval = ((TimeValue) (KNNSettings.state().getSettingValue(KNNSettings.KNN_REMOTE_BUILD_SERVICE_POLL_INTERVAL)))
.getMillis();

Thread.sleep(INITIAL_DELAY_MS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to look into this some more, but I'm not sure this is the best implementation. Off the top of my head I am thinking of some sort of job scheduler semantic with a latch and we can do latch.await() here or something like that. Also it seems a little dangerous to do this in a Singleton class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, shouldn't INITIAL_DELAY_MS just be pollInterval? or maybe some function of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the two as related. I was thinking of the INITIAL_DELAY_MS as a buffer to prevent the client from sending any status requests when we know for sure that the build is still in progress. For example finding a lower bound t on the GPU build or calculating this based on workload size, then waiting 0.5t so to not send any unnecessary status checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking on it some more I think Thread.sleep is fine, but let's make sure we have tests for it across concurrent jobs.

For the other point, the polling interval itself should also be a function of t else it's also going to send unnecessary status checks.

* @param jobId to check
* @return HttpExecuteResponse for the status request
*/
public String getBuildStatus(String jobId) throws IOException, URISyntaxException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of these methods can be private or protected

Add RemoteIndexClient initial implementation, its accompanying dependencies, and Build Request, Retry Strategy, and test files

Signed-off-by: owenhalpert <ohalpert@gmail.com>
@owenhalpert owenhalpert force-pushed the owen-remote-vector-dev branch from 9bb2d92 to 907c6be Compare February 21, 2025 16:56
Copy link
Contributor

@jed326 jed326 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @owenhalpert!

At a high level I think we should separate out the client changes into the following individual PRs:

  1. Client skeleton + build request, including polling logic + tests, failure handling
  2. Polling + failure handling
  3. Metrics

I think there is a lot more testing that needs to be written for all of these and I think it will be much easier to focus on each component individually.

@@ -88,12 +89,12 @@ public void buildAndWriteIndex(BuildIndexParams indexInfo) throws IOException {
log.debug("Repository write took {} ms for vector field [{}]", time_in_millis, indexInfo.getFieldName());

stopWatch = new StopWatch().start();
submitVectorBuild();
String jobId = RemoteIndexClient.getInstance().submitVectorBuild();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes sense to build the requests out here and then pass the request itself into RemoteIndexClient, else you may need to pass a lot of other parameters through too.

Comment on lines +69 to +78
/**
* Initialize the httpClient to be used
* @return The HTTP Client
*/
private CloseableHttpClient createHttpClient() {
return HttpClients.custom()
.setRetryStrategy(new RemoteIndexClientRetryStrategy())
.setConnectionBackoffStrategy(new DefaultBackoffStrategy())
.build();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to recreate this client if/when any of the retry strategy settings change right? Let's add a TODO for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to implement it such that the 5 settings (endpoint, timeout, poll interval, username, password) were fetched when needed so they could be dynamically updated. We don't have a user setting for MAX_RETRY but you are right that the client would need to be rebuilt if we needed to update the retry interval, max retries, or retriable status codes.

Comment on lines +51 to +52
// To be used for round-robin task assignment to know which endpoint accepted the given job.
private final Map<String, String> jobEndpoints = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this, there's no need for job A to have access to this and be able to look up the endpoints of all the other jobs, that seems like it's unnecessarily leaking information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could pass the working endpoint back to RemoteIndexBuildStrategy (along with the job ID) and have it awaitVectorBuild(acceptedEndpoint, jobID) and eliminate the need for this map.

Comment on lines +91 to +110
int i = 0;
while (i < endpoints.size()) {
HttpPost buildRequest = constructBuildRequest(URI.create(endpoints.get(i)));
authenticateRequest(buildRequest);
String response = httpClient.execute(buildRequest, body -> {
if (body.getCode() == 507) {
return null;
}
if (body.getCode() != 200) {
throw new IOException("Failed to submit build request after retries with code: " + body.getCode());
}
return EntityUtils.toString(body.getEntity());
});
if (response != null) {
String jobId = getValueFromResponse(response, "job_id");
jobEndpoints.put(jobId, endpoints.get(i));
return jobId;
}
i++;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's abstract out the round robin aspect into a getEndpoint method so we can test it more specifically.

Also since i is initialized within this method, won't we always start with endpoint 0? It seems to me if we want to perform round robin across requests then this class needs to track the the list of endpoints in some sort of circular queue, otherwise it's basically just a list of backup endpoints.

long pollInterval = ((TimeValue) (KNNSettings.state().getSettingValue(KNNSettings.KNN_REMOTE_BUILD_SERVICE_POLL_INTERVAL)))
.getMillis();

Thread.sleep(INITIAL_DELAY_MS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking on it some more I think Thread.sleep is fine, but let's make sure we have tests for it across concurrent jobs.

For the other point, the polling interval itself should also be a function of t else it's also going to send unnecessary status checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants