-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce RemoteIndexClient #2548
base: main
Are you sure you want to change the base?
Introduce RemoteIndexClient #2548
Conversation
57290c1
to
23faf72
Compare
public static final long BASE_DELAY_MS = 100; | ||
public static final long INITIAL_DELAY_MS = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are currently set low for testing, though we haven't thought on strong defaults for these anyway.
23faf72
to
9bb2d92
Compare
long pollInterval = ((TimeValue) (KNNSettings.state().getSettingValue(KNNSettings.KNN_REMOTE_BUILD_SERVICE_POLL_INTERVAL))) | ||
.getMillis(); | ||
|
||
Thread.sleep(INITIAL_DELAY_MS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to look into this some more, but I'm not sure this is the best implementation. Off the top of my head I am thinking of some sort of job scheduler semantic with a latch and we can do latch.await() here or something like that. Also it seems a little dangerous to do this in a Singleton class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, shouldn't INITIAL_DELAY_MS
just be pollInterval
? or maybe some function of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the two as related. I was thinking of the INITIAL_DELAY_MS
as a buffer to prevent the client from sending any status requests when we know for sure that the build is still in progress. For example finding a lower bound t
on the GPU build or calculating this based on workload size, then waiting 0.5t
so to not send any unnecessary status checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking on it some more I think Thread.sleep
is fine, but let's make sure we have tests for it across concurrent jobs.
For the other point, the polling interval itself should also be a function of t
else it's also going to send unnecessary status checks.
* @param jobId to check | ||
* @return HttpExecuteResponse for the status request | ||
*/ | ||
public String getBuildStatus(String jobId) throws IOException, URISyntaxException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a lot of these methods can be private or protected
Add RemoteIndexClient initial implementation, its accompanying dependencies, and Build Request, Retry Strategy, and test files Signed-off-by: owenhalpert <ohalpert@gmail.com>
9bb2d92
to
907c6be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @owenhalpert!
At a high level I think we should separate out the client changes into the following individual PRs:
- Client skeleton + build request, including polling logic + tests, failure handling
- Polling + failure handling
- Metrics
I think there is a lot more testing that needs to be written for all of these and I think it will be much easier to focus on each component individually.
@@ -88,12 +89,12 @@ public void buildAndWriteIndex(BuildIndexParams indexInfo) throws IOException { | |||
log.debug("Repository write took {} ms for vector field [{}]", time_in_millis, indexInfo.getFieldName()); | |||
|
|||
stopWatch = new StopWatch().start(); | |||
submitVectorBuild(); | |||
String jobId = RemoteIndexClient.getInstance().submitVectorBuild(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it makes sense to build the requests out here and then pass the request itself into RemoteIndexClient
, else you may need to pass a lot of other parameters through too.
/** | ||
* Initialize the httpClient to be used | ||
* @return The HTTP Client | ||
*/ | ||
private CloseableHttpClient createHttpClient() { | ||
return HttpClients.custom() | ||
.setRetryStrategy(new RemoteIndexClientRetryStrategy()) | ||
.setConnectionBackoffStrategy(new DefaultBackoffStrategy()) | ||
.build(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need to recreate this client if/when any of the retry strategy settings change right? Let's add a TODO for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to implement it such that the 5 settings (endpoint, timeout, poll interval, username, password) were fetched when needed so they could be dynamically updated. We don't have a user setting for MAX_RETRY
but you are right that the client would need to be rebuilt if we needed to update the retry interval, max retries, or retriable status codes.
// To be used for round-robin task assignment to know which endpoint accepted the given job. | ||
private final Map<String, String> jobEndpoints = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this, there's no need for job A to have access to this and be able to look up the endpoints of all the other jobs, that seems like it's unnecessarily leaking information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could pass the working endpoint back to RemoteIndexBuildStrategy (along with the job ID) and have it awaitVectorBuild(acceptedEndpoint, jobID)
and eliminate the need for this map.
int i = 0; | ||
while (i < endpoints.size()) { | ||
HttpPost buildRequest = constructBuildRequest(URI.create(endpoints.get(i))); | ||
authenticateRequest(buildRequest); | ||
String response = httpClient.execute(buildRequest, body -> { | ||
if (body.getCode() == 507) { | ||
return null; | ||
} | ||
if (body.getCode() != 200) { | ||
throw new IOException("Failed to submit build request after retries with code: " + body.getCode()); | ||
} | ||
return EntityUtils.toString(body.getEntity()); | ||
}); | ||
if (response != null) { | ||
String jobId = getValueFromResponse(response, "job_id"); | ||
jobEndpoints.put(jobId, endpoints.get(i)); | ||
return jobId; | ||
} | ||
i++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's abstract out the round robin aspect into a getEndpoint
method so we can test it more specifically.
Also since i
is initialized within this method, won't we always start with endpoint 0
? It seems to me if we want to perform round robin across requests then this class needs to track the the list of endpoints in some sort of circular queue, otherwise it's basically just a list of backup endpoints.
long pollInterval = ((TimeValue) (KNNSettings.state().getSettingValue(KNNSettings.KNN_REMOTE_BUILD_SERVICE_POLL_INTERVAL))) | ||
.getMillis(); | ||
|
||
Thread.sleep(INITIAL_DELAY_MS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking on it some more I think Thread.sleep
is fine, but let's make sure we have tests for it across concurrent jobs.
For the other point, the polling interval itself should also be a function of t
else it's also going to send unnecessary status checks.
Description
First PR for #2518
This PR introduces the initial implementation of the Remote Index client, which is awaiting implementation on the following:
Related Issues
#2391 Meta Issue
#2393 HLD by @jed326
#2518 My LLD
Check List
- [ ] New functionality has been documented.- [ ] API changes companion pull request created.--signoff
.- [ ] Public documentation issue/PR created.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.