-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix scheduler initialisation startup race #4132
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Currently the SubmitCheck.Run loads all the queues from the queue cache + executors from the database, this relies on QueueCache already having the queues loaded As SubmitCheck.Run and QueueCache.Run are called at the same time in separate go routines, sometimes SubmitCheck.Run happens first and can't find the queues - When this happens, it blocks scheduling for 1 minute (until next SubmitCheck executor refresh) I've added an Initialise method to QueueCache so we can call this during component creation, preventing the race described above Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
…s, remove 70s sleep Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
d80tb7
previously approved these changes
Jan 17, 2025
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> # Conflicts: # internal/scheduler/submitcheck.go
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
…esearch/armada into fix_queue_cache_startup_race
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
…esearch/armada into fix_queue_cache_startup_race
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
d80tb7
approved these changes
Feb 3, 2025
richscott
pushed a commit
to richscott/armada
that referenced
this pull request
Feb 5, 2025
* Fix scheduler initialisation startup race Currently the SubmitCheck.Run loads all the queues from the queue cache + executors from the database, this relies on QueueCache already having the queues loaded As SubmitCheck.Run and QueueCache.Run are called at the same time in separate go routines, sometimes SubmitCheck.Run happens first and can't find the queues - When this happens, it blocks scheduling for 1 minute (until next SubmitCheck executor refresh) I've added an Initialise method to QueueCache so we can call this during component creation, preventing the race described above Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> * Change executor update frequency - test if this fixes CI * Fix ci Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> * Make sure queue cache and submit check are initialise on startup Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> * Revert config changes Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> * Respect JobLeaseRequestTimeout, reduce CI JobLeaseRequestTimeout to 5s, remove 70s sleep Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> * Format Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> * Set scheduler depending on server Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> * Wait for postgres to start running Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> * Remove server dependency Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> * Improve startup tests Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> * Fix func calls Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> * Move to separate funcs for checking running/ready Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> * Add timeouts + log errors on startup rather than exit Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> * Pass in timeout Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> * Set longer timeout Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> --------- Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com> Signed-off-by: Rich Scott <richscott@sent.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add an Initialise() func to SubmitCheck and QueueCache, which are called at startup to prevent a race that can happen where SubmitCheck tries to get the queues but they haven't been loaded yet
The Initialise calls don't block startup if they fail, as that the current behaviour
Also made some tweaks the the CI - to better ensure components are up/ready before we start running the integration tests