Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix scheduler initialisation startup race #4132

Merged
merged 27 commits into from
Feb 4, 2025

Conversation

JamesMurkin
Copy link
Contributor

@JamesMurkin JamesMurkin commented Jan 10, 2025

Add an Initialise() func to SubmitCheck and QueueCache, which are called at startup to prevent a race that can happen where SubmitCheck tries to get the queues but they haven't been loaded yet

  • This caused an annoying 1minute delay for the scheduler to actually work, which is poor UX when debugging locally

The Initialise calls don't block startup if they fail, as that the current behaviour

  • Ideally we'd change this, but blocking the scheduler starting up means the ExecutorAPI is down. As the ExecutorAPI doesn't rely on queues, it'd be overall worse to prevent it being available because queues are unavailable. At some point we may split these components, meaning this issue goes away an we can block scheduler startup.

Also made some tweaks the the CI - to better ensure components are up/ready before we start running the integration tests

Currently the SubmitCheck.Run loads all the queues from the queue cache + executors from the database, this relies on QueueCache already having the queues loaded

As SubmitCheck.Run and QueueCache.Run are called at the same time in separate go routines, sometimes SubmitCheck.Run happens first and can't find the queues
 - When this happens, it blocks scheduling for 1 minute (until next SubmitCheck executor refresh)

I've added an Initialise method to QueueCache so we can call this during component creation, preventing the race described above

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
…s, remove 70s sleep

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
d80tb7
d80tb7 previously approved these changes Jan 17, 2025
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

# Conflicts:
#	internal/scheduler/submitcheck.go
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
@JamesMurkin JamesMurkin enabled auto-merge (squash) February 4, 2025 17:39
@JamesMurkin JamesMurkin merged commit 7c96fa0 into master Feb 4, 2025
20 checks passed
@JamesMurkin JamesMurkin deleted the fix_queue_cache_startup_race branch February 4, 2025 17:45
richscott pushed a commit to richscott/armada that referenced this pull request Feb 5, 2025
* Fix scheduler initialisation startup race

Currently the SubmitCheck.Run loads all the queues from the queue cache + executors from the database, this relies on QueueCache already having the queues loaded

As SubmitCheck.Run and QueueCache.Run are called at the same time in separate go routines, sometimes SubmitCheck.Run happens first and can't find the queues
 - When this happens, it blocks scheduling for 1 minute (until next SubmitCheck executor refresh)

I've added an Initialise method to QueueCache so we can call this during component creation, preventing the race described above

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Change executor update frequency - test if this fixes CI

* Fix ci

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Make sure queue cache and submit check are initialise on startup

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Revert config changes

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Respect JobLeaseRequestTimeout, reduce CI JobLeaseRequestTimeout to 5s, remove 70s sleep

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Format

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Set scheduler depending on server

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Wait for postgres to start running

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Remove server dependency

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Improve startup tests

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Fix func calls

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Move to separate funcs for checking running/ready

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Add timeouts + log errors on startup rather than exit

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Pass in timeout

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

* Set longer timeout

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

---------

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>
Signed-off-by: Rich Scott <richscott@sent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants