Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Venice Push Job Diagram as a refresher:

Problem Statement
There is no timeout on the Compute Engine
DataWriterComputeJob
. This leads to some push jobs taking days and even sometimes weeks to complete, which significantly delays the debugging process. There is an existing timeout (bootstrapToOnlineTimeoutInHours
) on the job status polling, but that takes place after the data writer job portion (which is where we've been witnessing significant delays). With the increased frequency of KIF repushes, this issue will become even more pronounced.Solution
There should be a timeout on the entire push job as a whole. If exceeded, the push job should be cancelled. The existing configuration
bootstrapToOnlineTimeoutInHours
can be repurposed for this, and this should've be the original purpose of the configuration.Concurrency-Specific Checks
Both reviewer and PR author to verify
ConcurrentHashMap
,CopyOnWriteArrayList
).How was this PR tested?
testPushJobTimeout()
which tests the poll job status gets cancelled.testDataWriterComputeJobTimeout()
which tests the data writer job gets killed.testPushJobPollStatus()
andtestPushJobUnknownPollStatusDoesWaiting()
, which pertained to the old usage of thebootstrapToOnlineTimeoutInHours
configuration.Does this PR introduce any user-facing or breaking changes?