-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure build_query_plan()
cancels when the request cancels
#6840
Conversation
This comment has been minimized.
This comment has been minimized.
✅ Docs preview readyThe preview is ready to be viewed. View the preview File Changes 0 new, 2 changed, 0 removed
Build ID: 6f6b4476b6c27277dc04f455 URL: https://www.apollographql.com/docs/deploy-preview/6f6b4476b6c27277dc04f455 |
@@ -339,6 +354,7 @@ impl<'a: 'b, 'b> QueryPlanningTraversal<'a, 'b> { | |||
)] | |||
fn find_best_plan_inner(&mut self) -> Result<Option<&BestQueryPlanInfo>, FederationError> { | |||
while !self.open_branches.is_empty() { | |||
self.parameters.check_cancellation()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checks like this would need to be added to all loops where the planner can spend significant amount of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think handle_open_branch
is the one. Other functions won't take too long. Even if advance_with_operation_element
created a million options, it returns reasonably quickly. It's the handle_open_branch
that actually visits them. One decision point is how often we want to check (like every 100 or 1000 iterations?). Also, since handle_open_branch
is called recursively, it may check either only at the top-level or always.
/// | ||
/// In this case, a long-running job should try to cancel itself | ||
/// to avoid needless resource consumption. | ||
pub(crate) fn check_for_cooperative_cancellation(&self) -> ControlFlow<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty cool! While tying it to the client request (disconnect/time out) is a great start - if we were to expose this feature as a user specified config value it sounds like it should be as simple as wrapping those QP tasks with a timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be exposing this specifically as a feature. Cancellation at the query planning level should work in the same way as backpressure in other parts of the router. That is, query planning should adhere to the router's existing traffic_shaping.router.timeout
configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that for now this feature should be tied to the request lifecycle. That being said, sometime in the future, I think we should revisit whether it would make sense to further expand on this functionality as there are some potential additional usecases (eg folks might want to still plan the query even though their request timed out with the assumption that be subsequent request would hit the cached plan, similarly we might want to have some limits when doing qp cache prewarming etc).
build_query_plan()
cancels when the request cancels
d051bb8
to
e3bb09b
Compare
e3bb09b
to
ba9877f
Compare
From what I understand, the cancellation check is relatively cheap, so as long as it's not in a tight loop it should be fine to run it. I've added cancellation checks to other potentially problematic loops/iterations in query planner, added a PR description and changelog, and marked this PR as ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is calling check_for_cooperative_cancellation
too often, since It tries access memory that is not in the cpu cache. But, I may be wrong because CPU architecture might have changed since the last time I studied it :)
fragments, | ||
schema, | ||
&Default::default(), | ||
&|| Ok(()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we declare a named thing in place of || Ok(())
, so it's more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. A named function can be used directly like &never_cancel
with:
fn never_cancel() -> Result<(), SingleFederationError> {
Ok(())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I've shifted test files to using &never_cancel
in 083ebc1
Ok, I've done some reading and I'm convinced this won't slowdown our QP. In the future, when we add a timeout in the cancelation check, that may add up some cpu cycles. But, that still seems reasonable. |
&|| { | ||
QueryPlanningParameters::check_cancellation_with( | ||
&options.check_for_cooperative_cancellation, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering why we needed the check_cancellation_with
method. Then, I realized normalize_operation
has to be called before QueryPlanningParameters
is constructed. That's unfortunate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. Only a minor suggestion to refactor some obscure closures in tests.
fragments, | ||
schema, | ||
&Default::default(), | ||
&|| Ok(()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. A named function can be used directly like &never_cancel
with:
fn never_cancel() -> Result<(), SingleFederationError> {
Ok(())
}
@@ -3488,6 +3488,7 @@ pub(crate) fn compute_nodes_for_tree( | |||
initial_node_path: FetchDependencyGraphNodePath, | |||
initial_defer_context: DeferContext, | |||
initial_conditions: &OpGraphPathContext, | |||
check_cancellation: &impl Fn() -> Result<(), SingleFederationError>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using impl Fn
here makes the whole method generic and monomorphized separately for every distinct callback that can be passed. This has some compilation time and code size cost.
Assuming that check_cancellation
isn’t called in very hot loop, would it be better to use &dyn Fn()
? (Here and in other places with a similar argument)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be in any hot loops, I've switched it to using &dyn Fn()
in 083ebc1
67fad15
to
083ebc1
Compare
083ebc1
to
4072854
Compare
When a request is cancelled, resources (compute time/memory) allocated to that request should be released. However, query planning is an exception, and currently keeps running after the request is cancelled. This PR adds cooperative cancellation to `build_query_plan()`, to ensure that when the request is cancelled, query planning errors shortly after. Co-authored-by: Sachin D. Shinde <sachin@apollographql.com> (cherry picked from commit 2c554fc) # Conflicts: # apollo-federation/src/query_graph/path_tree.rs # apollo-router/src/compute_job.rs # apollo-router/src/introspection.rs # apollo-router/src/query_planner/query_planner_service.rs # apollo-router/src/services/layers/query_analysis.rs
When a request is cancelled, resources (compute time/memory) allocated to that request should be released. However, query planning is an exception, and currently keeps running after the request is cancelled. This PR adds cooperative cancellation to
build_query_plan()
, to ensure that when the request is cancelled, query planning errors shortly after.