Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure build_query_plan() cancels when the request cancels #6840

Merged
merged 4 commits into from
Mar 3, 2025

Conversation

SimonSapin
Copy link
Contributor

@SimonSapin SimonSapin commented Feb 20, 2025

When a request is cancelled, resources (compute time/memory) allocated to that request should be released. However, query planning is an exception, and currently keeps running after the request is cancelled. This PR adds cooperative cancellation to build_query_plan(), to ensure that when the request is cancelled, query planning errors shortly after.

This comment has been minimized.

@svc-apollo-docs
Copy link
Collaborator

svc-apollo-docs commented Feb 20, 2025

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

0 new, 2 changed, 0 removed
* graphos/reference/upgrade/from-router-v1.mdx
* graphos/routing/(latest)/customization/rhai.mdx

Build ID: 6f6b4476b6c27277dc04f455

URL: https://www.apollographql.com/docs/deploy-preview/6f6b4476b6c27277dc04f455

@@ -339,6 +354,7 @@ impl<'a: 'b, 'b> QueryPlanningTraversal<'a, 'b> {
)]
fn find_best_plan_inner(&mut self) -> Result<Option<&BestQueryPlanInfo>, FederationError> {
while !self.open_branches.is_empty() {
self.parameters.check_cancellation()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks like this would need to be added to all loops where the planner can spend significant amount of time.

Copy link
Contributor

@duckki duckki Feb 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think handle_open_branch is the one. Other functions won't take too long. Even if advance_with_operation_element created a million options, it returns reasonably quickly. It's the handle_open_branch that actually visits them. One decision point is how often we want to check (like every 100 or 1000 iterations?). Also, since handle_open_branch is called recursively, it may check either only at the top-level or always.

///
/// In this case, a long-running job should try to cancel itself
/// to avoid needless resource consumption.
pub(crate) fn check_for_cooperative_cancellation(&self) -> ControlFlow<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty cool! While tying it to the client request (disconnect/time out) is a great start - if we were to expose this feature as a user specified config value it sounds like it should be as simple as wrapping those QP tasks with a timeout?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be exposing this specifically as a feature. Cancellation at the query planning level should work in the same way as backpressure in other parts of the router. That is, query planning should adhere to the router's existing traffic_shaping.router.timeout configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that for now this feature should be tied to the request lifecycle. That being said, sometime in the future, I think we should revisit whether it would make sense to further expand on this functionality as there are some potential additional usecases (eg folks might want to still plan the query even though their request timed out with the assumption that be subsequent request would hit the cached plan, similarly we might want to have some limits when doing qp cache prewarming etc).

@sachindshinde sachindshinde changed the title Initial scaffolding for cooperative planner cancellation Ensure build_query_plan() cancels when the request cancels Feb 28, 2025
@sachindshinde sachindshinde force-pushed the simon/coop branch 2 times, most recently from d051bb8 to e3bb09b Compare February 28, 2025 18:53
@sachindshinde sachindshinde marked this pull request as ready for review February 28, 2025 18:53
@sachindshinde sachindshinde requested a review from a team as a code owner February 28, 2025 20:21
@sachindshinde
Copy link
Contributor

sachindshinde commented Feb 28, 2025

From what I understand, the cancellation check is relatively cheap, so as long as it's not in a tight loop it should be fine to run it. I've added cancellation checks to other potentially problematic loops/iterations in query planner, added a PR description and changelog, and marked this PR as ready for review.

@sachindshinde sachindshinde added the backport-1.x Backport this PR to 1.x label Feb 28, 2025
Copy link
Contributor

@duckki duckki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is calling check_for_cooperative_cancellation too often, since It tries access memory that is not in the cpu cache. But, I may be wrong because CPU architecture might have changed since the last time I studied it :)

fragments,
schema,
&Default::default(),
&|| Ok(()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we declare a named thing in place of || Ok(()), so it's more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. A named function can be used directly like &never_cancel with:

fn never_cancel() -> Result<(), SingleFederationError> {
    Ok(())
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I've shifted test files to using &never_cancel in 083ebc1

@duckki
Copy link
Contributor

duckki commented Feb 28, 2025

Ok, I've done some reading and I'm convinced this won't slowdown our QP. In the future, when we add a timeout in the cancelation check, that may add up some cpu cycles. But, that still seems reasonable.

&|| {
QueryPlanningParameters::check_cancellation_with(
&options.check_for_cooperative_cancellation,
)
Copy link
Contributor

@duckki duckki Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why we needed the check_cancellation_with method. Then, I realized normalize_operation has to be called before QueryPlanningParameters is constructed. That's unfortunate.

Copy link
Contributor

@duckki duckki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Only a minor suggestion to refactor some obscure closures in tests.

@apollographql apollographql deleted a comment from duckki Mar 3, 2025
fragments,
schema,
&Default::default(),
&|| Ok(()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. A named function can be used directly like &never_cancel with:

fn never_cancel() -> Result<(), SingleFederationError> {
    Ok(())
}

@@ -3488,6 +3488,7 @@ pub(crate) fn compute_nodes_for_tree(
initial_node_path: FetchDependencyGraphNodePath,
initial_defer_context: DeferContext,
initial_conditions: &OpGraphPathContext,
check_cancellation: &impl Fn() -> Result<(), SingleFederationError>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using impl Fn here makes the whole method generic and monomorphized separately for every distinct callback that can be passed. This has some compilation time and code size cost.

Assuming that check_cancellation isn’t called in very hot loop, would it be better to use &dyn Fn()? (Here and in other places with a similar argument)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be in any hot loops, I've switched it to using &dyn Fn() in 083ebc1

@sachindshinde sachindshinde merged commit 2c554fc into dev Mar 3, 2025
15 checks passed
@sachindshinde sachindshinde deleted the simon/coop branch March 3, 2025 22:57
SimonSapin added a commit that referenced this pull request Mar 3, 2025
When a request is cancelled, resources (compute time/memory) allocated to that request should be released. However, query planning is an exception, and currently keeps running after the request is cancelled. This PR adds cooperative cancellation to `build_query_plan()`, to ensure that when the request is cancelled, query planning errors shortly after.

Co-authored-by: Sachin D. Shinde <sachin@apollographql.com>
(cherry picked from commit 2c554fc)

# Conflicts:
#	apollo-federation/src/query_graph/path_tree.rs
#	apollo-router/src/compute_job.rs
#	apollo-router/src/introspection.rs
#	apollo-router/src/query_planner/query_planner_service.rs
#	apollo-router/src/services/layers/query_analysis.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-1.x Backport this PR to 1.x
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants