Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fail fast LeaderCheck on CoordinationStateRejectedException #17400

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

anuragrai16
Copy link

@anuragrai16 anuragrai16 commented Feb 20, 2025

Description

This PR adds a provision to fail-fast the leaderCheck quickly if a CoordinationStateRejectedException exception is received. Please see the related issue for more details.

Related Issues

Resolves #17155

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for c60fc77: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Anurag Rai <anurag.rai@uber.com>
Signed-off-by: Anurag Rai <anurag.rai@uber.com>
@anuragrai16 anuragrai16 force-pushed the fail-fast-leader-check branch from 72452c7 to 9292aa3 Compare February 20, 2025 15:47
Copy link
Contributor

❌ Gradle check result for 9292aa3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 9292aa3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Comment on lines +316 to +319
} else if (exp.getCause() instanceof CoordinationStateRejectedException) {
logger.debug(new ParameterizedMessage("leader [{}] rejected coordination state", leader), exp);
leaderFailed(new CoordinationStateRejectedException("node [" + leader + "] rejected coordination state", exp));
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite possible this check might succeed on the next liveness check interval, failing fast might turn out to be disruptive?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[BUG] Node disconnection for long duration due to Encrypting network mesh during Mesh deployment
2 participants