Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: Added new session option to set timeout for session creation #16385

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

AngstyDuck
Copy link
Contributor

When communicating via gRPC, etcd sets the gRPC option WaitForReady = true to minimize client request error responses due to transient failures. But because of this option, the creation of new sessions hang without errors when all nodes could not be contacted by the client. This new session option allows client to receive an error if the nodes cannot be contacted after the specified amount of time.

This seeks to resolve #14631

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@AngstyDuck AngstyDuck force-pushed the session-creation-timeout-option branch 4 times, most recently from cd6fd75 to ffe5ef5 Compare August 9, 2023 08:33
@ahrtr
Copy link
Member

ahrtr commented Aug 9, 2023

What if the etcdserver shutdown after the connection has already established? Will the session timeout or be blocked waiting for the cluster to recover? The expected behaviour should be the latter. Can you add a test case to verify it?

@AngstyDuck AngstyDuck force-pushed the session-creation-timeout-option branch from ffe5ef5 to 77ec301 Compare August 13, 2023 17:11
@AngstyDuck
Copy link
Contributor Author

What if the etcdserver shutdown after the connection has already established? Will the session timeout or be blocked waiting for the cluster to recover? The expected behaviour should be the latter. Can you add a test case to verify it?

Thanks for pointing this out, I've fixed the bug and added another test case to verify this. Do let me know what you think.

@AngstyDuck
Copy link
Contributor Author

AngstyDuck commented Aug 20, 2023

Hey @ahrtr, any updates regarding the changes I've made? I'm not quite sure how to write a test that verifies that hanging would occur (without using timeout, as the behavior has to be differentiated from the timeout that the new session option creates), so I have verified the context object instead.

@AngstyDuck
Copy link
Contributor Author

Hey @fuweid, thanks for your review. I've addressed them accordingly, so do let me know what you think 🙏

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ahrtr
Copy link
Member

ahrtr commented Sep 4, 2023

Please also squash the commits, thx

@AngstyDuck AngstyDuck force-pushed the session-creation-timeout-option branch from 9f1ce8d to 976cff5 Compare September 5, 2023 16:17
@AngstyDuck
Copy link
Contributor Author

Hey @ahrtr, I've added a new test case where we verify that the default behavior isn't affected if an operation hangs beyond the timeout specified by the new option CreationTimeout. Do let me know if this works for you 🙏

// if Put operation is blocked beyond the timeout specified using WithCreationTimeout,
// that timeout is not used by the Put operation
case <-time.After(2 * time.Second):
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please start the etcdserver again, and verify that eventually the cli.Put will succeed.

@AngstyDuck AngstyDuck force-pushed the session-creation-timeout-option branch from 011d138 to 357f70d Compare September 6, 2023 17:25
@AngstyDuck
Copy link
Contributor Author

AngstyDuck commented Sep 6, 2023

Hey @ahrtr, I've altered the test TestTimeoutDoesntAffectSubsequentConnections where I've restarted the test and checked that the ongoing Put succeeds. Unfortunately this has made the test flaky (53.29% failing after 152 runs) and I'm unsure how to proceed. The log is as attached here: log.txt

  1. My first choice would be to omit the testing of this specific logic until we've resolved the flakiness issue. If we go with this, I'll be creating the issue and would like to look into this myself.
  2. My second choice would be to resolve the flakiness before merging this into main, but the scope of this PR may be inflated.

How do you think we should proceed?

@AngstyDuck AngstyDuck force-pushed the session-creation-timeout-option branch from 4869140 to d0ac5df Compare September 7, 2023 14:24
@AngstyDuck
Copy link
Contributor Author

AngstyDuck commented Sep 7, 2023

Hey @ahrtr, I've added the logging as suggested, and the error returned was session_test.go:198: Put failed: etcdserver: requested lease not found. This error is raised with or without the usage of the new option during session creation, so I think this might be a separate issue. I can look into this if you'd like

@AngstyDuck AngstyDuck force-pushed the session-creation-timeout-option branch from d0ac5df to 98fa529 Compare September 7, 2023 14:26
Comment on lines 193 to 195
case <-donec:
case err := <-errorc:
t.Errorf("Put failed: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential issue here. When the Put finishes, both donec and errorc might be ready to be read, then it's uncertain which case is selected.

So I suggest to remove donec, and always feed the err (returned by Put) into errroc, even it's nil.

Sorry for the confusion.

@AngstyDuck
Copy link
Contributor Author

Hey @ahrtr no worries, I've removed donec as instructed. However I'm still facing the same error session_test.go:199: Put failed: etcdserver: requested lease not found from errorc, with or without the new session creation option. How should I proceed?

@ahrtr
Copy link
Member

ahrtr commented Sep 7, 2023

However I'm still facing the same error session_test.go:199: Put failed: etcdserver: requested lease not found from errorc, with or without the new session creation option. How should I proceed?

  • I will take a closer look sometime next week.
  • The workflow checks are all green. You only occasionally see the error?

@AngstyDuck
Copy link
Contributor Author

The workflow checks are all green. You only occasionally see the error?

Hmm no, oddly this happens every time I've ran the test TestTimeoutDoesntAffectSubsequentConnections. I'll see if I can figure it out as well.


// create new cluster
clus := integration2.NewCluster(t, &integration2.ClusterConfig{Size: 1})
defer clus.Terminate(t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defer clus.Terminate(t)
defer clus.Terminate(t)
clus.Members[0].KeepDataDirTerminate = true

}

// restarting and ensuring that the Put operation will eventually succeed
clus.Members[0].Restart(t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
clus.Members[0].Restart(t)
clus.Members[0].Restart(t)
clus.Members[0].WaitOK(t)

@ahrtr
Copy link
Member

ahrtr commented Sep 9, 2023

The reason why the workflow is green is the test was skipped.

=== SKIP: integration/clientv3/concurrency TestTimeoutDoesntAffectSubsequentConnections (0.05s)
    leak.go:102: Found leaked goroutined BEFORE test appears to have leaked :
        github.com/soheilhy/cmux.muxListener.Accept(...)
        	/go/pkg/mod/github.com/soheilhy/cmux@v0.1.5/cmux.go:262
        net/http.(*Server).Serve(0xc00001c0f0, {0x1992220, 0xc0003b9820})
        	/__t/go/1.20.8/x64/src/net/http/server.go:3059 +0x5a7
        net/http/httptest.(*Server).goServe.func1()
        	/__t/go/1.20.8/x64/src/net/http/httptest/server.go:310 +0xbc
        created by net/http/httptest.(*Server).goServe
        	/__t/go/1.20.8/x64/src/net/http/httptest/server.go:308 +0xa7

When terminating the member, the data is removed automatically. You need to keep the data dir so that the lease is still valid after restarting the member. Please apply the suggested change, and try again.

@AngstyDuck AngstyDuck force-pushed the session-creation-timeout-option branch from 288707e to b3f356b Compare September 12, 2023 16:08
Copy link
Member

@jmhbnz jmhbnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @AngstyDuck - Can you please rebase this pr to address conflicts, thanks!

@AngstyDuck AngstyDuck force-pushed the session-creation-timeout-option branch from 2354cb9 to 895e9b8 Compare September 24, 2023 16:08
@AngstyDuck
Copy link
Contributor Author

AngstyDuck commented Sep 24, 2023

Hey @jmhbnz, I've rebased the PR 👍 thanks!

@jmhbnz
Copy link
Member

jmhbnz commented Jan 18, 2024

Discussed during sig-etcd triage meeting, @AngstyDuck can you please rebase this pr once more? Thanks!

@AngstyDuck AngstyDuck force-pushed the session-creation-timeout-option branch from 895e9b8 to ba58a49 Compare January 22, 2024 15:38
@AngstyDuck AngstyDuck force-pushed the session-creation-timeout-option branch 2 times, most recently from 311eea4 to 6f76f6e Compare January 30, 2024 03:44
@AngstyDuck
Copy link
Contributor Author

Hey @jmhbnz, the linting errors have been fixed, do let me know if you need anything else 👍

Copy link
Member

@jmhbnz jmhbnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks @AngstyDuck and apologies this pr has been in review for so long.

@jmhbnz
Copy link
Member

jmhbnz commented May 23, 2024

/test all

@jmhbnz
Copy link
Member

jmhbnz commented May 23, 2024

@AngstyDuck to be safe we should rebase one final time before we merge.

@AngstyDuck AngstyDuck force-pushed the session-creation-timeout-option branch 4 times, most recently from 2b5f97c to c723ffd Compare May 26, 2024 16:47
When communicating via gRPC, etcd sets the gRPC option WaitForReady = true
to minimize client request error responses due to transient failures. But
because of this option, the creation of new sessions hang without errors
when all nodes could not be contacted by the client. This new session option
allows client to receive an error if the nodes cannot be contacted after the
specified amount of time.

Signed-off-by: AngstyDuck <solsticedante@gmail.com>
@AngstyDuck AngstyDuck force-pushed the session-creation-timeout-option branch from c723ffd to b37c822 Compare August 4, 2024 17:47
@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, AngstyDuck, fuweid, jmhbnz, lhmzhou

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@a-nych
Copy link

a-nych commented Sep 17, 2024

Any updates on this?

@k8s-ci-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@fuweid
Copy link
Member

fuweid commented Oct 11, 2024

@a-nych sorry for late update. Please rebase it
ping @AngstyDuck . Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

concurrency.NewSession hang after etcd server is killed with SIGSTOP(19)
7 participants