Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure websocket conections persist until done on queue-proxy drain #15759

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

elijah-rou
Copy link

Fixes: Websockets closing abruptly when queue-proxy undergoes drain.

Due to hijacked connections in net/http not being respected when server.Shutdown is called, any active websocket connections currently end as soon as the queue-proxy calls .Shutdown. See gorilla/websocket#448 and golang/go#17721 for details. This patch fixes this issue by introducing an atomic counter of active requests, which increments as a request comes in and decrements as a request handler terminates. During drain, this counter must reach zero or adhere to the revision timeout, in order to call .Shutdown.

Release Note

Introduce a pending requests atom which the queue-proxy can use to gracefully terminate all connections (including highjacked connections)

Copy link

knative-prow bot commented Feb 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elijah-rou
Once this PR has been reviewed and has the lgtm label, please assign dsimansk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

knative-prow bot commented Feb 7, 2025

Hi @elijah-rou. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@knative-prow knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 7, 2025
@knative-prow knative-prow bot requested review from dprotaso and skonto February 7, 2025 20:01
@dprotaso
Copy link
Member

dprotaso commented Feb 9, 2025

/ok-to-test

@knative-prow knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 9, 2025
Copy link

codecov bot commented Feb 9, 2025

Codecov Report

Attention: Patch coverage is 0% with 22 lines in your changes missing coverage. Please review.

Project coverage is 80.74%. Comparing base (6265a8e) to head (99dd71c).
Report is 25 commits behind head on main.

Files with missing lines Patch % Lines
pkg/queue/sharedmain/main.go 0.00% 14 Missing ⚠️
pkg/queue/sharedmain/handlers.go 0.00% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15759      +/-   ##
==========================================
- Coverage   80.84%   80.74%   -0.11%     
==========================================
  Files         222      222              
  Lines       18070    18096      +26     
==========================================
+ Hits        14609    14611       +2     
- Misses       3089     3114      +25     
+ Partials      372      371       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@dprotaso dprotaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an e2e test to validate the failure is fixed by these changes

pkg/queue/sharedmain/handlers.go Outdated Show resolved Hide resolved
pkg/queue/sharedmain/main.go Outdated Show resolved Hide resolved
@elijah-rou elijah-rou force-pushed the fix/ensure-websockets-complete-on-drain branch from d6da0ea to 99dd71c Compare February 12, 2025 00:59
break WaitOnPendingRequests
}
}
time.Sleep(drainSleepDuration)
Copy link
Contributor

@skonto skonto Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a non required overhead if there are no websocket connections though. Could we avoid it for the non-websocket workloads?

I suppose this is needed so that there is enough time for the app to notify the client by initiating a websocker close action? Note here that QP does not do the actual draining of the connection (due to the known reasons).

Copy link
Contributor

@skonto skonto Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea is to have something like:

done := make(chan struct{})
Server.RegisterOnShutdown( func() {
ticker := time.NewTicker(1 * time.Second)
		defer ticker.Stop()

		logger.Infof("Drain: waiting for %d pending requests to complete", pendingRequests.Load())
	WaitOnPendingRequests:
		for range ticker.C {
			if pendingRequests.Load() <= 0 {
				logger.Infof("Drain: all pending requests completed")
				break WaitOnPendingRequests
			}
		}
		
       # wait some configurable time e.g. WEBSOCKET_TERMINATION_DRAIN_DURATION_SECONDS
       
      defer done <- struct{}{}
})


Server.Shutdown(...)
<-done

WEBSOCKET_TERMINATION_DRAIN_DURATION_SECONDS could be zero by default for regular workloads. Or use something similar for the sleep time above in this PR.

Another thing (a bit of a hack and thinking out loud) is whether we could detect hijacked connections with the upgrade and connection headers which are mandatory 🤔 (not sure about wss but I think we dont support it do we ?). cc @dprotaso

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think may have moved that sleep to the wrong place, should probably just move the sleep to before the wait on pending. Will avoid the additional wait. drainSleepDuration should only be a minimum wait period.

As for avoiding the actual pending req check, may be able to, but it's probably required then that you check specifics about the connection before incrementing the counter to make it a websocket specific (so instead of just using rev proxy we would probably have to). IMO likely not worth doing, since you expecting a wait from .Shutdown already, and this would only bypass .Shutdown having to do the work to wait for non-highjacked connections.

@elijah-rou elijah-rou force-pushed the fix/ensure-websockets-complete-on-drain branch from 99dd71c to 54db92b Compare February 12, 2025 16:48
add e2e for ws beyond queue drain; move sleep to appropriate loc

add ref to go issue
@elijah-rou elijah-rou force-pushed the fix/ensure-websockets-complete-on-drain branch from 54db92b to e8dec25 Compare February 12, 2025 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants