-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protect the .Conn
property of tabletHealthCheck
from being read/modified concurrently.
#16362
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #16362 +/- ##
==========================================
+ Coverage 68.69% 68.74% +0.04%
==========================================
Files 1547 1548 +1
Lines 198297 200153 +1856
==========================================
+ Hits 136228 137598 +1370
- Misses 62069 62555 +486 ☔ View full report in Codecov by Sentry. |
.Conn
property of tabletHealthCheck
from being modified concurrently..Conn
property of tabletHealthCheck
from being read/modified concurrently.
@@ -840,7 +840,7 @@ func (hc *HealthCheckImpl) TabletConnection(ctx context.Context, alias *topodata | |||
hc.mu.Lock() | |||
thc := hc.healthByAlias[tabletAliasString(topoproto.TabletAliasString(alias))] | |||
hc.mu.Unlock() | |||
if thc == nil || thc.Conn == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this because I don't think this particular check makes sense (and it causes the race detector to be unhappy).
thc.Conn
can become nil
at any time, including right after this check.
Calling thc.Connection
a few lines below will return a new connection object that will either be usable (if the vttablet on the other side of the connection can be accessed), or it will return an error when the connection is attempted to be used.
Removing this line seems to have impact on some integration tests and I'll try if I can figure out how to update them accordingly, but as-is this check simply leads to flaky behaviour and can't be trusted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems correct to remove this check for the reasons you stated. We should wait and see what your test failures tell you.
One thing to note is that there seem to be exactly two paths to thc.Connection
which, as you noted, is responsible for initializing the connection object. One is from here, and the other is from stream
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and looked at the blame for this. Perhaps it is actually protecting us from some other race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may end up recreating a connection to a tablet that we are trying to drop.
…odified concurrently. Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
3436cc4
to
c22b815
Compare
Signed-off-by: Arthur Schreiber <arthurschreiber@github.com>
utils.AssertContainsError(t, readConn, fetchAllCustomers, "is either down or nonexistent") | ||
utils.AssertContainsError(t, readConn, fetchAllCustomers, "connect: connection refused") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need a replacement test that produces the original error.
// Wait for the tablet to become healthy | ||
time.Sleep(30 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid flakiness, we should probably check in hc
for the healthy tablet instead of sleeping for such a short amount of time.
This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:
If no action is taken within 7 days, this PR will be closed. |
This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:
If no action is taken within 7 days, this PR will be closed. |
This PR was closed because it has been stale for 7 days with no activity. |
Description
During multiple external primary failovers +
TabletExternallyReparented
calls across different keyspaces we've run into situations where differentvtgate
processes would end up being "stuck" and unable to serve queries that are supposed to be routed to the newly elected primary.Those queries all fail with a
vttablet: Connection Closed
error, which can only happen ifgRPCQueryClient.cc
is set tonil
. The only place in the code base wheregRPCQueryClient.cc
is modified is ingRPCQueryClient.Close
, which in turn is only ever called fromtabletHealthCheck.closeConnection
ortabletHealthCheck.finalizeConn
.This leads me to believe that there are situations where we can end up with a
tabletHealthCheck
on whichConn
is non-nil
butgRPCQueryClient.cc
isnil
. This would explain thevttablet: Connection Closed
errors we're seeing, while at the same time not seeing the grpc connection being re-established.I noticed that the modification of
tabletHealthCheck.Conn
in bothtabletHealthCheck.closeConnection
andtabletHealthCheck.finalizeConn
is not protected by a mutex like it is intabletHealthCheck.Connection
. I'm not sure this really explains the issue we're seeing, but it seems like a good idea to protect the.Conn
property oftabletHealthCheck
from being read/modified concurrently.I added a test case that simulates how I believe this code is used in
vtgate
- by havingtabletHealthCheck.checkConn
run in one goroutine while callingtabletHealthCheck.Connection
from another goroutine while the connection sporadically receives errors (so as to triggertabletHealthCheck.closeConnection
to be called).Related Issue(s)
Checklist
Deployment Notes