-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
safekeeper: flush WAL on transaction commit #9694
base: main
Are you sure you want to change the base?
Conversation
@arssher Submitting this for an early look at the overall approach. Needs tweaks, tests and benchmarks. |
5355 tests run: 5132 passed, 1 failed, 222 skipped (full report)Failures on Postgres 17
Flaky tests (2)Postgres 17
Postgres 14
Test coverage report is not availableThe comment gets automatically updated with the latest test results
a81df96 at 2024-11-11T17:56:19.242Z :recycle: |
10223d1
to
1cf5b69
Compare
Note to self: don't do this when the ingested LSN lags the commit LSN. This implies that the local node is catching up, and not on the commit critical path. |
1cf5b69
to
71f04f9
Compare
97a9cd3
to
a81df96
Compare
No objections, approach is ok. Note though that postgres replication doesn't do this (or flush on segment boundary) AFAIR and doesn't suffer much. Probably combination of dense stream (so dense that no fsync and reply is issued) and commit latency sensitiveness is rare. |
Interesting. With Postgres also has |
AFAIR it is the same what we have: write as long as there is data in the socket and flush once there is no data. It is a natural amortizer: as long as sender can send we try not to block it (but cap with a timeout). If it is stuck waiting for receiver likely it won't be able to write/send. Except when this doesn't work, but again this seems to be rare. |
Postgres group commit is different, it optimizes singe fsync for many backends, but write to safekeeper is single threaded. I doubt batching on safekeeper side is much useful because very similar batching already happens on compute. If compute writes a small xact and commits it, it wants the reply ASAP so we send this little piece immediately and want to flush it ASAP. If OTOH it writes a lot of WAL and doesn't need to commit we send it in large chunks, up to 128KB. This value is easy to increase if wanted (it indeed might give some benefit). |
This is pretty workload-dependant though. I was doing some testing with a script that used You're right though, that it doesn't matter where we do this batching. But I guess any batching in Postgres would be per backend, and not across all backends? |
No, it's across all backends because separate walproposer process sends everything. Each time it tries to send something it sends as much as possible (limited by these 128kb). |
Ok, that's not what I saw with this script at least -- the Safekeeper did lots of small 10KB appends. Anyway, I have a prototype for Safekeeper batching, and it gives a ~1500% throughput increase for small appends. So we should ensure we properly batch things somewhere at least. I'll open a draft PR and we can move this discussion there. |
4bfdf46
to
e7e2683
Compare
Problem
Currently, if the Safekeeper is processing pipelined WAL, it will only flush the WAL and respond with an
AppendResponse
once per second. If a transaction commits in the middle of this pipeline, the transaction will see up to 1 second of latency.Resolves #9690.
Requires #9692.
Summary of changes
Flush the WAL and respond with
AppendResponse
when ingesting a transaction commit WAL record. This provides the same behavior as Postgres, which immediately flushes the WAL on transaction commits.Checklist before requesting a review
Checklist before merging