Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safekeeper: flush WAL on transaction commit #9694

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

erikgrinaker
Copy link
Contributor

Problem

Currently, if the Safekeeper is processing pipelined WAL, it will only flush the WAL and respond with an AppendResponse once per second. If a transaction commits in the middle of this pipeline, the transaction will see up to 1 second of latency.

Resolves #9690.
Requires #9692.

Summary of changes

Flush the WAL and respond with AppendResponse when ingesting a transaction commit WAL record. This provides the same behavior as Postgres, which immediately flushes the WAL on transaction commits.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@erikgrinaker erikgrinaker requested a review from arssher November 8, 2024 15:42
@erikgrinaker
Copy link
Contributor Author

@arssher Submitting this for an early look at the overall approach. Needs tweaks, tests and benchmarks.

Copy link

github-actions bot commented Nov 8, 2024

5355 tests run: 5132 passed, 1 failed, 222 skipped (full report)


Failures on Postgres 17

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_default_locales[debug-pg17]"
Flaky tests (2)

Postgres 17

Postgres 14

Test coverage report is not available

The comment gets automatically updated with the latest test results
a81df96 at 2024-11-11T17:56:19.242Z :recycle:

@erikgrinaker erikgrinaker force-pushed the erik/respond-segment-flush-lsn branch 2 times, most recently from 10223d1 to 1cf5b69 Compare November 9, 2024 15:40
@erikgrinaker
Copy link
Contributor Author

Note to self: don't do this when the ingested LSN lags the commit LSN. This implies that the local node is catching up, and not on the commit critical path.

@erikgrinaker erikgrinaker force-pushed the erik/respond-segment-flush-lsn branch from 1cf5b69 to 71f04f9 Compare November 11, 2024 10:28
@erikgrinaker erikgrinaker force-pushed the erik/respond-xact-commit branch from 97a9cd3 to a81df96 Compare November 11, 2024 16:48
@arssher
Copy link
Contributor

arssher commented Nov 12, 2024

No objections, approach is ok.

Note though that postgres replication doesn't do this (or flush on segment boundary) AFAIR and doesn't suffer much. Probably combination of dense stream (so dense that no fsync and reply is issued) and commit latency sensitiveness is rare.

@erikgrinaker
Copy link
Contributor Author

Note though that postgres replication doesn't do this (or flush on segment boundary) AFAIR and doesn't suffer much. Probably combination of dense stream (so dense that no fsync and reply is issued) and commit latency sensitiveness is rare.

Interesting. With synchronous_commit it can't commit a txn on the primary until the secondary has flushed it to disk, so there must be some policy for when to flush. But I can dig around a bit, would be interesting to find out.

Postgres also has group_commit to amortize fsync costs for many commits (off by default). We may want to do something similar -- but the batching in #9689 will also implicitly group all commits per batch (~1 MB?), so maybe that's enough.

@arssher
Copy link
Contributor

arssher commented Nov 12, 2024

so there must be some policy for when to flush.

AFAIR it is the same what we have: write as long as there is data in the socket and flush once there is no data. It is a natural amortizer: as long as sender can send we try not to block it (but cap with a timeout). If it is stuck waiting for receiver likely it won't be able to write/send. Except when this doesn't work, but again this seems to be rare.

@arssher
Copy link
Contributor

arssher commented Nov 12, 2024

We may want to do something similar

Postgres group commit is different, it optimizes singe fsync for many backends, but write to safekeeper is single threaded.

I doubt batching on safekeeper side is much useful because very similar batching already happens on compute. If compute writes a small xact and commits it, it wants the reply ASAP so we send this little piece immediately and want to flush it ASAP. If OTOH it writes a lot of WAL and doesn't need to commit we send it in large chunks, up to 128KB. This value is easy to increase if wanted (it indeed might give some benefit).

@erikgrinaker
Copy link
Contributor Author

I doubt batching on safekeeper side is much useful because very similar batching already happens on compute. If compute writes a small xact and commits it, it wants the reply ASAP so we send this little piece immediately and want to flush it ASAP. If OTOH it writes a lot of WAL and doesn't need to commit we send it in large chunks, up to 128KB. This value is easy to increase if wanted (it indeed might give some benefit).

This is pretty workload-dependant though. I was doing some testing with a script that used pg_logical_emit_message() and there was no batching on the compute -- the Safekeeper received tiny AppendRequests which killed throughput. The same would be true with many concurrent writers.

You're right though, that it doesn't matter where we do this batching. But I guess any batching in Postgres would be per backend, and not across all backends?

@arssher
Copy link
Contributor

arssher commented Nov 12, 2024

No, it's across all backends because separate walproposer process sends everything. Each time it tries to send something it sends as much as possible (limited by these 128kb).

@erikgrinaker
Copy link
Contributor Author

No, it's across all backends because separate walproposer process sends everything. Each time it tries to send something it sends as much as possible (limited by these 128kb).

Ok, that's not what I saw with this script at least -- the Safekeeper did lots of small 10KB appends.

Anyway, I have a prototype for Safekeeper batching, and it gives a ~1500% throughput increase for small appends. So we should ensure we properly batch things somewhere at least. I'll open a draft PR and we can move this discussion there.

@erikgrinaker erikgrinaker force-pushed the erik/respond-segment-flush-lsn branch from 4bfdf46 to e7e2683 Compare November 16, 2024 14:50
Base automatically changed from erik/respond-segment-flush-lsn to main November 17, 2024 17:19
@arssher arssher removed their request for review November 18, 2024 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants