make postgres a continuously checkpointing datastore #2247

vroldanbet · 2025-02-20T16:13:21Z

Context

Postgres datastore does not offer a continuous checkpointing Watch API because Postgres revisions are not time-based.
In the other databases, any timestamp represents a fully fledged revision, while in the Postgres datastore, a timestamp does not necessarily represent a transaction.

This is only relevant for consumers of the Watch API, and more particularly, consumers of the datastore Watch API, which exposes the concept of Watch Checkpoints. Consumers of the Postgres Watch API don't observe the passing of time, and this is important for any system that needs to report lag as a health metric.

The Fix

This PR introduces a new way to report checkpoints in Postgres Watch by running an actual transaction. When no new application revisions are detected, a method will be invoked to create a new synthetic transaction, which is not written to the transactions table, it's just returned as the new checkpoint. Because we know there were no new revisions at the time of the evaluation of the query, we can return a new revision. This is not different from what HeadRevision does: it is the same, but the difference is that the transaction is assigned an ID.

The new logic leverages the pg_current_xact_id() function, which assigns a transaction ID to the current running transaction, even if not committed. We compute the transaction XID, the snapshot, and the timestamp, all within the same transaction that determines if there are new revisions to make sure there isn't a race. The operation will be run only if no new application revisions are determined.

The concern of this approach is the consumption of the XID8 space. XID8 is a 64-bit ID space, and thus, it would take 584.5 years to consume the space if we were able to generate 1 XID per nanosecond, which is totally unrealistic. The expectation is that the watch API poll interval happens at a maximum of 1 poll per second.

The other concern is that this makes HeadRevision move continuously, which impacts the selection of optimized revision, which keeps moving continuously and affects cache hit rates. This was also identified as an issue for SpiceDB deployments where other workloads run in the same Postgres instance, causing the HeadRevision to move continuously. The fix for this problem is described in detail in the next section.

Finally, this change fixes a latent bug identified while reasoning about all the different pieces in place to support Postgres datastore watch API. In commit 74dc7b2 a change was introduced to address the lack of checkpoints when the Postgres datastore moved its pg_current_snapshot() outside of application changes (e.g. VACUUM operations, or other local databases in the same Postgres instance). The inconsistency identified is that pg_current_snapshot() will be emitted as a checkpoint, but it's entirely possible for a new transaction to come in and be evaluated at that transaction because pg_current_snapshot() returns the snapshot at which
the query is evaluated, and that does not consume it if not transaction ID was assigned (e.g. there was no commit involved). Consequently, it's possible to observe one checkpoint at revision T and then a change that comes after that at the same revision T, which violates the semantics of the checkpoint acting as a high watermark (the client must
not observe pop-in after the checkpoint).

This commit fixes it because the returned checkpoint is now part of an actual transaction, so no pop-in is possible. One phenomenon to remember is that a client might call Watch API with the revision out of a previous call to HeadRevision. If that revision was never committed, then the creation of the new revision will consume it. This is fine, but the code needs to ensure the checkpoint is not emitted via the Watch API because the API contract is only revisions above the provided argument revision will be emitted.

Changes to `OptimizedRevision`

This PR also changes how Postgres-optimized revisions are done to be able to quantize revisions even when no application writes are happening and something else in the Postgres instance is causing the current snapshot to move (e.g., VACUUM or non-spicedb workloads).

The changes are put in place to address the impact of commit 3f22176, where we made PG watch API generate artificial new transaction IDs as a checkpoint heart-beat. The fix in this PR also mitigates the impact this has on the cache, which would cause cache thrashing because a new optimized revision is selected each time it is computed. It also fixes cache thrashing caused by sharing a Postgres database between multiple SpiceDB instances or sharing with unrelated workloads.

postgres datastore does not offer a continuous checkpointing Watch API, because postgres revisions are not time based. In the other databases, any timestamp represents a fully fledged revision, while in the postgres datastore a timestamp does not necessarily represent a transaction. This is only relevant for consumers of the Watch API, and more particularly, consumers of the datastore Watch API, which exposes the concept of Watch Checkpoints. Consumers of the postgres Watch API don't see observe the passing of time, and this is important for any system that needs to report lag as a health metric. This change introduces a new way to report checkpoints in postgres watch by running an actual transaction. When no new application revisions are detected, a method will be invoked to create a new synthetic transaction, which is not written to the transactions table, it's just returned as the new checkpoint. Because we know that there were no new revisions at the time of the evaluation of the query, we can return a new revision. This is not different from what HeadRevision does, it is in fact the same, but the difference is that the transaction is actually assigned an ID. The new logic leverages the pg_current_xact_id() function, which assigns a transaction ID to the current running transaction, even if not committed. We compute the transactionID, the snapshot and timestamp, all within the same transaction that determines if there are new revisions to make sure there isn't a race. The operation will be run only if no new application revisions were determined. The concern of this approach is consumption of the XID8 space. XID8 is a 64 bit ID space, and thus it would take 584.5 years to consume the space if we were able to generate 1 XID per nanosecond, which totally unrealistic. The expectation is that the watch API poll interval happens at a maximum of 1 poll per second. The other concern is that this makes `HeadRevision` move continuously, which impacts selection of optimized revision, which keeps moving continuously and affecting cache hit rates. This will be addressed by changing how OptimizedRevision is computed, by making sure it quantizes only application revisions, and not any out-of-band transaction generated in Postgres. This was in fact identified as an issue for SpiceDB deployments where some other workloads run on it, causing the HeadRevision to continuously move. Finally, this change fixes a latent bug that was identified while reasoning about all the different pieces in place to support postgres datastore watch API. In commit 74dc7b2 a change was introduced to address the lack of checkpoints when the postgres datastore moved its pg_current_snapshot() outside of application changes (e.g. VACUUM operations, or other local databases in the same postgres instance). The inconsistency identified is that pg_current_snapshot() will be emitted as a checkpoint, but it's entirely possible for a new transaction to come in and be evaluated at that transaction, because pg_current_snapshot() returns _the snapshot at which the query is evaluated_, and that does not consume it if not transaction ID was assigned (e.g. there was no commit involved). As a consequence, it's possible for observe one checkpoint at revision T, and then a change that comes after that at the exact same revision T, which violates the semantics of the checkpoint acting as a high-watermark (client must not observe pop-in after the checkpoint). This commit fixes it because the checkpoint being returned is now part of an actual transaction, so no pop-in is possible. One phenomenon to bear in mind is that a client might call Watch API with the revision out of a previous call to HeadRevision. If that revision was never committed, then the creation of the new revision will actually consume it. This is fine, but we need to make sure the checkpoint is not emitted via the Watch API, because the expectation from the client is that only revisios above the provided argument revision will be emitted.

this commit changes how postgres optimized revisions are done to be able to quantize revisions event when no application writes are happening and something else in the postgres instance is causing the current snapshot to move (e.g. VACUUM or non-spicedb workloads). This commit is put in place to address the side-effects of commit 3f22176, where we made PG watch API generate artificial new transaction IDs as a checkpoint heart-beat. The fix in this commit mitigates the impact to cache thrashing it would have caused. It also fixes cache thrashing caused by sharing a postgres database between multiple SpiceDB instances, or sharing with other unrelated workloads. The Query does the same thing as before, but now COALESCES between the optimized revision selected, and the latest revision available in the datastore. This means that as time passes, the same last revision will always be selected. The LEFT JOIN is added so that COALESCE does not fail with an empty result from the "optimized" CTE - it needs a null result.

the new logic fails to select an optimized revision that correctly captures all overlapping snapshots within a quantization window, causing it to miss updates if no newer writes have taken place.

this fixes the failing test-case in the previous commit, demonstrating that the new query to determine the optimized revision wasn't able to correctly compute the revision if concurrent snapshots existed in the same quantization window. This fix changes the query so: - if an optimized revision exists, it is returned - if it does not, then all the revisions from the last quantization window are returned Now the optimizedRevisionFunc needs to handle both scenarios: - if an optimized revision was found, one row will be returned, and the revision will be returned as is - if an optimized revision was not found, between we will get one or more rows. In either case, we will create a synthetic revision that captures all the XIDs in the quantization window. The tradeoff here is multiple rows are returned. This should only happen when no writes take place and the quantiation window is past. In this case, it seems reasonable to tradeoff returning multiple transaction rows for a deterministic optimized revision that improves cache hit ratios.

ecordell · 2025-02-25T15:20:39Z

internal/datastore/postgres/postgres_shared_test.go

 		},
 		{
 			"QuantizedRecentWithFollowerReadDelay",
 			500 * time.Millisecond,
 			2 * time.Second,
 			[]time.Duration{-4 * time.Second, -2 * time.Second, 0},
-			1, 2,
+			2, 1,


Is this because -2s lands exactly on a 500ms window? Maybe we can update this test to be clearer?

It's because there was a bug in the original test: it wasn't calling pgSnapshot.complete() like the actual optimizeRevisionFunc code was doing. So even though it was returning the revision at -2 as the optimized one, it wasn't observing its own transaction, and that led to only observing 1 revision. The test was wrong from the beginning. The other test cases were affected as well.

alternative implementation to #2247

vroldanbet · 2025-02-26T19:39:12Z

Closed in favor of #2252

github-actions bot added the area/datastore Affects the storage system label Feb 20, 2025

vroldanbet force-pushed the continuous-checkpointing-pg branch 2 times, most recently from 947e9ef to 7c2054d Compare February 21, 2025 16:16

github-actions bot added the area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) label Feb 21, 2025

vroldanbet force-pushed the continuous-checkpointing-pg branch 2 times, most recently from b0d05a2 to 28ab8fb Compare February 21, 2025 17:47

vroldanbet marked this pull request as ready for review February 24, 2025 16:18

vroldanbet requested a review from a team as a code owner February 24, 2025 16:18

vroldanbet self-assigned this Feb 24, 2025

vroldanbet added 3 commits February 24, 2025 16:27

expand pgSnapshot tests

fb3dc2a

vroldanbet force-pushed the continuous-checkpointing-pg branch from cee5aec to f2f2a11 Compare February 24, 2025 16:27

vroldanbet added 2 commits February 25, 2025 11:22

reproduce issue with concurrent revisions

7746f11

the new logic fails to select an optimized revision that correctly captures all overlapping snapshots within a quantization window, causing it to miss updates if no newer writes have taken place.

github-actions bot added the area/dependencies Affects dependencies label Feb 25, 2025

ecordell reviewed Feb 25, 2025

View reviewed changes

fix reported grpc-health-probe CVE

2bc85c9

vroldanbet force-pushed the continuous-checkpointing-pg branch from 558c102 to 2bc85c9 Compare February 25, 2025 15:56

vroldanbet marked this pull request as draft February 26, 2025 12:17

vroldanbet added a commit that referenced this pull request Feb 26, 2025

make pg datastore continuously checkpoint using a revision heartbeat

4d1c55c

alternative implementation to #2247

vroldanbet mentioned this pull request Feb 26, 2025

make pg datastore continuously checkpoint using a revision heartbeat #2252

Merged

vroldanbet added a commit that referenced this pull request Feb 26, 2025

make pg datastore continuously checkpoint using a revision heartbeat

43cf879

alternative implementation to #2247

vroldanbet added a commit that referenced this pull request Feb 26, 2025

make pg datastore continuously checkpoint using a revision heartbeat

71f8d65

alternative implementation to #2247

vroldanbet added a commit that referenced this pull request Feb 26, 2025

make pg datastore continuously checkpoint using a revision heartbeat

6b8ed22

alternative implementation to #2247

vroldanbet added a commit that referenced this pull request Feb 26, 2025

make pg datastore continuously checkpoint using a revision heartbeat

5b3d384

alternative implementation to #2247

vroldanbet closed this Feb 26, 2025

github-actions bot locked and limited conversation to collaborators Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make postgres a continuously checkpointing datastore #2247

make postgres a continuously checkpointing datastore #2247

vroldanbet commented Feb 20, 2025 •

edited

Loading

ecordell Feb 25, 2025

vroldanbet Feb 25, 2025

vroldanbet commented Feb 26, 2025

make postgres a continuously checkpointing datastore #2247

make postgres a continuously checkpointing datastore #2247

Conversation

vroldanbet commented Feb 20, 2025 • edited Loading

Context

The Fix

Changes to OptimizedRevision

ecordell Feb 25, 2025

Choose a reason for hiding this comment

vroldanbet Feb 25, 2025

Choose a reason for hiding this comment

vroldanbet commented Feb 26, 2025

vroldanbet commented Feb 20, 2025 •

edited

Loading

Changes to `OptimizedRevision`