Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make postgres a continuously checkpointing datastore #2247

Closed
wants to merge 6 commits into from

Conversation

vroldanbet
Copy link
Contributor

@vroldanbet vroldanbet commented Feb 20, 2025

Context

Postgres datastore does not offer a continuous checkpointing Watch API because Postgres revisions are not time-based.
In the other databases, any timestamp represents a fully fledged revision, while in the Postgres datastore, a timestamp does not necessarily represent a transaction.

This is only relevant for consumers of the Watch API, and more particularly, consumers of the datastore Watch API, which exposes the concept of Watch Checkpoints. Consumers of the Postgres Watch API don't observe the passing of time, and this is important for any system that needs to report lag as a health metric.

The Fix

This PR introduces a new way to report checkpoints in Postgres Watch by running an actual transaction. When no new application revisions are detected, a method will be invoked to create a new synthetic transaction, which is not written to the transactions table, it's just returned as the new checkpoint. Because we know there were no new revisions at the time of the evaluation of the query, we can return a new revision. This is not different from what HeadRevision does: it is the same, but the difference is that the transaction is assigned an ID.

The new logic leverages the pg_current_xact_id() function, which assigns a transaction ID to the current running transaction, even if not committed. We compute the transaction XID, the snapshot, and the timestamp, all within the same transaction that determines if there are new revisions to make sure there isn't a race. The operation will be run only if no new application revisions are determined.

The concern of this approach is the consumption of the XID8 space. XID8 is a 64-bit ID space, and thus, it would take 584.5 years to consume the space if we were able to generate 1 XID per nanosecond, which is totally unrealistic. The expectation is that the watch API poll interval happens at a maximum of 1 poll per second.

The other concern is that this makes HeadRevision move continuously, which impacts the selection of optimized revision, which keeps moving continuously and affects cache hit rates. This was also identified as an issue for SpiceDB deployments where other workloads run in the same Postgres instance, causing the HeadRevision to move continuously. The fix for this problem is described in detail in the next section.

Finally, this change fixes a latent bug identified while reasoning about all the different pieces in place to support Postgres datastore watch API. In commit 74dc7b2 a change was introduced to address the lack of checkpoints when the Postgres datastore moved its pg_current_snapshot() outside of application changes (e.g. VACUUM operations, or other local databases in the same Postgres instance). The inconsistency identified is that pg_current_snapshot() will be emitted as a checkpoint, but it's entirely possible for a new transaction to come in and be evaluated at that transaction because pg_current_snapshot() returns the snapshot at which
the query is evaluated
, and that does not consume it if not transaction ID was assigned (e.g. there was no commit involved). Consequently, it's possible to observe one checkpoint at revision T and then a change that comes after that at the same revision T, which violates the semantics of the checkpoint acting as a high watermark (the client must
not observe pop-in after the checkpoint).

This commit fixes it because the returned checkpoint is now part of an actual transaction, so no pop-in is possible. One phenomenon to remember is that a client might call Watch API with the revision out of a previous call to HeadRevision. If that revision was never committed, then the creation of the new revision will consume it. This is fine, but the code needs to ensure the checkpoint is not emitted via the Watch API because the API contract is only revisions above the provided argument revision will be emitted.

Changes to OptimizedRevision

This PR also changes how Postgres-optimized revisions are done to be able to quantize revisions even when no application writes are happening and something else in the Postgres instance is causing the current snapshot to move (e.g., VACUUM or non-spicedb workloads).

The changes are put in place to address the impact of commit 3f22176, where we made PG watch API generate artificial new transaction IDs as a checkpoint heart-beat. The fix in this PR also mitigates the impact this has on the cache, which would cause cache thrashing because a new optimized revision is selected each time it is computed. It also fixes cache thrashing caused by sharing a Postgres database between multiple SpiceDB instances or sharing with unrelated workloads.

@github-actions github-actions bot added the area/datastore Affects the storage system label Feb 20, 2025
@vroldanbet vroldanbet force-pushed the continuous-checkpointing-pg branch 2 times, most recently from 947e9ef to 7c2054d Compare February 21, 2025 16:16
@github-actions github-actions bot added the area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) label Feb 21, 2025
@vroldanbet vroldanbet force-pushed the continuous-checkpointing-pg branch 2 times, most recently from b0d05a2 to 28ab8fb Compare February 21, 2025 17:47
@vroldanbet vroldanbet marked this pull request as ready for review February 24, 2025 16:18
@vroldanbet vroldanbet requested a review from a team as a code owner February 24, 2025 16:18
@vroldanbet vroldanbet self-assigned this Feb 24, 2025
postgres datastore does not offer a continuous checkpointing
Watch API, because postgres revisions are not time based.
In the other databases, any timestamp represents a fully
fledged revision, while in the postgres datastore a timestamp
does not necessarily represent a transaction.

This is only relevant for consumers of the Watch API, and more
particularly, consumers of the datastore Watch API, which exposes
the concept of Watch Checkpoints. Consumers of the postgres Watch
API don't see observe the passing of time, and this is important
for any system that needs to report lag as a health metric.

This change introduces a new way to report checkpoints in postgres
watch by running an actual transaction. When no new application
revisions are detected, a method will be invoked to create a new
synthetic transaction, which is not written to the transactions table,
it's just returned as the new checkpoint. Because we know that
there were no new revisions at the time of the evaluation of the
query, we can return a new revision. This is not different from
what HeadRevision does, it is in fact the same, but the difference
is that the transaction is actually assigned an ID.

The new logic leverages the pg_current_xact_id() function, which assigns a transaction ID
to the current running transaction, even if not committed. We compute
the transactionID, the snapshot and timestamp, all within the same
 transaction that determines if there are new revisions to make sure
 there isn't a race. The operation will be run only if no new
 application revisions were determined.

The concern of this approach is consumption of the XID8 space. XID8
is a 64 bit ID space, and thus it would take 584.5 years to consume
the space if we were able to generate 1 XID per nanosecond, which
totally unrealistic. The expectation is that the watch API poll interval
happens at a maximum of 1 poll per second.

The other concern is that this makes `HeadRevision` move continuously,
which impacts selection of optimized revision, which keeps moving
continuously and affecting cache hit rates. This will be addressed
by changing how OptimizedRevision is computed, by making sure it
quantizes only application revisions, and not any out-of-band
transaction generated in Postgres. This was in fact identified as
an issue for SpiceDB deployments where some other workloads run on
it, causing the HeadRevision to continuously move.

Finally, this change fixes a latent bug that was identified while
reasoning about all the different pieces in place to support postgres
datastore watch API. In commit
74dc7b2
a change was introduced to address the lack of checkpoints when
the postgres datastore moved its pg_current_snapshot() outside of
application changes (e.g. VACUUM operations, or other local databases
in the same postgres instance). The inconsistency identified is that
pg_current_snapshot() will be emitted as a checkpoint, but it's entirely
possible for a new transaction to come in and be evaluated at that
transaction, because pg_current_snapshot() returns _the snapshot at which
the query is evaluated_, and that does not consume it if not transaction ID
was assigned (e.g. there was no commit involved). As a consequence, it's
possible for observe one checkpoint at revision T, and then a change
that comes after that at the exact same revision T, which violates
the semantics of the checkpoint acting as a high-watermark (client must
not observe pop-in after the checkpoint).

This commit fixes it because the checkpoint being returned is now
part of an actual transaction, so no pop-in is possible. One phenomenon
to bear in mind is that a client might call Watch API with the revision
out of a previous call to HeadRevision. If that revision was never committed, then the creation of the new revision will actually consume it.
This is fine, but we need to make sure the checkpoint is not emitted
via the Watch API, because the expectation from the client is that only
revisios above the provided argument revision will be emitted.
this commit changes how postgres optimized revisions are
done to be able to quantize revisions event when
no application writes are happening and something
else in the postgres instance is causing the current snapshot
to move (e.g. VACUUM or non-spicedb workloads).

This commit is put in place to address the side-effects of
commit 3f22176, where
we made PG watch API generate artificial new transaction
IDs as a checkpoint heart-beat. The fix in this commit
mitigates the impact to cache thrashing it would have caused.

It also fixes cache thrashing caused by sharing a postgres
database between multiple SpiceDB instances, or sharing
with other unrelated workloads.

The Query does the same thing as before, but now
COALESCES between the optimized revision selected, and the
latest revision available in the datastore. This means
that as time passes, the same last revision will always
be selected.

The LEFT JOIN is added so that COALESCE does not fail
with an empty result from the "optimized" CTE - it needs
a null result.
@vroldanbet vroldanbet force-pushed the continuous-checkpointing-pg branch from cee5aec to f2f2a11 Compare February 24, 2025 16:27
the new logic fails to select an optimized
revision that correctly captures all overlapping
snapshots within a quantization window, causing
it to miss updates if no newer writes have
taken place.
this fixes the failing test-case in the previous commit,
demonstrating that the new query to determine
the optimized revision wasn't able to correctly
compute the revision if concurrent snapshots existed
in the same quantization window.

This fix changes the query so:
- if an optimized revision exists, it is returned
- if it does not, then all the revisions
  from the last quantization window are returned

Now the optimizedRevisionFunc needs to handle both scenarios:
- if an optimized revision was found, one row will be returned,
  and the revision will be returned as is
- if an optimized revision was not found, between we will get
  one or more rows. In either case, we will create a synthetic
  revision that captures all the XIDs in the quantization window.

The tradeoff here is multiple rows are returned. This should only
happen when no writes take place and the quantiation window is past.
In this case, it seems reasonable to tradeoff returning
multiple transaction rows for a deterministic optimized revision
that improves cache hit ratios.
@github-actions github-actions bot added the area/dependencies Affects dependencies label Feb 25, 2025
},
{
"QuantizedRecentWithFollowerReadDelay",
500 * time.Millisecond,
2 * time.Second,
[]time.Duration{-4 * time.Second, -2 * time.Second, 0},
1, 2,
2, 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because -2s lands exactly on a 500ms window? Maybe we can update this test to be clearer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because there was a bug in the original test: it wasn't calling pgSnapshot.complete() like the actual optimizeRevisionFunc code was doing. So even though it was returning the revision at -2 as the optimized one, it wasn't observing its own transaction, and that led to only observing 1 revision. The test was wrong from the beginning. The other test cases were affected as well.

@vroldanbet vroldanbet force-pushed the continuous-checkpointing-pg branch from 558c102 to 2bc85c9 Compare February 25, 2025 15:56
@vroldanbet vroldanbet marked this pull request as draft February 26, 2025 12:17
vroldanbet added a commit that referenced this pull request Feb 26, 2025
vroldanbet added a commit that referenced this pull request Feb 26, 2025
vroldanbet added a commit that referenced this pull request Feb 26, 2025
vroldanbet added a commit that referenced this pull request Feb 26, 2025
vroldanbet added a commit that referenced this pull request Feb 26, 2025
@vroldanbet
Copy link
Contributor Author

Closed in favor of #2252

@vroldanbet vroldanbet closed this Feb 26, 2025
@github-actions github-actions bot locked and limited conversation to collaborators Feb 26, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/datastore Affects the storage system area/dependencies Affects dependencies area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants