-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make postgres a continuously checkpointing datastore #2247
Conversation
947e9ef
to
7c2054d
Compare
b0d05a2
to
28ab8fb
Compare
postgres datastore does not offer a continuous checkpointing Watch API, because postgres revisions are not time based. In the other databases, any timestamp represents a fully fledged revision, while in the postgres datastore a timestamp does not necessarily represent a transaction. This is only relevant for consumers of the Watch API, and more particularly, consumers of the datastore Watch API, which exposes the concept of Watch Checkpoints. Consumers of the postgres Watch API don't see observe the passing of time, and this is important for any system that needs to report lag as a health metric. This change introduces a new way to report checkpoints in postgres watch by running an actual transaction. When no new application revisions are detected, a method will be invoked to create a new synthetic transaction, which is not written to the transactions table, it's just returned as the new checkpoint. Because we know that there were no new revisions at the time of the evaluation of the query, we can return a new revision. This is not different from what HeadRevision does, it is in fact the same, but the difference is that the transaction is actually assigned an ID. The new logic leverages the pg_current_xact_id() function, which assigns a transaction ID to the current running transaction, even if not committed. We compute the transactionID, the snapshot and timestamp, all within the same transaction that determines if there are new revisions to make sure there isn't a race. The operation will be run only if no new application revisions were determined. The concern of this approach is consumption of the XID8 space. XID8 is a 64 bit ID space, and thus it would take 584.5 years to consume the space if we were able to generate 1 XID per nanosecond, which totally unrealistic. The expectation is that the watch API poll interval happens at a maximum of 1 poll per second. The other concern is that this makes `HeadRevision` move continuously, which impacts selection of optimized revision, which keeps moving continuously and affecting cache hit rates. This will be addressed by changing how OptimizedRevision is computed, by making sure it quantizes only application revisions, and not any out-of-band transaction generated in Postgres. This was in fact identified as an issue for SpiceDB deployments where some other workloads run on it, causing the HeadRevision to continuously move. Finally, this change fixes a latent bug that was identified while reasoning about all the different pieces in place to support postgres datastore watch API. In commit 74dc7b2 a change was introduced to address the lack of checkpoints when the postgres datastore moved its pg_current_snapshot() outside of application changes (e.g. VACUUM operations, or other local databases in the same postgres instance). The inconsistency identified is that pg_current_snapshot() will be emitted as a checkpoint, but it's entirely possible for a new transaction to come in and be evaluated at that transaction, because pg_current_snapshot() returns _the snapshot at which the query is evaluated_, and that does not consume it if not transaction ID was assigned (e.g. there was no commit involved). As a consequence, it's possible for observe one checkpoint at revision T, and then a change that comes after that at the exact same revision T, which violates the semantics of the checkpoint acting as a high-watermark (client must not observe pop-in after the checkpoint). This commit fixes it because the checkpoint being returned is now part of an actual transaction, so no pop-in is possible. One phenomenon to bear in mind is that a client might call Watch API with the revision out of a previous call to HeadRevision. If that revision was never committed, then the creation of the new revision will actually consume it. This is fine, but we need to make sure the checkpoint is not emitted via the Watch API, because the expectation from the client is that only revisios above the provided argument revision will be emitted.
this commit changes how postgres optimized revisions are done to be able to quantize revisions event when no application writes are happening and something else in the postgres instance is causing the current snapshot to move (e.g. VACUUM or non-spicedb workloads). This commit is put in place to address the side-effects of commit 3f22176, where we made PG watch API generate artificial new transaction IDs as a checkpoint heart-beat. The fix in this commit mitigates the impact to cache thrashing it would have caused. It also fixes cache thrashing caused by sharing a postgres database between multiple SpiceDB instances, or sharing with other unrelated workloads. The Query does the same thing as before, but now COALESCES between the optimized revision selected, and the latest revision available in the datastore. This means that as time passes, the same last revision will always be selected. The LEFT JOIN is added so that COALESCE does not fail with an empty result from the "optimized" CTE - it needs a null result.
cee5aec
to
f2f2a11
Compare
the new logic fails to select an optimized revision that correctly captures all overlapping snapshots within a quantization window, causing it to miss updates if no newer writes have taken place.
this fixes the failing test-case in the previous commit, demonstrating that the new query to determine the optimized revision wasn't able to correctly compute the revision if concurrent snapshots existed in the same quantization window. This fix changes the query so: - if an optimized revision exists, it is returned - if it does not, then all the revisions from the last quantization window are returned Now the optimizedRevisionFunc needs to handle both scenarios: - if an optimized revision was found, one row will be returned, and the revision will be returned as is - if an optimized revision was not found, between we will get one or more rows. In either case, we will create a synthetic revision that captures all the XIDs in the quantization window. The tradeoff here is multiple rows are returned. This should only happen when no writes take place and the quantiation window is past. In this case, it seems reasonable to tradeoff returning multiple transaction rows for a deterministic optimized revision that improves cache hit ratios.
}, | ||
{ | ||
"QuantizedRecentWithFollowerReadDelay", | ||
500 * time.Millisecond, | ||
2 * time.Second, | ||
[]time.Duration{-4 * time.Second, -2 * time.Second, 0}, | ||
1, 2, | ||
2, 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because -2s
lands exactly on a 500ms
window? Maybe we can update this test to be clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because there was a bug in the original test: it wasn't calling pgSnapshot.complete()
like the actual optimizeRevisionFunc
code was doing. So even though it was returning the revision at -2
as the optimized one, it wasn't observing its own transaction, and that led to only observing 1
revision. The test was wrong from the beginning. The other test cases were affected as well.
558c102
to
2bc85c9
Compare
alternative implementation to #2247
alternative implementation to #2247
alternative implementation to #2247
alternative implementation to #2247
alternative implementation to #2247
Closed in favor of #2252 |
Context
Postgres datastore does not offer a continuous checkpointing Watch API because Postgres revisions are not time-based.
In the other databases, any timestamp represents a fully fledged revision, while in the Postgres datastore, a timestamp does not necessarily represent a transaction.
This is only relevant for consumers of the Watch API, and more particularly, consumers of the datastore Watch API, which exposes the concept of Watch Checkpoints. Consumers of the Postgres Watch API don't observe the passing of time, and this is important for any system that needs to report lag as a health metric.
The Fix
This PR introduces a new way to report checkpoints in Postgres
Watch
by running an actual transaction. When no new application revisions are detected, a method will be invoked to create a new synthetic transaction, which is not written to the transactions table, it's just returned as the new checkpoint. Because we know there were no new revisions at the time of the evaluation of the query, we can return a new revision. This is not different from whatHeadRevision
does: it is the same, but the difference is that the transaction is assigned an ID.The new logic leverages the
pg_current_xact_id()
function, which assigns a transaction ID to the current running transaction, even if not committed. We compute the transaction XID, the snapshot, and the timestamp, all within the same transaction that determines if there are new revisions to make sure there isn't a race. The operation will be run only if no new application revisions are determined.The concern of this approach is the consumption of the XID8 space. XID8 is a 64-bit ID space, and thus, it would take 584.5 years to consume the space if we were able to generate 1 XID per nanosecond, which is totally unrealistic. The expectation is that the watch API poll interval happens at a maximum of 1 poll per second.
The other concern is that this makes
HeadRevision
move continuously, which impacts the selection of optimized revision, which keeps moving continuously and affects cache hit rates. This was also identified as an issue for SpiceDB deployments where other workloads run in the same Postgres instance, causing theHeadRevision
to move continuously. The fix for this problem is described in detail in the next section.Finally, this change fixes a latent bug identified while reasoning about all the different pieces in place to support Postgres datastore watch API. In commit 74dc7b2 a change was introduced to address the lack of checkpoints when the Postgres datastore moved its
pg_current_snapshot()
outside of application changes (e.g.VACUUM
operations, or other local databases in the same Postgres instance). The inconsistency identified is thatpg_current_snapshot()
will be emitted as a checkpoint, but it's entirely possible for a new transaction to come in and be evaluated at that transaction becausepg_current_snapshot()
returns the snapshot at whichthe query is evaluated, and that does not consume it if not transaction ID was assigned (e.g. there was no commit involved). Consequently, it's possible to observe one checkpoint at revision T and then a change that comes after that at the same revision T, which violates the semantics of the checkpoint acting as a high watermark (the client must
not observe pop-in after the checkpoint).
This commit fixes it because the returned checkpoint is now part of an actual transaction, so no pop-in is possible. One phenomenon to remember is that a client might call Watch API with the revision out of a previous call to
HeadRevision
. If that revision was never committed, then the creation of the new revision will consume it. This is fine, but the code needs to ensure the checkpoint is not emitted via theWatch
API because the API contract is only revisions above the provided argument revision will be emitted.Changes to
OptimizedRevision
This PR also changes how Postgres-optimized revisions are done to be able to quantize revisions even when no application writes are happening and something else in the Postgres instance is causing the current snapshot to move (e.g., VACUUM or non-spicedb workloads).
The changes are put in place to address the impact of commit 3f22176, where we made PG watch API generate artificial new transaction IDs as a checkpoint heart-beat. The fix in this PR also mitigates the impact this has on the cache, which would cause cache thrashing because a new optimized revision is selected each time it is computed. It also fixes cache thrashing caused by sharing a Postgres database between multiple SpiceDB instances or sharing with unrelated workloads.