Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-postgres-batch: Add /_meta/row_id and /_meta/op #2450

Merged
merged 6 commits into from
Feb 27, 2025

Conversation

willdonnelly
Copy link
Member

@willdonnelly willdonnelly commented Feb 25, 2025

Description:

This PR implements #2382 for source-postgres-batch:

  • Adds /_meta/row_id and /_meta/op properties to all documents, plus logic to fill them out as appropriate.
  • Adds a feature flag keyless_row_id which when set causes tables without a PK in the source DB to be discovered with collection key [/_meta/row_id] instead of the old [/_meta/polled, /_meta/index] behavior.
  • Adds special case logic for inferring deletions when a collection's key is row_id and it's a full-refresh binding.
  • Adds tests exercising the new logic and both states of the keyless_row_id feature flag.

Notes:

  • In the course of making this change I opted to change the document sanitization logic so that /_meta/index is usually untouched, and only tests which run a capture in parallel with DB writes now perform that sanitization (since the breakdown of changes into polling intervals isn't consistent for those tests but is for any others). This causes a bunch of test snapshot churn, but it makes it slightly easier to verify correct operation.

This change is Reviewable

@willdonnelly willdonnelly added the change:planned This is a planned change label Feb 25, 2025
@willdonnelly willdonnelly requested a review from a team February 25, 2025 16:53
@willdonnelly willdonnelly force-pushed the wgd/batch-postgres-row-id-20250224 branch from 60f5377 to 18da326 Compare February 25, 2025 16:58
Copy link
Member

@williamhbaker williamhbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This commit adds the properties `/_meta/row_id` and `/_meta/op`
to the document metadata for this capture. The Row ID property
will be present on all documents going forward and will count
up from zero at the start of an initial backfill (that is, a
polling cycle with no prior cursor state to resume from).

(As a special case, preexisting captures will also count up
from zero starting with the first polling cycle after this
change goes live. This should be fine, as preexisting captures
also can't be using the Row ID already).

Currently the `/_meta/op` property will only be set on bindings
whose output collection uses `/_meta/row_id` as the key. When
the collection key is the row ID, a create/update event type
can be inferred for each document captured and deletions can
be inferred whenever a full-refresh binding receives fewer rows
in the latest polling cycle compared to the previous one.
When true, keyless tables will be discovered with collection key
`[/_meta/row_id]` instead of the old fallback `[polled, index]`.
Includes a test which explicitly sets `keyless_row_id` to on and
off for a keyless full-refresh table in order to demonstrate the
difference in behavior.
@willdonnelly willdonnelly force-pushed the wgd/batch-postgres-row-id-20250224 branch from 18da326 to 24808e2 Compare February 26, 2025 23:24
@willdonnelly willdonnelly merged commit e3a5c56 into main Feb 27, 2025
53 of 56 checks passed
@willdonnelly willdonnelly deleted the wgd/batch-postgres-row-id-20250224 branch February 27, 2025 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:planned This is a planned change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants