Compute release 2025-02-14 #10821

vipvap · 2025-02-14T07:00:52Z

Compute release 2025-02-14

Please merge this Pull Request using 'Create a merge commit' button

We need to disable the detection of memory leaks when running ``neon_local init` for build with sanitizers to avoid an error thrown by AddressSanitizer.

## Problem RFC pointed out the need of reconciliation, but wasn't detailed how it can be done. ## Summary of changes Add these details.

## Problem Ref: #10632 We use dns named `*.localtest.me` in our test, and that domain is well-known and widely used for that, with all the records there resolve to the localhost, both IPv4 and IPv6: `127.0.0.1` and `::1` In some cases on our runners these addresses resolves only to `IPv6`, and so components fail to connect when runner doesn't have `IPv6` address. We suspect issue in systemd-resolved here (systemd/systemd#17745) To workaround that and improve test stability, we introduced our own domain `*.local.neon.build` with IPv4 address `127.0.0.1` only See full details and troubleshoot log in referred issue. p.s. If you're FritzBox user, don't forget to add that domain `local.neon.build` to the `DNS Rebind Protection` section under `Home Network -> Network -> Network Settings`, otherwise FritzBox will block addresses, resolving to the local addresses. For other devices/vendors, please check corresponding documentation, if resolving `local.neon.build` will produce empty answer for you. ## Summary of changes Replace all the occurrences of `localtest.me` with `local.neon.build`

## Problem This is tech debt. While we introduced generations for tenants, some legacy situations without generations needed to delete things inline (async operation) instead of enqueing them (sync operation). ## Summary of changes - Remove the async code, replace calls with the sync variant, and assert that the generation is always set

## Problem Endpoint kept running while timeline was deleted, causing forbidden warnings on the pageserver when the tenant is not found. ## Summary of changes - Explicitly stop the endpoint before the end of the test, so that it isn't trying to talk to the pageserver in the background while things are torn down

Make timeline deletion print the sub-steps, so that we can narrow down some stuck timeline deletion issues we are observing. https://neondb.slack.com/archives/C08C2G15M6U/p1738930694716009

## Problem L0 compaction can get starved by other background tasks. It needs to be responsive to avoid read amp blowing up during heavy write workloads. Touches #10694. ## Summary of changes Add a separate semaphore for compaction, configurable via `use_compaction_semaphore` (disabled by default). This is primarily for testing in staging; it needs further work (in particular to split image/L0 compaction jobs) before it can be enabled.

## Problem There are a couple of log warnings tripping up `test_timeline_archival_chaos` - `[stopping left-over name="timeline_delete" tenant_shard_id=2d526292b67dac0e6425266d7079c253 timeline_id=Some(44ba36bfdee5023672c93778985facd9) kind=TimelineDeletionWorker\n')](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10672/13161357302/index.html#/testresult/716b997bb1d8a021)` - `ignoring attempt to restart exited flush_loop 503d8f401d8887cfaae873040a6cc193/d5eed0673ba37d8992f7ec411363a7e3\n')` Related: #10389 ## Summary of changes - Downgrade the 'ignoring attempt to restart' to info -- there's nothing in the design that forbids this happening, i.e. someone calling maybe_spawn_flush_loop concurrently with shutdown() - Prevent timeline deletion tasks outliving tenants by carrying a gateguard. This logically makes sense because the deletion process does call into Tenant to update manifests.

I plan to use this when launching a fast_import job in a VM. There's currently no good way for an executable running in a NeonVM to exit gracefully and have the VM shut down. The inittab we use always respawns the payload command. The idea is that the control plane can use "fast_import ... && poweroff" as the command, so that when fast_import completes successfully, the VM is terminated, and the k8s Pod and VirtualMachine object are marked as completed successfully. I'm working on bigger changes to how we launch VMs, and will try to come up with a nicer system for that, but in the meanwhile, this quick hack allows us to proceed with using VMs for one-off jobs like fast_import.

…10719) ## Problem This test would sometimes fail its assertion that a timeline does not revert to active once archived. That's because it was using the in-memory offload state, not the persistent state, so this was sometimes lost across a pageserver restart. Closes: #10389 ## Summary of changes - When reading offload status, read from pageserver API _and_ remote storage before considering the timeline offloaded

We've been seeing some regressions in staging since the AWS SDK updates: #10695 . We aren't sure the regression was caused by the SDK update, but the issues do involve S3, so it's not unlikely. By reverting the SDK update we find out whether it was really the SDK update, or something else. Reverts the two PRs: * #10588 * #10699 https://neondb.slack.com/archives/C08C2G15M6U/p1738576986047179

…_id` label (#10680) # Problem Before this PR, the `shard_id` field was missing when page_service logs a reconstruct error. This was caused by batching-related refactorings. Example from staging: ``` 2025-01-30T07:10:04.346022Z ERROR page_service_conn_main{peer_addr=...}:process_query{tenant_id=... timeline_id=...}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key ... ``` # Changes Delay creation of the handler-specific span until after shard routing This also avoids the need for the record() call in the pagestream hot path. # Testing Manual testing with a failpoint that is part of this PR's history but will be squashed away. # Refs - fixes #10599

…ush time of preceding requests (#10728) Before this PR, if a batch contains N responses, the smgr op latency reported for response (N-i) would include the time we spent flushing the preceding requests. refs: - fixup of #10042 - fixes #10674

This patch does a bunch of superficial cleanups of `tenant::tasks` to avoid noise in subsequent PRs. There are no functional changes. PS: enable "hide whitespace" when reviewing, due to the unindentation of large async blocks.

# Problem walredo shutdown is done in the compaction task. Let's move it to tenant housekeeping. # Summary of changes * Rename "ingest housekeeping" to "tenant housekeeping". * Move walredo shutdown into tenant housekeeping. * Add a constant `WALREDO_IDLE_TIMEOUT` set to 3 minutes (previously 10x compaction threshold).

## Problem The upgrade test for pg_jwt does not work correctly. ## Summary of changes The script for the upgrade test is modified to use the database `contrib_regression`.

Use expire() op to set TTL for Redis cancellation key

## Problem Parameterising `build-neon` job with `test-cfg` makes it to build exactly the same thing several times. See - https://github.com/neondatabase/neon/blob/874accd6ede7231e5e4e1f562a83862e2286f6cd/.github/workflows/_build-and-test-locally.yml#L51-L52 - https://github.com/neondatabase/neon/actions/runs/13215068271/job/36893373038 ## Summary of changes - Extract `sanitizers` to a separate input from `test-cfg` and set it separately - Don't parametrise `build-neon` with `test-cfg`

## Problem Reduce the read amplification when doing `repartition`. ## Summary of changes Compute the L0-L1 boundary LSN and do repartition here. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem helps investigate #10482 ## Summary of changes In debug mode and testing mode, we will record all files visited by a read operation, and print it out when it errors. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem Make it possible to dump WAL records in format recognised by walredo process. Intended usage: ``` pg_waldump -R 1663/5/16396 -B 771727 000000010000000100000034 --save-records=/tmp/walredo.records postgres --wal-redo < /tmp/walredo.records > /tmp/page.img ``` ## Summary of changes Related Postgres PRs: neondatabase/postgres#575 neondatabase/postgres#572 --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

Handle errors from local proxy by parsing HTTP response in auth broker code Closes [#19476](neondatabase/cloud#19476)

The compute_id will be used when verifying claims sent by the control plane. Signed-off-by: Tristan Partin <tristan@neon.tech> Signed-off-by: Tristan Partin <tristan@neon.tech>

…0733) The awscli was downloaded at the last stages of the overall compute image build, which meant that if you modified any part of the build, it would trigger a re-download of the awscli. That's a bit annoying when developing locally and rebuilding the compute image repeatedly. Move it to a separate layer, to cache separately and to avoid the spurious rebuilds.

Building the compute rust binaries from scratch is pretty slow, it takes between 4-15 minutes on my laptop, depending on which compiler flags and other tricks I use. A cache mount allows caching the dependencies and incremental builds, which speeds up rebuilding significantly when you only makes a small change in a source file.

## Problem The compaction loop currently runs periodically, which can cause it to wait for up to 20 seconds before starting L0 compaction by default. Also, when we later separate the semaphores for L0 compaction and image compaction, we want to give up waiting for the image compaction semaphore if L0 compaction is needed on any timeline. Touches #10694. ## Summary of changes Notify the compaction loop when an L0 flush (on any timeline) exceeds `compaction_threshold`. Also do some opportunistic cleanups in the area.

## Problem close #10213 `range_search` only returns the top-most layers that may satisfy the search, so it doesn't include all layers that might be accessed (the user needs to recursively call this function). We need to retrieve the full layer map and find overlaps in order to have a correct heuristics of the job split. ## Summary of changes Retrieve all layers and find overlaps instead of doing `range_search`. The patch also reduces the time holding the layer map read guard. Signed-off-by: Alex Chi Z <chi@neon.tech>

Since the merge of #10523, forward compatibility tests have been broken everywhere. Signed-off-by: Tristan Partin <tristan@neon.tech>

compute_ctl is mostly written in synchronous fashion, intended to run in a single thread. However various parts had become async, and they launched their own tokio runtimes to run the async code. For example, VM monitor ran in its own multi-threaded runtime, and apply_spec_sql() launched another multi-threaded runtime to run the per-database SQL commands in parallel. In addition to that, a few places used a current-thread runtime to run async code in the main thread, or launched a current-thread runtime in a *different* thread to run background tasks. Unify the runtimes so that there is only one tokio runtime. It's created very early at process startup, and the main thread "enters" the runtime, so that it's always available for tokio::spawn() and runtime.block_on() calls. All code that needs to run async code uses the same runtime. The main thread still mostly runs in a synchronous fashion. When it needs to run async code, it uses rt.block_on(). Spawn fewer additional threads, prefer to spawn tokio tasks instead. Convert some code that ran synchronously in background threads into async. I didn't go all the way, though, some background threads are still spawned.

…turned value (#10273) ## Problem /database_schema endpoint returns incomplete output from `pg_dump` ## Summary of changes The Tokio process was not used properly. The returned stream does not include `process::Child`, and the process is scheduled to be killed immediately after the `get_database_schema` call when `cmd` goes out of scope. The solution in this PR is to return a special Stream implementation that retains `process::Child`.

## Problem This test occasionally fails while the test teardown tries to do a graceful shutdown, because the test has quickly written lots of data into the pageserver. Closes: #10654 ## Summary of changes - Call `post_checks` at the end of `test_isolation`, as we already do for test_pg_regress -- this improves our detection of issues, and as a nice side effect flushes the pageserver. - Ignore pg_notify files when validating state at end of test, these are not expected to be the same

pg_search is 46ish MB. All other remote extensions are around hundeds of KB. 3 seconds is not long enough to download the tarball if the S3 gateway cache doesn't already contain a copy. According to our setup, the cache is limited to 10 GB in size and anything that has not been accessed for an hour is purged. This is really bad for scaling to 0, even more so if you're the only project actively using the extension in a production Kubernetes cluster. Signed-off-by: Tristan Partin <tristan@neon.tech>

) There is now a compute_ctl_config field in the response that currently only contains a JSON Web Key set. compute_ctl currently doesn't do anything with the keys, but will in the future. The reasoning for the new field is due to the nature of empty computes. When an empty compute is created, it does not have a tenant. A compute spec is the primary means of communicating the details of an attached tenant. In the empty compute state, there is no spec. Instead we wait for the control plane to pass us one via /configure. If we were to include the jwks field in the compute spec, we would have a partial compute spec, which doesn't logically make sense. Instead, we can have two means of passing settings to the compute: - spec: tenant specific config details - compute_ctl_config: compute specific settings For instance, the JSON Web Key set passed to the compute is independent of any tenant. It is a setting of the compute whether it is attached or not. Signed-off-by: Tristan Partin <tristan@neon.tech>

## Problem The test is flaky: WAL in remote storage appears to be corrupted. One of hypotheses so far is that corruption is the result of local fs implementation being non atomic, and safekeepers may concurrently PUT the same segment. That's dubious though because by looking at local_fs impl I'd expect then early EOF on segment read rather then observed zeros in test failures, but other directions seem even less probable. ## Summary of changes Let's add s3 backend as well and see if it is also flaky. Also add some more logging around segments uploads. ref #10761

## Problem We do not test `pgtap` which is shipped with Neon ## Summary of changes Test and binaries for `pgtap` are added.

## Problem We didn't catch all client errors causing alerts. ## Summary of changes Client errors should be wrapped with ClientError so that it doesn't fire alerts. Signed-off-by: Alex Chi Z <chi@neon.tech>

…#10786) Before this PR, an IO error returned from the kernel, e.g., due to a bad disk, would get bubbled up, all the way to a user-visible query failing. This is against the IO error handling policy where we have established and is hence being rectified in this PR. [[(internal Policy document link)]](https://github.com/neondatabase/docs/blob/bef44149f746d6705c709b6d9c5e342c0ecac49c/src/storage/handling_io_and_logical_errors.md#L33-L35) The practice on the write path seems to be that we call `maybe_fatal_err()` or `fatal_err()` fairly high up the stack. That is, regardless of whether std::fs, tokio::fs, or VirtualFile is used to perform the IO. For the read path, I choose a centralized approach in this PR by checking for errors as close to the kernel interface as possible. I believe this is better for long-term consistency. To mitigate the problem of missing context if we abort so far down in the stack, the `on_fatal_io_error` now captures and logs a backtrace. I grepped the pageserver code base for `fs::read` to convince myself that all non-VirtualFile reads already handle IO errors according to policy. Refs - fixes #10454

There was a typo in the name of the utilization endpoint URL, fix it. Also, ensure that the heartbeat mechanism actually works. Related: #10583, #10429 Part of #9011

## Problem `benchmarks` is a long-running and non-blocking job. If, on Staging, a deploy-blocking job fails, restarting it requires cancelling any running `benchmarks` jobs, which is a waste of CI resources and requires a couple of extra clicks for a human to do. Ref: https://neondb.slack.com/archives/C059ZC138NR/p1739292995400899 ## Summary of changes - Run `benchmarks` after `deploy` job - Handle `benchmarks` run in PRs with `run-benchmarks` label but without `deploy` job.

github-actions · 2025-02-14T07:58:41Z

7491 tests run: 7103 passed, 0 failed, 388 skipped (full report)

Code coverage* (full report)

functions: 33.0% (8596 of 26020 functions)
lines: 49.0% (72545 of 148029 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
ed21a65 at 2025-02-14T07:58:40.468Z :recycle:}

MMeent

Given the issues with PG 17.3 we'd better wait for PG 17.4 next week:

https://www.postgresql.org/message-id/173949822394.197401.17211408919440664181%40wrigleys.postgresql.org

ololobus · 2025-02-18T12:28:46Z

Closing as @hlinnaka is going to prepare a new release with 17.4 and co.

alexanderlaw and others added 30 commits February 7, 2025 08:56

Adjust diesel schema check for build with sanitizers (#10711)

de05258

We need to disable the detection of memory leaks when running ``neon_local init` for build with sanitizers to avoid an error thrown by AddressSanitizer.

Add reconciliation details to sk membership change rfc (#10514)

b5a239c

## Problem RFC pointed out the need of reconciliation, but wasn't detailed how it can be done. ## Summary of changes Add these details.

safekeeper: make timeline deletions a bit more verbose (#10721)

f524399

Make timeline deletion print the sub-steps, so that we can narrow down some stuck timeline deletion issues we are observing. https://neondb.slack.com/archives/C08C2G15M6U/p1738930694716009

pageserver: misc task cleanups (#10723)

874accd

This patch does a bunch of superficial cleanups of `tenant::tasks` to avoid noise in subsequent PRs. There are no functional changes. PS: enable "hide whitespace" when reviewing, due to the unindentation of large async blocks.

Fix the upgrade test for pg_jwt by adding the database name (#10738)

d204d51

## Problem The upgrade test for pg_jwt does not work correctly. ## Summary of changes The script for the upgrade test is modified to use the database `contrib_regression`.

impr(proxy): Set TTL for Redis cancellation map keys (#10671)

e711821

Use expire() op to set TTL for Redis cancellation key

feat(pageserver): repartition on L0-L1 boundary (#10548)

443c8d0

## Problem Reduce the read amplification when doing `repartition`. ## Summary of changes Compute the L0-L1 boundary LSN and do repartition here. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

fix(proxy): Log errors from the local proxy in auth-broker (#10659)

73633e2

Handle errors from local proxy by parsing HTTP response in auth broker code Closes [#19476](neondatabase/cloud#19476)

Require --compute-id when running compute_ctl (#10523)

946da3f

The compute_id will be used when verifying claims sent by the control plane. Signed-off-by: Tristan Partin <tristan@neon.tech> Signed-off-by: Tristan Partin <tristan@neon.tech>

Unbrick the forward compatibility test failures (#10747)

3d143ad

Since the merge of #10523, forward compatibility tests have been broken everywhere. Signed-off-by: Tristan Partin <tristan@neon.tech>

jcsp and others added 10 commits February 13, 2025 16:23

Add tests for pgtap (#10589)

7ac7755

## Problem We do not test `pgtap` which is shipped with Neon ## Summary of changes Test and binaries for `pgtap` are added.

fix(pageserver): ensure all basebackup client errors are caught (#10793)

6a741fd

## Problem We didn't catch all client errors causing alerts. ## Summary of changes Client errors should be wrapped with ClientError so that it doesn't fire alerts. Signed-off-by: Alex Chi Z <chi@neon.tech>

Fix utilization URL and ensure heartbeats work (#10811)

5008324

There was a typo in the name of the utilization endpoint URL, fix it. Also, ensure that the heartbeat mechanism actually works. Related: #10583, #10429 Part of #9011

Compute release 2025-02-14

ed21a65

vipvap requested review from a team as code owners February 14, 2025 07:00

vipvap requested review from ololobus, myrrc, piercypixel, nikitakalyanov, problame and conradludgate and removed request for a team February 14, 2025 07:00

MMeent requested changes Feb 14, 2025

View reviewed changes

problame removed their request for review February 14, 2025 17:44

ololobus closed this Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute release 2025-02-14 #10821

Compute release 2025-02-14 #10821

vipvap commented Feb 14, 2025

github-actions bot commented Feb 14, 2025

MMeent left a comment

ololobus commented Feb 18, 2025

Compute release 2025-02-14 #10821

Compute release 2025-02-14 #10821

Conversation

vipvap commented Feb 14, 2025