Proxy release 2024-12-26 #10243

vipvap · 2024-12-26T06:02:08Z

Proxy release 2024-12-26

Please merge this Pull Request using 'Create a merge commit' button

## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1733180965970089 Replication state is checkpointed only by shutdown checkpoint. It means that replication snapshots are not removed till compute shutdown. ## Summary of changes Checkpoint replication state during online checkpoint Related Postgres PR: neondatabase/postgres#546 Co-authored-by: Konstantin Knizhnik <[email protected]>

Add a `safekeepers` subcommand to `storcon_cli` that allows listing the safekeepers. ``` $ curl -X POST --url http://localhost:1234/control/v1/safekeeper/42 --data \ '{"active":true, "id":42, "created_at":"2023-10-25T09:11:25Z", "updated_at":"2024-08-28T11:32:43Z","region_id":"neon_local","host":"localhost","port":5454,"http_port":0,"version":123,"availability_zone_id":"us-east-2b"}' $ cargo run --bin storcon_cli -- --api http://localhost:1234 safekeepers Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.38s Running `target/debug/storcon_cli --api 'http://localhost:1234' safekeepers` +----+---------+-----------+------+-----------+------------+ | Id | Version | Host | Port | Http Port | AZ Id | +==========================================================+ | 42 | 123 | localhost | 5454 | 0 | us-east-2b | +----+---------+-----------+------+-----------+------------+ ``` Also: * Don't return the raw `SafekeeperPersistence` struct that contains the raw database presentation, but instead a new `SafekeeperDescribeResponse` struct. * The `SafekeeperPersistence` struct leaves out the `active` field on purpose because we want to deprecate it and replace it with a `scheduling_policy` one. Part of #9981

…tests (#10164) ## Problem part of #9114 In #10127 we fixed the race, but we didn't add the errors to the allowlist. ## Summary of changes * Allow repartition errors in the gc-compaction smoke test. I think it might be worth to refactor the code to allow multiple threads getting a copy of repartition status (i.e., using Rcu) in the future. Signed-off-by: Alex Chi Z <[email protected]>

When neondatabase/cloud#21856 is merged, this flag is no longer necessary.

…0185) ## Problem `neon_local` has always been unsafe to run concurrently with itself: it uses simple text files for persistent state, and concurrent runs will step on each other. In some test environments we intentionally handle this with mutexes in python land, but it's fragile to try and always remember to do that. ## Summary of changes - Add a `flock` based mutex around the `main` function of neon_local, using the repo directory as the file to lock - Clean up an Option<> around control_plane_api, this is a drive-by change because it was one of the fields that had a weird effect when previous concurrent stuff stamped on it.

## Problem We cannot get the size of the compaction queue and access the info. Part of #9114 ## Summary of changes * Add an API endpoint to get the compaction queue. * gc_compaction test case now waits until the compaction finishes. --------- Signed-off-by: Alex Chi Z <[email protected]>

## Problem In #8103 we changed the test case to have more test coverage of gc_compaction. Now that we have `test_gc_compaction_smoke`, we can revert this test case to serve its original purpose and revert the parameter changes. part of #9114 ## Summary of changes * Revert pitr_interval from 60s to 10s. * Assert the physical/logical size ratio in the benchmark. --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Arpad Müller <[email protected]>

## Problem See #10037 test_physical_and_logical_replication.py sometimes failed. ## Summary of changes Add `wait_replica_caughtup` to wait for replica sync Co-authored-by: Konstantin Knizhnik <[email protected]>

## Problem ref #10170 ref #9994 The psql command will block the main thread, causing other async tasks to timeout (i.e., HTTP connect). Therefore, we need to move it to an I/O executor thread. ## Summary of changes * run psql connection in a thread --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: John Spray <[email protected]>

## Problem s5cmd doesn't pick up the pod service account ``` 2024/12/16 16:26:01 Ignoring, HTTP credential provider invalid endpoint host, "169.254.170.23", only loopback hosts are allowed. <nil> ERROR "ls s3://neon-dev-bulk-import-us-east-2/import-pgdata/fast-import/v1/br-wandering-hall-w2xobawv": NoCredentialProviders: no valid providers in chain. Deprecated. For verbose messaging see aws.Config.CredentialsChainVerboseErrors ``` ## Summary of changes Switch to offical CLI. ## Testing Tested the pre-merge image in staging, using `job_image` override in project settings. https://neondb.slack.com/archives/C033RQ5SPDH/p1734554944391949?thread_ts=1734368383.258759&cid=C033RQ5SPDH ## Future Work Switch back to s5cmd once peak/s5cmd#769 gets merged. ## Refs - fixes neondatabase/cloud#21876 --------- Co-authored-by: Gleb Novikov <[email protected]>

…10k) (#10172) ## Problem We want to verify how much / if pgbench throughput and latency on Neon suffers if the database contains many other relations, too. ## Summary of changes Modify the benchmarking.yml pgbench-compare job to - create an addiitional project at scale factor 10 GiB - before running pgbench add n tables (initially 10k) to the database - then compare the pgbench throughput and latency to the existing pgbench-compare at 10 Gib scale factor We use a realistic template for the n relations that is a partitioned table with some realistic data types, indexes and constraints - similar to a table that we use internally. Example run: https://github.com/neondatabase/neon/actions/runs/12377565956/job/34547386959

The final SASL complete message can be bundled with the remainder of the auth flow messages until ReadyForQuery. neondatabase/cloud#19184

…fective_io_concurrency=2` in tests by default (#10114) ## Problem `test_pgdata_import_smoke` writes two gigabytes of pages and then reads them back serially. This is CPU bottlenecked and results in a long runtime, and sensitivity to CPU load from other tests on the same machine. Closes: #10071 ## Summary of changes - Use effective_io_concurrency=32 when doing sequential scans through 2GiB of pages in test_pgdata_import_smoke. This is a ~10x runtime decrease in the parts of the test that do sequential scans. - Also set `effective_io_concurrency=2` for tests, as I noticed while debugging that we were doing all getpage requests serially, which is bad for checking the stability of the batching code.

…10198) ## Problem test_timeline_archival_chaos does timeline creation with failure injection, and thereby sometimes leaves timelines in a part created state. This was being reported as corruption by the scrubber on test teardown, because it considered a layer without an index to be an invalid state. This was incorrect: the scrubber should accept this state, it occurs legitimately during timeline creation. Closes: #9988 ## Summary of changes - Report a timeline with layers but no index as Relic rather than MissingIndexPart. - We retain the MissingIndexPart variant for the case where an index _was_ found in the listing, but was not found by a subsequent GET, i.e. racing with deletion.

## Problem The benchmarking utilities are also useful for testing. We want to write tests in the safekeeper crate. ## Summary of changes This commit lifts the utils to the safekeeper crate. They are compiled if the benchmarking features is enabled or if in test mode.

## Problem Safekeeper may currently send a batch to the pageserver even if it hasn't decoded a new record. I think this is quite unlikely in the field, but worth adressing. ## Summary of changes Don't send anything if we haven't decoded a full record. Once this merges and releases, the `InterpretedWalRecords` struct can be updated to remove the Option wrapper for `next_record_lsn`.

## Problem See #10184 and https://neondb.slack.com/archives/C04DGM6SMTM/p1733997259898819 Reloading config file inside parallel worker cause it's termination ## Summary of changes Remove call of `HandleMainLoopInterrupts()` Update of page server URL is propagated by postmaster through shared memory and we should not reload config for it. Co-authored-by: Konstantin Knizhnik <[email protected]>

…#10044) ## Problem In #9897 we temporarily disabled the layer valid check because the current one only considers the end result of all compaction algorithms, but partial gc-compaction would temporarily produce an "invalid" layer map. part of #9114 ## Summary of changes Allow LSN splits to overlap in the slow path check. Currently, the valid check is only used in storage scrubber (background job) and during gc-compaction (without taking layer lock). Therefore, it's fine for such checks to be a little bit inefficient but more accurate. --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Arpad Müller <[email protected]>

#10206) ## Problem Currently default value of storage controller heartbeat interval is 100msec. It means that 10 times per second it establish connection to PS. And it seems to be quite expensive. At MacOS right now storage_controller consumes 70% CPU and trusts - 30%. So together they completely utilize one core. A lot of us has Macs. Let's save environment a little bit and do not waste electricity and contribute to global warming. By the way, on prod we have interval 10seconds ## Summary of changes Increase heartbeat interval from 100msec to 1 second. Co-authored-by: Konstantin Knizhnik <[email protected]>

…10209) ## Problem close #10208 part of #9114 ## Summary of changes * Ensure remote `latest_gc_cutoff` is up-to-date before removing any files for gc-compaction. Signed-off-by: Alex Chi Z <[email protected]>

github-actions · 2024-12-26T07:04:35Z

7095 tests run: 6798 passed, 0 failed, 297 skipped (full report)

Flaky tests (5)

Postgres 17

test_physical_replication_config_mismatch_too_many_known_xids: debug-x86-64
test_timeline_archival_chaos: release-x86-64, release-arm64

Postgres 15

test_auth_broker_happy: release-arm64
test_scrubber_physical_gc_ancestors[None]: release-arm64

Code coverage* (full report)

functions: 31.2% (8399 of 26877 functions)
lines: 48.0% (66683 of 139059 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
7afffdc at 2024-12-26T07:04:34.749Z :recycle:}

danieltprice · 2024-12-27T14:34:06Z

reviewed for changelog

knizhnik and others added 22 commits December 18, 2024 09:34

safekeeper: fix typo in allowlist for /profile/heap (#10186)

1668d39

chore(proxy): fully remove allow-self-signed-compute flag (#10168)

d63602c

When neondatabase/cloud#21856 is merged, this flag is no longer necessary.

Fix flukyness of test_physical_and_logical_replicaiton.py (#10176)

61fcf64

## Problem See #10037 test_physical_and_logical_replication.py sometimes failed. ## Summary of changes Add `wait_replica_caughtup` to wait for replica sync Co-authored-by: Konstantin Knizhnik <[email protected]>

proxy: Delay SASL complete message until auth is done (#10189)

b135194

The final SASL complete message can be bundled with the remainder of the auth flow messages until ReadyForQuery. neondatabase/cloud#19184

fix(pageserver): update remote latest_gc_cutoff after gc-compaction (#…

9c53b41

…10209) ## Problem close #10208 part of #9114 ## Summary of changes * Ensure remote `latest_gc_cutoff` is up-to-date before removing any files for gc-compaction. Signed-off-by: Alex Chi Z <[email protected]>

Proxy release 2024-12-26

7afffdc

vipvap requested review from a team as code owners December 26, 2024 06:02

vipvap requested review from knizhnik and jcsp and removed request for a team December 26, 2024 06:02

vipvap requested review from conradludgate and removed request for a team December 26, 2024 06:02

cloneable closed this Dec 26, 2024

cloneable deleted the rc/release-proxy/2024-12-26 branch December 26, 2024 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy release 2024-12-26 #10243

Proxy release 2024-12-26 #10243

vipvap commented Dec 26, 2024

github-actions bot commented Dec 26, 2024

Postgres 17

Postgres 15

danieltprice commented Dec 27, 2024

Proxy release 2024-12-26 #10243

Proxy release 2024-12-26 #10243

Conversation

vipvap commented Dec 26, 2024