Epic: async walredo #6628

problame · 2024-02-05T11:45:07Z

Problem

walredo using block poll() system call, stalling the executor if the walredo process doesn't answer requests.

DoD

We use tokio to read from / write to the walredo pipes.

Impl

Tasks

Give feedback

feat(walredo): various observability improvements #6573
fix(walredo): walredo process that causes errors is never killed #6568
walredo bench should be switched to a single multi-threaded runtime #6648

a/performance c/storage/pageserver m/good_first_issue
walredo benchmark: throughput-oriented rewrite #7190
pageserver: use a single tokio runtime #6555
Revert "pageserver: use a single tokio runtime (#6555)" #7246
fix(walredo spawn): coalescing stalls other executors std::sync::RwLock #7310
de-base this work from single runtime, redo benchmarks
add async walredo mode (disabled-by-default, opt-in via config) #6548
gradual rollout
https://github.com/neondatabase/aws/pull/1288
https://github.com/neondatabase/aws/pull/1299
https://github.com/neondatabase/aws/pull/1309
remove sync walredo impl & clean up production ansible #7753

7 of 7
Options

Follow_ups

revise tracing span fields of apply_wal_records, tenant_id / timeline_id (they are probably dupes)

The text was updated successfully, but these errors were encountered:

koivunej · 2024-02-26T08:44:54Z

How many bytes of RAM do we allow for single page reconstruction? I'm pausing #6881 because it would had made pageserver OOM more likely by yielding while applying, but of course, we could had already been near-OOM in the situations with logical replication slots before they were "fixed" by #6873.

What kind of concurrency limiter could we even use here... Because we cannot know how much get value reconstruct data would allocate before we actually do the search, should we just have a hard limit of 16MB per key? Well, this is probably the wrong issue to talk about it.

problame · 2024-02-26T10:03:04Z

Initial concurrency limit I'm thinking of is number of vCPUs = executor threads.

jcsp · 2024-03-11T11:15:04Z

Action this week: benchmark changes and progress PRs.

jcsp · 2024-03-18T11:16:57Z

Merging these changes will block on writing an improved benchmark

See the updated `bench_walredo.rs` module comment. tl;dr: we measure avg latency of single redo operations issues against a single redo manager from N tokio tasks. part of #6628

Before this PR, each core had 3 executor threads from 3 different runtimes. With this PR, we just have one runtime, with one thread per core. Switching to a single tokio runtime should reduce that effective over-commit of CPU and in theory help with tail latencies -- iff all tokio tasks are well-behaved and yield to the runtime regularly. Are All Tasks Well-Behaved? Are We Ready? ----------------------------------------- Sadly there doesn't seem to be good out-of-the box tokio tooling to answer this question. We *believe* all tasks are well behaved in today's code base, as of the switch to `virtual_file_io_engine = "tokio-epoll-uring"` in production (neondatabase/infra#1121). The only remaining executor-thread-blocking code is walredo and some filesystem namespace operations. Filesystem namespace operations work is being tracked in #6663 and not considered likely to actually block at this time. Regarding walredo, it currently does a blocking `poll` for read/write to the pipe file descriptors we use for IPC with the walredo process. There is an ongoing experiment to make walredo async (#6628), but it needs more time because there are surprisingly tricky trade-offs that are articulated in that PR's description (which itself is still WIP). What's relevant for *this* PR is that 1. walredo is always CPU-bound 2. production tail latencies for walredo request-response (`pageserver_wal_redo_seconds_bucket`) are - p90: with few exceptions, low hundreds of micro-seconds - p95: except on very packed pageservers, below 1ms - p99: all below 50ms, vast majority below 1ms - p99.9: almost all around 50ms, rarely at >= 70ms - [Dashboard Link](https://neonprod.grafana.net/d/edgggcrmki3uof/2024-03-walredo-latency?orgId=1&var-ds=ZNX49CDVz&var-pXX_by_instance=0.9&var-pXX_by_instance=0.99&var-pXX_by_instance=0.95&var-adhoc=instance%7C%21%3D%7Cpageserver-30.us-west-2.aws.neon.tech&var-per_instance_pXX_max_seconds=0.0005&from=1711049688777&to=1711136088777) The ones below 1ms are below our current threshold for when we start thinking about yielding to the executor. The tens of milliseconds stalls aren't great, but, not least because of the implicit overcommit of CPU by the three runtimes, we can't be sure whether these tens of milliseconds are inherently necessary to do the walredo work or whether we could be faster if there was less contention for CPU. On the first item (walredo being always CPU-bound work): it means that walredo processes will always compete with the executor threads. We could yield, using async walredo, but then we hit the trade-offs explained in that PR. tl;dr: the risk of stalling executor threads through blocking walredo seems low, and switching to one runtime cleans up one potential source for higher-than-necessary stall times (explained in the previous paragraphs). Code Changes ------------ - Remove the 3 different runtime definitions. - Add a new definition called `THE_RUNTIME`. - Use it in all places that previously used one of the 3 removed runtimes. - Remove the argument from `task_mgr`. - Fix failpoint usage where `pausable_failpoint!` should have been used. We encountered some actual failures because of this, e.g., hung `get_metric()` calls during test teardown that would client-timeout after 300s. As indicated by the comment above `THE_RUNTIME`, we could take this clean-up further. But before we create so much churn, let's first validate that there's no perf regression. Performance ----------- We will test this in staging using the various nightly benchmark runs. However, the worst-case impact of this change is likely compaction (=>image layer creation) competing with compute requests. Image layer creation work can't be easily generated & repeated quickly by pagebench. So, we'll simply watch getpage & basebackup tail latencies in staging. Additionally, I have done manual benchmarking using pagebench. Report: https://neondatabase.notion.site/2024-03-23-oneruntime-change-benchmarking-22a399c411e24399a73311115fb703ec?pvs=4 Tail latencies and throughput are marginally better (no regression = good). Except in a workload with 128 clients against one tenant. There, the p99.9 and p99.99 getpage latency is about 2x worse (at slightly lower throughput). A dip in throughput every 20s (compaction_period_ is clearly visible, and probably responsible for that worse tail latency. This has potential to improve with async walredo, and is an edge case workload anyway. Future Work ----------- 1. Once this change has shown satisfying results in production, change the codebase to use the ambient runtime instead of explicitly referencing `THE_RUNTIME`. 2. Have a mode where we run with a single-threaded runtime, so we uncover executor stalls more quickly. 3. Switch or write our own failpoints library that is async-native: #7216

problame · 2024-03-25T10:00:46Z

Update:

benchmark that measures single-tenant scalability is merged
- async walredo improves up to 50% here: https://github.com/neondatabase/neon/pull/6548/files#diff-fbcb6117f0e507ed04478e6ef073af0b1a976f87111b04af1a9bf02e1bbb0106R33-R48
however, multi-tenant scalability and system behavior when we're CPU bottlenecked is still hard to understand and not quantified by that benchmark
=> shipping pageserver to single runtime to eliminate some uncertainty and make things simpler overall: pageserver: use a single tokio runtime #6555

This week:

observe single runtime change in prod
rebase async walredo PR, figure out how to quantify multi-tenant scalability effects

problame · 2024-03-28T12:01:08Z

Update:

observe single runtime change in prod

We reverted it (#7246) after prod-like cloudbench shows a ramp-up in walredo processes, presumably because compaction_loop didn't run. Some hypotheses / investigations in this thread https://neondb.slack.com/archives/C06K38EB05D/p1711451737661299?thread_ts=1711445451.915969&cid=C06K38EB05D

rebase async walredo PR

Didn't happen.

figure out how to quantify multi-tenant scalability effects

Did comparison benchmark between sync and async: https://www.notion.so/neondatabase/2024-03-25-async-walredo-benchmarking-2586fdb14d504996bc2192c1bba2f1ab?pvs=4

throughput: async delivers 8% better throughput (21.2k vs 22.9k)
tail latencies at well-defined throughput that is slightly below sweet spot of both implementations (20k)
- Results
- async has slightly higher tail latencies (p99.99 is 16.68ms vs 15.41ms with sync)
- Overall getpage latency is way more muddy with Sync than with Async (see the histogram screenshots)
- but walredo request-response latency is much cleaner with Sync (because there’s no competition for CPU)

…ck (#7310) part of #6628 Before this PR, we used a std::sync::RwLock to coalesce multiple callers on one walredo spawning. One thread would win the write lock and others would queue up either at the read() or write() lock call. In a scenario where a compute initiates multiple getpage requests from different Postgres backends (= different page_service conns), and we don't have a walredo process around, this means all these page_service handler tasks will enter the spawning code path, one of them will do the spawning, and the others will stall their respective executor thread because they do a blocking read()/write() lock call. I don't know exactly how bad the impact is in reality because posix_spawn uses CLONE_VFORK under the hood, which means that the entire parent process stalls anyway until the child does `exec`, which in turn resumes the parent. But, anyway, we won't know until we fix this issue. And, there's definitely a future way out of stalling the pageserver on posix_spawn, namely, forking template walredo processes that fork again when they need to be per-tenant. This idea is tracked in #7320. Changes ------- This PR fixes that scenario by switching to use `heavier_once_cell` for coalescing. There is a comment on the struct field that explains it in a bit more nuance. ### Alternative Design An alternative would be to use tokio::sync::RwLock. I did this in the first commit in this PR branch, before switching to `heavier_once_cell`. Performance ----------- I re-ran the `bench_walredo` and updated the results, showing that the changes are neglible. For the record, the earlier commit in this PR branch that uses `tokio::sync::RwLock` also has updated benchmark numbers, and the results / kinds of tiny regression were equivalent to `heavier_once_cell`. Note that the above doesn't measure performance on the cold path, i.e., when we need to launch the process and coalesce. We don't have a benchmark for that, and I don't expect any significant changes. We have metrics and we log spawn latency, so, we can monitor it in staging & prod. Risks ----- As "usual", replacing a std::sync primitive with something that yields to the executor risks exposing concurrency that was previously implicitly limited to the number of executor threads. This would be the first one for walredo. The risk is that we get descheduled while the reconstruct data is already there. That could pile up reconstruct data. In practice, I think the risk is low because once we get scheduled again, we'll likely have a walredo process ready, and there is no further await point until walredo is complete and the reconstruct data has been dropped. This will change with async walredo PR #6548, and I'm well aware of it in that PR.

problame · 2024-04-04T19:21:14Z

Update for this week:

Single runtime change revert investigation will continue in this issue
=> #7312
Spent 2h on it.
Could work with peter to repro on a setup that I control, but, it'd likely consume multiple days.
Alternative is to wait for prodlike cloud bench to work in us-east-2, then run a pre-merge build with just one runtime.
Other alternative to explore is to make one-runtime a pageserver-config configurable.
Would make it easier to play around with it, esp in staging.

While reviewing code, found a usage of std::sync::RwLock in walredo.rs that has potential to stall all executors while spawn is ongoing.
=> PR'ed (already merged): #7310

Implemented runtime-configurability for sync vs async walredo: #6548 (comment)
Its not maintainable because unlike the previous PR that switched from sync to async, the runtime-configurability is achieved by restoring the old code; so now we have two copies of it that vary only in

sync vs async
has walredo timeout vs has not

I suppose we could merge it and use the code for a few weeks to do experimentation / gradual rollout.
I could add myself as required reviewer for CODEOWNERS to ensure it doesn't get out of sync.
But after a few weeks, we should throw out the old code or revert back to sync.

Not happy with the progress on this, generally, I think we're being overly cautious.

jcsp · 2024-04-15T10:05:44Z

Ongoing: gradual rollout of sync/async configurable code + switching on the async mode.

EOW goal: deploy across staging (a prod region next week)

) Before this PR, the `nix::poll::poll` call would stall the executor. This PR refactors the `walredo::process` module to allow for different implementations, and adds a new `async` implementation which uses `tokio::process::ChildStd{in,out}` for IPC. The `sync` variant remains the default for now; we'll do more testing in staging and gradual rollout to prod using the config variable. Performance ----------- I updated `bench_walredo.rs`, demonstrating that a single `async`-based walredo manager used by N=1...128 tokio tasks has lower latency and higher throughput. I further did manual less-micro-benchmarking in the real pageserver binary. Methodology & results are published here: https://neondatabase.notion.site/2024-04-08-async-walredo-benchmarking-8c0ed3cc8d364a44937c4cb50b6d7019?pvs=4 tl;dr: - use pagebench against a pageserver patched to answer getpage request & small-enough working set to fit into PS PageCache / kernel page cache. - compare knee in the latency/throughput curve - N tenants, each 1 pagebench clients - sync better throughput at N < 30, async better at higher N - async generally noticable but not much worse p99.X tail latencies - eyeballing CPU efficiency in htop, `async` seems significantly more CPU efficient at ca N=[0.5*ncpus, 1.5*ncpus], worse than `sync` outside of that band Mental Model For Walredo & Scheduler Interactions ------------------------------------------------- Walredo is CPU-/DRAM-only work. This means that as soon as the Pageserver writes to the pipe, the walredo process becomes runnable. To the Linux kernel scheduler, the `$ncpus` executor threads and the walredo process thread are just `struct task_struct`, and it will divide CPU time fairly among them. In `sync` mode, there are always `$ncpus` runnable `struct task_struct` because the executor thread blocks while `walredo` runs, and the executor thread becomes runnable when the `walredo` process is done handling the request. In `async` mode, the executor threads remain runnable unless there are no more runnable tokio tasks, which is unlikely in a production pageserver. The above means that in `sync` mode, there is an implicit concurrency limit on concurrent walredo requests (`$num_runtimes * $num_executor_threads_per_runtime`). And executor threads do not compete in the Linux kernel scheduler for CPU time, due to the blocked-runnable-ping-pong. In `async` mode, there is no concurrency limit, and the walredo tasks compete with the executor threads for CPU time in the kernel scheduler. If we're not CPU-bound, `async` has a pipelining and hence throughput advantage over `sync` because one executor thread can continue processing requests while a walredo request is in flight. If we're CPU-bound, under a fair CPU scheduler, the *fixed* number of executor threads has to share CPU time with the aggregate of walredo processes. It's trivial to reason about this in `sync` mode due to the blocked-runnable-ping-pong. In `async` mode, at 100% CPU, the system arrives at some (potentially sub-optiomal) equilibrium where the executor threads get just enough CPU time to fill up the remaining CPU time with runnable walredo process. Why `async` mode Doesn't Limit Walredo Concurrency -------------------------------------------------- To control that equilibrium in `async` mode, one may add a tokio semaphore to limit the number of in-flight walredo requests. However, the placement of such a semaphore is non-trivial because it means that tasks queuing up behind it hold on to their request-scoped allocations. In the case of walredo, that might be the entire reconstruct data. We don't limit the number of total inflight Timeline::get (we only throttle admission). So, that queue might lead to an OOM. The alternative is to acquire the semaphore permit *before* collecting reconstruct data. However, what if we need to on-demand download? A combination of semaphores might help: one for reconstruct data, one for walredo. The reconstruct data semaphore permit is dropped after acquiring the walredo semaphore permit. This scheme effectively enables both a limit on in-flight reconstruct data and walredo concurrency. However, sizing the amount of permits for the semaphores is tricky: - Reconstruct data retrieval is a mix of disk IO and CPU work. - If we need to do on-demand downloads, it's network IO + disk IO + CPU work. - At this time, we have no good data on how the wall clock time is distributed. It turns out that, in my benchmarking, the system worked fine without a semaphore. So, we're shipping async walredo without one for now. Future Work ----------- We will do more testing of `async` mode and gradual rollout to prod using the config flag. Once that is done, we'll remove `sync` mode to avoid the temporary code duplication introduced by this PR. The flag will be removed. The `wait()` for the child process to exit is still synchronous; the comment [here]( https://github.com/neondatabase/neon/blob/655d3b64681b6562530665c9ab5f2f806f30ad01/pageserver/src/walredo.rs#L294-L306) is still a valid argument in favor of that. The `sync` mode had another implicit advantage: from tokio's perspective, the calling task was using up coop budget. But with `async` mode, that's no longer the case -- to tokio, the writes to the child process pipe look like IO. We could/should inform tokio about the CPU time budget consumed by the task to achieve fairness similar to `sync`. However, the [runtime function for this is `tokio_unstable`](`https://docs.rs/tokio/latest/tokio/task/fn.consume_budget.html). Refs ---- refs #6628 refs #2975

jcsp · 2024-04-22T13:35:41Z

Same plan this week: config change in preprod after the release, then gradually in prod.

problame · 2024-05-06T14:04:27Z

Rollout to remaining regions happening this week.

problame · 2024-05-13T13:26:19Z

Rollout went without problems.

This week: address the remaining task items, i.e.,

change default to async
make staging & prod use default
remove config option
remove sync impl

spotted during reviews of async walredo work in #6628

problame self-assigned this Feb 5, 2024

problame added the c/storage/pageserver Component: storage: pageserver label Feb 5, 2024

This was referenced Feb 5, 2024

add async walredo mode (disabled-by-default, opt-in via config) #6548

Merged

Blocking throughout pageserver in async contexts #2975

Closed

This was referenced Feb 16, 2024

Use multi-threaded runtime instead of single-threaded ones #6743

Merged

walredo: further hardening ideas #6850

Open

koivunej mentioned this issue Feb 26, 2024

fix(walredo): yield while applying neon records #6881

Draft

problame mentioned this issue Mar 20, 2024

walredo benchmark: throughput-oriented rewrite #7190

Merged

problame linked a pull request Mar 20, 2024 that will close this issue

walredo benchmark: throughput-oriented rewrite #7190

Merged

problame removed a link to a pull request Mar 20, 2024

walredo benchmark: throughput-oriented rewrite #7190

Merged

problame mentioned this issue Mar 22, 2024

pageserver: use a single tokio runtime #6555

Merged

This was referenced Apr 4, 2024

fix(walredo spawn): coalescing stalls other executors std::sync::RwLock #7310

Merged

figure out why "use a single tokio runtime" PR had problems under prod-like cloudbench #7312

Open

problame closed this as completed Jun 5, 2024

problame added a commit that referenced this issue Jun 5, 2024

chore(walredo): avoid duplicate tenant_id and shard_slug fields

d77484b

spotted during reviews of async walredo work in #6628

problame mentioned this issue Jun 5, 2024

chore(walredo): avoid duplicate tenant_id and shard_slug fields #7977

Merged

problame added a commit that referenced this issue Jun 6, 2024

chore(walredo): avoid duplicate tenant_id and shard_slug fields (#7977)

0a65333

spotted during reviews of async walredo work in #6628

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: async walredo #6628

Epic: async walredo #6628

problame commented Feb 5, 2024 •

edited

Loading

Tasks

koivunej commented Feb 26, 2024

problame commented Feb 26, 2024

jcsp commented Mar 11, 2024

jcsp commented Mar 18, 2024

problame commented Mar 25, 2024

problame commented Mar 28, 2024

problame commented Apr 4, 2024

jcsp commented Apr 15, 2024

jcsp commented Apr 22, 2024

problame commented May 6, 2024

problame commented May 13, 2024 •

edited

Loading

Epic: async walredo #6628

Epic: async walredo #6628

Comments

problame commented Feb 5, 2024 • edited Loading

Problem

DoD

Related

Impl

Tasks

Follow_ups

koivunej commented Feb 26, 2024

problame commented Feb 26, 2024

jcsp commented Mar 11, 2024

jcsp commented Mar 18, 2024

problame commented Mar 25, 2024

problame commented Mar 28, 2024

problame commented Apr 4, 2024

jcsp commented Apr 15, 2024

jcsp commented Apr 22, 2024

problame commented May 6, 2024

problame commented May 13, 2024 • edited Loading

problame commented Feb 5, 2024 •

edited

Loading

problame commented May 13, 2024 •

edited

Loading