add `async` walredo mode (disabled-by-default, opt-in via config) #6548

problame · 2024-01-31T11:25:53Z

Before this PR, the nix::poll::poll call would stall the executor.

This PR refactors the walredo::process module to allow for different implementations, and adds a new async implementation which uses tokio::process::ChildStd{in,out} for IPC.

The sync variant remains the default for now; we'll do more testing in staging and gradual rollout to prod using the config variable.

Performance

I updated bench_walredo.rs, demonstrating that a single async-based walredo manager used by N=1...128 tokio tasks has lower latency and higher throughput.

I further did manual less-micro-benchmarking in the real pageserver binary.
Methodology & results are published here:
https://neondatabase.notion.site/2024-04-08-async-walredo-benchmarking-8c0ed3cc8d364a44937c4cb50b6d7019?pvs=4

tl;dr:

use pagebench against a pageserver patched to answer getpage request & small-enough working set to fit into PS PageCache / kernel page cache.
compare knee in the latency/throughput curve
- N tenants, each 1 pagebench clients
- sync better throughput at N < 30, async better at higher N
- async generally noticable but not much worse p99.X tail latencies
- eyeballing CPU efficiency in htop, async seems significantly more CPU efficient at ca N=[0.5ncpus, 1.5ncpus], worse than sync outside of that band

Mental Model For Walredo & Scheduler Interactions

Walredo is CPU-/DRAM-only work.
This means that as soon as the Pageserver writes to the pipe, the walredo process becomes runnable.

To the Linux kernel scheduler, the $ncpus executor threads and the walredo process thread are just struct task_struct, and it will divide CPU time fairly among them.

In sync mode, there are always $ncpus runnable struct task_struct because the executor thread blocks while walredo runs, and the executor thread becomes runnable when the walredo process is done handling the request.
In async mode, the executor threads remain runnable unless there are no more runnable tokio tasks, which is unlikely in a production pageserver.

The above means that in sync mode, there is an implicit concurrency limit on concurrent walredo requests ($num_runtimes * $num_executor_threads_per_runtime).
And executor threads do not compete in the Linux kernel scheduler for CPU time, due to the blocked-runnable-ping-pong.
In async mode, there is no concurrency limit, and the walredo tasks compete with the executor threads for CPU time in the kernel scheduler.

If we're not CPU-bound, async has a pipelining and hence throughput advantage over sync because one executor thread can continue processing requests while a walredo request is in flight.

If we're CPU-bound, under a fair CPU scheduler, the fixed number of executor threads has to share CPU time with the aggregate of walredo processes.
It's trivial to reason about this in sync mode due to the blocked-runnable-ping-pong.
In async mode, at 100% CPU, the system arrives at some (potentially sub-optiomal) equilibrium where the executor threads get just enough CPU time to fill up the remaining CPU time with runnable walredo process.

Why `async` mode Doesn't Limit Walredo Concurrency

To control that equilibrium in async mode, one may add a tokio semaphore to limit the number of in-flight walredo requests.
However, the placement of such a semaphore is non-trivial because it means that tasks queuing up behind it hold on to their request-scoped allocations.
In the case of walredo, that might be the entire reconstruct data.
We don't limit the number of total inflight Timeline::get (we only throttle admission).
So, that queue might lead to an OOM.

The alternative is to acquire the semaphore permit before collecting reconstruct data.
However, what if we need to on-demand download?

A combination of semaphores might help: one for reconstruct data, one for walredo.
The reconstruct data semaphore permit is dropped after acquiring the walredo semaphore permit.
This scheme effectively enables both a limit on in-flight reconstruct data and walredo concurrency.

However, sizing the amount of permits for the semaphores is tricky:

Reconstruct data retrieval is a mix of disk IO and CPU work.
If we need to do on-demand downloads, it's network IO + disk IO + CPU work.
At this time, we have no good data on how the wall clock time is distributed.

It turns out that, in my benchmarking, the system worked fine without a semaphore. So, we're shipping async walredo without one for now.

Future Work

We will do more testing of async mode and gradual rollout to prod using the config flag.
Once that is done, we'll remove sync mode to avoid the temporary code duplication introduced by this PR.
The flag will be removed.

The wait() for the child process to exit is still synchronous; the comment here is still a valid argument in favor of that.

The sync mode had another implicit advantage: from tokio's perspective, the calling task was using up coop budget.
But with async mode, that's no longer the case -- to tokio, the writes to the child process pipe look like IO.
We could/should inform tokio about the CPU time budget consumed by the task to achieve fairness similar to sync.
However, the runtime function for this is tokio_unstable.

Refs

refs #6628
refs #2975

github-actions · 2024-01-31T13:11:25Z

2760 tests run: 2642 passed, 0 failed, 118 skipped (full report)

Code coverage* (full report)

functions: 28.0% (6437 of 23015 functions)
lines: 46.5% (45063 of 96823 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
cecc9bc at 2024-04-15T19:58:53.464Z :recycle:}

koivunej

We should add to output.pending_responses a way to mark that this response will never be read for cancelled requests... Sadly, this is important for those requests which are waiting on self.stdout.lock().await so they cannot just "push" something there.

Alternatives are fine, because ~~my http testing endpoint is probably only one which could be dropped right now~~ no it's not because we spawn all requests. Alternatives such as "log an error from scopeguard".

Similarly it would be really bad to have the stdin writing be cancelled midway.

Otherwise this is looking good, please ping me when ready.

problame · 2024-01-31T16:27:10Z

Good catch with the cancellation problem.

Pushed changes, please have another look.

libs/utils/src/poison.rs

problame · 2024-02-05T11:49:23Z

Joonas has found performance of this PR to be much worse, DM : https://neondb.slack.com/archives/D049K7HJ9JM/p1707009731704869

short/short/1           time:   [11.417 µs 11.635 µs 11.846 µs]
Found 18 outliers among 100 measurements (18.00%)
  4 (4.00%) low severe
  5 (5.00%) low mild
  9 (9.00%) high mild
short/short/2           time:   [33.945 µs 34.371 µs 34.795 µs]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  2 (2.00%) high mild
short/short/4           time:   [55.483 µs 55.901 µs 56.294 µs]
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
short/short/8           time:   [126.11 µs 126.67 µs 127.22 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
short/short/16          time:   [257.16 µs 258.27 µs 259.40 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

medium/medium/1         time:   [76.750 µs 78.309 µs 79.861 µs]
medium/medium/2         time:   [169.51 µs 170.02 µs 170.61 µs]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  7 (7.00%) high severe
medium/medium/4         time:   [305.85 µs 308.94 µs 312.44 µs]
Found 10 outliers among 100 measurements (10.00%)
  8 (8.00%) high mild
  2 (2.00%) high severe
medium/medium/8         time:   [555.67 µs 560.14 µs 564.97 µs]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
medium/medium/16        time:   [1.1525 ms 1.1644 ms 1.1773 ms]
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe

these are very much inline with what I remember, but yeah perhaps the single thread version used to be below 10us, maybe 7us. my shmempipe got to 2-3us for the short, which I think is the limit I guess. that was with probably way too much spinning :) (edited)

…kio-epoll-uring/benchmarking/2024-01-31-prs/async-walredo Tricky to merge because I had split up walredo.rs in the meantime.

…rtial-revert this if runtime reconfig is needed for benching)

problame · 2024-04-12T20:51:24Z

@koivunej the PR changed significantly since your last review, re-requesting it. I think we have established context in various meetings / calls in the last couple of weeks.

libs/pageserver_api/src/models.rs

pageserver/src/walredo/process/process_impl/process_async.rs

pageserver/src/walredo/process/process_impl/process_std.rs

pageserver/src/walredo/process.rs

addresses #6548 (comment) #6548 (comment)

problame · 2024-04-15T13:00:48Z

Addressed your review comments, see latest batch of pushes.

I wonder if you explicitly reviewed the diff of process_sync.rs vs process_async.rs.

I just did, and noticed one forgotten difference that might impact the performance evaluation: the process_async.rs has #[instrument(skip_all, level = tracing::Level::DEBUG), the process_sync.rs doesn't.

I'll do a quick test with bench_walredo.rs to see whether there's a major difference, but, I won't do a full evaluation run again.

koivunej · 2024-04-15T13:12:55Z

Ack, going through the diff now.

koivunej

Approving again, diff looks reasonable.

…rking/2024-01-31-prs/async-walredo

problame added 5 commits January 31, 2024 12:09

WIP: async walredo

2c1652a

remove wal_redo_timeout

a93be15

some cleanup work

8012b80

error handling

2736f61

clippy + compile errors

639ed3c

This was referenced Jan 31, 2024

benchmark & stress-test tokio-epoll-uring using pagebench #6509

Closed

Epic: scalable async disk IO (tokio-epoll-uring) #4744

Closed

problame requested review from koivunej and knizhnik January 31, 2024 12:12

clippy (again?)

a29ac8b

problame marked this pull request as ready for review January 31, 2024 12:15

problame requested a review from a team as a code owner January 31, 2024 12:15

cfg(testing) still needs io::Write

4160d40

koivunej requested changes Jan 31, 2024

View reviewed changes

problame added 3 commits January 31, 2024 15:57

WIP poison

70b37cf

working impl

b1b8ca3

move poison to utils and document

9641374

problame requested a review from koivunej January 31, 2024 16:27

koivunej reviewed Jan 31, 2024

View reviewed changes

libs/utils/src/poison.rs Outdated Show resolved Hide resolved

koivunej approved these changes Jan 31, 2024

View reviewed changes

problame mentioned this pull request Feb 1, 2024

pageserver: spawning walredo process is slow #6565

Closed

2 tasks

knizhnik approved these changes Feb 1, 2024

View reviewed changes

problame mentioned this pull request Feb 5, 2024

Epic: async walredo #6628

Closed

1 task

problame mentioned this pull request Feb 16, 2024

Use multi-threaded runtime instead of single-threaded ones #6743

Merged

5 tasks

problame requested a review from a team as a code owner March 13, 2024 14:01

problame requested a review from save-buffer March 13, 2024 14:01

Merge remote-tracking branch 'origin/main' into problame/integrate-to…

0cf5619

…kio-epoll-uring/benchmarking/2024-01-31-prs/async-walredo Tricky to merge because I had split up walredo.rs in the meantime.

problame added 2 commits April 12, 2024 15:38

also assert metric is set

237f27a

remove runtime reconfiguration capability + assert a bit more (can pa…

4a26245

…rtial-revert this if runtime reconfig is needed for benching)

problame changed the title ~~async walredo & remove wal_redo_timeout~~ async walredo Apr 12, 2024

problame changed the title ~~async walredo~~ add async walredo mode (disabled-by-default, opt-in via config) Apr 12, 2024

problame marked this pull request as ready for review April 12, 2024 20:50

problame requested a review from koivunej April 12, 2024 20:51