walredo spawn: avoid stalling pageserver process on CLONE_VFORK #7320

problame · 2024-04-04T15:03:50Z

Problem

posix_spawn uses CLONE_VFORK under the hood, which freezes the entire parent process anyway until the child does the exec.

This issue tracks the need to avoid the latency impact of CLONE_VORK on the parent process.

Quantifying The Impact

Before jumping to implementations, make sure the problem above is actually valid.

Idea to do that: write a test program that

serves a static http GET response using tokio + hyper
on the same tokio runtime that uses std::process::Command to posix_spawn the true program at a steady, configurable rate

Throw wrk against the GET endpoint to measure tail latencies.

Play around with the rate of posix_spawn & observe impact on tail latencies.

We'd expect tail latencies to get worse as the spawn rate increases.

Solution Proposal, Assuming The Impact Is Material

High-level:

Instead of spawning walredo processes directly, have a spawner process running next to pageserver.
When Pageserver needs a walredo process, it asks the spawner to create that process.

Implementation:

PS <> Spawner process interaction:
- PS spawns it as a child process during startup (/proc/self/exe --mode Spawner)
- PS and Spawner are connected via unnamed unix domain socket (socketpair())
- Spawner has no stdio, it is for all intents and purposes a function running in a separate address space.
- Spawner terminates if there is an error interacting with socketpair() connection.
- When PS encounters an error on the socketpair() connection, PS SIGKILL & waits the old Spawner and starts a new one.
- There is only one Spawner in existence at any time.
PS <> Spawner protocol:
- PS requests Spawner to launch a process (the request specifies binary path, argv, envp)
- Spawner creates 3 unnamed pipes for stdin, stdout, stderr of the (grand)child
- Spawner spawns the grand child using CLONE_PIDFD
  - Ensures only the 3 pipe file descriptors for the child make it to the child
- Spawner responds to PS with
  - the pidfd of the child process
  - its end of the stdin, stdout, stderr file descriptors
- (The file descriptors can be sent over the socketpair using cmsg, it looks esoteric but is used by quite lot of software.
PS interacts with the child using the pidfd + stdin/stdout/stderr pipe ends, as it does today.

Work

Preliminary

Give feedback

Work

Give feedback

Quantify the impact; rest of this list assumes it's there.
Implement the design
Options

The text was updated successfully, but these errors were encountered:

…ck (#7310) part of #6628 Before this PR, we used a std::sync::RwLock to coalesce multiple callers on one walredo spawning. One thread would win the write lock and others would queue up either at the read() or write() lock call. In a scenario where a compute initiates multiple getpage requests from different Postgres backends (= different page_service conns), and we don't have a walredo process around, this means all these page_service handler tasks will enter the spawning code path, one of them will do the spawning, and the others will stall their respective executor thread because they do a blocking read()/write() lock call. I don't know exactly how bad the impact is in reality because posix_spawn uses CLONE_VFORK under the hood, which means that the entire parent process stalls anyway until the child does `exec`, which in turn resumes the parent. But, anyway, we won't know until we fix this issue. And, there's definitely a future way out of stalling the pageserver on posix_spawn, namely, forking template walredo processes that fork again when they need to be per-tenant. This idea is tracked in #7320. Changes ------- This PR fixes that scenario by switching to use `heavier_once_cell` for coalescing. There is a comment on the struct field that explains it in a bit more nuance. ### Alternative Design An alternative would be to use tokio::sync::RwLock. I did this in the first commit in this PR branch, before switching to `heavier_once_cell`. Performance ----------- I re-ran the `bench_walredo` and updated the results, showing that the changes are neglible. For the record, the earlier commit in this PR branch that uses `tokio::sync::RwLock` also has updated benchmark numbers, and the results / kinds of tiny regression were equivalent to `heavier_once_cell`. Note that the above doesn't measure performance on the cold path, i.e., when we need to launch the process and coalesce. We don't have a benchmark for that, and I don't expect any significant changes. We have metrics and we log spawn latency, so, we can monitor it in staging & prod. Risks ----- As "usual", replacing a std::sync primitive with something that yields to the executor risks exposing concurrency that was previously implicitly limited to the number of executor threads. This would be the first one for walredo. The risk is that we get descheduled while the reconstruct data is already there. That could pile up reconstruct data. In practice, I think the risk is low because once we get scheduled again, we'll likely have a walredo process ready, and there is no further await point until walredo is complete and the reconstruct data has been dropped. This will change with async walredo PR #6548, and I'm well aware of it in that PR.

problame · 2024-04-30T15:41:39Z

Quantifying The Impact

Started doing that using my sample project.

The tail latency impact is clearly visible with 500k files open (3ms on my machine).
With 0 files open, it's ~300us.

experiment-posix-spawn-tail-latency-impact.zip

problame · 2024-04-30T15:42:01Z

However, these numbers don't justify investing time into pre-spawned walredo right now.
Closing this issue.

problame mentioned this issue Apr 4, 2024

Epic: fully eliminate latency impact of walredo spawning #6581

Closed

problame added the c/storage/pageserver Component: storage: pageserver label Apr 4, 2024

problame mentioned this issue Apr 4, 2024

fix(walredo spawn): coalescing stalls other executors std::sync::RwLock #7310

Merged

problame closed this as not planned Won't fix, can't repro, duplicate, stale Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

walredo spawn: avoid stalling pageserver process on CLONE_VFORK #7320

walredo spawn: avoid stalling pageserver process on CLONE_VFORK #7320

problame commented Apr 4, 2024 •

edited

Loading

Preliminary

Work

problame commented Apr 30, 2024

problame commented Apr 30, 2024

walredo spawn: avoid stalling pageserver process on CLONE_VFORK #7320

walredo spawn: avoid stalling pageserver process on CLONE_VFORK #7320

Comments

problame commented Apr 4, 2024 • edited Loading

Problem

Quantifying The Impact

Solution Proposal, Assuming The Impact Is Material

Work

Preliminary

Work

problame commented Apr 30, 2024

problame commented Apr 30, 2024

problame commented Apr 4, 2024 •

edited

Loading