Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

walredo spawn: avoid stalling pageserver process on CLONE_VFORK #7320

Closed
1 of 3 tasks
Tracked by #6581
problame opened this issue Apr 4, 2024 · 2 comments
Closed
1 of 3 tasks
Tracked by #6581

walredo spawn: avoid stalling pageserver process on CLONE_VFORK #7320

problame opened this issue Apr 4, 2024 · 2 comments
Labels
c/storage/pageserver Component: storage: pageserver

Comments

@problame
Copy link
Contributor

problame commented Apr 4, 2024

Problem

posix_spawn uses CLONE_VFORK under the hood, which freezes the entire parent process anyway until the child does the exec.

This issue tracks the need to avoid the latency impact of CLONE_VORK on the parent process.

Quantifying The Impact

Before jumping to implementations, make sure the problem above is actually valid.

Idea to do that: write a test program that

  • serves a static http GET response using tokio + hyper
  • on the same tokio runtime that uses std::process::Command to posix_spawn the true program at a steady, configurable rate

Throw wrk against the GET endpoint to measure tail latencies.

Play around with the rate of posix_spawn & observe impact on tail latencies.

We'd expect tail latencies to get worse as the spawn rate increases.

Solution Proposal, Assuming The Impact Is Material

High-level:

  • Instead of spawning walredo processes directly, have a spawner process running next to pageserver.
  • When Pageserver needs a walredo process, it asks the spawner to create that process.

Implementation:

  • PS <> Spawner process interaction:
    • PS spawns it as a child process during startup (/proc/self/exe --mode Spawner)
    • PS and Spawner are connected via unnamed unix domain socket (socketpair())
    • Spawner has no stdio, it is for all intents and purposes a function running in a separate address space.
    • Spawner terminates if there is an error interacting with socketpair() connection.
    • When PS encounters an error on the socketpair() connection, PS SIGKILL & waits the old Spawner and starts a new one.
    • There is only one Spawner in existence at any time.
  • PS <> Spawner protocol:
    • PS requests Spawner to launch a process (the request specifies binary path, argv, envp)
    • Spawner creates 3 unnamed pipes for stdin, stdout, stderr of the (grand)child
    • Spawner spawns the grand child using CLONE_PIDFD
      • Ensures only the 3 pipe file descriptors for the child make it to the child
    • Spawner responds to PS with
      • the pidfd of the child process
      • its end of the stdin, stdout, stderr file descriptors
    • (The file descriptors can be sent over the socketpair using cmsg, it looks esoteric but is used by quite lot of software.
  • PS interacts with the child using the pidfd + stdin/stdout/stderr pipe ends, as it does today.

Work

Preliminary

Preview Give feedback

Work

Preview Give feedback
@problame problame added the c/storage/pageserver Component: storage: pageserver label Apr 4, 2024
problame added a commit that referenced this issue Apr 4, 2024
…ck (#7310)

part of #6628

Before this PR, we used a std::sync::RwLock to coalesce multiple
callers on one walredo spawning. One thread would win the write lock
and others would queue up either at the read() or write() lock call.

In a scenario where a compute initiates multiple getpage requests
from different Postgres backends (= different page_service conns),
and we don't have a walredo process around, this means all these
page_service handler tasks will enter the spawning code path,
one of them will do the spawning, and the others will stall their
respective executor thread because they do a blocking
read()/write() lock call.

I don't know exactly how bad the impact is in reality because
posix_spawn uses CLONE_VFORK under the hood, which means that the
entire parent process stalls anyway until the child does `exec`,
which in turn resumes the parent.

But, anyway, we won't know until we fix this issue.
And, there's definitely a future way out of stalling the
pageserver on posix_spawn, namely, forking template walredo processes
that fork again when they need to be per-tenant.
This idea is tracked in
#7320.

Changes
-------

This PR fixes that scenario by switching to use `heavier_once_cell`
for coalescing. There is a comment on the struct field that explains
it in a bit more nuance.

### Alternative Design

An alternative would be to use tokio::sync::RwLock.
I did this in the first commit in this PR branch,
before switching to `heavier_once_cell`.

Performance
-----------

I re-ran the `bench_walredo` and updated the results, showing that
the changes are neglible.

For the record, the earlier commit in this PR branch that uses
`tokio::sync::RwLock` also has updated benchmark numbers, and the
results / kinds of tiny regression were equivalent to
`heavier_once_cell`.

Note that the above doesn't measure performance on the cold path, i.e.,
when we need to launch the process and coalesce. We don't have a
benchmark
for that, and I don't expect any significant changes. We have metrics
and we log spawn latency, so, we can monitor it in staging & prod.

Risks
-----

As "usual", replacing a std::sync primitive with something that yields
to
the executor risks exposing concurrency that was previously implicitly
limited to the number of executor threads.

This would be the first one for walredo.

The risk is that we get descheduled while the reconstruct data is
already there.
That could pile up reconstruct data.

In practice, I think the risk is low because once we get scheduled
again, we'll
likely have a walredo process ready, and there is no further await point
until walredo is complete and the reconstruct data has been dropped.

This will change with async walredo PR #6548, and I'm well aware of it
in that PR.
@problame
Copy link
Contributor Author

Quantifying The Impact

Started doing that using my sample project.

The tail latency impact is clearly visible with 500k files open (3ms on my machine).
With 0 files open, it's ~300us.

experiment-posix-spawn-tail-latency-impact.zip

@problame
Copy link
Contributor Author

However, these numbers don't justify investing time into pre-spawned walredo right now.
Closing this issue.

@problame problame closed this as not planned Won't fix, can't repro, duplicate, stale Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

1 participant