-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
walredo spawn: avoid stalling pageserver process on CLONE_VFORK #7320
Closed
1 of 3 tasks
Tracked by
#6581
Labels
c/storage/pageserver
Component: storage: pageserver
Comments
problame
added a commit
that referenced
this issue
Apr 4, 2024
…ck (#7310) part of #6628 Before this PR, we used a std::sync::RwLock to coalesce multiple callers on one walredo spawning. One thread would win the write lock and others would queue up either at the read() or write() lock call. In a scenario where a compute initiates multiple getpage requests from different Postgres backends (= different page_service conns), and we don't have a walredo process around, this means all these page_service handler tasks will enter the spawning code path, one of them will do the spawning, and the others will stall their respective executor thread because they do a blocking read()/write() lock call. I don't know exactly how bad the impact is in reality because posix_spawn uses CLONE_VFORK under the hood, which means that the entire parent process stalls anyway until the child does `exec`, which in turn resumes the parent. But, anyway, we won't know until we fix this issue. And, there's definitely a future way out of stalling the pageserver on posix_spawn, namely, forking template walredo processes that fork again when they need to be per-tenant. This idea is tracked in #7320. Changes ------- This PR fixes that scenario by switching to use `heavier_once_cell` for coalescing. There is a comment on the struct field that explains it in a bit more nuance. ### Alternative Design An alternative would be to use tokio::sync::RwLock. I did this in the first commit in this PR branch, before switching to `heavier_once_cell`. Performance ----------- I re-ran the `bench_walredo` and updated the results, showing that the changes are neglible. For the record, the earlier commit in this PR branch that uses `tokio::sync::RwLock` also has updated benchmark numbers, and the results / kinds of tiny regression were equivalent to `heavier_once_cell`. Note that the above doesn't measure performance on the cold path, i.e., when we need to launch the process and coalesce. We don't have a benchmark for that, and I don't expect any significant changes. We have metrics and we log spawn latency, so, we can monitor it in staging & prod. Risks ----- As "usual", replacing a std::sync primitive with something that yields to the executor risks exposing concurrency that was previously implicitly limited to the number of executor threads. This would be the first one for walredo. The risk is that we get descheduled while the reconstruct data is already there. That could pile up reconstruct data. In practice, I think the risk is low because once we get scheduled again, we'll likely have a walredo process ready, and there is no further await point until walredo is complete and the reconstruct data has been dropped. This will change with async walredo PR #6548, and I'm well aware of it in that PR.
Started doing that using my sample project. The tail latency impact is clearly visible with 500k files open (3ms on my machine). |
However, these numbers don't justify investing time into pre-spawned walredo right now. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Problem
posix_spawn
usesCLONE_VFORK
under the hood, which freezes the entire parent process anyway until the child does the exec.This issue tracks the need to avoid the latency impact of
CLONE_VORK
on the parent process.Quantifying The Impact
Before jumping to implementations, make sure the problem above is actually valid.
Idea to do that: write a test program that
std::process::Command
toposix_spawn
thetrue
program at a steady, configurable rateThrow
wrk
against the GET endpoint to measure tail latencies.Play around with the rate of
posix_spawn
& observe impact on tail latencies.We'd expect tail latencies to get worse as the spawn rate increases.
Solution Proposal, Assuming The Impact Is Material
High-level:
Implementation:
/proc/self/exe --mode Spawner
)socketpair()
)socketpair()
connection.socketpair()
connection, PS SIGKILL & waits the old Spawner and starts a new one.CLONE_PIDFD
cmsg
, it looks esoteric but is used by quite lot of software.Work
Preliminary
Work
The text was updated successfully, but these errors were encountered: