-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add async
walredo mode (disabled-by-default, opt-in via config)
#6548
add async
walredo mode (disabled-by-default, opt-in via config)
#6548
Conversation
2760 tests run: 2642 passed, 0 failed, 118 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
cecc9bc at 2024-04-15T19:58:53.464Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add to output.pending_responses
a way to mark that this response will never be read for cancelled requests... Sadly, this is important for those requests which are waiting on self.stdout.lock().await
so they cannot just "push" something there.
Alternatives are fine, because my http testing endpoint is probably only one which could be dropped right now no it's not because we spawn all requests. Alternatives such as "log an error from scopeguard".
Similarly it would be really bad to have the stdin writing be cancelled midway.
Otherwise this is looking good, please ping me when ready.
Good catch with the cancellation problem. Pushed changes, please have another look. |
Joonas has found performance of this PR to be much worse, DM : https://neondb.slack.com/archives/D049K7HJ9JM/p1707009731704869
|
…kio-epoll-uring/benchmarking/2024-01-31-prs/async-walredo Tricky to merge because I had split up walredo.rs in the meantime.
…rtial-revert this if runtime reconfig is needed for benching)
async
walredo mode (disabled-by-default, opt-in via config)
@koivunej the PR changed significantly since your last review, re-requesting it. I think we have established context in various meetings / calls in the last couple of weeks. |
Addressed your review comments, see latest batch of pushes. I wonder if you explicitly reviewed the diff of I just did, and noticed one forgotten difference that might impact the performance evaluation: the I'll do a quick test with |
Ack, going through the diff now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving again, diff looks reasonable.
…rking/2024-01-31-prs/async-walredo
Before this PR, the
nix::poll::poll
call would stall the executor.This PR refactors the
walredo::process
module to allow for different implementations, and adds a newasync
implementation which usestokio::process::ChildStd{in,out}
for IPC.The
sync
variant remains the default for now; we'll do more testing in staging and gradual rollout to prod using the config variable.Performance
I updated
bench_walredo.rs
, demonstrating that a singleasync
-based walredo manager used by N=1...128 tokio tasks has lower latency and higher throughput.I further did manual less-micro-benchmarking in the real pageserver binary.
Methodology & results are published here:
https://neondatabase.notion.site/2024-04-08-async-walredo-benchmarking-8c0ed3cc8d364a44937c4cb50b6d7019?pvs=4
tl;dr:
async
seems significantly more CPU efficient at ca N=[0.5ncpus, 1.5ncpus], worse thansync
outside of that bandMental Model For Walredo & Scheduler Interactions
Walredo is CPU-/DRAM-only work.
This means that as soon as the Pageserver writes to the pipe, the walredo process becomes runnable.
To the Linux kernel scheduler, the
$ncpus
executor threads and the walredo process thread are juststruct task_struct
, and it will divide CPU time fairly among them.In
sync
mode, there are always$ncpus
runnablestruct task_struct
because the executor thread blocks whilewalredo
runs, and the executor thread becomes runnable when thewalredo
process is done handling the request.In
async
mode, the executor threads remain runnable unless there are no more runnable tokio tasks, which is unlikely in a production pageserver.The above means that in
sync
mode, there is an implicit concurrency limit on concurrent walredo requests ($num_runtimes * $num_executor_threads_per_runtime
).And executor threads do not compete in the Linux kernel scheduler for CPU time, due to the blocked-runnable-ping-pong.
In
async
mode, there is no concurrency limit, and the walredo tasks compete with the executor threads for CPU time in the kernel scheduler.If we're not CPU-bound,
async
has a pipelining and hence throughput advantage oversync
because one executor thread can continue processing requests while a walredo request is in flight.If we're CPU-bound, under a fair CPU scheduler, the fixed number of executor threads has to share CPU time with the aggregate of walredo processes.
It's trivial to reason about this in
sync
mode due to the blocked-runnable-ping-pong.In
async
mode, at 100% CPU, the system arrives at some (potentially sub-optiomal) equilibrium where the executor threads get just enough CPU time to fill up the remaining CPU time with runnable walredo process.Why
async
mode Doesn't Limit Walredo ConcurrencyTo control that equilibrium in
async
mode, one may add a tokio semaphore to limit the number of in-flight walredo requests.However, the placement of such a semaphore is non-trivial because it means that tasks queuing up behind it hold on to their request-scoped allocations.
In the case of walredo, that might be the entire reconstruct data.
We don't limit the number of total inflight Timeline::get (we only throttle admission).
So, that queue might lead to an OOM.
The alternative is to acquire the semaphore permit before collecting reconstruct data.
However, what if we need to on-demand download?
A combination of semaphores might help: one for reconstruct data, one for walredo.
The reconstruct data semaphore permit is dropped after acquiring the walredo semaphore permit.
This scheme effectively enables both a limit on in-flight reconstruct data and walredo concurrency.
However, sizing the amount of permits for the semaphores is tricky:
It turns out that, in my benchmarking, the system worked fine without a semaphore. So, we're shipping async walredo without one for now.
Future Work
We will do more testing of
async
mode and gradual rollout to prod using the config flag.Once that is done, we'll remove
sync
mode to avoid the temporary code duplication introduced by this PR.The flag will be removed.
The
wait()
for the child process to exit is still synchronous; the comment here is still a valid argument in favor of that.The
sync
mode had another implicit advantage: from tokio's perspective, the calling task was using up coop budget.But with
async
mode, that's no longer the case -- to tokio, the writes to the child process pipe look like IO.We could/should inform tokio about the CPU time budget consumed by the task to achieve fairness similar to
sync
.However, the runtime function for this is
tokio_unstable
.Refs
refs #6628
refs #2975