-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: fully eliminate latency impact of walredo spawning #6581
Labels
c/storage/pageserver
Component: storage: pageserver
Comments
2 tasks
This was referenced Feb 2, 2024
This was referenced Feb 2, 2024
problame
added a commit
that referenced
this issue
Feb 6, 2024
problame
changed the title
pre-spawn walredo processes
Epic: pre-spawn walredo processes
Apr 4, 2024
problame
changed the title
Epic: pre-spawn walredo processes
Epic: lazy walredo process spawning impacts p99.9X getpage latencies
Apr 4, 2024
problame
changed the title
Epic: lazy walredo process spawning impacts p99.9X getpage latencies
Epic: avoid latency of lazy walredo process spawning / skews p99.9X on moderately busy PSes
Apr 4, 2024
problame
changed the title
Epic: avoid latency of lazy walredo process spawning / skews p99.9X on moderately busy PSes
Epic: fully eliminate latency impact of walredo spawning
Apr 4, 2024
The latency impact is easily demonstratable, but not even a mid-term priority. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is a spin-off from Epic
pageserver: spawning walredo process is slow
#6565Problem
We currently spawn walredo processes lazily, causing the first getpage request to experience a 5ms slowdown, which is the avg spawn latency.
If the compute issues getpage requests concurrently (parallel query execution, prefetch), then all these concurrent requests coalesce on the 5ms long spawn, i.e., all experience a 5ms slowdown.
We also shut down walredo after idle time. If the compute is still up, and then issues a request, it experiences the 5ms outlier(s) again.
On moderately busy PSes (mid hundreds to low thousands of GetPage/second), the outliers introduced by this skew the p99.9X latencies.
Also, there's the impact of CLONE_VFORK (posix_spawn), which freezes the entire parent until the child does
exec
. See #7320ALl of the above is definitely contributing to tail latencies; the question is: how badly, and is it easier to just fix than to quantify it.
Metrics-only solution
Only fix the metrics by counting getpage requests that needed walredo spawn seprately.
This goes in a simmilar direction as #3797
Partial Solution
Spawn walredo during basebackup. This solves the outlier on cold starts, and probably makes our p99.9X getpage stats look better.
But,
cold starts + first query
is dominated by cold starts, so,+5ms
on thefirst query
part won't move the needle for actual UX.Full Solution: pre-spawned pool
Hence, to completely take walredo process spawning off the critical path, we should have a pool of pre-spawned walredo processes for use by tenants.
Note that as a preliminary, #7320 should be done fist to avoid API-level conflicts.
However, the core ideas are orthogonal:
Work
Preliminaries
walredo.rs
#6583Partial solution: spawn walredo as part of basebackup
Full solution: pre-spawned pool
Related
The text was updated successfully, but these errors were encountered: