Epic: fully eliminate latency impact of walredo spawning #6581

problame · 2024-02-02T09:55:46Z

This is a spin-off from Epic pageserver: spawning walredo process is slow #6565

Problem

We currently spawn walredo processes lazily, causing the first getpage request to experience a 5ms slowdown, which is the avg spawn latency.

If the compute issues getpage requests concurrently (parallel query execution, prefetch), then all these concurrent requests coalesce on the 5ms long spawn, i.e., all experience a 5ms slowdown.

We also shut down walredo after idle time. If the compute is still up, and then issues a request, it experiences the 5ms outlier(s) again.

On moderately busy PSes (mid hundreds to low thousands of GetPage/second), the outliers introduced by this skew the p99.9X latencies.

Also, there's the impact of CLONE_VFORK (posix_spawn), which freezes the entire parent until the child does exec. See #7320

ALl of the above is definitely contributing to tail latencies; the question is: how badly, and is it easier to just fix than to quantify it.

Metrics-only solution

Only fix the metrics by counting getpage requests that needed walredo spawn seprately.
This goes in a simmilar direction as #3797

Partial Solution

Spawn walredo during basebackup. This solves the outlier on cold starts, and probably makes our p99.9X getpage stats look better.

But, cold starts + first query is dominated by cold starts, so, +5ms on the first query part won't move the needle for actual UX.

Full Solution: pre-spawned pool

Hence, to completely take walredo process spawning off the critical path, we should have a pool of pre-spawned walredo processes for use by tenants.

Note that as a preliminary, #7320 should be done fist to avoid API-level conflicts.
However, the core ideas are orthogonal:

this issue is about the pre-spawned pool to avoid spawn latency
issue walredo spawn: avoid stalling pageserver process on CLONE_VFORK #7320 is about avoiding process-wide freezes/stalls due to posix_spawn

Work

Preliminaries

Give feedback

Partial solution: spawn walredo as part of basebackup

Give feedback

draft PR it
Options

Full solution: pre-spawned pool

Give feedback

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: fully eliminate latency impact of walredo spawning #6581

Epic: fully eliminate latency impact of walredo spawning #6581

problame commented Feb 2, 2024 •

edited

Loading

Preliminaries

Partial solution: spawn walredo as part of basebackup

Full solution: pre-spawned pool

problame commented Apr 30, 2024

Epic: fully eliminate latency impact of walredo spawning #6581

Epic: fully eliminate latency impact of walredo spawning #6581

Comments

problame commented Feb 2, 2024 • edited Loading

Problem

Metrics-only solution

Partial Solution

Full Solution: pre-spawned pool

Work

Preliminaries

Partial solution: spawn walredo as part of basebackup

Full solution: pre-spawned pool

Related

problame commented Apr 30, 2024

problame commented Feb 2, 2024 •

edited

Loading