pageserver: use a single tokio runtime #6555

problame · 2024-01-31T15:08:39Z

Before this PR, each core had 3 executor threads from 3 different runtimes. With this PR, we just have one runtime, with one thread per core. Switching to a single tokio runtime should reduce that effective over-commit of CPU and in theory help with tail latencies -- iff all tokio tasks are well-behaved and yield to the runtime regularly.

Are All Tasks Well-Behaved? Are We Ready?

Sadly there doesn't seem to be good out-of-the box tokio tooling to answer this question.

We believe all tasks are well behaved in today's code base, as of the switch to virtual_file_io_engine = "tokio-epoll-uring" in production (https://github.com/neondatabase/aws/pull/1121).

The only remaining executor-thread-blocking code is walredo and some filesystem namespace operations.

Filesystem namespace operations work is being tracked in #6663 and not considered likely to actually block at this time.

Regarding walredo, it currently does a blocking poll for read/write to the pipe file descriptors we use for IPC with the walredo process.
There is an ongoing experiment to make walredo async (#6628), but it needs more time because there are surprisingly tricky trade-offs that are articulated in that PR's description (which itself is still WIP). What's relevant for this PR is that

walredo is always CPU-bound
production tail latencies for walredo request-response (pageserver_wal_redo_seconds_bucket) are

p90: with few exceptions, low hundreds of micro-seconds
p95: except on very packed pageservers, below 1ms
p99: all below 50ms, vast majority below 1ms
p99.9: almost all around 50ms, rarely at >= 70ms
Dashboard Link

The ones below 1ms are below our current threshold for when we start thinking about yielding to the executor.
The tens of milliseconds stalls aren't great, but, not least because of the implicit overcommit of CPU by the three runtimes, we can't be sure whether these tens of milliseconds are inherently necessary to do the walredo work or whether we could be faster if there was less contention for CPU.

On the first item (walredo being always CPU-bound work): it means that walredo processes will always compete with the executor threads.
We could yield, using async walredo, but then we hit the trade-offs explained in that PR.

tl;dr: the risk of stalling executor threads through blocking walredo seems low, and switching to one runtime cleans up one potential source for higher-than-necessary stall times (explained in the previous paragraphs).

Code Changes

Remove the 3 different runtime definitions.
Add a new definition called THE_RUNTIME.
Use it in all places that previously used one of the 3 removed runtimes.
Remove the argument from task_mgr.
Fix failpoint usage where pausable_failpoint! should have been used. We encountered some actual failures because of this, e.g., hung get_metric() calls during test teardown that would client-timeout after 300s.

As indicated by the comment above THE_RUNTIME, we could take this clean-up further.
But before we create so much churn, let's first validate that there's no perf regression.

Performance

We will test this in staging using the various nightly benchmark runs.

However, the worst-case impact of this change is likely compaction (=>image layer creation) competing with compute requests.
Image layer creation work can't be easily generated & repeated quickly by pagebench.
So, we'll simply watch getpage & basebackup tail latencies in staging.

Additionally, I have done manual benchmarking using pagebench.
Report: https://neondatabase.notion.site/2024-03-23-oneruntime-change-benchmarking-22a399c411e24399a73311115fb703ec?pvs=4
Tail latencies and throughput are marginally better (no regression = good).
Except in a workload with 128 clients against one tenant.
There, the p99.9 and p99.99 getpage latency is about 2x worse (at slightly lower throughput).
A dip in throughput every 20s (compaction_period_ is clearly visible, and probably responsible for that worse tail latency.
This has potential to improve with async walredo, and is an edge case workload anyway.

Future Work

Once this change has shown satisfying results in production, change the codebase to use the ambient runtime instead of explicitly referencing THE_RUNTIME.
Have a mode where we run with a single-threaded runtime, so we uncover executor stalls more quickly.
Switch or write our own failpoints library that is async-native: async-native & less-global failpoints #7216

This refactoring makes it easier to experimentally replace BACKGROUND_RUNTIME with a single-threaded runtime. Found this useful [during benchmarking](#6555).

github-actions · 2024-03-22T18:11:29Z

2718 tests run: 2581 passed, 0 failed, 137 skipped (full report)

Flaky tests (1)

Postgres 14

test_timeline_size_quota_on_startup: release

Code coverage* (full report)

functions: 28.1% (6279 of 22338 functions)
lines: 46.9% (44176 of 94105 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
a7fd969 at 2024-03-23T16:28:43.276Z :recycle:}

(I suspect something about shutdown signalling is the root cause for the failure of test_lazy_attach_activation in CI which I can't reproduce locally)

This reverts commit b546031.

Before this patch, we would leave the `timeline-calculate-logical-size-pause` failpoint in `pause` mode at the end of the test. With the switch to a single runtime, somehow we'd end up in a place where the pageserver was half shut down while the failpoint spawn_blocking thread was waiting for the `off` event that never arrived. Failures were reproducible quite well in CI: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6555/8396322235/index.html#suites/c39429f093f87547b2a3b0943e2522d9/4dacb1efb232b98/ I couldn't repro it locally. I managed to repro it once on an i3en.3xlarge , where I then attached gdb to capture the backtrace. For posterity: https://www.notion.so/neondatabase/debug-test_lazy_attach_activation-teardown-hang-as-part-of-PR-6555-421cb61dc45d4d4e90220c86567f50da?pvs=4

This reverts commit 53e9173.

Possibly the last commit was a red herring (still a good cleanup though). However, this one here is a definite executor stall at the failpoint. And it makes sense that it would cause the shutdown timeouts with the unified runtime, because the walreceiver code previously ran on a separate runtime.

Methodology: 1. search for `"pause"` and fix as appropriate 2. search for `=pause` => no further fixes found Future work: audit all uses of fail::fail_point! / auto-convert them all to use pausable_failpoint! (or implement our own failpoints library)

…ional" This reverts commit 0fccba2.

…of a bottleneck

problame · 2024-03-24T17:33:57Z

Nightly pgbench runs in staging look unchanged 👍

The nightly benchmarks looked unchanged but the production-like cloudbench in eu-west-1 is showing a bunch of not-yet-fully-understood symptoms: https://neondb.slack.com/archives/C06K38EB05D/p1711447656987039?thread_ts=1711445451.915969&cid=C06K38EB05D So, revert the commit and ship a release without it. This reverts commit 3220f83.

This PR is an off-by-default revision v2 of the (since-reverted) PR #6555 / commit `3220f830b7fbb785d6db8a93775f46314f10a99b`. See that PR for details on why running with a single runtime is desirable and why we should be ready. We reverted #6555 because it showed regressions in prodlike cloudbench, see the revert commit message `ad072de4209193fd21314cf7f03f14df4fa55eb1` for more context. This PR makes it an opt-in choice via an env var. The default is to use the 4 separate runtimes that we have today, there shouldn't be any performance change. I tested manually that the env var & added metric works. ``` # undefined env var => no change to before this PR, uses 4 runtimes ./target/debug/neon_local start # defining the env var enables one-runtime mode, value defines that one runtime's configuration NEON_PAGESERVER_USE_ONE_RUNTIME=current_thread ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:1 ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:2 ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:default ./target/debug/neon_local start ``` I want to use this change to do more manualy testing and potentially testing in staging. Future Work ----------- Testing / deployment ergonomics would be better if this were a variable in `pageserver.toml`. It can be done, but, I don't need it right now, so let's stick with the env var.

This was referenced Jan 31, 2024

benchmark & stress-test tokio-epoll-uring using pagebench #6509

Closed

refactor(pageserver main): signal handling #6554

Merged

Base automatically changed from problame/integrate-tokio-epoll-uring/benchmarking/2024-01-31-prs/refactor-signals to main January 31, 2024 23:25

problame added a commit that referenced this pull request Jan 31, 2024

refactor(pageserver main): signal handling (#6554)

e82625b

This refactoring makes it easier to experimentally replace BACKGROUND_RUNTIME with a single-threaded runtime. Found this useful [during benchmarking](#6555).

problame added the run-no-ci Don't run any CI for this PR. label Mar 22, 2024

problame force-pushed the problame/one-runtime branch from ac88917 to 3426680 Compare March 22, 2024 18:14

problame removed the run-no-ci Don't run any CI for this PR. label Mar 22, 2024

problame changed the title ~~[DO NOT MERGE] only one tokio runtime in pageserver~~ only one tokio runtime in pageserver Mar 22, 2024

problame changed the title ~~only one tokio runtime in pageserver~~ only have one tokio runtime in pageserver Mar 22, 2024

only one runtime

9e2a805

problame force-pushed the problame/one-runtime branch from 3426680 to 9e2a805 Compare March 22, 2024 18:56

fixup: task_mgr::spawn is used in places that don't have ambient runtime

d83eaf7

problame mentioned this pull request Mar 22, 2024

Epic: async walredo #6628

Closed

1 task

problame added 2 commits March 22, 2024 20:51

use tokio::signal for pageserver shutdown

b546031

(I suspect something about shutdown signalling is the root cause for the failure of test_lazy_attach_activation in CI which I can't reproduce locally)

Revert "use tokio::signal for pageserver shutdown"

53e9173

This reverts commit b546031.

problame force-pushed the problame/one-runtime branch from 02f851f to 945f3b5 Compare March 23, 2024 13:50

problame added 4 commits March 23, 2024 13:56

Revert "Revert "use tokio::signal for pageserver shutdown""

4134486

This reverts commit 53e9173.

problame force-pushed the problame/one-runtime branch from 945f3b5 to 6dca7c6 Compare March 23, 2024 14:00

problame added 5 commits March 23, 2024 14:12

failpoint before-timeline-gc should use pausable_failpoint!

2e95f50

remove two unused failpoints

de878cc

failed attempt to generalize failpoints & make logging conditional

0fccba2

Revert "failed attempt to generalize failpoints & make logging condit…

497acb4

…ional" This reverts commit 0fccba2.

fixup(walreceiver-after-ingest changes): pausable_failpoint too much …

b6ed8e0

…of a bottleneck

problame mentioned this pull request Mar 23, 2024

async-native & less-global failpoints #7216

Open

problame changed the title ~~only have one tokio runtime in pageserver~~ only use a single tokio runtime in pageserver Mar 23, 2024

problame changed the title ~~only use a single tokio runtime in pageserver~~ pageserver: use a single tokio runtime Mar 23, 2024

fixup prev commit

a7fd969

problame marked this pull request as ready for review March 23, 2024 16:44

problame requested a review from a team as a code owner March 23, 2024 16:44

problame requested review from VladLazar and koivunej and removed request for VladLazar March 23, 2024 16:44

koivunej approved these changes Mar 23, 2024

View reviewed changes

problame merged commit 3220f83 into main Mar 23, 2024
54 checks passed

problame deleted the problame/one-runtime branch March 23, 2024 18:25

problame added a commit that referenced this pull request Mar 26, 2024

Revert "pageserver: use a single tokio runtime (#6555)" (#7246)

ad072de

problame mentioned this pull request Apr 5, 2024

pageserver: option to run with just one tokio runtime #7331

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: use a single tokio runtime #6555

pageserver: use a single tokio runtime #6555

problame commented Jan 31, 2024 •

edited

Loading

github-actions bot commented Mar 22, 2024 •

edited

Loading

Postgres 14

problame commented Mar 24, 2024

pageserver: use a single tokio runtime #6555

pageserver: use a single tokio runtime #6555

Conversation

problame commented Jan 31, 2024 • edited Loading

Are All Tasks Well-Behaved? Are We Ready?

Code Changes

Performance

Future Work

github-actions bot commented Mar 22, 2024 • edited Loading

2718 tests run: 2581 passed, 0 failed, 137 skipped (full report)

Postgres 14

Code coverage* (full report)

problame commented Mar 24, 2024

problame commented Jan 31, 2024 •

edited

Loading

github-actions bot commented Mar 22, 2024 •

edited

Loading