Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: use a single tokio runtime #6555

Merged
merged 14 commits into from
Mar 23, 2024
Merged

pageserver: use a single tokio runtime #6555

merged 14 commits into from
Mar 23, 2024

Conversation

problame
Copy link
Contributor

@problame problame commented Jan 31, 2024

Before this PR, each core had 3 executor threads from 3 different runtimes. With this PR, we just have one runtime, with one thread per core. Switching to a single tokio runtime should reduce that effective over-commit of CPU and in theory help with tail latencies -- iff all tokio tasks are well-behaved and yield to the runtime regularly.

Are All Tasks Well-Behaved? Are We Ready?

Sadly there doesn't seem to be good out-of-the box tokio tooling to answer this question.

We believe all tasks are well behaved in today's code base, as of the switch to virtual_file_io_engine = "tokio-epoll-uring" in production (https://github.com/neondatabase/aws/pull/1121).

The only remaining executor-thread-blocking code is walredo and some filesystem namespace operations.

Filesystem namespace operations work is being tracked in #6663 and not considered likely to actually block at this time.

Regarding walredo, it currently does a blocking poll for read/write to the pipe file descriptors we use for IPC with the walredo process.
There is an ongoing experiment to make walredo async (#6628), but it needs more time because there are surprisingly tricky trade-offs that are articulated in that PR's description (which itself is still WIP). What's relevant for this PR is that

  1. walredo is always CPU-bound
  2. production tail latencies for walredo request-response (pageserver_wal_redo_seconds_bucket) are
  • p90: with few exceptions, low hundreds of micro-seconds
  • p95: except on very packed pageservers, below 1ms
  • p99: all below 50ms, vast majority below 1ms
  • p99.9: almost all around 50ms, rarely at >= 70ms
  • Dashboard Link

The ones below 1ms are below our current threshold for when we start thinking about yielding to the executor.
The tens of milliseconds stalls aren't great, but, not least because of the implicit overcommit of CPU by the three runtimes, we can't be sure whether these tens of milliseconds are inherently necessary to do the walredo work or whether we could be faster if there was less contention for CPU.

On the first item (walredo being always CPU-bound work): it means that walredo processes will always compete with the executor threads.
We could yield, using async walredo, but then we hit the trade-offs explained in that PR.

tl;dr: the risk of stalling executor threads through blocking walredo seems low, and switching to one runtime cleans up one potential source for higher-than-necessary stall times (explained in the previous paragraphs).

Code Changes

  • Remove the 3 different runtime definitions.
  • Add a new definition called THE_RUNTIME.
  • Use it in all places that previously used one of the 3 removed runtimes.
  • Remove the argument from task_mgr.
  • Fix failpoint usage where pausable_failpoint! should have been used. We encountered some actual failures because of this, e.g., hung get_metric() calls during test teardown that would client-timeout after 300s.

As indicated by the comment above THE_RUNTIME, we could take this clean-up further.
But before we create so much churn, let's first validate that there's no perf regression.

Performance

We will test this in staging using the various nightly benchmark runs.

However, the worst-case impact of this change is likely compaction (=>image layer creation) competing with compute requests.
Image layer creation work can't be easily generated & repeated quickly by pagebench.
So, we'll simply watch getpage & basebackup tail latencies in staging.

Additionally, I have done manual benchmarking using pagebench.
Report: https://neondatabase.notion.site/2024-03-23-oneruntime-change-benchmarking-22a399c411e24399a73311115fb703ec?pvs=4
Tail latencies and throughput are marginally better (no regression = good).
Except in a workload with 128 clients against one tenant.
There, the p99.9 and p99.99 getpage latency is about 2x worse (at slightly lower throughput).
A dip in throughput every 20s (compaction_period_ is clearly visible, and probably responsible for that worse tail latency.
This has potential to improve with async walredo, and is an edge case workload anyway.

Future Work

  1. Once this change has shown satisfying results in production, change the codebase to use the ambient runtime instead of explicitly referencing THE_RUNTIME.
  2. Have a mode where we run with a single-threaded runtime, so we uncover executor stalls more quickly.
  3. Switch or write our own failpoints library that is async-native: async-native & less-global failpoints #7216

Base automatically changed from problame/integrate-tokio-epoll-uring/benchmarking/2024-01-31-prs/refactor-signals to main January 31, 2024 23:25
problame added a commit that referenced this pull request Jan 31, 2024
This refactoring makes it easier to experimentally replace
BACKGROUND_RUNTIME with a single-threaded runtime. Found this useful
[during benchmarking](#6555).
@problame problame added the run-no-ci Don't run any CI for this PR. label Mar 22, 2024
Copy link

github-actions bot commented Mar 22, 2024

2718 tests run: 2581 passed, 0 failed, 137 skipped (full report)


Flaky tests (1)

Postgres 14

  • test_timeline_size_quota_on_startup: release

Code coverage* (full report)

  • functions: 28.1% (6279 of 22338 functions)
  • lines: 46.9% (44176 of 94105 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
a7fd969 at 2024-03-23T16:28:43.276Z :recycle:

@problame problame force-pushed the problame/one-runtime branch from ac88917 to 3426680 Compare March 22, 2024 18:14
@problame problame removed the run-no-ci Don't run any CI for this PR. label Mar 22, 2024
@problame problame changed the title [DO NOT MERGE] only one tokio runtime in pageserver only one tokio runtime in pageserver Mar 22, 2024
@problame problame changed the title only one tokio runtime in pageserver only have one tokio runtime in pageserver Mar 22, 2024
@problame problame force-pushed the problame/one-runtime branch from 3426680 to 9e2a805 Compare March 22, 2024 18:56
@problame problame mentioned this pull request Mar 22, 2024
1 task
(I suspect something about shutdown signalling is the
root cause for the failure of test_lazy_attach_activation in CI which I
can't reproduce locally)
@problame problame force-pushed the problame/one-runtime branch from 02f851f to 945f3b5 Compare March 23, 2024 13:50
Before this patch, we would leave the
`timeline-calculate-logical-size-pause` failpoint in `pause` mode at the
end of the test.

With the switch to a single runtime, somehow we'd end up in a place where
the pageserver was half shut down while the failpoint spawn_blocking
thread was waiting for the `off` event that never arrived.

Failures were reproducible quite well in CI: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6555/8396322235/index.html#suites/c39429f093f87547b2a3b0943e2522d9/4dacb1efb232b98/

I couldn't repro it locally.

I managed to repro it once on an i3en.3xlarge , where I then attached
gdb to capture the backtrace.

For posterity: https://www.notion.so/neondatabase/debug-test_lazy_attach_activation-teardown-hang-as-part-of-PR-6555-421cb61dc45d4d4e90220c86567f50da?pvs=4
Possibly the last commit was a red herring (still a good cleanup though).

However, this one here is a definite executor stall at the failpoint.

And it makes sense that it would cause the shutdown timeouts with the
unified runtime, because the walreceiver code previously ran on a
separate runtime.
Methodology:
1. search for `"pause"` and fix as appropriate
2. search for `=pause` => no further fixes found

Future work: audit all uses of fail::fail_point! / auto-convert them all
to use pausable_failpoint! (or implement our own failpoints library)
@problame problame force-pushed the problame/one-runtime branch from 945f3b5 to 6dca7c6 Compare March 23, 2024 14:00
@problame problame changed the title only have one tokio runtime in pageserver only use a single tokio runtime in pageserver Mar 23, 2024
@problame problame changed the title only use a single tokio runtime in pageserver pageserver: use a single tokio runtime Mar 23, 2024
@problame problame marked this pull request as ready for review March 23, 2024 16:44
@problame problame requested a review from a team as a code owner March 23, 2024 16:44
@problame problame requested review from VladLazar and koivunej and removed request for VladLazar March 23, 2024 16:44
@problame problame merged commit 3220f83 into main Mar 23, 2024
54 checks passed
@problame problame deleted the problame/one-runtime branch March 23, 2024 18:25
@problame
Copy link
Contributor Author

Nightly pgbench runs in staging look unchanged 👍

problame added a commit that referenced this pull request Mar 26, 2024
The nightly benchmarks looked unchanged but the production-like
cloudbench in eu-west-1 is showing a bunch of not-yet-fully-understood
symptoms:

https://neondb.slack.com/archives/C06K38EB05D/p1711447656987039?thread_ts=1711445451.915969&cid=C06K38EB05D

So, revert the commit and ship a release without it.

This reverts commit 3220f83.
problame added a commit that referenced this pull request Apr 8, 2024
This PR is an off-by-default revision v2 of the (since-reverted) PR
#6555 / commit `3220f830b7fbb785d6db8a93775f46314f10a99b`.

See that PR for details on why running with a single runtime is
desirable and why we should be ready.

We reverted #6555 because it showed regressions in prodlike cloudbench,
see the revert commit message `ad072de4209193fd21314cf7f03f14df4fa55eb1`
for more context.

This PR makes it an opt-in choice via an env var.

The default is to use the 4 separate runtimes that we have today, there
shouldn't be any performance change.

I tested manually that the env var & added metric works.

```
# undefined env var => no change to before this PR, uses 4 runtimes
./target/debug/neon_local start
# defining the env var enables one-runtime mode, value defines that one runtime's configuration
NEON_PAGESERVER_USE_ONE_RUNTIME=current_thread ./target/debug/neon_local start
NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:1 ./target/debug/neon_local start
NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:2 ./target/debug/neon_local start
NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:default ./target/debug/neon_local start

```

I want to use this change to do more manualy testing and potentially
testing in staging.

Future Work
-----------

Testing / deployment ergonomics would be better if this were a variable
in `pageserver.toml`.
It can be done, but, I don't need it right now, so let's stick with the
env var.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants