pageserver: add tokio-epoll-uring slots waiters queue depth metrics #9482

yliang412 · 2024-10-22T16:28:04Z

In complement to neondatabase/tokio-epoll-uring#56.

Problem

We want to make tokio-epoll-uring slots waiters queue depth observable via Prometheus.

Summary of changes

Add pageserver_tokio_epoll_uring_slots_submission_queue_depth metrics as a Histogram.
Each thread-local tokio-epoll-uring system is given a LocalHistogram to observe the metrics.
Keep a list of Arc<ThreadLocalMetrics> used on-demand to flush data to the shared histogram.
Extend Collector::collect to report pageserver_tokio_epoll_uring_slots_submission_queue_depth.

Alternative Design Considered #

Since the overall idea is to let each thread observe the metrics separately and on demand aggregate the metrics, the thread_local crate may seem to be a good fit because we can use the its iterator to collect thread local metrics (Rust std library does not).

However, the implementation does not free up the thread_local storage until the ThreadLocal<T> object got destroyed. Even if thread ids are aggressively reused, we would have at least n thread local storage, where n is the maximum number of active threads throughout lifetime of the ThreadLocal<T>. This is not acceptable in our use case, as this number would be num_runtime (4) times the spawn_blocking pool size (512).

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

Signed-off-by: Yuchen Liang <[email protected]>

github-actions · 2024-10-22T17:27:54Z

5256 tests run: 5041 passed, 0 failed, 215 skipped (full report)

Flaky tests (1)

Postgres 17

test_local_only_layers_after_crash: debug-x86-64

Code coverage* (full report)

functions: 31.3% (7689 of 24543 functions)
lines: 48.8% (60444 of 123907 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
143cc19 at 2024-10-25T20:12:55.440Z :recycle:}

Signed-off-by: Yuchen Liang <[email protected]>

Cargo.toml

Signed-off-by: Yuchen Liang <[email protected]>

pageserver/src/metrics.rs

yliang412 · 2024-10-25T17:28:13Z

Verified metrics output locally:

Output from localhost:9898/metrics

...
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="1"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="2"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="4"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="8"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="16"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="32"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="64"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="128"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="256"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="512"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="1024"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="+Inf"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_sum 37
pageserver_tokio_epoll_uring_slots_submission_queue_depth_count 87261
...

Local grafana:

sum(increase(pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket[$__rate_interval])) by (le)

…d remove unused Default impl

…on of the design

This reverts commit a463865.

…ther metric

Cargo.toml

## Problem We want to make slots waiters queue depth observable, so it is easier to tune the submission queue size (also equal to the number of inflight operation as implemented in the system). ## Summary of changes - Introduced `PerSystemMetrics` trait that record slots submission queue depth. In the future, per-system metrics can be added through extending this trait. - `tokio_epoll_uring::System` now takes in the per-system metrics observer `Arc<M> where M : PerSystemMetrics` ## Alternative design See neondatabase/neon#9482 for details. --------- Signed-off-by: Yuchen Liang <[email protected]> Co-authored-by: Christian Schwarz <[email protected]>

Signed-off-by: Yuchen Liang <[email protected]>

pageserver: add tokio-epoll-uring slots waiters queue depth metrics

a8318de

Signed-off-by: Yuchen Liang <[email protected]>

yliang412 self-assigned this Oct 22, 2024

yliang412 added the c/storage/pageserver Component: storage: pageserver label Oct 22, 2024

fix clippy

3aaedce

Signed-off-by: Yuchen Liang <[email protected]>

pass thread-local metrics observer to tokio-epoll-uring system

c8a3514

Signed-off-by: Yuchen Liang <[email protected]>

yliang412 mentioned this pull request Oct 25, 2024

expose waiters queue depth metrics neondatabase/tokio-epoll-uring#56

Merged

reorg functions; more comments

09dd44a

Signed-off-by: Yuchen Liang <[email protected]>

yliang412 commented Oct 25, 2024

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

yliang412 requested a review from problame October 25, 2024 15:09

fix clippy

c509674

Signed-off-by: Yuchen Liang <[email protected]>

yliang412 mentioned this pull request Oct 25, 2024

tokio-epoll-uring: use thread_local crate for better per-system metrics aggregation #9480

Closed

yliang412 marked this pull request as ready for review October 25, 2024 15:19

yliang412 requested a review from a team as a code owner October 25, 2024 15:19

yliang412 commented Oct 25, 2024

View reviewed changes

pageserver/src/metrics.rs Show resolved Hide resolved

yliang412 commented Oct 25, 2024

View reviewed changes

pageserver/src/metrics.rs Outdated Show resolved Hide resolved

problame added 8 commits October 25, 2024 19:52

eliminate possibility of calling register_histogram multiple times an…

4895dcd

…d remove unused Default impl

don't need the Weak/Arc anymore, it was an idea from an older iterati…

6efbdad

…on of the design

why the Mutex?

a463865

Revert "why the Mutex?": it's required because it's sync

eb8f190

This reverts commit a463865.

doc comment explaining ThreadLocalMetrics a bit

e0d3688

exhaustive destructure so that we get compiler errors when we add ano…

184c9c2

…ther metric

eagerly initialize metrics

8033d21

adjust to changes in tokio-epoll-uring PR

bdf0dcf

problame approved these changes Oct 25, 2024

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

use tokio-epoll-uring main branch

143cc19

Signed-off-by: Yuchen Liang <[email protected]>

yliang412 enabled auto-merge (squash) October 25, 2024 19:20

yliang412 merged commit 85b954f into main Oct 25, 2024
80 checks passed

yliang412 deleted the yuchen/export-tokio-epoll-uring-slots-queue-depth branch October 25, 2024 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: add tokio-epoll-uring slots waiters queue depth metrics #9482

pageserver: add tokio-epoll-uring slots waiters queue depth metrics #9482

yliang412 commented Oct 22, 2024 •

edited

Loading

github-actions bot commented Oct 22, 2024 •

edited

Loading

Postgres 17

yliang412 commented Oct 25, 2024

pageserver: add tokio-epoll-uring slots waiters queue depth metrics #9482

pageserver: add tokio-epoll-uring slots waiters queue depth metrics #9482

Conversation

yliang412 commented Oct 22, 2024 • edited Loading

Problem

Summary of changes

Alternative Design Considered #

Checklist before requesting a review

Checklist before merging

github-actions bot commented Oct 22, 2024 • edited Loading

5256 tests run: 5041 passed, 0 failed, 215 skipped (full report)

Postgres 17

Code coverage* (full report)

yliang412 commented Oct 25, 2024

yliang412 commented Oct 22, 2024 •

edited

Loading

github-actions bot commented Oct 22, 2024 •

edited

Loading