Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: add tokio-epoll-uring slots waiters queue depth metrics #9482

Merged
merged 14 commits into from
Oct 25, 2024

Conversation

yliang412
Copy link
Contributor

@yliang412 yliang412 commented Oct 22, 2024

In complement to neondatabase/tokio-epoll-uring#56.

Problem

We want to make tokio-epoll-uring slots waiters queue depth observable via Prometheus.

Summary of changes

  • Add pageserver_tokio_epoll_uring_slots_submission_queue_depth metrics as a Histogram.
  • Each thread-local tokio-epoll-uring system is given a LocalHistogram to observe the metrics.
  • Keep a list of Arc<ThreadLocalMetrics> used on-demand to flush data to the shared histogram.
  • Extend Collector::collect to report pageserver_tokio_epoll_uring_slots_submission_queue_depth.

Alternative Design Considered #

Since the overall idea is to let each thread observe the metrics separately and on demand aggregate the metrics, the thread_local crate may seem to be a good fit because we can use the its iterator to collect thread local metrics (Rust std library does not).

However, the implementation does not free up the thread_local storage until the ThreadLocal<T> object got destroyed. Even if thread ids are aggressively reused, we would have at least n thread local storage, where n is the maximum number of active threads throughout lifetime of the ThreadLocal<T>. This is not acceptable in our use case, as this number would be num_runtime (4) times the spawn_blocking pool size (512).

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@yliang412 yliang412 self-assigned this Oct 22, 2024
@yliang412 yliang412 added the c/storage/pageserver Component: storage: pageserver label Oct 22, 2024
Signed-off-by: Yuchen Liang <[email protected]>
Copy link

github-actions bot commented Oct 22, 2024

5256 tests run: 5041 passed, 0 failed, 215 skipped (full report)


Flaky tests (1)

Postgres 17

Code coverage* (full report)

  • functions: 31.3% (7689 of 24543 functions)
  • lines: 48.8% (60444 of 123907 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
143cc19 at 2024-10-25T20:12:55.440Z :recycle:

Cargo.toml Outdated Show resolved Hide resolved
@yliang412 yliang412 requested a review from problame October 25, 2024 15:09
Signed-off-by: Yuchen Liang <[email protected]>
pageserver/src/metrics.rs Outdated Show resolved Hide resolved
@yliang412
Copy link
Contributor Author

Verified metrics output locally:

Output from localhost:9898/metrics

...
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="1"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="2"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="4"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="8"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="16"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="32"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="64"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="128"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="256"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="512"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="1024"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket{le="+Inf"} 87261
pageserver_tokio_epoll_uring_slots_submission_queue_depth_sum 37
pageserver_tokio_epoll_uring_slots_submission_queue_depth_count 87261
...

Local grafana:

sum(increase(pageserver_tokio_epoll_uring_slots_submission_queue_depth_bucket[$__rate_interval])) by (le)
Screenshot 2024-10-25 at 1 25 49 PM

Cargo.toml Outdated Show resolved Hide resolved
yliang412 added a commit to neondatabase/tokio-epoll-uring that referenced this pull request Oct 25, 2024
## Problem

We want to make slots waiters queue depth observable, so it is easier to
tune the submission queue size (also equal to the number of inflight
operation as implemented in the system).

## Summary of changes

- Introduced `PerSystemMetrics` trait that record slots submission queue
depth. In the future, per-system metrics can be added through extending
this trait.
- `tokio_epoll_uring::System` now takes in the per-system metrics
observer `Arc<M> where M : PerSystemMetrics`

## Alternative design

See neondatabase/neon#9482 for details.

---------

Signed-off-by: Yuchen Liang <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
@yliang412 yliang412 enabled auto-merge (squash) October 25, 2024 19:20
@yliang412 yliang412 merged commit 85b954f into main Oct 25, 2024
80 checks passed
@yliang412 yliang412 deleted the yuchen/export-tokio-epoll-uring-slots-queue-depth branch October 25, 2024 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants