page_service: metric `pageserver_smgr_query_started_count` incremented after waiting for effective lsn #9925

problame · 2024-11-28T14:05:59Z

Likely introduced as part of the

pageserver: batch get page requests and serve them with one vectored get #9377

changes.

https://neondb.slack.com/archives/C033RQ5SPDH/p1732802655376009?thread_ts=1732785911.264089&cid=C033RQ5SPDH

…t the expense of exclusion of throttling from metrics (too complicated lifetimes)

… metrics (#9870) This PR - fixes smgr metrics #9925 - adds an additional startup log line logging the current batching config - adds a histogram of batch sizes global and per-tenant - adds a metric exposing the current batching config The issue described #9925 is that before this PR, request latency was only observed *after* batching. This means that smgr latency metrics (most importantly getpage latency) don't account for - `wait_lsn` time - time spent waiting for batch to fill up / the executor stage to pick up the batch. The fix is to use a per-request batching timer, like we did before the initial batching PR. We funnel those timers through the entire request lifecycle. I noticed that even before the initial batching changes, we weren't accounting for the time spent writing & flushing the response to the wire. This PR drive-by fixes that deficiency by dropping the timers at the very end of processing the batch, i.e., after the `pgb.flush()` call. I was **unable to maintain the behavior that we deduct time-spent-in-throttle from various latency metrics. The reason is that we're using a *single* counter in `RequestContext` to track micros spent in throttle. But there are *N* metrics timers in the batch, one per request. As a consequence, the practice of consuming the counter in the drop handler of each timer no longer works because all but the first timer will encounter error `close() called on closed state`. A failed attempt to maintain the current behavior can be found in #9951. So, this PR remvoes the deduction behavior from all metrics. I started a discussion on Slack about it the implications this has for our internal SLO calculation: https://neondb.slack.com/archives/C033RQ5SPDH/p1732910861704029 # Refs - fixes #9925 - sub-issue #9377 - epic: #9376

problame added a/observability Area: related to observability c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Nov 28, 2024

problame self-assigned this Nov 28, 2024

This was referenced Nov 28, 2024

pageserver: batch get page requests and serve them with one vectored get #9377

Open

page_service: batching observability & include throttled time in smgr metrics #9870

Merged

problame added a commit that referenced this issue Nov 29, 2024

correct start & end times for smgr query observations (fixes #9925) a…

7a39ad4

…t the expense of exclusion of throttling from metrics (too complicated lifetimes)

problame closed this as completed in #9870 Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page_service: metric `pageserver_smgr_query_started_count` incremented after waiting for effective lsn #9925

page_service: metric `pageserver_smgr_query_started_count` incremented after waiting for effective lsn #9925

problame commented Nov 28, 2024

page_service: metric pageserver_smgr_query_started_count incremented after waiting for effective lsn #9925

page_service: metric pageserver_smgr_query_started_count incremented after waiting for effective lsn #9925

Comments

problame commented Nov 28, 2024

page_service: metric `pageserver_smgr_query_started_count` incremented after waiting for effective lsn #9925

page_service: metric `pageserver_smgr_query_started_count` incremented after waiting for effective lsn #9925