Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page_service: batching observability & include throttled time in smgr metrics #9870

Merged
merged 105 commits into from
Dec 3, 2024

Conversation

problame
Copy link
Contributor

@problame problame commented Nov 25, 2024

This PR

The issue described #9925 is that before this PR, request latency was only observed after batching.
This means that smgr latency metrics (most importantly getpage latency) don't account for

  • wait_lsn time
  • time spent waiting for batch to fill up / the executor stage to pick up the batch.

The fix is to use a per-request batching timer, like we did before the initial batching PR.
We funnel those timers through the entire request lifecycle.

I noticed that even before the initial batching changes, we weren't accounting for the time spent writing & flushing the response to the wire.
This PR drive-by fixes that deficiency by dropping the timers at the very end of processing the batch, i.e., after the pgb.flush() call.

I was **unable to maintain the behavior that we deduct time-spent-in-throttle from various latency metrics.
The reason is that we're using a single counter in RequestContext to track micros spent in throttle.
But there are N metrics timers in the batch, one per request.
As a consequence, the practice of consuming the counter in the drop handler of each timer no longer works because all but the first timer will encounter error close() called on closed state.
A failed attempt to maintain the current behavior can be found in #9951.

So, this PR remvoes the deduction behavior from all metrics.
I started a discussion on Slack about it the implications this has for our internal SLO calculation: https://neondb.slack.com/archives/C033RQ5SPDH/p1732910861704029

Refs

The steps in the test work in neon_local + psql
but for some reason they don't work in the test.

Asked compute team on Slack for help:
https://neondb.slack.com/archives/C04DGM6SMTM/p1731952688386789
This PR adds a benchmark to demonstrate the effect of server-side
getpage request batching added in #9321.

Refs:

- Epic: #9376
- Extracted from #9792
With this, 10us batching timeout works, but it has some other wrinkles:

- it uses the signal-based timer APIs instead of going through epoll (=> timerfd)
= it needs to make a syscall for each batch, which costs around 1-2us, so, probably significant CPU time wasted on this.
batching at 10us doesn't work well enough, prob the future is ready
too soon. batching factor is just 1.5

https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780b79c8dd6d007dbb120
Performs identically great to the async-timer::Timer features=tokio1 impl
Makes sense because it's the same thing that's happening under the hood.

https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780ea9decc82281f6b8d1
@problame problame removed the run-no-ci Don't run any CI for this PR. label Nov 29, 2024
@problame problame changed the base branch from problame/batching-sidecar-task to main November 30, 2024 00:19
@problame problame force-pushed the problame/batching-metrics-improvements branch 2 times, most recently from 9193a4f to d04f05d Compare November 30, 2024 00:30
@problame problame force-pushed the problame/batching-metrics-improvements branch from d04f05d to 69b878f Compare November 30, 2024 00:35
@problame problame marked this pull request as ready for review November 30, 2024 00:35
@problame problame requested a review from a team as a code owner November 30, 2024 00:35
@problame problame requested a review from skyzh November 30, 2024 00:35
@problame
Copy link
Contributor Author

@problame
Copy link
Contributor Author

problame commented Dec 2, 2024

Potential solution: #9962

pageserver/src/page_service.rs Show resolved Hide resolved
pageserver/src/metrics.rs Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved
pageserver/src/metrics.rs Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved
@problame problame enabled auto-merge December 3, 2024 10:56
@problame problame disabled auto-merge December 3, 2024 10:59
@problame problame enabled auto-merge December 3, 2024 10:59
@problame problame added this pull request to the merge queue Dec 3, 2024
Merged via the queue into main with commit cb10be7 Dec 3, 2024
80 checks passed
@problame problame deleted the problame/batching-metrics-improvements branch December 3, 2024 11:05
github-merge-queue bot pushed a commit that referenced this pull request Dec 3, 2024
… deduction for smgr latency metrics (#9962)

## Problem

In the batching PR 
- #9870

I stopped deducting the time-spent-in-throttle fro latency metrics,
i.e.,
- smgr latency metrics (`SmgrOpTimer`)
- basebackup latency (+scan latency, which I think is part of
basebackup).

The reason for stopping the deduction was that with the introduction of
batching, the trick with tracking time-spent-in-throttle inside
RequestContext and swap-replacing it from the `impl Drop for
SmgrOpTimer` no longer worked with >1 requests in a batch.

However, deducting time-spent-in-throttle is desirable because our
internal latency SLO definition does not account for throttling.

## Summary of changes

- Redefine throttling to be a page_service pagestream request throttle
instead of a throttle for repository `Key` reads through `Timeline::get`
/ `Timeline::get_vectored`.
- This means reads done by `basebackup` are no longer subject to any
throttle.
- The throttle applies after batching, before handling of the request.
- Drive-by fix: make throttle sensitive to cancellation.
- Rename metric label `kind` from `timeline_get` to `pagestream` to
reflect the new scope of throttling.

To avoid config format breakage, we leave the config field named
`timeline_get_throttle` and ignore the `task_kinds` field.
This will be cleaned up in a future PR.

## Trade-Offs

Ideally, we would apply the throttle before reading a request off the
connection, so that we queue the minimal amount of work inside the
process.
However, that's not possible because we need to do shard routing.

The redefinition of the throttle to limit pagestream request rate
instead of repository `Key` rate comes with several downsides:
- We're no longer able to use the throttle mechanism for other other
tasks, e.g. image layer creation.
  However, in practice, we never used that capability anyways.
- We no longer throttle basebackup.
awarus pushed a commit that referenced this pull request Dec 5, 2024
… metrics (#9870)

This PR 

- fixes smgr metrics #9925 
- adds an additional startup log line logging the current batching
config
- adds a histogram of batch sizes global and per-tenant
- adds a metric exposing the current batching config

The issue described #9925 is that before this PR, request latency was
only observed *after* batching.
This means that smgr latency metrics (most importantly getpage latency)
don't account for
- `wait_lsn` time 
- time spent waiting for batch to fill up / the executor stage to pick
up the batch.

The fix is to use a per-request batching timer, like we did before the
initial batching PR.
We funnel those timers through the entire request lifecycle.

I noticed that even before the initial batching changes, we weren't
accounting for the time spent writing & flushing the response to the
wire.
This PR drive-by fixes that deficiency by dropping the timers at the
very end of processing the batch, i.e., after the `pgb.flush()` call.

I was **unable to maintain the behavior that we deduct
time-spent-in-throttle from various latency metrics.
The reason is that we're using a *single* counter in `RequestContext` to
track micros spent in throttle.
But there are *N* metrics timers in the batch, one per request.
As a consequence, the practice of consuming the counter in the drop
handler of each timer no longer works because all but the first timer
will encounter error `close() called on closed state`.
A failed attempt to maintain the current behavior can be found in
#9951.

So, this PR remvoes the deduction behavior from all metrics.
I started a discussion on Slack about it the implications this has for
our internal SLO calculation:
https://neondb.slack.com/archives/C033RQ5SPDH/p1732910861704029

# Refs

- fixes #9925
- sub-issue #9377
- epic: #9376
awarus pushed a commit that referenced this pull request Dec 5, 2024
… deduction for smgr latency metrics (#9962)

## Problem

In the batching PR 
- #9870

I stopped deducting the time-spent-in-throttle fro latency metrics,
i.e.,
- smgr latency metrics (`SmgrOpTimer`)
- basebackup latency (+scan latency, which I think is part of
basebackup).

The reason for stopping the deduction was that with the introduction of
batching, the trick with tracking time-spent-in-throttle inside
RequestContext and swap-replacing it from the `impl Drop for
SmgrOpTimer` no longer worked with >1 requests in a batch.

However, deducting time-spent-in-throttle is desirable because our
internal latency SLO definition does not account for throttling.

## Summary of changes

- Redefine throttling to be a page_service pagestream request throttle
instead of a throttle for repository `Key` reads through `Timeline::get`
/ `Timeline::get_vectored`.
- This means reads done by `basebackup` are no longer subject to any
throttle.
- The throttle applies after batching, before handling of the request.
- Drive-by fix: make throttle sensitive to cancellation.
- Rename metric label `kind` from `timeline_get` to `pagestream` to
reflect the new scope of throttling.

To avoid config format breakage, we leave the config field named
`timeline_get_throttle` and ignore the `task_kinds` field.
This will be cleaned up in a future PR.

## Trade-Offs

Ideally, we would apply the throttle before reading a request off the
connection, so that we queue the minimal amount of work inside the
process.
However, that's not possible because we need to do shard routing.

The redefinition of the throttle to limit pagestream request rate
instead of repository `Key` rate comes with several downsides:
- We're no longer able to use the throttle mechanism for other other
tasks, e.g. image layer creation.
  However, in practice, we never used that capability anyways.
- We no longer throttle basebackup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

page_service: metric pageserver_smgr_query_started_count incremented after waiting for effective lsn
2 participants