page_service: batching observability & include throttled time in smgr metrics #9870

problame · 2024-11-25T08:49:33Z

This PR

fixes smgr metrics page_service: metric pageserver_smgr_query_started_count incremented after waiting for effective lsn #9925
adds an additional startup log line logging the current batching config
adds a histogram of batch sizes global and per-tenant
adds a metric exposing the current batching config

The issue described #9925 is that before this PR, request latency was only observed after batching.
This means that smgr latency metrics (most importantly getpage latency) don't account for

wait_lsn time
time spent waiting for batch to fill up / the executor stage to pick up the batch.

The fix is to use a per-request batching timer, like we did before the initial batching PR.
We funnel those timers through the entire request lifecycle.

I noticed that even before the initial batching changes, we weren't accounting for the time spent writing & flushing the response to the wire.
This PR drive-by fixes that deficiency by dropping the timers at the very end of processing the batch, i.e., after the pgb.flush() call.

I was **unable to maintain the behavior that we deduct time-spent-in-throttle from various latency metrics.
The reason is that we're using a single counter in RequestContext to track micros spent in throttle.
But there are N metrics timers in the batch, one per request.
As a consequence, the practice of consuming the counter in the drop handler of each timer no longer works because all but the first timer will encounter error close() called on closed state.
A failed attempt to maintain the current behavior can be found in #9951.

So, this PR remvoes the deduction behavior from all metrics.
I started a discussion on Slack about it the implications this has for our internal SLO calculation: https://neondb.slack.com/archives/C033RQ5SPDH/p1732910861704029

Refs

The steps in the test work in neon_local + psql but for some reason they don't work in the test. Asked compute team on Slack for help: https://neondb.slack.com/archives/C04DGM6SMTM/p1731952688386789

=> https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e00478054b8a3e325735ffa19 => unacceptable

This PR adds a benchmark to demonstrate the effect of server-side getpage request batching added in #9321. Refs: - Epic: #9376 - Extracted from #9792

This reverts commit b974616.

…e-test

This reverts commit aa695b2.

With this, 10us batching timeout works, but it has some other wrinkles: - it uses the signal-based timer APIs instead of going through epoll (=> timerfd) = it needs to make a syscall for each batch, which costs around 1-2us, so, probably significant CPU time wasted on this.

This reverts commit 1639b26.

batching at 10us doesn't work well enough, prob the future is ready too soon. batching factor is just 1.5 https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780b79c8dd6d007dbb120

This reverts commit 81d9970.

Resolution not high enough to do _any_ batching at 10us or 20us https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e0047800fb74bd8f4ab6cf8e2

This reverts commit 12124b2.

Yep, it's clearly the best one with best batching factor at lowest CPU usage. https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780d0a205e081458b46db

Results unchanged to 0.7.4 https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780e18416cc0faf2aca65

Best batching factor so far with no worse degradation of un-batchable workloads than the other candidates. https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780c0921fe99e1da0e8c9

=> zero batching https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e00478065a9b3e51726082885

This reverts commit 721643b.

This reverts commit 68550f0.

This reverts commit c73e9e4.

This reverts commit 689788c.

Performs identically great to the async-timer::Timer features=tokio1 impl Makes sense because it's the same thing that's happening under the hood. https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780ea9decc82281f6b8d1

…rics-improvements

problame · 2024-11-30T00:36:29Z

Opened a discussion in https://neondb.slack.com/archives/C033RQ5SPDH/p1732910861704029

problame · 2024-12-02T17:12:50Z

Potential solution: #9962

pageserver/src/page_service.rs

pageserver/src/metrics.rs

pageserver/src/tenant/timeline.rs

pageserver/src/metrics.rs

pageserver/src/tenant/timeline.rs

…ling

… deduction for smgr latency metrics (#9962) ## Problem In the batching PR - #9870 I stopped deducting the time-spent-in-throttle fro latency metrics, i.e., - smgr latency metrics (`SmgrOpTimer`) - basebackup latency (+scan latency, which I think is part of basebackup). The reason for stopping the deduction was that with the introduction of batching, the trick with tracking time-spent-in-throttle inside RequestContext and swap-replacing it from the `impl Drop for SmgrOpTimer` no longer worked with >1 requests in a batch. However, deducting time-spent-in-throttle is desirable because our internal latency SLO definition does not account for throttling. ## Summary of changes - Redefine throttling to be a page_service pagestream request throttle instead of a throttle for repository `Key` reads through `Timeline::get` / `Timeline::get_vectored`. - This means reads done by `basebackup` are no longer subject to any throttle. - The throttle applies after batching, before handling of the request. - Drive-by fix: make throttle sensitive to cancellation. - Rename metric label `kind` from `timeline_get` to `pagestream` to reflect the new scope of throttling. To avoid config format breakage, we leave the config field named `timeline_get_throttle` and ignore the `task_kinds` field. This will be cleaned up in a future PR. ## Trade-Offs Ideally, we would apply the throttle before reading a request off the connection, so that we queue the minimal amount of work inside the process. However, that's not possible because we need to do shard routing. The redefinition of the throttle to limit pagestream request rate instead of repository `Key` rate comes with several downsides: - We're no longer able to use the throttle mechanism for other other tasks, e.g. image layer creation. However, in practice, we never used that capability anyways. - We no longer throttle basebackup.

… metrics (#9870) This PR - fixes smgr metrics #9925 - adds an additional startup log line logging the current batching config - adds a histogram of batch sizes global and per-tenant - adds a metric exposing the current batching config The issue described #9925 is that before this PR, request latency was only observed *after* batching. This means that smgr latency metrics (most importantly getpage latency) don't account for - `wait_lsn` time - time spent waiting for batch to fill up / the executor stage to pick up the batch. The fix is to use a per-request batching timer, like we did before the initial batching PR. We funnel those timers through the entire request lifecycle. I noticed that even before the initial batching changes, we weren't accounting for the time spent writing & flushing the response to the wire. This PR drive-by fixes that deficiency by dropping the timers at the very end of processing the batch, i.e., after the `pgb.flush()` call. I was **unable to maintain the behavior that we deduct time-spent-in-throttle from various latency metrics. The reason is that we're using a *single* counter in `RequestContext` to track micros spent in throttle. But there are *N* metrics timers in the batch, one per request. As a consequence, the practice of consuming the counter in the drop handler of each timer no longer works because all but the first timer will encounter error `close() called on closed state`. A failed attempt to maintain the current behavior can be found in #9951. So, this PR remvoes the deduction behavior from all metrics. I started a discussion on Slack about it the implications this has for our internal SLO calculation: https://neondb.slack.com/archives/C033RQ5SPDH/p1732910861704029 # Refs - fixes #9925 - sub-issue #9377 - epic: #9376

… deduction for smgr latency metrics (#9962) ## Problem In the batching PR - #9870 I stopped deducting the time-spent-in-throttle fro latency metrics, i.e., - smgr latency metrics (`SmgrOpTimer`) - basebackup latency (+scan latency, which I think is part of basebackup). The reason for stopping the deduction was that with the introduction of batching, the trick with tracking time-spent-in-throttle inside RequestContext and swap-replacing it from the `impl Drop for SmgrOpTimer` no longer worked with >1 requests in a batch. However, deducting time-spent-in-throttle is desirable because our internal latency SLO definition does not account for throttling. ## Summary of changes - Redefine throttling to be a page_service pagestream request throttle instead of a throttle for repository `Key` reads through `Timeline::get` / `Timeline::get_vectored`. - This means reads done by `basebackup` are no longer subject to any throttle. - The throttle applies after batching, before handling of the request. - Drive-by fix: make throttle sensitive to cancellation. - Rename metric label `kind` from `timeline_get` to `pagestream` to reflect the new scope of throttling. To avoid config format breakage, we leave the config field named `timeline_get_throttle` and ignore the `task_kinds` field. This will be cleaned up in a future PR. ## Trade-Offs Ideally, we would apply the throttle before reading a request off the connection, so that we queue the minimal amount of work inside the process. However, that's not possible because we need to do shard routing. The redefinition of the throttle to limit pagestream request rate instead of repository `Key` rate comes with several downsides: - We're no longer able to use the throttle mechanism for other other tasks, e.g. image layer creation. However, in practice, we never used that capability anyways. - We no longer throttle basebackup.

problame added 30 commits November 18, 2024 23:57

WIP: page_service: add basic testcase for merging

0689965

The steps in the test work in neon_local + psql but for some reason they don't work in the test. Asked compute team on Slack for help: https://neondb.slack.com/archives/C04DGM6SMTM/p1731952688386789

got it working and turn it more into a benchmark

15e21c7

compiles

61ff84a

fixes

911946a

parametrize more test

5cc0059

switch back to tokio::time::sleep, to get the numbers

b974616

=> https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e00478054b8a3e325735ffa19 => unacceptable

make it a proper benchmark

f2de5b5

collect CPU utilization

e80ce97

bench fixups

75041cb

page_service: add benchmark for batching

b695907

This PR adds a benchmark to demonstrate the effect of server-side getpage request batching added in #9321. Refs: - Epic: #9376 - Extracted from #9792

Revert "switch back to tokio::time::sleep, to get the numbers"

aa695b2

This reverts commit b974616.

Merge branch 'problame/batching-benchmark' into problame/merge-getpag…

88d52b3

…e-test

fixup whitespace stuff

b299eb1

Revert "Revert "switch back to tokio::time::sleep, to get the numbers""

af95320

This reverts commit aa695b2.

Revert "async-timer based approach"

f3ed569

This reverts commit 1639b26.

tokio::time::Interval based approach

81d9970

batching at 10us doesn't work well enough, prob the future is ready too soon. batching factor is just 1.5 https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780b79c8dd6d007dbb120

Revert "tokio::time::Interval based approach"

1d85bec

This reverts commit 81d9970.

tokio_timerfd::Interval

12124b2

Resolution not high enough to do _any_ batching at 10us or 20us https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e0047800fb74bd8f4ab6cf8e2

Revert "tokio_timerfd::Interval"

f9bf038

This reverts commit 12124b2.

async-timer based approach (again, with data)

689788c

Yep, it's clearly the best one with best batching factor at lowest CPU usage. https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780d0a205e081458b46db

undo local modifications to benchmark

7be13bc

try async-timer 1.0.0-beta15 (still signal-based timers)

c73e9e4

Results unchanged to 0.7.4 https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780e18416cc0faf2aca65

async-timer 1.0.0-beta15 with features=tokio1

68550f0

Best batching factor so far with no worse degradation of un-batchable workloads than the other candidates. https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780c0921fe99e1da0e8c9

try interval-based impl to cross-chec

721643b

=> zero batching https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e00478065a9b3e51726082885

Revert "try interval-based impl to cross-chec"

5f3e6f3

This reverts commit 721643b.

Revert "async-timer 1.0.0-beta15 with features=tokio1"

cbb5817

This reverts commit 68550f0.

Revert "try async-timer 1.0.0-beta15 (still signal-based timers)"

21866fa

This reverts commit c73e9e4.

Revert "async-timer based approach (again, with data)"

469ce81

This reverts commit 689788c.

problame removed the run-no-ci Don't run any CI for this PR. label Nov 29, 2024

problame changed the base branch from problame/batching-sidecar-task to main November 30, 2024 00:19

problame force-pushed the problame/batching-metrics-improvements branch 2 times, most recently from 9193a4f to d04f05d Compare November 30, 2024 00:30

Merge remote-tracking branch 'origin/main' into problame/batching-met…

69b878f

…rics-improvements

problame force-pushed the problame/batching-metrics-improvements branch from d04f05d to 69b878f Compare November 30, 2024 00:35

problame marked this pull request as ready for review November 30, 2024 00:35

problame requested a review from a team as a code owner November 30, 2024 00:35

problame requested a review from skyzh November 30, 2024 00:35

problame removed the request for review from skyzh November 30, 2024 00:36

problame mentioned this pull request Dec 2, 2024

pageserver: only throttle pagestream requests & bring back throttling deduction for smgr latency metrics #9962

Merged

problame requested a review from VladLazar December 2, 2024 14:08

fix test

e6c14f6

VladLazar reviewed Dec 2, 2024

View reviewed changes

problame added 3 commits December 2, 2024 19:56

obsolete comment; #9870 (comment)

04c358a

restore the behavior that get_vectored and scan metrics are ex thrott…

be2d64b

…ling

comment on timer lifetime; #9870 (comment)

5d778ee

VladLazar approved these changes Dec 2, 2024

View reviewed changes

problame enabled auto-merge December 3, 2024 10:56

problame disabled auto-merge December 3, 2024 10:59

problame enabled auto-merge December 3, 2024 10:59

problame added this pull request to the merge queue Dec 3, 2024

Merged via the queue into main with commit cb10be7 Dec 3, 2024
80 checks passed

problame deleted the problame/batching-metrics-improvements branch December 3, 2024 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page_service: batching observability & include throttled time in smgr metrics #9870

page_service: batching observability & include throttled time in smgr metrics #9870

problame commented Nov 25, 2024 •

edited

Loading

problame commented Nov 30, 2024

problame commented Dec 2, 2024

page_service: batching observability & include throttled time in smgr metrics #9870

page_service: batching observability & include throttled time in smgr metrics #9870

Conversation

problame commented Nov 25, 2024 • edited Loading

Refs

problame commented Nov 30, 2024

problame commented Dec 2, 2024

problame commented Nov 25, 2024 •

edited

Loading