per-TenantShard read throttling #6706

problame · 2024-02-09T17:13:21Z

Slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1707499237721679

Problem

part of epic #5899
Specifically, RFC #5648

tl;dr: we currently serve GetPage@LSN requests as fast as possible.
This sets the wrong incentives and poses operational and economic problems.

Summary of changes

Mechanism:

Add a throttling mechanism based on the leaky-bucket
Make throttle reconfigurable via tenant conf (=> durable & runtime-changeable)
Throttle is implemented at the top of Timeline::get
The throttle applies to a configurable set of TaskKinds;
- To limit only Timeline::get requests from clients, specify only PageRequestHandler
- This will include Timeline::get requests from basebackup, so, it might slow down basebackup.

The rationale for placing the throttle at the top of Timeline::get is, and for the TaskKind selector, is that it enables experimentation. E.g., we can also use it to rate limit compaction.

Observability:

Global metrics that counts accounted & throttled Timeline::get calls, as well as sum of wait time.
Per-tenant reporting through the logs if a tenant was throttled within last compaction_period seconds. Would like to have a better-defined reporting period, but, that would require a separate background task. If we add that in the future, we should move the walredo quiescing there, too.

Example

Live reconfigure a tenant like so:

curl -X PUT --data @override.json localhost:9898/v1/tenant/config
# override.json
{
    "tenant_id": "f76997305d46b20964d2a5636aa302c3",
    "timeline_get_throttle": {
            "fair": true,
            "initial": 0,
            "refill_interval": '1s',
            "refill_amount": 1000,
            "max": 1000,
            "task_kinds": [ "PageRequestHandler" ]
    }
}

Testing & Performance

I made changes to pagebench so it issues requests at a defined rate.
If a request takes too long, it waits client-side until the next request at regular frequency would have been issued before sending the next request.
It prints the number of requests that weren't sent as MISSED in the live stats.

I then used pagebench to benchmark the throttling code as follows

empty neon_local release build, on i3en.3xlarge instance store
create 50 empty tenants
configure page cache size & fd limits so everything stays within PS process
configure rate limit with 5000 IOPS:

page_cache_size = 1638400
max_file_descriptors = 500000

[tenant_config]

[tenant_config.timeline_get_throttle]
fair = true
initial = 0
refill_interval = '100ms'
refill_amount = 500
max = 5000
task_kinds = [ "PageRequestHandler" ]

Measure CPU contribution of throttling using process-scoped flamegraph, searching for throttling's contribution by searching the flamegraph for throttle:

sudo perf record -p "$(pgrep pageserver)" -F666  --call-graph=dwarf,65528 -a sleep 4  && sudo perf script | ./stackcollapse-perf.pl| ./flamegraph.pl > flamegraph.svg

Workload 0: 500 connections against one tenant, throttling disabled. 0.3% CPU time

./target/release/pagebench get-page-latest-lsn  --limit-to-first-n-targets 1 --per-client-rate 100 --num-clients 500
2024-02-16T15:19:24.703789Z  INFO RPS: 49944   MISSED: 0

Flame Graph

![2024-02-16 no throttle configured](https://github.com/neondatabase/neon/assets/956573/d7fa463d-f04d-48ba-82aa-619f0fa6c140)

Workload 1: 500 connections against one tenant, target rate 10x above throttle: 1.3% cpu time
TODO: debate this workload giving the skipping behavior of pagebench. Should the client instead go as fast as possible to determine CPU overhead?

./target/release/pagebench get-page-latest-lsn  --limit-to-first-n-targets 1 --per-client-rate 100 --num-clients 500
2024-02-14T13:30:12.505501Z  INFO RPS: 4995   MISSED: 45453

Flame Graph

![10x](https://github.com/neondatabase/neon/assets/956573/695148fc-7cbd-489d-9a30-0c5c053baa23)

Workload 2: single client per tenant, request rate below limit, run in parallel against each tenant to increase sampling coverage (all way below 100% usage, not CPU bound). 1.1% cpu time.

./target/release/pagebench get-page-latest-lsn  --limit-to-first-n-targets 50 --per-client-rate 400  --num-clients 1
2024-02-16T15:23:23.263295Z  INFO RPS: 19516   MISSED: 503

Flame Graph

![en par](https://github.com/neondatabase/neon/assets/956573/b42f22f0-34f6-4783-a3a4-8c39140cb03c)

TODOs

figure out what the page_service.rs metrics should mean on a rate limited tenant
- currently, they will observe the rate-limited performance; that means our 4 Golden Signals dashboard & SLO will be affected by the rate limit
- ideas:
  - A: duplicate these metrics, one rate-limited version, one not-rate-limited version
  - B: move the rate limit upwards into page_service.rs
    - if we don't want the experimentation capability that we gain by placing at top of Timeline::get

Future Work

Regression test using pagebench: assert that the rate limit is enforced by driving at limit rate with one client, then adding clients increases latency ~linearly.

Non-Goals

As pointed out in the RFC, this is a mechanism to prevent noisy-neighbor problems in pageserver. It's not a product feature / something we'll sell. The mechanism will be used to put an artificial ceiling on how many operations a single TenantShard can serve.

…ving nice things

…ottling/wip

github-actions · 2024-02-09T17:28:39Z

2442 tests run: 2325 passed, 0 failed, 117 skipped (full report)

Flaky tests (3)

Postgres 16

test_create_snapshot: debug
test_empty_branch_remote_storage_upload: debug

Postgres 14

test_ts_of_lsn_api: release

Code coverage (full report)

functions: 55.8% (12921 of 23162 functions)
lines: 82.5% (69917 of 84722 lines)

_{The comment gets automatically updated with the latest test results
23baada at 2024-02-16T19:29:12.761Z :recycle:}

…ottling/wip

…spent on it

…usage in 2x RPS)

…tle now costs 2.7% of CPU time

empty neon_local release build, on i3en.3xlarge instance store create 50 empty tenants configure page cache size & fd limits so everything stays within PS process and configure rate limit with 5000 IOPS: page_cache_size = 1638400 max_file_descriptors = 500000 [tenant_config.timeline_get_rate_limit] fair = true initial = 0 interval_millis = 100 interval_refill = 500 max = 5000 task_kinds = [ "PageRequestHandler" ] Overhead with throttle kicking in: --------------------------------- ./target/release/pagebench get-page-latest-lsn --limit-to-first-n-targets 1 --per-client-rate 100 --num-clients 500 shows as expected that 90% of requests get dropped because above throttle; 2024-02-14T13:30:12.505501Z INFO RPS: 4995 MISSED: 45453 CPU overhead: sudo perf record -p "$(pgrep pageserver)" -F666 --call-graph=dwarf,65528 -a sleep 4 && sudo perf script | ./stackcollapse-perf.pl| ./flamegraph.pl > flamegraph.svg => search result for "throttle"; 1.6% of CPU time. Overhead while below throttle: ----------------------------- ./target/release/pagebench get-page-latest-lsn --limit-to-first-n-targets 50 --per-client-rate 400 --num-clients 1 2024-02-14T13:41:34.811237Z INFO RPS: 19588 MISSED: 434 CPU overhead: sudo perf record -p "$(pgrep pageserver)" -F666 --call-graph=dwarf,65528 -a sleep 4 && sudo perf script | ./stackcollapse-perf.pl| ./flamegraph.pl > flamegraph.svg => search result for "throttle"; 0.9% of CPU time

libs/utils/src/lib.rs

pageserver/src/tenant/throttle.rs

pageserver/src/tenant/timeline.rs

control_plane/src/pageserver.rs

pageserver/pagebench/src/cmd/getpage_latest_lsn.rs

pageserver/src/tenant.rs

pageserver/src/tenant/throttle.rs

koivunej

Looks like it should work. The vast configuration options feels a bit like extra, but we can hopefully get to throttle those down in future. As far as I understood the perf is looking good.

Few nits for your consideration, maybe it's better to document on the Throttle that it will be shared between timelines instead everywhere else. At least that would had clarified for me why the arc-swap.

the Throttle has nice testable design, sadly I don't think it should get any tests because it just forwards to leaky_bucket.

libs/pageserver_api/src/models.rs

pageserver/src/tenant/throttle.rs

pageserver/src/tenant.rs

pageserver/pagebench/src/cmd/getpage_latest_lsn.rs

… metrics + regression test (#6953) part of #5899 Problem ------- Before this PR, the time spent waiting on the throttle was charged towards the higher-level page_service metrics, i.e., `pageserver_smgr_query_seconds`. The metrics are the foundation of internal SLIs / SLOs. A throttled tenant would cause the SLI to degrade / SLO alerts to fire. Changes ------- - don't charge time spent in throttle towards the page_service metrics - record time spent in throttle in RequestContext and subtract it from the elapsed time - this works because the page_service path doesn't create child context, so, all the throttle time is recorded in the parent - it's quite brittle and will break if we ever decide to spawn child tasks that need child RequestContexts, which would have separate instances of the `micros_spent_throttled` counter. - however, let's punt that to a more general refactoring of RequestContext - add a test case that ensures that - throttling happens for getpage requests; this aspect of the test passed before this PR - throttling delays aren't charged towards the page_service metrics; this aspect of the test only passes with this PR - drive-by: make the throttle log message `info!`, it's an expected condition Performance ----------- I took the same measurements as in #6706 , no meaningful change in CPU overhead. Future Work ----------- This PR enables us to experiment with the throttle for select tenants without affecting the SLI metrics / triggering SLO alerts. Before declaring this feature done, we need more work to happen, specifically: - decide on whether we want to retain the flexibility of throttling any `Timeline::get` call, filtered by TaskKind - versus: separate throttles for each page_service endpoint, potentially with separate config options - the trouble here is that this decision implies changes to the TenantConfig, so, if we start using the current config style now, then decide to switch to a different config, it'll be a breaking change Nice-to-haves but probably not worth the time right now: - Equivalent tests to ensure the throttle applies to all other page_service handlers.

problame added 10 commits February 5, 2024 15:25

WIP

06a17a1

WIP

c9f1bab

few todo's left

26ea5f3

finish impl & fmt

4ee5489

TenantConfOpt <=> models::TenantConfig conversion prevents us from ha…

1e7b9ce

…ving nice things

pagebench: support num_clients

f3e000e

make it work

a516735

make tests compile

c9eb808

Merge remote-tracking branch 'origin/main' into problame/get-page-thr…

44060f4

…ottling/wip

Merge branch 'main' into problame/get-page-throttling/wip

57a8ec3

problame mentioned this pull request Feb 9, 2024

Epic: per-tenant read path throttling #5899

Closed

problame changed the title ~~getpage@lsn throttling~~ per-TenantShard read throttling Feb 9, 2024

problame requested a review from jcsp February 9, 2024 17:20

problame added 11 commits February 13, 2024 11:29

metric

3abbd69

Merge remote-tracking branch 'origin/main' into problame/get-page-thr…

de1db01

…ottling/wip

pagebench: rate limiter fixes part 1

d7515c8

pagebench: working limit

de3722a

finish cleaning up pagebench; cargo fmt

57a5640

refine typing & names of per-tenant metric

0b4df46

global metric; at 2x rate (60k RPS total), 1.6% of CPU time of PS is …

9f46fa8

…spent on it

WIP: store last_throttled_at behind Mutex; too expensive (2.7% total …

3ccad2b

…usage in 2x RPS)

report throttling of individual tenants in the log; at 2x rate, throt…

f44d681

…tle now costs 2.7% of CPU time

drop the per-tenant metric, add additional global metrics

451eefe

This comment was marked as outdated.

Sign in to view

fix test

aa8dcaa

problame marked this pull request as ready for review February 14, 2024 17:56

problame requested review from a team as code owners February 14, 2024 17:56

clippy

d60a661

problame requested review from koivunej and removed request for knizhnik February 14, 2024 18:04