Epic: get page throughput improvements #9376

VladLazar · 2024-10-14T10:45:12Z

Slack Channel: #proj-pageserver-superscalar-page_service

Background

There's some fairly low-hanging fruit for improving get page tput on the pageserver:

batch requests on the pageserver side
IO parallelism on the read path
configure computes to generate queue depth

@problame and @VladLazar worked on this during the Lisbon hackathon
and demonstrated 60k per sec get page tput. This epic is for productionizing and shipping that code (or some evolution of it).

Big Rocks

Give feedback

[HACKATHON]: superscalar page_service / get_vectored #9002
pageserver: batch get page requests and serve them with one vectored get #9377

10 of 15

a/performance c/storage c/storage/pageserver
pageserver: concurrent IO on the get page read path #9378

a/performance c/storage c/storage/pageserver
Figure out correct compute config for generating queue depth and ensure deployment
Options

Prod Readiness

Give feedback

pageserver: LSN wait location follow-up #9379

c/storage c/storage/pageserver
pageserver: batching observability #9380

c/storage c/storage/pageserver
pageserver: revisit vectored get error handling #9382

c/storage c/storage/pageserver
Options

## Problem We don't take advantage of queue depth generated by the compute on the pageserver. We can process getpage requests more efficiently by batching them. ## Summary of changes Batch up incoming getpage requests that arrive within a configurable time window (`server_side_batch_timeout`). Then process the entire batch via one `get_vectored` timeline operation. By default, no merging takes place. ## Testing * **Functional**: #9792 * **Performance**: will be done in staging/pre-prod # Refs * #9377 * #9376 Co-authored-by: Christian Schwarz <[email protected]>

This PR adds a benchmark to demonstrate the effect of server-side getpage request batching added in #9321. Refs: - Epic: #9376 - Extracted from #9792

This PR adds two benchmark to demonstrate the effect of server-side getpage request batching added in #9321. For the CPU usage, I found the the `prometheus` crate's built-in CPU usage accounts the seconds at integer granularity. That's not enough you reduce the target benchmark runtime for local iteration. So, add a new `libmetrics` metric and report that. The benchmarks are disabled because [on our benchmark nodes, timer resolution isn't high enough](https://neondb.slack.com/archives/C059ZC138NR/p1732264223207449). They work (no statement about quality) on my bare-metal devbox. They will be refined and enabled once we find a fix. Candidates at time of writing are: - #9822 - #9851 Refs: - Epic: #9376 - Extracted from #9792

# Problem The timeout-based batching adds latency to unbatchable workloads. We can choose a short batching timeout (e.g. 10us) but that requires high-resolution timers, which tokio doesn't have. I thoroughly explored options to use OS timers (see [this](#9822) abandoned PR). In short, it's not an attractive option because any timer implementation adds non-trivial overheads. # Solution The insight is that, in the steady state of a batchable workload, the time we spend in `get_vectored` will be hundreds of microseconds anyway. If we prepare the next batch concurrently to `get_vectored`, we will have a sizeable batch ready once `get_vectored` of the current batch is done and do not need an explicit timeout. This can be reasonably described as **pipelining of the protocol handler**. # Implementation We model the sub-protocol handler for pagestream requests (`handle_pagrequests`) as two futures that form a pipeline: 2. Batching: read requests from the connection and fill the current batch 3. Execution: `take` the current batch, execute it using `get_vectored`, and send the response. The Reading and Batching stage are connected through a new type of channel called `spsc_fold`. See the long comment in the `handle_pagerequests_pipelined` for details. # Changes - Refactor `handle_pagerequests` - separate functions for - reading one protocol message; produces a `BatchedFeMessage` with just one page request in it - batching; tried to merge an incoming `BatchedFeMessage` into an existing `BatchedFeMessage`; returns `None` on success and returns back the incoming message in case merging isn't possible - execution of a batched message - unify the timeline handle acquisition & request span construction; it now happen in the function that reads the protocol message - Implement serial and pipelined model - serial: what we had before any of the batching changes - read one protocol message - execute protocol messages - pipelined: the design described above - optionality for execution of the pipeline: either via concurrent futures vs tokio tasks - Pageserver config - remove batching timeout field - add ability to configure pipelining mode - add ability to limit max batch size for pipelined configurations (required for the rollout, cf neondatabase/cloud#20620 ) - ability to configure execution mode - Tests - remove `batch_timeout` parametrization - rename `test_getpage_merge_smoke` to `test_throughput` - add parametrization to test different max batch sizes and execution moes - rename `test_timer_precision` to `test_latency` - rename the test case file to `test_page_service_batching.py` - better descriptions of what the tests actually do ## On the holding The `TimelineHandle` in the pending batch While batching, we hold the `TimelineHandle` in the pending batch. Therefore, the timeline will not finish shutting down while we're batching. This is not a problem in practice because the concurrently ongoing `get_vectored` call will fail quickly with an error indicating that the timeline is shutting down. This results in the Execution stage returning a `QueryError::Shutdown`, which causes the pipeline / entire page service connection to shut down. This drops all references to the `Arc<Mutex<Option<Box<BatchedFeMessage>>>>` object, thereby dropping the contained `TimelineHandle`s. - => fixes #9850 # Performance Local run of the benchmarks, results in [this empty commit](1cf5b14) in the PR branch. Key take-aways: * `concurrent-futures` and `tasks` deliver identical `batching_factor` * tail latency impact unknown, cf #9837 * `concurrent-futures` has higher throughput than `tasks` in all workloads (=lower `time` metric) * In unbatchable workloads, `concurrent-futures` has 5% higher `CPU-per-throughput` than that of `tasks`, and 15% higher than that of `serial`. * In batchable-32 workload, `concurrent-futures` has 8% lower `CPU-per-throughput` than that of `tasks` (comparison to tput of `serial` is irrelevant) * in unbatchable workloads, mean and tail latencies of `concurrent-futures` is practically identical to `serial`, whereas `tasks` adds 20-30us of overhead Overall, `concurrent-futures` seems like a slightly more attractive choice. # Rollout This change is disabled-by-default. Rollout plan: - neondatabase/cloud#20620 # Refs - epic: #9376 - this sub-task: #9377 - the abandoned attempt to improve batching timeout resolution: #9820 - closes #9850 - fixes #9835

… metrics (#9870) This PR - fixes smgr metrics #9925 - adds an additional startup log line logging the current batching config - adds a histogram of batch sizes global and per-tenant - adds a metric exposing the current batching config The issue described #9925 is that before this PR, request latency was only observed *after* batching. This means that smgr latency metrics (most importantly getpage latency) don't account for - `wait_lsn` time - time spent waiting for batch to fill up / the executor stage to pick up the batch. The fix is to use a per-request batching timer, like we did before the initial batching PR. We funnel those timers through the entire request lifecycle. I noticed that even before the initial batching changes, we weren't accounting for the time spent writing & flushing the response to the wire. This PR drive-by fixes that deficiency by dropping the timers at the very end of processing the batch, i.e., after the `pgb.flush()` call. I was **unable to maintain the behavior that we deduct time-spent-in-throttle from various latency metrics. The reason is that we're using a *single* counter in `RequestContext` to track micros spent in throttle. But there are *N* metrics timers in the batch, one per request. As a consequence, the practice of consuming the counter in the drop handler of each timer no longer works because all but the first timer will encounter error `close() called on closed state`. A failed attempt to maintain the current behavior can be found in #9951. So, this PR remvoes the deduction behavior from all metrics. I started a discussion on Slack about it the implications this has for our internal SLO calculation: https://neondb.slack.com/archives/C033RQ5SPDH/p1732910861704029 # Refs - fixes #9925 - sub-issue #9377 - epic: #9376

This is the first step towards batching rollout. Refs - rollout neondatabase/cloud#20620 - task #9377 - uber-epic: #9376

…rks (#9993) This is the first step towards batching rollout. Refs - rollout plan: neondatabase/cloud#20620 - task #9377 - uber-epic: #9376

# Problem The timeout-based batching adds latency to unbatchable workloads. We can choose a short batching timeout (e.g. 10us) but that requires high-resolution timers, which tokio doesn't have. I thoroughly explored options to use OS timers (see [this](#9822) abandoned PR). In short, it's not an attractive option because any timer implementation adds non-trivial overheads. # Solution The insight is that, in the steady state of a batchable workload, the time we spend in `get_vectored` will be hundreds of microseconds anyway. If we prepare the next batch concurrently to `get_vectored`, we will have a sizeable batch ready once `get_vectored` of the current batch is done and do not need an explicit timeout. This can be reasonably described as **pipelining of the protocol handler**. # Implementation We model the sub-protocol handler for pagestream requests (`handle_pagrequests`) as two futures that form a pipeline: 2. Batching: read requests from the connection and fill the current batch 3. Execution: `take` the current batch, execute it using `get_vectored`, and send the response. The Reading and Batching stage are connected through a new type of channel called `spsc_fold`. See the long comment in the `handle_pagerequests_pipelined` for details. # Changes - Refactor `handle_pagerequests` - separate functions for - reading one protocol message; produces a `BatchedFeMessage` with just one page request in it - batching; tried to merge an incoming `BatchedFeMessage` into an existing `BatchedFeMessage`; returns `None` on success and returns back the incoming message in case merging isn't possible - execution of a batched message - unify the timeline handle acquisition & request span construction; it now happen in the function that reads the protocol message - Implement serial and pipelined model - serial: what we had before any of the batching changes - read one protocol message - execute protocol messages - pipelined: the design described above - optionality for execution of the pipeline: either via concurrent futures vs tokio tasks - Pageserver config - remove batching timeout field - add ability to configure pipelining mode - add ability to limit max batch size for pipelined configurations (required for the rollout, cf neondatabase/cloud#20620 ) - ability to configure execution mode - Tests - remove `batch_timeout` parametrization - rename `test_getpage_merge_smoke` to `test_throughput` - add parametrization to test different max batch sizes and execution moes - rename `test_timer_precision` to `test_latency` - rename the test case file to `test_page_service_batching.py` - better descriptions of what the tests actually do ## On the holding The `TimelineHandle` in the pending batch While batching, we hold the `TimelineHandle` in the pending batch. Therefore, the timeline will not finish shutting down while we're batching. This is not a problem in practice because the concurrently ongoing `get_vectored` call will fail quickly with an error indicating that the timeline is shutting down. This results in the Execution stage returning a `QueryError::Shutdown`, which causes the pipeline / entire page service connection to shut down. This drops all references to the `Arc<Mutex<Option<Box<BatchedFeMessage>>>>` object, thereby dropping the contained `TimelineHandle`s. - => fixes #9850 # Performance Local run of the benchmarks, results in [this empty commit](1cf5b14) in the PR branch. Key take-aways: * `concurrent-futures` and `tasks` deliver identical `batching_factor` * tail latency impact unknown, cf #9837 * `concurrent-futures` has higher throughput than `tasks` in all workloads (=lower `time` metric) * In unbatchable workloads, `concurrent-futures` has 5% higher `CPU-per-throughput` than that of `tasks`, and 15% higher than that of `serial`. * In batchable-32 workload, `concurrent-futures` has 8% lower `CPU-per-throughput` than that of `tasks` (comparison to tput of `serial` is irrelevant) * in unbatchable workloads, mean and tail latencies of `concurrent-futures` is practically identical to `serial`, whereas `tasks` adds 20-30us of overhead Overall, `concurrent-futures` seems like a slightly more attractive choice. # Rollout This change is disabled-by-default. Rollout plan: - neondatabase/cloud#20620 # Refs - epic: #9376 - this sub-task: #9377 - the abandoned attempt to improve batching timeout resolution: #9820 - closes #9850 - fixes #9835

… metrics (#9870) This PR - fixes smgr metrics #9925 - adds an additional startup log line logging the current batching config - adds a histogram of batch sizes global and per-tenant - adds a metric exposing the current batching config The issue described #9925 is that before this PR, request latency was only observed *after* batching. This means that smgr latency metrics (most importantly getpage latency) don't account for - `wait_lsn` time - time spent waiting for batch to fill up / the executor stage to pick up the batch. The fix is to use a per-request batching timer, like we did before the initial batching PR. We funnel those timers through the entire request lifecycle. I noticed that even before the initial batching changes, we weren't accounting for the time spent writing & flushing the response to the wire. This PR drive-by fixes that deficiency by dropping the timers at the very end of processing the batch, i.e., after the `pgb.flush()` call. I was **unable to maintain the behavior that we deduct time-spent-in-throttle from various latency metrics. The reason is that we're using a *single* counter in `RequestContext` to track micros spent in throttle. But there are *N* metrics timers in the batch, one per request. As a consequence, the practice of consuming the counter in the drop handler of each timer no longer works because all but the first timer will encounter error `close() called on closed state`. A failed attempt to maintain the current behavior can be found in #9951. So, this PR remvoes the deduction behavior from all metrics. I started a discussion on Slack about it the implications this has for our internal SLO calculation: https://neondb.slack.com/archives/C033RQ5SPDH/p1732910861704029 # Refs - fixes #9925 - sub-issue #9377 - epic: #9376

…rks (#9993) This is the first step towards batching rollout. Refs - rollout plan: neondatabase/cloud#20620 - task #9377 - uber-epic: #9376

VladLazar added a/performance Area: relates to performance of the system c/storage Component: storage c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic labels Oct 14, 2024

VladLazar self-assigned this Oct 14, 2024

problame self-assigned this Nov 17, 2024

problame added a commit that referenced this issue Nov 20, 2024

page_service: add benchmark for batching

b695907

This PR adds a benchmark to demonstrate the effect of server-side getpage request batching added in #9321. Refs: - Epic: #9376 - Extracted from #9792

This was referenced Nov 20, 2024

page_service: add benchmark for batching #9820

Merged

page_service: rewrite batching to work without a timeout #9851

Merged

problame mentioned this issue Nov 29, 2024

page_service: batching observability & include throttled time in smgr metrics #9870

Merged

problame added a commit that referenced this issue Dec 3, 2024

page_service: enable batching in Rust & Python Tests + Python benchmarks

e2a3ae0

This is the first step towards batching rollout. Refs - rollout neondatabase/cloud#20620 - task #9377 - uber-epic: #9376

problame mentioned this issue Dec 3, 2024

page_service: enable batching in Rust & Python Tests + Python benchmarks #9993

Merged

awarus pushed a commit that referenced this issue Dec 5, 2024

page_service: enable batching in Rust & Python Tests + Python benchma…

16163fb

…rks (#9993) This is the first step towards batching rollout. Refs - rollout plan: neondatabase/cloud#20620 - task #9377 - uber-epic: #9376

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: get page throughput improvements #9376

Epic: get page throughput improvements #9376

VladLazar commented Oct 14, 2024 •

edited by problame

Loading

Big Rocks

Prod Readiness

Epic: get page throughput improvements #9376

Epic: get page throughput improvements #9376

Comments

VladLazar commented Oct 14, 2024 • edited by problame Loading

Background

Big Rocks

Prod Readiness

VladLazar commented Oct 14, 2024 •

edited by problame

Loading