Make DiskBtreeReader::dump async #4838

arpad-m · 2023-07-28T22:53:04Z

Problem

DiskBtreeReader::dump calls read_blk internally, which we want to make async in the future. As it is currently relying on recursion, and async doesn't like recursion, we want to find an alternative to that and instead traverse the tree using a loop and a manual stack.

Summary of changes

Make DiskBtreeReader::dump and all the places calling it async
Make DiskBtreeReader::dump non-recursive internally and use a stack instead. It now deparses the node in each iteration, which isn't optimal, but on the other hand it's hard to store the node as it is referencing the buffer. Self referential data are hard in Rust. For a dumping function, speed isn't a priority so we deparse the node multiple times now (up to branching factor many times).

Part of #4743

I have verified that output is unchanged by comparing the output of this command both before and after this patch:

cargo test -p pageserver -- particular_data --nocapture

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2023-07-28T23:18:39Z

1240 tests run: 1187 passed, 0 failed, 53 skipped (full report)

Flaky tests (1)

Postgres 15

test_remote_timeline_client_calls_started_metric[local_fs]: release

## Problem `DiskBtreeReader::get` and `DiskBtreeReader::visit` both call `read_blk` internally, which we would like to make async in the future. This PR focuses on making the interface of these two functions `async`. There is further work to be done in forms of making `visit` to not be recursive any more, similar to #4838. For that, see #4884. Builds on top of #4839, part of #4743 ## Summary of changes Make `DiskBtreeReader::get` and `DiskBtreeReader::visit` async functions and `await` in the places that call these functions.

## Problem The `DiskBtreeReader::visit` function calls `read_blk` internally, and while #4863 converted the API of `visit` to async, the internal function is still recursive. So, analogously to #4838, we turn the recursive function into an iterative one. ## Summary of changes First, we prepare the change by moving the for loop outside of the case switch, so that we only have one loop that calls recursion. Then, we switch from using recursion to an approach where we store the search path inside the tree on a stack on the heap. The caller of the `visit` function can control when the search over the B-Tree ends, by returning `false` from the closure. This is often used to either only find one specific entry (by always returning `false`), but it is also used to iterate over all entries of the B-tree (by always returning `true`), or to look for ranges (mostly in tests, but `get_value_reconstruct_data` also has such a use). Each stack entry contains two things: the block number (aka the block's offset), and a children iterator. The children iterator is constructed depending on the search direction, and with the results of a binary search over node's children list. It is the only thing that survives a spilling/push to the stack, everything else is reconstructed. In other words, each stack spill, will, if the search is still ongoing, cause an entire re-parsing of the node. Theoretically, this would be a linear overhead in the number of leaves the search visits. However, one needs to note: * the workloads to look for a specific entry are just visiting one leaf, ever, so this is mostly about workloads that visit larger ranges, including ones that visit the entire B-tree. * the requests first hit the page cache, so often the cost is just in terms of node deserialization * for nodes that only have leaf nodes as children, no spilling to the stack-on-heap happens (outside of the initial request where the iterator is `None`). In other words, for balanced trees, the spilling overhead is $\Theta\left(\frac{n}{b^2}\right)$, where `b` is the branching factor and `n` is the number of nodes in the tree. The B-Trees in the current implementation have a branching factor of roughly `PAGE_SZ/L` where `PAGE_SZ` is 8192, and `L` is `DELTA_KEY_SIZE = 26` or `KEY_SIZE = 18` in production code, so this gives us an estimate that we'd be re-loading an inner node for every 99000 leaves in the B-tree in the worst case. Due to these points above, I'd say that not fully caching the inner nodes with inner children is reasonable, especially as we also want to be fast for the "find one specific entry" workloads, where the stack content is never accessed: any action to make the spilling computationally more complex would contribute to wasted cycles here, even if these workloads "only" spill one node for each depth level of the b-tree (which is practically always a low single-digit number, Kleppmann points out on page 81 that for branching factor 500, a four level B-tree with 4 KB pages can store 250 TB of data). But disclaimer, this is all stuff I thought about in my head, I have not confirmed it with any benchmarks or data. Builds on top of #4863, part of #4743

arpad-m added 3 commits July 29, 2023 00:46

Make DiskBtreeReader::dump async

36fd6df

Rename dump_recurse to dump_inner

e4d7c78

Make dump non-recursive and use a stack instead

6840706

arpad-m requested review from a team as code owners July 28, 2023 22:53

arpad-m requested review from fprasx and koivunej and removed request for a team July 28, 2023 22:53

hlinnaka approved these changes Jul 29, 2023

View reviewed changes

koivunej approved these changes Jul 31, 2023

View reviewed changes

arpad-m merged commit e5183f8 into main Jul 31, 2023

arpad-m deleted the arpad/pageserver_io_async_btree branch July 31, 2023 10:52

This was referenced Aug 1, 2023

Make DiskBtreeReader::{visit, get} async #4863

Merged

Make the implementation of DiskBtreeReader::visit non-recursive #4884

Merged

Epic: convert remaining IO stack for Timeline::get to async fn #4743

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make DiskBtreeReader::dump async #4838

Make DiskBtreeReader::dump async #4838

arpad-m commented Jul 28, 2023 •

edited

Loading

github-actions bot commented Jul 28, 2023

Postgres 15

Make DiskBtreeReader::dump async #4838

Make DiskBtreeReader::dump async #4838

Conversation

arpad-m commented Jul 28, 2023 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Jul 28, 2023

1240 tests run: 1187 passed, 0 failed, 53 skipped (full report)

Postgres 15

arpad-m commented Jul 28, 2023 •

edited

Loading