ephemeral file: refactor write_blob impl to concentrate mutable state #5004

problame · 2023-08-16T13:36:11Z

Before this patch, we had the off and blknum as function-wide
mutable state. Now it's contained in the Writer struct.

The use of push_bytes instead of index-based filling of the buffer
also makes it easier to reason about what's going on.

This is prep for #4994

Before this patch, we had the `off` and `blknum` as function-wide mutable state. Now it's contained in the `Writer` struct. The use of `push_bytes` instead of index-based filling of the buffer also makes it easier to reason about what's going on. This is prep for #4994

pageserver/src/tenant/ephemeral_file.rs

github-actions · 2023-08-16T14:17:46Z

1596 tests run: 1522 passed, 0 failed, 74 skipped (full report)

_{The comment gets automatically updated with the latest test results
2985cc6 at 2023-08-17T09:51:27.406Z :recycle:}

pageserver/src/tenant/ephemeral_file.rs

Results: cs@devvm:[~/src/neon-work-2]: rm -r target/criterion/ cs@devvm:[~/src/neon-work-2]: cargo bench -p utils --bench write_blob Compiling utils v0.1.0 (/home/cs/src/neon-work-2/libs/utils) Finished bench [optimized + debuginfo] target(s) in 2.50s Running benches/write_blob.rs (target/release/deps/write_blob-f10f94bd2a09aa2a) Gnuplot not found, using plotters backend write_blob/old time: [18.001 ns 18.229 ns 18.487 ns] Found 11 outliers among 100 measurements (11.00%) 7 (7.00%) high mild 4 (4.00%) high severe write_blob/pr-5004-writer time: [38.320 ns 38.798 ns 39.291 ns] Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe write_blob/pr-4994-nopagecache time: [6.2369 ns 6.3093 ns 6.3926 ns] Found 8 outliers among 100 measurements (8.00%) 5 (5.00%) high mild 3 (3.00%) high severe

Before, each push_bytes would call get_buf_for_write. By memoizing it, we bring the number of get_buf_for_write calls to the same number as before this PR. Benchmarks results show the improvement (write_blob/pr-5004-writer) but there's still some overhead over the pre-PR situation (write_blob/old). I don't know where it's from, but I'm ok with that overhead (17.3ns vs 25.6ns). cs@devvm:[~/src/neon-work-2]: cargo bench -p utils --bench write_blob Compiling utils v0.1.0 (/home/cs/src/neon-work-2/libs/utils) Finished bench [optimized + debuginfo] target(s) in 2.43s Running benches/write_blob.rs (target/release/deps/write_blob-f10f94bd2a09aa2a) Gnuplot not found, using plotters backend write_blob/old time: [17.335 ns 17.584 ns 17.853 ns] change: [-4.3206% -1.9089% +0.7827%] (p = 0.15 > 0.05) No change in performance detected. Found 12 outliers among 100 measurements (12.00%) 2 (2.00%) low severe 5 (5.00%) high mild 5 (5.00%) high severe write_blob/pr-5004-writer time: [25.637 ns 25.959 ns 26.361 ns] change: [-34.513% -32.589% -30.577%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 1 (1.00%) low severe 1 (1.00%) low mild 4 (4.00%) high mild 5 (5.00%) high severe write_blob/pr-4994-nopagecache time: [6.2263 ns 6.2959 ns 6.3760 ns] change: [-2.1835% -0.3094% +1.6064%] (p = 0.76 > 0.05) No change in performance detected. Found 11 outliers among 100 measurements (11.00%) 5 (5.00%) high mild 6 (6.00%) high severe

shaves off a few nanoseconds in 5004, let's have it cs@devvm:[~/src/neon-work-2]: cargo bench -p utils --bench write_blob Finished bench [optimized + debuginfo] target(s) in 0.23s Running benches/write_blob.rs (target/release/deps/write_blob-f10f94bd2a09aa2a) Gnuplot not found, using plotters backend write_blob/old time: [17.930 ns 18.222 ns 18.518 ns] Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) low mild 6 (6.00%) high mild 3 (3.00%) high severe write_blob/pr-5004-writer time: [22.153 ns 22.538 ns 22.942 ns] Found 13 outliers among 100 measurements (13.00%) 1 (1.00%) low severe 2 (2.00%) low mild 4 (4.00%) high mild 6 (6.00%) high severe write_blob/pr-4994-nopagecache time: [6.2702 ns 6.3597 ns 6.4649 ns] Found 7 outliers among 100 measurements (7.00%) 4 (4.00%) high mild 3 (3.00%) high severe

with the previous commit, we now no longer need the Option Shaves off another 2ns cs@devvm:[~/src/neon-work-2]: cargo bench -p utils --bench write_blob Finished bench [optimized + debuginfo] target(s) in 0.22s Running benches/write_blob.rs (target/release/deps/write_blob-f10f94bd2a09aa2a) Gnuplot not found, using plotters backend write_blob/old time: [18.051 ns 18.301 ns 18.552 ns] Found 10 outliers among 100 measurements (10.00%) 2 (2.00%) low severe 1 (1.00%) low mild 5 (5.00%) high mild 2 (2.00%) high severe write_blob/pr-5004-writer time: [19.766 ns 20.019 ns 20.314 ns] Found 11 outliers among 100 measurements (11.00%) 4 (4.00%) high mild 7 (7.00%) high severe write_blob/pr-4994-nopagecache time: [6.3191 ns 6.3981 ns 6.4814 ns] Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe

cs@devvm:[~/src/neon-work-2]: cargo bench -p utils --bench write_blob Finished bench [optimized + debuginfo] target(s) in 0.22s Running benches/write_blob.rs (target/release/deps/write_blob-f10f94bd2a09aa2a) Gnuplot not found, using plotters backend write_blob/old time: [17.859 ns 18.089 ns 18.279 ns] Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) low severe write_blob/pr-5004-writer time: [17.886 ns 18.040 ns 18.221 ns] Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) low mild 3 (3.00%) high mild write_blob/pr-4994-nopagecache time: [6.5882 ns 6.6204 ns 6.6529 ns] Found 8 outliers among 100 measurements (8.00%) 8 (8.00%) high mild

koivunej

I would had wished this impl looked more like std impls for std::io::Write because that's the trait equivalent of this impl. Though, unsure if std::io::Write or "semi-equivalent" writers are the way to go in an effort to asyncify.

problame · 2023-08-17T10:05:24Z

Yeah, didn't care for the std traits due to the asyncification efforts.

(This PR is the successor of #4984 ) ## Summary The current way in which `EphemeralFile` uses `PageCache` complicates the Pageserver code base to a degree that isn't worth it. This PR refactors how we cache `EphemeralFile` contents, by exploiting the append-only nature of `EphemeralFile`. The result is that `PageCache` only holds `ImmutableFilePage` and `MaterializedPage`. These types of pages are read-only and evictable without write-back. This allows us to remove the writeback code from `PageCache`, also eliminating an entire failure mode. Futher, many great open-source libraries exist to solve the problem of a read-only cache, much better than our `page_cache.rs` (e.g., better replacement policy, less global locking). With this PR, we can now explore using them. ## Problem & Analysis Before this PR, `PageCache` had three types of pages: * `ImmutableFilePage`: caches Delta / Image layer file contents * `MaterializedPage`: caches results of Timeline::get (page materialization) * `EphemeralPage`: caches `EphemeralFile` contents `EphemeralPage` is quite different from `ImmutableFilePage` and `MaterializedPage`: * Immutable and materialized pages are for the acceleration of (future) reads of the same data using `PAGE_CACHE_SIZE * PAGE_SIZE` bytes of DRAM. * Ephemeral pages are a write-back cache of `EphemeralFile` contents, i.e., if there is pressure in the page cache, we spill `EphemeralFile` contents to disk. `EphemeralFile` is only used by `InMemoryLayer`, for the following purposes: * **write**: when filling up the `InMemoryLayer`, via `impl BlobWriter for EphemeralFile` * **read**: when doing **page reconstruction** for a page@lsn that isn't written to disk * **read**: when writing L0 layer files, we re-read the `InMemoryLayer` and put the contents into the L0 delta writer (**`create_delta_layer`**). This happens every 10min or when InMemoryLayer reaches 256MB in size. The access patterns of the `InMemoryLayer` use case are as follows: * **write**: via `BlobWriter`, strictly append-only * **read for page reconstruction**: via `BlobReader`, random * **read for `create_delta_layer`**: via `BlobReader`, dependent on data, but generally random. Why? * in classical LSM terms, this function is what writes the memory-resident `C0` tree into the disk-resident `C1` tree * in our system, though, the values of InMemoryLayer are stored in an EphemeralFile, and hence they are not guaranteed to be memory-resident * the function reads `Value`s in `Key, LSN` order, which is `!=` insert order What do these `EphemeralFile`-level access patterns mean for the page cache? * **write**: * the common case is that `Value` is a WAL record, and if it isn't a full-page-image WAL record, then it's smaller than `PAGE_SIZE` * So, the `EphemeralPage` pages act as a buffer for these `< PAGE_CACHE` sized writes. * If there's no page cache eviction between subsequent `InMemoryLayer::put_value` calls, the `EphemeralPage` is still resident, so the page cache avoids doing a `write` system call. * In practice, a busy page server will have page cache evictions because we only configure 64MB of page cache size. * **reads for page reconstruction**: read acceleration, just as for the other page types. * **reads for `create_delta_layer`**: * The `Value` reads happen through a `BlockCursor`, which optimizes the case of repeated reads from the same page. * So, the best case is that subsequent values are located on the same page; hence `BlockCursor`s buffer is maximally effective. * The worst case is that each `Value` is on a different page; hence the `BlockCursor`'s 1-page-sized buffer is ineffective. * The best case translates into `256MB/PAGE_SIZE` page cache accesses, one per page. * the worst case translates into `#Values` page cache accesses * again, the page cache accesses must be assumed to be random because the `Value`s aren't accessed in insertion order but `Key, LSN` order. ## Summary of changes Preliminaries for this PR were: - #5003 - #5004 - #5005 - uncommitted microbenchmark in #5011 Based on the observations outlined above, this PR makes the following changes: * Rip out `EphemeralPage` from `page_cache.rs` * Move the `block_io::FileId` to `page_cache::FileId` * Add a `PAGE_SIZE`d buffer to the `EphemeralPage` struct. It's called `mutable_tail`. * Change `write_blob` to use `mutable_tail` for the write buffering instead of a page cache page. * if `mutable_tail` is full, it writes it out to disk, zeroes it out, and re-uses it. * There is explicitly no double-buffering, so that memory allocation per `EphemeralFile` instance is fixed. * Change `read_blob` to return different `BlockLease` variants depending on `blknum` * for the `blknum` that corresponds to the `mutable_tail`, return a ref to it * Rust borrowing rules prevent `write_blob` calls while refs are outstanding. * for all non-tail blocks, return a page-cached `ImmutablePage` * It is safe to page-cache these as ImmutablePage because EphemeralFile is append-only. ## Performance How doe the changes above affect performance? M claim is: not significantly. * **write path**: * before this PR, the `EphemeralFile::write_blob` didn't issue its own `write` system calls. * If there were enough free pages, it didn't issue *any* `write` system calls. * If it had to evict other `EphemeralPage`s to get pages a page for its writes (`get_buf_for_write`), the page cache code would implicitly issue the writeback of victim pages as needed. * With this PR, `EphemeralFile::write_blob` *always* issues *all* of its *own* `write` system calls. * Also, the writes are explicit instead of implicit through page cache write back, which will help #4743 * The perf impact of always doing the writes is the CPU overhead and syscall latency. * Before this PR, we might have never issued them if there were enough free pages. * We don't issue `fsync` and can expect the writes to only hit the kernel page cache. * There is also an advantage in issuing the writes directly: the perf impact is paid by the tenant that caused the writes, instead of whatever tenant evicts the `EphemeralPage`. * **reads for page reconstruction**: no impact. * The `write_blob` function pre-warms the page cache when it writes the `mutable_tail` to disk. * So, the behavior is the same as with the EphemeralPages before this PR. * **reads for `create_delta_layer`**: no impact. * Same argument as for page reconstruction. * Note for the future: * going through the page cache likely causes read amplification here. Why? * Due to the `Key,Lsn`-ordered access pattern, we don't read all the values in the page before moving to the next page. In the worst case, we might read the same page multiple times to read different `Values` from it. * So, it might be better to bypass the page cache here. * Idea drafts: * bypass PS page cache + prefetch pipeline + iovec-based IO * bypass PS page cache + use `copy_file_range` to copy from ephemeral file into the L0 delta file, without going through user space

problame requested review from a team as code owners August 16, 2023 13:36

problame requested review from awestover, koivunej, hlinnaka and arpad-m and removed request for a team and awestover August 16, 2023 13:36

problame mentioned this pull request Aug 16, 2023

simplify page-caching of EphemeralFile #4994

Merged

arpad-m approved these changes Aug 16, 2023

View reviewed changes

hlinnaka reviewed Aug 16, 2023

View reviewed changes

pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved

arpad-m reviewed Aug 16, 2023

View reviewed changes

pageserver/src/tenant/ephemeral_file.rs Show resolved Hide resolved

hlinnaka reviewed Aug 16, 2023

View reviewed changes

pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved

hlinnaka approved these changes Aug 16, 2023

View reviewed changes

accomodate heikki's brain; #5004 (comment)

4beac4d

problame mentioned this pull request Aug 16, 2023

EphemeralFile::write_blob: behavior in case of errors is unspecified / broken / unretyable #5007

Open

hlinnaka reviewed Aug 16, 2023

View reviewed changes

pageserver/src/tenant/ephemeral_file.rs Show resolved Hide resolved

koivunej reviewed Aug 16, 2023

View reviewed changes

pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved

koivunej reviewed Aug 16, 2023

View reviewed changes

pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved

koivunej reviewed Aug 16, 2023

View reviewed changes

pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved

problame added 2 commits August 16, 2023 18:09

problame mentioned this pull request Aug 16, 2023

for #5004: microbenchmark & improvement #5011

Merged

problame added 4 commits August 16, 2023 19:27

cargo fmt

24cec20

avoid the loop (eliminates one jump, insignificant for perf)

99c0ada

problame added 5 commits August 16, 2023 19:38

check write_blob results (ran benchmarks, insignificant for performance)

2bbe595

comments and renamings

c47be4c

remove the microbenchmark

4bfbb39

add comment on src_remaining and remove unneeded block

2985cc6

koivunej reviewed Aug 17, 2023

View reviewed changes

problame enabled auto-merge (squash) August 17, 2023 10:05

problame merged commit 957af04 into main Aug 17, 2023

problame deleted the problame/ephemeral-write-blob-refactor branch August 17, 2023 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ephemeral file: refactor write_blob impl to concentrate mutable state #5004

ephemeral file: refactor write_blob impl to concentrate mutable state #5004

problame commented Aug 16, 2023

github-actions bot commented Aug 16, 2023 •

edited

Loading

koivunej left a comment

problame commented Aug 17, 2023

ephemeral file: refactor write_blob impl to concentrate mutable state #5004

ephemeral file: refactor write_blob impl to concentrate mutable state #5004

Conversation

problame commented Aug 16, 2023

github-actions bot commented Aug 16, 2023 • edited Loading

1596 tests run: 1522 passed, 0 failed, 74 skipped (full report)

koivunej left a comment

Choose a reason for hiding this comment

problame commented Aug 17, 2023

github-actions bot commented Aug 16, 2023 •

edited

Loading