pageserver: make `BufferedWriter` do double-buffering #9693

yliang412 · 2024-11-08T15:20:52Z

Closes #9387.

Problem

BufferedWriter cannot proceed while the owned buffer is flushing to disk. We want to implement double buffering so that the flush can happen in the background. See #9387.

Summary of changes

Maintain two owned buffers in BufferedWriter.
The writer is in charge of copying the data into owned, aligned buffer, once full, submit it to the flush task.
The flush background task is in charge of flushing the owned buffer to disk, and returned the buffer to the writer for reuse.
The writer and the flush background task communicate through a bi-directional channel.

For in-memory layer, we also need to be able to read from the buffered writer in get_values_reconstruct_data. To handle this case, we did the following

Use replace VirtualFile::write_all with VirtualFile::write_all_at, and use Arc to share it between writer and background task.
leverage IoBufferMut::freeze to get a cheaply clonable IoBuffer, one clone will be submitted to the channel, the other clone will be saved within the writer to serve reads. When we want to reuse the buffer, we can invoke IoBuffer::into_mut, which gives us back the mutable aligned buffer.
InMemoryLayer reads is now aware of the maybe_flushed part of the buffer.

Caveat

We removed the owned version of write, because this interface does not work well with buffer alignment. The result is that without direct IO enabled, download_object does one more memcpy than before this PR due to the switch to use _borrowed version of the write.
"Bypass aligned part of write" could be implemented later to avoid large amount of memcpy.

Testing

use an oneshot channel based control mechanism to make flush behavior deterministic in test.
test reading from EphemeralFile when the last submitted buffer is not flushed, in-progress, and done flushing to disk.

Performance

We see performance improvement for small values, and regression on big values, likely due to being CPU bound + disk write latency.

Results

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

Signed-off-by: Yuchen Liang <[email protected]>

github-actions · 2024-11-08T16:23:45Z

7018 tests run: 6710 passed, 0 failed, 308 skipped (full report)

Flaky tests (1)

Postgres 14

test_pull_timeline[True]: release-x86-64

Code coverage* (full report)

functions: 30.8% (8306 of 26946 functions)
lines: 47.8% (65399 of 136789 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
1da4028 at 2024-12-03T15:31:43.605Z :recycle:}

Signed-off-by: Yuchen Liang <[email protected]>

pageserver/src/tenant/remote_timeline_client/download.rs

pageserver/src/virtual_file/owned_buffers_io/write/flush.rs

pageserver/src/virtual_file/owned_buffers_io/write.rs

…rsion) Signed-off-by: Yuchen Liang <[email protected]>

problame · 2024-11-13T12:45:20Z

write_buffered vs write_buffered_borrowed: my gut feeling is that in practice on-demand downloads did benefit from the old behavior where we were able to bypass the buffer (lower CPU usage).

We have that pagebench sub-benchmark for on-demand downloads, you could compare CPU usage before and after this change.

But, might be faster to "just" address this TODO.

Maybe you can be generic over constraints on the buffer type by making the buffer type an associated type of the writer?

problame

Ok, again on write_buffered / write_bufferd_borrowed.
Let's call it "buffer bypass for aligned parts of the write".

I remembered that with O_DIRECT, we save one memcpy so we can spend one and come out net 0 wrt CPU efficiency.

It would be nice to have a CPU efficiency WIN, though I'm ok with net 0.

The only remaining CPU efficiency difference that I can think of right now is that write_buffered issues one giant write for the entire middle of the buffer, whereas write_buffered issues TAIL_SZ'd writes.

Left a couple of comments that need addressing. Let's discuss major unclarities on Slack.

pageserver/src/tenant/ephemeral_file.rs

pageserver/src/virtual_file/owned_buffers_io/write/flush.rs

pageserver/src/virtual_file/owned_buffers_io/write.rs

Signed-off-by: Yuchen Liang <[email protected]>

## Problem The newly added flush task in #9693 should hold timeline gate open, to avoid doing local IO after timeline shutdown completes. ## Solution Pass timeline gate guard to flush background task. The flush task do not need cancellation token b/c it will automatically shutdown when the front writer task drop the channel. - Refactor relevant paths to pass down `&Gate` instead of `GateGuard`. Signed-off-by: Yuchen Liang <[email protected]>

pageserver/src/tenant/remote_timeline_client/download.rs

pageserver/src/virtual_file/owned_buffers_io/write.rs

pageserver/src/virtual_file/owned_buffers_io/io_buf_ext.rs

pageserver/src/virtual_file/owned_buffers_io/write/flush.rs

pageserver/src/virtual_file/owned_buffers_io/write.rs

pageserver/src/virtual_file/owned_buffers_io/write/flush.rs

pageserver/src/virtual_file/owned_buffers_io/aligned_buffer/buffer_mut.rs

Signed-off-by: Yuchen Liang <[email protected]>

panics if IoBufferMut does not enough capacity left to accomodate the source buffer. Signed-off-by: Yuchen Liang <[email protected]>

consider cases where offset != 0 Signed-off-by: Yuchen Liang <[email protected]>

Co-authored-by: Christian Schwarz <[email protected]>

Signed-off-by: Yuchen Liang <[email protected]>

problame

Let's see how this works in staging. Preprod deployment later this week.

## Problem In #9693, we forgot to check macos build. The [CI run](https://github.com/neondatabase/neon/actions/runs/12164541897/job/33926455468) on main showed that macos build failed with unused variables and dead code. ## Summary of changes - add `allow(dead_code)` and `allow(unused_variables)` to the relevant code that is not used on macos. Signed-off-by: Yuchen Liang <[email protected]>

Closes #9387. ## Problem `BufferedWriter` cannot proceed while the owned buffer is flushing to disk. We want to implement double buffering so that the flush can happen in the background. See #9387. ## Summary of changes - Maintain two owned buffers in `BufferedWriter`. - The writer is in charge of copying the data into owned, aligned buffer, once full, submit it to the flush task. - The flush background task is in charge of flushing the owned buffer to disk, and returned the buffer to the writer for reuse. - The writer and the flush background task communicate through a bi-directional channel. For in-memory layer, we also need to be able to read from the buffered writer in `get_values_reconstruct_data`. To handle this case, we did the following - Use replace `VirtualFile::write_all` with `VirtualFile::write_all_at`, and use `Arc` to share it between writer and background task. - leverage `IoBufferMut::freeze` to get a cheaply clonable `IoBuffer`, one clone will be submitted to the channel, the other clone will be saved within the writer to serve reads. When we want to reuse the buffer, we can invoke `IoBuffer::into_mut`, which gives us back the mutable aligned buffer. - InMemoryLayer reads is now aware of the maybe_flushed part of the buffer. **Caveat** - We removed the owned version of write, because this interface does not work well with buffer alignment. The result is that without direct IO enabled, [`download_object`](https://github.com/neondatabase/neon/blob/a439d57050dafd603d24e001215213eb5246a029/pageserver/src/tenant/remote_timeline_client/download.rs#L243) does one more memcpy than before this PR due to the switch to use `_borrowed` version of the write. - "Bypass aligned part of write" could be implemented later to avoid large amount of memcpy. **Testing** - use an oneshot channel based control mechanism to make flush behavior deterministic in test. - test reading from `EphemeralFile` when the last submitted buffer is not flushed, in-progress, and done flushing to disk. ## Performance We see performance improvement for small values, and regression on big values, likely due to being CPU bound + disk write latency. [Results](https://www.notion.so/neondatabase/Benchmarking-New-BufferedWriter-11-20-2024-143f189e0047805ba99acda89f984d51?pvs=4) ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Signed-off-by: Yuchen Liang <[email protected]> Co-authored-by: Christian Schwarz <[email protected]>

## Problem In #9693, we forgot to check macos build. The [CI run](https://github.com/neondatabase/neon/actions/runs/12164541897/job/33926455468) on main showed that macos build failed with unused variables and dead code. ## Summary of changes - add `allow(dead_code)` and `allow(unused_variables)` to the relevant code that is not used on macos. Signed-off-by: Yuchen Liang <[email protected]>

yliang412 added 3 commits November 7, 2024 20:44

eliminate size_tracking_writer

dd1c45e

Signed-off-by: Yuchen Liang <[email protected]>

change OwnedAsyncWriter trait to use write_all_at

224cbb4

Signed-off-by: Yuchen Liang <[email protected]>

use Arc around W: OwnedAsyncWriter

f0efc90

Signed-off-by: Yuchen Liang <[email protected]>

yliang412 added 4 commits November 9, 2024 18:41

implement non-generic flush handle & bg task

26c8b50

Signed-off-by: Yuchen Liang <[email protected]>

make flush handle & task generic

4599804

Signed-off-by: Yuchen Liang <[email protected]>

use background flush for write path; read path broken

bdffc35

Signed-off-by: Yuchen Liang <[email protected]>

make InMemory read aware of mutable & maybe_flushed

e0848c2

Signed-off-by: Yuchen Liang <[email protected]>

yliang412 mentioned this pull request Nov 11, 2024

pageserver: direct I/O #8130

Open

yliang412 self-assigned this Nov 11, 2024

yliang412 and others added 7 commits November 11, 2024 21:33

fix clippy

e5bb85d

Signed-off-by: Yuchen Liang <[email protected]>

fix tests

7b34e73

Signed-off-by: Yuchen Liang <[email protected]>

fix IoBufferMut::extend_from_slice

b0d7fc7

Signed-off-by: Yuchen Liang <[email protected]>

add IoBufAligned marker

ce7cd36

Signed-off-by: Yuchen Liang <[email protected]>

use open_with_options_v2 (O_DIRECT) for ephemeral file

20e6a0c

Signed-off-by: Yuchen Liang <[email protected]>

Merge branch 'main' into yuchen/double-buffered-writer

d6d8a16

fix clippy

ffd88ed

Signed-off-by: Yuchen Liang <[email protected]>

yliang412 commented Nov 12, 2024

View reviewed changes

pageserver/src/tenant/remote_timeline_client/download.rs Outdated Show resolved Hide resolved

yliang412 commented Nov 12, 2024

View reviewed changes

pageserver/src/virtual_file/owned_buffers_io/write/flush.rs Outdated Show resolved Hide resolved

yliang412 commented Nov 12, 2024

View reviewed changes

pageserver/src/virtual_file/owned_buffers_io/write.rs Show resolved Hide resolved

add comments; make read buffering works with write_buffered (owned ve…

6844b5f

…rsion) Signed-off-by: Yuchen Liang <[email protected]>

yliang412 changed the title ~~[WIP] double buffered writer~~ pageserver: make BufferedWriter do double-buffering Nov 12, 2024

yliang412 requested a review from problame November 12, 2024 17:31

yliang412 marked this pull request as ready for review November 12, 2024 17:32

yliang412 requested a review from a team as a code owner November 12, 2024 17:32

problame requested changes Nov 13, 2024

View reviewed changes

yliang412 added 3 commits November 15, 2024 15:42

review: #9693 (comment)

990bc65

Signed-off-by: Yuchen Liang <[email protected]>

move duplex to utils; make flush behavior controllable in test

5acc61b

Signed-off-by: Yuchen Liang <[email protected]>

fix clippy

9db6b1e

Signed-off-by: Yuchen Liang <[email protected]>

yliang412 requested a review from problame November 25, 2024 05:45

yliang412 mentioned this pull request Nov 25, 2024

pageserver: do aligned writes for delta and image layers #9868

Open

2 tasks

yliang412 and others added 2 commits November 25, 2024 15:25

fix docs clippy

4284fcd

Signed-off-by: Yuchen Liang <[email protected]>

Merge branch 'main' into yuchen/double-buffered-writer

c3302ad

yliang412 mentioned this pull request Nov 27, 2024

hold timeline open in background task using gate guard #9825

Merged

yliang412 added 2 commits November 27, 2024 10:10

Merge branch 'main' into yuchen/double-buffered-writer

b6a2516

problame reviewed Dec 2, 2024

View reviewed changes

yliang412 and others added 8 commits December 2, 2024 15:58

review: remove unused impl Buffer for BytesMut

9f384a8

Signed-off-by: Yuchen Liang <[email protected]>

review: follow Buffer::extend_from_slice trait definition

bf9a6d0

panics if IoBufferMut does not enough capacity left to accomodate the source buffer. Signed-off-by: Yuchen Liang <[email protected]>

review: fix CheapCloneForRead for FullSlice

fac4269

consider cases where offset != 0 Signed-off-by: Yuchen Liang <[email protected]>

review: cleanup comments + expect_err

a439d57

Co-authored-by: Christian Schwarz <[email protected]>

review: move FlushHandle::handle_error right after ::flush

6a1aa52

Signed-off-by: Yuchen Liang <[email protected]>

review: set channel buffer size to 1

21ca0c4

Signed-off-by: Yuchen Liang <[email protected]>

Merge branch 'main' into yuchen/double-buffered-writer

9d1821a

fix clippy

1da4028

Signed-off-by: Yuchen Liang <[email protected]>

problame approved these changes Dec 4, 2024

View reviewed changes

yliang412 added this pull request to the merge queue Dec 4, 2024

Merged via the queue into main with commit e6cd505 Dec 4, 2024
82 checks passed

yliang412 deleted the yuchen/double-buffered-writer branch December 4, 2024 16:55

yliang412 mentioned this pull request Dec 4, 2024

pageserver: fix buffered-writer on macos build #10019

Merged

This was referenced Dec 9, 2024

pageserver: use direct io for delta + image layer writes #10063

Open

[WIP] use marker trait + refactor inmemory layer #9283

Closed

pageserver: implement buffer bypass for aligned parts of the write #10101

Open

jcsp assigned problame and unassigned yliang412 Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: make `BufferedWriter` do double-buffering #9693

pageserver: make `BufferedWriter` do double-buffering #9693

yliang412 commented Nov 8, 2024 •

edited

Loading

github-actions bot commented Nov 8, 2024 •

edited

Loading

Postgres 14

problame commented Nov 13, 2024

problame left a comment

problame left a comment

pageserver: make BufferedWriter do double-buffering #9693

pageserver: make BufferedWriter do double-buffering #9693

Conversation

yliang412 commented Nov 8, 2024 • edited Loading

Problem

Summary of changes

Performance

Checklist before requesting a review

Checklist before merging

github-actions bot commented Nov 8, 2024 • edited Loading

7018 tests run: 6710 passed, 0 failed, 308 skipped (full report)

Postgres 14

Code coverage* (full report)

problame commented Nov 13, 2024

problame left a comment

Choose a reason for hiding this comment

problame left a comment

Choose a reason for hiding this comment

pageserver: make `BufferedWriter` do double-buffering #9693

pageserver: make `BufferedWriter` do double-buffering #9693

yliang412 commented Nov 8, 2024 •

edited

Loading

github-actions bot commented Nov 8, 2024 •

edited

Loading