-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: direct IO for Pageserver #8240
base: main
Are you sure you want to change the base?
Conversation
3042 tests run: 2925 passed, 2 failed, 115 skipped (full report)Failures on Postgres 15
Failures on Postgres 14
Test coverage report is not availableThe comment gets automatically updated with the latest test results
02ce39f at 2024-07-09T09:45:25.659Z :recycle: |
c05178d
to
b7b9be1
Compare
2. all indirect blocks (=disk btree blocks) are cached in the PS `PageCache`. | ||
The norm will be very low baseline replacement rates in PS `PageCache`. | ||
High baseline replacement rates will be treated as a signal of resource exhaustion (page cache insufficient to host working set of the PS). | ||
It will be remediated by the storage controller, migrating tenants away to relieve pressure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth caveating this with a note that this RFC/project doesn't cover such migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bulk of the design & coding work is to ensure adherence to the alignment requirements. | ||
|
||
Our automed benchmarks are insufficient to rule out performance regressions. | ||
Manual benchmarking / new automated benchmarks will be required for the last two items (new PS PageCache size, avoiding regressions). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's specify what these benchmarks should be, to avoid open-ended benchmarking work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave this open because I can imagine this won't be sufficient for your taste, but ENOTIME at this point.
|
||
The **buffer pool** mentioned to above will be a load-bearing component. | ||
Its basic function is to provide callers with a memory buffer of adequate alignment and size (statx `Dio_mem_align` / `Dio_offset_align`). | ||
Callers `get()` a buffer from the pool. Size is specified at `get` time and is fixed (not growable). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the buffers are non-growable, but each buffer can have different size? I was thinking the buffer pool could give out buffers of the same size. Would this be easier to solve the alignment problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing some details, but are the buffer pool memory buffers allocated using the regular memory allocator or does it require some special mechanism?
For example, a 10 byte sized read or write to offset 5000 in a file will load the file contents | ||
at offset `[4096,8192)` into a free page in the kernel page cache. If necessary, it will evict | ||
other pages to make room (cf eviction). Then, the kernel performs a memory-to-memory copy of 10 bytes | ||
from/to the offset `4` (`5000 = 4096 + 4`) within the cached page. If it's a write, the kernel keeps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from/to the offset `4` (`5000 = 4096 + 4`) within the cached page. If it's a write, the kernel keeps | |
from/to the offset `904` (`5000 = 4096 + 904`) within the cached page. If it's a write, the kernel keeps |
each buffer with all thread-local executors. However, above API requirements for the buffer pool implicitly require the buffer | ||
handle that's returned by `get()` to be a custom smart pointer type. We will be able to extend it in the future to include the | ||
io_uring registered buffer index without having to touch the entire code base. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the call I was thinking that even with current day apis alloc::alloc::alloc
or GlobalAlloc::alloc
we are able to construct at runtime alloc::alloc::Layout
for properly aligned and sized buffers. With the likely move towards jemalloc, we might be able to get away with buffer pool == tokio::sync::Semaphore + jemalloc
(initially?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this idea, jemalloc is probably not a requirement. Pretty sure all allocators have thread local pools.
Together with the work stealing executor it might be there this has outweighed risk of one thread ending up doing all of the allocations, growing it's thread local pool, but then again, could be that on average it actually works out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarifications / discussions:
-
Virtual File will not be used for buffered IO anymore
- => tenant conf and similar should just use tokio::fs
- Arpad: concerns about not using VirtualFile fd cache
-
RFC is unclear about whether buffer pool buffers are all the same size or not
- Christian thinks we can get away with one single buffer / io size (8K)
- No Plan B for if we cannot (except using jemalloc, see "option C" in "Discussion" section below)
Discussion:
- Vlad: prefetch: do we currently rely on the kernel prefetch for perf?
- compaction, esp the upcomping k-merge compaction, most likely does
- Will it still be there with direct IO?
- fixed buffer pool size? What if we run out of memory?
- option A: "wait for free" mechanism has synchronization overhead and can deadlock
- option B: "fail with error, ask client to retry with back-off"
- this would probably be the correct approach for responding to overload
- page_service protocol doesn't provide this option, though
- option C: use jemalloc as the buffer pool impl, do not limit buffer pool size
- use jemalloc for both regular allocations and buffer pool
- PS DRAM consumption would then be: fixed PageCache + jemalloc
- => this seems attractive because it punts on drawbacks of (A) and (B) and de-risks OOMs
- Vlad: plans for new metrics such as queue depth?
part of #8184 # Problem We want to bypass PS PageCache for all data block reads, but `compact_level0_phase1` currently uses `ValueRef::load` to load the WAL records from delta layers. Internally, that maps to `FileBlockReader:read_blk` which hits the PageCache [here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236). # Solution This PR adds a mode for `compact_level0_phase1` that uses the `MergeIterator` for reading the `Value`s from the delta layer files. `MergeIterator` is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks. Other notable changes: * change the `DiskBtreeReader::into_stream` to buffer the node, instead of holding a `PageCache` `PageReadGuard`. * Without this, we run out of page cache slots in `test_pageserver_compaction_smoke`. * Generally, `PageReadGuard`s aren't supposed to be held across await points, so, this is a general bugfix. # Testing / Validation / Performance `MergeIterator` has not yet been used in production; it's being developed as part of * #8002 Therefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item. If they're not identical, we log a warning / fail the unit/regression test. To avoid flooding the logs, we apply a global rate limit of once per 10 seconds. In any case, we use the existing approach's value. Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod: * with validation: * increased CPU usage * ~doubled VirtualFile read bytes/second metric * no change in disk IO usage because the kernel page cache will likely have the pages buffered on the second read * without validation: * slightly higher DRAM usage because each iterator participating in the k-merge has a dedicated buffer (as opposed to before, where compactions would rely on the PS PageCaceh as a shared evicting buffer) * less disk IO if previously there were repeat PageCache misses (likely case on a busy production Pageserver) * lower CPU usage: PageCache out of the picture, fewer syscalls are made (vectored blob io batches reads) # Rollout The new code is used with validation mode enabled-by-default. This gets us validation everywhere by default, specifically in - Rust unit tests - Python tests - Nightly pagebench (shouldn't really matter) - Staging Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior: * neondatabase/infra#1663 # Interactions With Other Features This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240). # Future Work The streaming k-merge's memory usage is proportional to the amount of memory per participating layer. But `compact_level0_phase1` still loads all keys into memory for `all_keys_iter`. Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction. Future work should replace `all_keys_iter` with a streaming keys iterator. This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.
part of #8184 # Problem We want to bypass PS PageCache for all data block reads, but `compact_level0_phase1` currently uses `ValueRef::load` to load the WAL records from delta layers. Internally, that maps to `FileBlockReader:read_blk` which hits the PageCache [here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236). # Solution This PR adds a mode for `compact_level0_phase1` that uses the `MergeIterator` for reading the `Value`s from the delta layer files. `MergeIterator` is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks. Other notable changes: * change the `DiskBtreeReader::into_stream` to buffer the node, instead of holding a `PageCache` `PageReadGuard`. * Without this, we run out of page cache slots in `test_pageserver_compaction_smoke`. * Generally, `PageReadGuard`s aren't supposed to be held across await points, so, this is a general bugfix. # Testing / Validation / Performance `MergeIterator` has not yet been used in production; it's being developed as part of * #8002 Therefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item. If they're not identical, we log a warning / fail the unit/regression test. To avoid flooding the logs, we apply a global rate limit of once per 10 seconds. In any case, we use the existing approach's value. Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod: * with validation: * increased CPU usage * ~doubled VirtualFile read bytes/second metric * no change in disk IO usage because the kernel page cache will likely have the pages buffered on the second read * without validation: * slightly higher DRAM usage because each iterator participating in the k-merge has a dedicated buffer (as opposed to before, where compactions would rely on the PS PageCaceh as a shared evicting buffer) * less disk IO if previously there were repeat PageCache misses (likely case on a busy production Pageserver) * lower CPU usage: PageCache out of the picture, fewer syscalls are made (vectored blob io batches reads) # Rollout The new code is used with validation mode enabled-by-default. This gets us validation everywhere by default, specifically in - Rust unit tests - Python tests - Nightly pagebench (shouldn't really matter) - Staging Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior: * neondatabase/infra#1663 # Interactions With Other Features This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240). # Future Work The streaming k-merge's memory usage is proportional to the amount of memory per participating layer. But `compact_level0_phase1` still loads all keys into memory for `all_keys_iter`. Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction. Future work should replace `all_keys_iter` with a streaming keys iterator. This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.
Rendered
refs #8130