-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compaction_level0_phase1: bypass PS PageCache for data blocks #8543
compaction_level0_phase1: bypass PS PageCache for data blocks #8543
Conversation
3150 tests run: 3029 passed, 0 failed, 121 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
296a694 at 2024-07-31T09:54:42.125Z :recycle: |
So, interesting test failures: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8543/10147986168/index.html
Looking at the test code: neon/test_runner/regress/test_compaction.py Lines 39 to 43 in c96e801
I think we run out of PageCache slots because each delta layer's iterator holds one neon/pageserver/src/tenant/disk_btree.rs Lines 301 to 303 in a4434cf
|
This is a trade-off, but it's probably the right trade-off at this time. See comments added for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the k-merge iterator, the same-key handling looks correct under the original logic. Also to do runtime sanity checks, we can assert if the key-lsns from all_values_iter is in exactly the same order as what we had before (all_keys).
…ache-for-compactlevel0phase1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing plan LGTM to me
blocked on e2e failures: https://neondb.slack.com/archives/C059ZC138NR/p1722420565719369 |
Post-merge testing / validation update:
CI pipeline durationNo impact beyond noise on regress tests duration: Before: https://github.com/neondatabase/neon/actions/runs/10168623495/job/28124265193 After: https://github.com/neondatabase/neon/actions/runs/10180033285/job/28157886382 Staging basic testingBeforeSetup: pump a bunch of data into a fresh staging project with 2 vCPUs. Pumping a lot of data creates compaction work for the pageserver. Watch the Pageserver's Page Cache Dashboard before and after deploying this PR. To create a bunch of compaction work, I ran
It ran from 11:58 UTC to +850s = 12:12UTC. However, above pgbench ingests data faster than compaction, and we don't have compaction backpressure, so we run into
until 2024-07-31 13:42:32.909 The grafana logs from that compaction run are here. PageCache dashboard during this period: AfterThe PR merged as tag I deleted the above project to stop the image layer creations. Then I created a new staging project and repeated above pgbench against it.
Started at ~14:02 ended after 825s at 14:15. Staging has validation mode enabled, so, we'd expect the PageCache footprint to still show up. We laid out expectations in the PR description. Same pageserver (pageserver-23) The compactions do take a bit longer, about 3.5min before vs about 4.5 min after. Cant' identify expected changes in
That could be due to to the fact there was concurrently running load on that pageserver-23 during the "before" run, though: |
part of #8184 # Problem We want to bypass PS PageCache for all data block reads, but `compact_level0_phase1` currently uses `ValueRef::load` to load the WAL records from delta layers. Internally, that maps to `FileBlockReader:read_blk` which hits the PageCache [here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236). # Solution This PR adds a mode for `compact_level0_phase1` that uses the `MergeIterator` for reading the `Value`s from the delta layer files. `MergeIterator` is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks. Other notable changes: * change the `DiskBtreeReader::into_stream` to buffer the node, instead of holding a `PageCache` `PageReadGuard`. * Without this, we run out of page cache slots in `test_pageserver_compaction_smoke`. * Generally, `PageReadGuard`s aren't supposed to be held across await points, so, this is a general bugfix. # Testing / Validation / Performance `MergeIterator` has not yet been used in production; it's being developed as part of * #8002 Therefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item. If they're not identical, we log a warning / fail the unit/regression test. To avoid flooding the logs, we apply a global rate limit of once per 10 seconds. In any case, we use the existing approach's value. Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod: * with validation: * increased CPU usage * ~doubled VirtualFile read bytes/second metric * no change in disk IO usage because the kernel page cache will likely have the pages buffered on the second read * without validation: * slightly higher DRAM usage because each iterator participating in the k-merge has a dedicated buffer (as opposed to before, where compactions would rely on the PS PageCaceh as a shared evicting buffer) * less disk IO if previously there were repeat PageCache misses (likely case on a busy production Pageserver) * lower CPU usage: PageCache out of the picture, fewer syscalls are made (vectored blob io batches reads) # Rollout The new code is used with validation mode enabled-by-default. This gets us validation everywhere by default, specifically in - Rust unit tests - Python tests - Nightly pagebench (shouldn't really matter) - Staging Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior: * neondatabase/infra#1663 # Interactions With Other Features This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240). # Future Work The streaming k-merge's memory usage is proportional to the amount of memory per participating layer. But `compact_level0_phase1` still loads all keys into memory for `all_keys_iter`. Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction. Future work should replace `all_keys_iter` with a streaming keys iterator. This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.
part of #8184
Problem
We want to bypass PS PageCache for all data block reads, but
compact_level0_phase1
currently usesValueRef::load
to load the WAL records from delta layers.Internally, that maps to
FileBlockReader:read_blk
which hits the PageCache here.Solution
This PR adds a mode for
compact_level0_phase1
that uses theMergeIterator
for reading theValue
s from the delta layer files.MergeIterator
is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks.Other notable changes:
DiskBtreeReader::into_stream
to buffer the node, instead of holding aPageCache
PageReadGuard
.test_pageserver_compaction_smoke
.PageReadGuard
s aren't supposed to be held across await points, so, this is a general bugfix.Testing / Validation / Performance
MergeIterator
has not yet been used in production; it's being developed as part ofTherefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item.
If they're not identical, we log a warning / fail the unit/regression test.
To avoid flooding the logs, we apply a global rate limit of once per 10 seconds.
In any case, we use the existing approach's value.
Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod:
Rollout
The new code is used with validation mode enabled-by-default.
This gets us validation everywhere by default, specifically in
Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior:
Interactions With Other Features
This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240).
Future Work
The streaming k-merge's memory usage is proportional to the amount of memory per participating layer.
But
compact_level0_phase1
still loads all keys into memory forall_keys_iter
.Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction.
Future work should replace
all_keys_iter
with a streaming keys iterator.This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.