compaction_level0_phase1: bypass PS PageCache for data blocks #8543

problame · 2024-07-29T16:17:40Z

part of #8184

Problem

We want to bypass PS PageCache for all data block reads, but compact_level0_phase1 currently uses ValueRef::load to load the WAL records from delta layers.
Internally, that maps to FileBlockReader:read_blk which hits the PageCache here.

Solution

This PR adds a mode for compact_level0_phase1 that uses the MergeIterator for reading the Values from the delta layer files.

MergeIterator is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks.

Other notable changes:

change the DiskBtreeReader::into_stream to buffer the node, instead of holding a PageCache PageReadGuard.
- Without this, we run out of page cache slots in test_pageserver_compaction_smoke.
- Generally, PageReadGuards aren't supposed to be held across await points, so, this is a general bugfix.

Testing / Validation / Performance

MergeIterator has not yet been used in production; it's being developed as part of

GC parts of layers that are no longer needed (Legacy-Enhanced Compaction) #8002

Therefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item.
If they're not identical, we log a warning / fail the unit/regression test.
To avoid flooding the logs, we apply a global rate limit of once per 10 seconds.
In any case, we use the existing approach's value.

Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod:

with validation:
- increased CPU usage
- ~doubled VirtualFile read bytes/second metric
- no change in disk IO usage because the kernel page cache will likely have the pages buffered on the second read
without validation:
- slightly higher DRAM usage because each iterator participating in the k-merge has a dedicated buffer (as opposed to before, where compactions would rely on the PS PageCaceh as a shared evicting buffer)
- less disk IO if previously there were repeat PageCache misses (likely case on a busy production Pageserver)
- lower CPU usage: PageCache out of the picture, fewer syscalls are made (vectored blob io batches reads)

Rollout

The new code is used with validation mode enabled-by-default.
This gets us validation everywhere by default, specifically in

Rust unit tests
Python tests
Nightly pagebench (shouldn't really matter)
Staging

Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior:

https://github.com/neondatabase/aws/pull/1663

Interactions With Other Features

This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240).

Future Work

The streaming k-merge's memory usage is proportional to the amount of memory per participating layer.

But compact_level0_phase1 still loads all keys into memory for all_keys_iter.
Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction.

Future work should replace all_keys_iter with a streaming keys iterator.
This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.

This reverts commit 99efdf9.

This reverts commit ee2ead8.

github-actions · 2024-07-29T17:05:37Z

3150 tests run: 3029 passed, 0 failed, 121 skipped (full report)

Code coverage* (full report)

functions: 32.5% (7029 of 21632 functions)
lines: 49.9% (55933 of 112088 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
296a694 at 2024-07-31T09:54:42.125Z :recycle:}

problame · 2024-07-29T17:36:16Z

So, interesting test failures:

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8543/10147986168/index.html

test_pageserver_compaction_smoke[release-pg16]

fixtures.pageserver.http.PageserverApiException: IoError: Failed to read immutable buf: timeout: there were page guards alive for all page cache slots

Looking at the test code:

neon/test_runner/regress/test_compaction.py

Lines 39 to 43 in c96e801

    
               # Effectively disable the page cache to rely only on image layers 
        
               # to shorten reads. 
        
               neon_env_builder.pageserver_config_override = """ 
        
           page_cache_size=10 
        
           """

I think we run out of PageCache slots because each delta layer's iterator holds one BlockLease::PageReadGuard open here:

neon/pageserver/src/tenant/disk_btree.rs

Lines 301 to 303 in a4434cf

    
           let node_buf = block_cursor 
        
               .read_blk(self.start_blk + node_blknum, ctx) 
        
               .await?;

This is a trade-off, but it's probably the right trade-off at this time. See comments added for details.

skyzh

Using the k-merge iterator, the same-key handling looks correct under the original logic. Also to do runtime sanity checks, we can assert if the key-lsns from all_values_iter is in exactly the same order as what we had before (all_keys).

pageserver/src/tenant/timeline/compaction.rs

…ache-for-compactlevel0phase1

skyzh

testing plan LGTM to me

pageserver/src/tenant/timeline/compaction.rs

problame · 2024-07-31T10:18:50Z

blocked on e2e failures: https://neondb.slack.com/archives/C059ZC138NR/p1722420565719369

problame · 2024-07-31T14:52:05Z

Post-merge testing / validation update:

CI pipeline duration: from ~30min to
Validation overhead in staging

CI pipeline duration

No impact beyond noise on regress tests duration:

Before: https://github.com/neondatabase/neon/actions/runs/10168623495/job/28124265193

After: https://github.com/neondatabase/neon/actions/runs/10180033285/job/28157886382

Staging basic testing

Before

Setup: pump a bunch of data into a fresh staging project with 2 vCPUs. Pumping a lot of data creates compaction work for the pageserver. Watch the Pageserver's Page Cache Dashboard before and after deploying this PR.

To create a bunch of compaction work, I ran

admin@neon-us-east-2-pgbench:~$ pgbench -i -s 3424 -I dtG 'postgresql://neondb_owner:<MASKED>@ep-flat-bonus-w2d18yi4.us-east-2.aws.neon.build/neondb?sslmode=require'

It ran from 11:58 UTC to +850s = 12:12UTC.

However, above pgbench ingests data faster than compaction, and we don't have compaction backpressure, so we run into

 skipping image layer generation due to L0 compaction did not include all layers.

until 2024-07-31 13:42:32.909

The grafana logs from that compaction run are here.
The last iteration that then actually created image layers took 1379 seconds 🤯.
There are more image layer creations after that.
Anyways, not the problem of this PR.

PageCache dashboard during this period:

After

The PR merged as tag 5753 at 12:17 and deployed to staging via this pipeline which finished deploying at at 13:57.

I deleted the above project to stop the image layer creations.

Then I created a new staging project and repeated above pgbench against it.

postgresql://neondb_owner:<MASKED>@ep-lucky-hill-w2y61c8c.us-east-2.aws.neon.build/neondb?sslmode=require

Started at ~14:02 ended after 825s at 14:15.

Staging has validation mode enabled, so, we'd expect the PageCache footprint to still show up. We laid out expectations in the PR description.

Same pageserver (pageserver-23)

The compactions do take a bit longer, about 3.5min before vs about 4.5 min after.

Cant' identify expected changes in

higher CPU usage
VirtualFile read bytes

That could be due to to the fact there was concurrently running load on that pageserver-23 during the "before" run, though:

part of #8184 # Problem We want to bypass PS PageCache for all data block reads, but `compact_level0_phase1` currently uses `ValueRef::load` to load the WAL records from delta layers. Internally, that maps to `FileBlockReader:read_blk` which hits the PageCache [here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236). # Solution This PR adds a mode for `compact_level0_phase1` that uses the `MergeIterator` for reading the `Value`s from the delta layer files. `MergeIterator` is a streaming k-merge that uses vectored blob_io under the hood, which bypasses the PS PageCache for data blocks. Other notable changes: * change the `DiskBtreeReader::into_stream` to buffer the node, instead of holding a `PageCache` `PageReadGuard`. * Without this, we run out of page cache slots in `test_pageserver_compaction_smoke`. * Generally, `PageReadGuard`s aren't supposed to be held across await points, so, this is a general bugfix. # Testing / Validation / Performance `MergeIterator` has not yet been used in production; it's being developed as part of * #8002 Therefore, this PR adds a validation mode that compares the existing approach's value iterator with the new approach's stream output, item by item. If they're not identical, we log a warning / fail the unit/regression test. To avoid flooding the logs, we apply a global rate limit of once per 10 seconds. In any case, we use the existing approach's value. Expected performance impact that will be monitored in staging / nightly benchmarks / eventually pre-prod: * with validation: * increased CPU usage * ~doubled VirtualFile read bytes/second metric * no change in disk IO usage because the kernel page cache will likely have the pages buffered on the second read * without validation: * slightly higher DRAM usage because each iterator participating in the k-merge has a dedicated buffer (as opposed to before, where compactions would rely on the PS PageCaceh as a shared evicting buffer) * less disk IO if previously there were repeat PageCache misses (likely case on a busy production Pageserver) * lower CPU usage: PageCache out of the picture, fewer syscalls are made (vectored blob io batches reads) # Rollout The new code is used with validation mode enabled-by-default. This gets us validation everywhere by default, specifically in - Rust unit tests - Python tests - Nightly pagebench (shouldn't really matter) - Staging Before the next release, I'll merge the following aws.git PR that configures prod to continue using the existing behavior: * neondatabase/infra#1663 # Interactions With Other Features This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240). # Future Work The streaming k-merge's memory usage is proportional to the amount of memory per participating layer. But `compact_level0_phase1` still loads all keys into memory for `all_keys_iter`. Thus, it continues to have active memory usage proportional to the number of keys involved in the compaction. Future work should replace `all_keys_iter` with a streaming keys iterator. This PR has a draft in its first commit, which I later reverted because it's not necessary to achieve the goal of this PR / issue #8184.

…#8543)" This reverts commit 4825b0f.

problame added 6 commits July 29, 2024 11:49

add iter_keys method

ee2ead8

WIP

99efdf9

WIP

4db074d

Revert "WIP"

2c5c454

This reverts commit 99efdf9.

Revert "add iter_keys method"

d4beb8b

This reverts commit ee2ead8.

restate behavior of MergeIterator in comment

72d56d4

DiskBtreeReader::into_stream: do not hold PageCache slot

8274bdf

This is a trade-off, but it's probably the right trade-off at this time. See comments added for details.

problame mentioned this pull request Jul 29, 2024

bypass PageCache for compact_level0_phase1 #8184

Open

problame changed the title ~~[DO NOT REVIEW] bypass PS PageCache for compaction_level0_phase1~~ compaction_level0_phase1: bypass PS PageCache for data blocks Jul 29, 2024

problame requested a review from skyzh July 29, 2024 18:29

skyzh reviewed Jul 29, 2024

View reviewed changes

pageserver/src/tenant/timeline/compaction.rs Show resolved Hide resolved

problame added 6 commits July 30, 2024 09:37

add validation mode (currently always enabled)

4c1a6ef

make validation configurable

d57c042

add comment explaining explicit drop; #8543 (comment)

6f9ef9c

panic in tests

d60068e

Merge remote-tracking branch 'origin/main' into problame/byapss-pagec…

9fdf503

…ache-for-compactlevel0phase1

renamings & docs

fdf4f1b

problame marked this pull request as ready for review July 30, 2024 17:53

problame requested a review from a team as a code owner July 30, 2024 17:53

problame requested review from VladLazar and skyzh July 30, 2024 17:53

no allow dead_code

296a694

skyzh approved these changes Jul 30, 2024

View reviewed changes

pageserver/src/tenant/timeline/compaction.rs Outdated Show resolved Hide resolved

pageserver/src/tenant/timeline/compaction.rs Show resolved Hide resolved

problame mentioned this pull request Jul 31, 2024

RFC: direct IO for Pageserver #8240

Draft

problame enabled auto-merge (squash) July 31, 2024 09:34

problame disabled auto-merge July 31, 2024 09:55

problame merged commit 4825b0f into main Jul 31, 2024
65 checks passed

problame deleted the problame/byapss-pagecache-for-compactlevel0phase1 branch July 31, 2024 12:18

skyzh added a commit that referenced this pull request Aug 7, 2024

Revert "compaction_level0_phase1: bypass PS PageCache for data blocks (…

a3b6c4b

…#8543)" This reverts commit 4825b0f.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compaction_level0_phase1: bypass PS PageCache for data blocks #8543

compaction_level0_phase1: bypass PS PageCache for data blocks #8543

problame commented Jul 29, 2024 •

edited

Loading

github-actions bot commented Jul 29, 2024 •

edited

Loading

problame commented Jul 29, 2024 •

edited

Loading

skyzh left a comment

skyzh left a comment

problame commented Jul 31, 2024

problame commented Jul 31, 2024

compaction_level0_phase1: bypass PS PageCache for data blocks #8543

compaction_level0_phase1: bypass PS PageCache for data blocks #8543

Conversation

problame commented Jul 29, 2024 • edited Loading

Problem

Solution

Testing / Validation / Performance

Rollout

Interactions With Other Features

Future Work

github-actions bot commented Jul 29, 2024 • edited Loading

3150 tests run: 3029 passed, 0 failed, 121 skipped (full report)

Code coverage* (full report)

problame commented Jul 29, 2024 • edited Loading

skyzh left a comment

Choose a reason for hiding this comment

skyzh left a comment

Choose a reason for hiding this comment

problame commented Jul 31, 2024

problame commented Jul 31, 2024

CI pipeline duration

Staging basic testing

Before

After

problame commented Jul 29, 2024 •

edited

Loading

github-actions bot commented Jul 29, 2024 •

edited

Loading

problame commented Jul 29, 2024 •

edited

Loading