Epic: Bypass PageCache for user data blocks #7386

jcsp · 2024-04-15T13:44:24Z

From experiments, we believe that use of the userspace PageCache for user data blocks is a net-negative:

Imposes significant synchronization overhead on getpage requests using the shared cache
Displaces higher-value data (layer indices) from the cache
Very low hit rate at scale when repeated accesses are usually absorbed by the postgres buffer cache.

Ideas / Initial Plan

Mini-cache (~2 pages) on each request to enable common I/O patterns that read several sequential deltas
Enable readers to optionally skip the page cache
Pageserver config property to use the new I/O path
Add/modify any metrics as needed to demonstrate the before/after behavior: we should see higher peak RPS and a higher cache hit rate for index pages.
Remove caching of userspace pages.

Execution

High-Level

Give feedback

bypass PageCache for data blocks of delta and image layers => done by Vlad's work for get_impl=vectored
remove materialized page cache #8105
bypass PageCache for L0 flush #7418

18 of 18

a/tech_debt c/storage/pageserver
bypass PageCache for compact_level0_phase1 #8184

21 of 21
bypass PageCache for InMemoryLayer::get_values_reconstruct_data #8183

3 of 4
Options

The text was updated successfully, but these errors were encountered:

jcsp · 2024-04-29T13:55:09Z

Next step:

try removing materialized page cache & establish the impact -- what else needs to change?

refs #6107 refs #7386

…methods (#7566) This PR introduces no functional changes. The read path will be done separately. refs #6107 refs #7386

problame · 2024-05-06T14:07:55Z

this week: plumb through RequestContext on read path

Also, Vlad informed me that the switch to vectored get for all Timeline::get means that we'll stop using the PageCache for user data blocks. It's starting to roll out into prod this week.

…methods (#7566) This PR introduces no functional changes. The read path will be done separately. refs #6107 refs #7386

…ethods This PR introduces no functional changes. The `open()` path will be done separately. refs #6107 refs #7386

…ethods (#7720) This PR introduces no functional changes. The `open()` path will be done separately. refs #6107 refs #7386

…ethods This PR introduces no functional changes. The `open()` path will be done separately. refs #6107 refs #7386

…ethods (#7725) This PR introduces no functional changes. The `open()` path will be done separately. refs #6107 refs #7386 Co-authored-by: Joonas Koivunen <[email protected]>

problame · 2024-05-13T13:31:20Z

This week:

conclude VirtualFile RequestContext'ification
- slab-allocated RequestContext
- include Tenant/Shard/Timeline in RequestContext
- use that for VirtualFile metrics
review whether/where PS-PageCache-ing of data blocks is still relevant
revise plan

…ethods (#7720) This PR introduces no functional changes. The `open()` path will be done separately. refs #6107 refs #7386

…ethods (#7725) This PR introduces no functional changes. The `open()` path will be done separately. refs #6107 refs #7386 Co-authored-by: Joonas Koivunen <[email protected]>

part of Epic #7386 # Motivation The materialized page cache adds complexity to the code base, which increases the maintenance burden and risk for subtle and hard to reproduce bugs such as #8050. Further, the best hit rate that we currently achieve in production is ca 1% of materialized page cache lookups for `task_kind=PageRequestHandler`. Other task kinds have hit rates <0.2%. Last, caching page images in Pageserver rewards under-sized caches in Computes because reading from Pageserver's materialized page cache over the network is often sufficiently fast (low hundreds of microseconds). Such Computes should upscale their local caches to fit their working set, rather than repeatedly requesting the same page from Pageserver. Some more discussion and context in internal thread https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459 # Changes This PR removes the materialized page cache code & metrics. The infrastructure for different key kinds in `PageCache` is left in place, even though the "Immutable" key kind is the only remaining one. This can be further simplified in a future commit. Some tests started failing because their total runtime was dependent on high materialized page cache hit rates. This test makes them fixed-runtime or raises pytest timeouts: * test_local_file_cache_unlink * test_physical_replication * test_pg_regress # Performance I focussed on ensuring that this PR will not result in a performance regression in prod. * **getpage** requests: our production metrics have shown the materialized page cache to be irrelevant (low hit rate). Also, Pageserver is the wrong place to cache page images, it should happen in compute. * **ingest** (`task_kind=WalReceiverConnectionHandler`): prod metrics show 0 percent hit rate, so, removing will not be a regression. * **get_lsn_by_timestamp**: important API for branch creation, used by control pane. The clog pages that this code uses are not materialize-page-cached because they're not 8k. No risk of introducing a regression here. We will watch the various nightly benchmarks closely for more results before shipping to prod.

problame · 2024-06-21T12:41:41Z

Update:

we removed the materialized page cache remove materialized page cache #8105
the remaining data pages that are still in the PageCache is the InMemoryLayer pages
- starting point for investigation: this panel here as well as the entire "Drill down into Content Kind" section in that dashboard
- NB: above link compares the nightly pgbench run in staging the day before materialized page cache was removed and the day after. Prod data looks likely different.

problame · 2024-06-21T12:58:42Z

This time range here is instructive: in this time range, there is a high access rate coming from LayerFlushTask for InMemoryLayers which is displacing DeltaLayerBtreeNode layers which causes cache misses for WalReceiverConnectionHandler and to a lesser extent Compaction.

https://neonprod.grafana.net/d/fe7f056c-3ee1-49ef-a08d-e66055099396/pageserver-page-cache?orgId=1&from=1718937864602&to=1718957534079&var-datasource=HUNg6jvVk&var-hit_rate_drill_task_kind=WalReceiverConnectionHandler&var-hit_rate_drill_content_kind=InMemoryLayer&var-adhoc=neon_region%7C%3D%7Cus-east-2&var-adhoc=instance%7C%3D%7Cpageserver-4.us-east-2.aws.neon.build

problame · 2024-06-21T13:02:17Z

If it's just the LayerFlushTask perhaps this is the time to implement

bypass PageCache for L0 flush #7418

part of Epic #7386 # Motivation The materialized page cache adds complexity to the code base, which increases the maintenance burden and risk for subtle and hard to reproduce bugs such as #8050. Further, the best hit rate that we currently achieve in production is ca 1% of materialized page cache lookups for `task_kind=PageRequestHandler`. Other task kinds have hit rates <0.2%. Last, caching page images in Pageserver rewards under-sized caches in Computes because reading from Pageserver's materialized page cache over the network is often sufficiently fast (low hundreds of microseconds). Such Computes should upscale their local caches to fit their working set, rather than repeatedly requesting the same page from Pageserver. Some more discussion and context in internal thread https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459 # Changes This PR removes the materialized page cache code & metrics. The infrastructure for different key kinds in `PageCache` is left in place, even though the "Immutable" key kind is the only remaining one. This can be further simplified in a future commit. Some tests started failing because their total runtime was dependent on high materialized page cache hit rates. This test makes them fixed-runtime or raises pytest timeouts: * test_local_file_cache_unlink * test_physical_replication * test_pg_regress # Performance I focussed on ensuring that this PR will not result in a performance regression in prod. * **getpage** requests: our production metrics have shown the materialized page cache to be irrelevant (low hit rate). Also, Pageserver is the wrong place to cache page images, it should happen in compute. * **ingest** (`task_kind=WalReceiverConnectionHandler`): prod metrics show 0 percent hit rate, so, removing will not be a regression. * **get_lsn_by_timestamp**: important API for branch creation, used by control pane. The clog pages that this code uses are not materialize-page-cached because they're not 8k. No risk of introducing a regression here. We will watch the various nightly benchmarks closely for more results before shipping to prod.

problame · 2024-07-01T13:37:13Z

This week:

merge & benchmark bypass PageCache for L0 flush #7418
start working on bypass PageCache for InMemoryLayer::get_values_reconstruct_data #8183

problame · 2024-07-31T15:17:17Z

Status update:

gradually rolling into prod: bypass PageCache for L0 flush #7418
scouted bypass PageCache for InMemoryLayer::get_values_reconstruct_data #8183
- seems like it's better done after bypass PageCache for compact_level0_phase1 #8184
bypass PageCache for compact_level0_phase1 #8184
- coding done
- validation in staging, will build confidence over the next 1.5 weeks

## Problem We have been maintaining two read paths (legacy and vectored) for a while now. The legacy read-path was only used for cross validation in some tests. ## Summary of changes * Tweak all tests that were using the legacy read path to use the vectored read path instead * Remove the read path dispatching based on the pageserver configs * Remove the legacy read path code We will be able to remove the single blob io code in `pageserver/src/tenant/blob_io.rs` when #7386 is complete. Closes #8005

…lush (#8537) Part of [Epic: Bypass PageCache for user data blocks](#7386). # Problem `InMemoryLayer` still uses the `PageCache` for all data stored in the `VirtualFile` that underlies the `EphemeralFile`. # Background Before this PR, `EphemeralFile` is a fancy and (code-bloated) buffered writer around a `VirtualFile` that supports `blob_io`. The `InMemoryLayerInner::index` stores offsets into the `EphemeralFile`. At those offset, we find a varint length followed by the serialized `Value`. Vectored reads (`get_values_reconstruct_data`) are not in fact vectored - each `Value` that needs to be read is read sequentially. The `will_init` bit of information which we use to early-exit the `get_values_reconstruct_data` for a given key is stored in the serialized `Value`, meaning we have to read & deserialize the `Value` from the `EphemeralFile`. The L0 flushing **also** needs to re-determine the `will_init` bit of information, by deserializing each value during L0 flush. # Changes 1. Store the value length and `will_init` information in the `InMemoryLayer::index`. The `EphemeralFile` thus only needs to store the values. 2. For `get_values_reconstruct_data`: - Use the in-memory `index` figures out which values need to be read. Having the `will_init` stored in the index enables us to do that. - View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes in size (adjustable constant). A "DIO chunk" is the minimal unit that we can read under direct IO. - Figure out which chunks need to be read to retrieve the serialized bytes for thes values we need to read. - Coalesce chunk reads such that each DIO chunk is only read once to serve all value reads that need data from that chunk. - Merge adjacent chunk reads into larger `EphemeralFile::read_exact_at_eof_ok` of up to 128k (adjustable constant). 3. The new `EphemeralFile::read_exact_at_eof_ok` fills the IO buffer from the underlying VirtualFile and/or its in-memory buffer. 4. The L0 flush code is changed to use the `index` directly, `blob_io` 5. We can remove the `ephemeral_file::page_caching` construct now. The `get_values_reconstruct_data` changes seem like a bit overkill but they are necessary so we issue the equivalent amount of read system calls compared to before this PR where it was highly likely that even if the first PageCache access was a miss, remaining reads within the same `get_values_reconstruct_data` call from the same `EphemeralFile` page were a hit. The "DIO chunk" stuff is truly unnecessary for page cache bypass, but, since we're working on [direct IO](#8130) and #8719 specifically, we need to do _something_ like this anyways in the near future. # Alternative Design The original plan was to use the `vectored_blob_io` code it relies on the invariant of Delta&Image layers that `index order == values order`. Further, `vectored_blob_io` code's strategy for merging IOs is limited to adjacent reads. However, with direct IO, there is another level of merging that should be done, specifically, if multiple reads map to the same "DIO chunk" (=alignment-requirement-sized and -aligned region of the file), then it's "free" to read the chunk into an IO buffer and serve the two reads from that buffer. => #8719 # Testing / Performance Correctness of the IO merging code is ensured by unit tests. Additionally, minimal tests are added for the `EphemeralFile` implementation and the bit-packed `InMemoryLayerIndexValue`. Performance testing results are presented below. All pref testing done on my M2 MacBook Pro, running a Linux VM. It's a release build without `--features testing`. We see definitive improvement in ingest performance microbenchmark and an ad-hoc microbenchmark for getpage against InMemoryLayer. ``` baseline: commit 7c74112 origin/main HEAD: ef1c55c ``` <details> ``` cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta' baseline ingest-small-values/ingest 128MB/100b seq, no delta time: [483.50 ms 498.73 ms 522.53 ms] thrpt: [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s] HEAD ingest-small-values/ingest 128MB/100b seq, no delta time: [479.22 ms 482.92 ms 487.35 ms] thrpt: [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s] ``` </details> We don't have a micro-benchmark for InMemoryLayer and it's quite cumbersome to add one. So, I did manual testing in `neon_local`. <details> ``` ./target/release/neon_local stop rm -rf .neon ./target/release/neon_local init ./target/release/neon_local start ./target/release/neon_local tenant create --set-default ./target/release/neon_local endpoint create foo ./target/release/neon_local endpoint start foo psql 'postgresql://[email protected]:55432/postgres' psql (13.16 (Debian 13.16-0+deb11u1), server 15.7) CREATE TABLE wal_test ( id SERIAL PRIMARY KEY, data TEXT ); DO $$ DECLARE i INTEGER := 1; BEGIN WHILE i <= 500000 LOOP INSERT INTO wal_test (data) VALUES ('data'); i := i + 1; END LOOP; END $$; -- => result is one L0 from initdb and one 137M-sized ephemeral-2 DO $$ DECLARE i INTEGER := 1; random_id INTEGER; random_record wal_test%ROWTYPE; start_time TIMESTAMP := clock_timestamp(); selects_completed INTEGER := 0; min_id INTEGER := 1; -- Minimum ID value max_id INTEGER := 100000; -- Maximum ID value, based on your insert range iters INTEGER := 100000000; -- Number of iterations to run BEGIN WHILE i <= iters LOOP -- Generate a random ID within the known range random_id := min_id + floor(random() * (max_id - min_id + 1))::int; -- Select the row with the generated random ID SELECT * INTO random_record FROM wal_test WHERE id = random_id; -- Increment the select counter selects_completed := selects_completed + 1; -- Check if a second has passed IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN -- Print the number of selects completed in the last second RAISE NOTICE 'Selects completed in last second: %', selects_completed; -- Reset counters for the next second selects_completed := 0; start_time := clock_timestamp(); END IF; -- Increment the loop counter i := i + 1; END LOOP; END $$; ./target/release/neon_local stop baseline: commit 7c74112 origin/main NOTICE: Selects completed in last second: 1864 NOTICE: Selects completed in last second: 1850 NOTICE: Selects completed in last second: 1851 NOTICE: Selects completed in last second: 1918 NOTICE: Selects completed in last second: 1911 NOTICE: Selects completed in last second: 1879 NOTICE: Selects completed in last second: 1858 NOTICE: Selects completed in last second: 1827 NOTICE: Selects completed in last second: 1933 ours NOTICE: Selects completed in last second: 1915 NOTICE: Selects completed in last second: 1928 NOTICE: Selects completed in last second: 1913 NOTICE: Selects completed in last second: 1932 NOTICE: Selects completed in last second: 1846 NOTICE: Selects completed in last second: 1955 NOTICE: Selects completed in last second: 1991 NOTICE: Selects completed in last second: 1973 ``` NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller. </details> # Rollout This PR changes the code in-place and is not gated by a feature flag.

jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver labels Apr 15, 2024

jcsp assigned problame Apr 15, 2024

problame added a commit that referenced this issue Apr 30, 2024

chore: plumb through RequestContext to VirtualFile write methods

35e41b1

refs #6107 refs #7386

problame mentioned this issue Apr 30, 2024

chore(pageserver): plumb through RequestContext to VirtualFile write methods #7566

Merged

problame added a commit that referenced this issue May 2, 2024

chore(pageserver): plumb through RequestContext to VirtualFile write …

45ec868

…methods (#7566) This PR introduces no functional changes. The read path will be done separately. refs #6107 refs #7386

conradludgate pushed a commit that referenced this issue May 8, 2024

chore(pageserver): plumb through RequestContext to VirtualFile write …

65c0641

…methods (#7566) This PR introduces no functional changes. The read path will be done separately. refs #6107 refs #7386

problame added a commit that referenced this issue May 13, 2024

chore(pageserver): plumb through RequestContext to VirtualFile read m…

802f51f

…ethods This PR introduces no functional changes. The `open()` path will be done separately. refs #6107 refs #7386

problame mentioned this issue May 13, 2024

chore(pageserver): plumb through RequestContext to VirtualFile read methods #7720

Merged

problame added a commit that referenced this issue May 13, 2024

chore(pageserver): plumb through RequestContext to VirtualFile read m…

b58a615

…ethods (#7720) This PR introduces no functional changes. The `open()` path will be done separately. refs #6107 refs #7386

problame added a commit that referenced this issue May 13, 2024

chore(pageserver): plumb through RequestContext to VirtualFile open m…

2cf8642

…ethods This PR introduces no functional changes. The `open()` path will be done separately. refs #6107 refs #7386

problame mentioned this issue May 13, 2024

chore(pageserver): plumb through RequestContext to VirtualFile open methods #7725

Merged

a-masterov pushed a commit that referenced this issue May 20, 2024

chore(pageserver): plumb through RequestContext to VirtualFile read m…

24cef57

…ethods (#7720) This PR introduces no functional changes. The `open()` path will be done separately. refs #6107 refs #7386

problame mentioned this issue Jun 18, 2024

remove materialized page cache #8105

Merged

problame mentioned this issue Jun 27, 2024

bypass PageCache for InMemoryLayer::get_values_reconstruct_data #8183

Closed

VladLazar mentioned this issue Aug 5, 2024

pageserver: remove legacy read path #8601

Merged

5 tasks

problame mentioned this issue Aug 15, 2024

bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush #8537

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Bypass PageCache for user data blocks #7386

Epic: Bypass PageCache for user data blocks #7386

jcsp commented Apr 15, 2024 •

edited by problame

Loading

High-Level

jcsp commented Apr 29, 2024

problame commented May 6, 2024 •

edited

Loading

problame commented May 13, 2024

problame commented Jun 21, 2024

problame commented Jun 21, 2024

problame commented Jun 21, 2024 •

edited

Loading

problame commented Jul 1, 2024

problame commented Jul 31, 2024

Epic: Bypass PageCache for user data blocks #7386

Epic: Bypass PageCache for user data blocks #7386

Comments

jcsp commented Apr 15, 2024 • edited by problame Loading

Ideas / Initial Plan

Execution

High-Level

jcsp commented Apr 29, 2024

problame commented May 6, 2024 • edited Loading

problame commented May 13, 2024

problame commented Jun 21, 2024

problame commented Jun 21, 2024

problame commented Jun 21, 2024 • edited Loading

problame commented Jul 1, 2024

problame commented Jul 31, 2024

jcsp commented Apr 15, 2024 •

edited by problame

Loading

problame commented May 6, 2024 •

edited

Loading

problame commented Jun 21, 2024 •

edited

Loading