Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Bypass PageCache for user data blocks #7386

Open
4 of 5 tasks
jcsp opened this issue Apr 15, 2024 · 8 comments
Open
4 of 5 tasks

Epic: Bypass PageCache for user data blocks #7386

jcsp opened this issue Apr 15, 2024 · 8 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/feature Issue type: feature, for new features or requests

Comments

@jcsp
Copy link
Collaborator

jcsp commented Apr 15, 2024

From experiments, we believe that use of the userspace PageCache for user data blocks is a net-negative:

  • Imposes significant synchronization overhead on getpage requests using the shared cache
  • Displaces higher-value data (layer indices) from the cache
  • Very low hit rate at scale when repeated accesses are usually absorbed by the postgres buffer cache.

Ideas / Initial Plan

  • Mini-cache (~2 pages) on each request to enable common I/O patterns that read several sequential deltas
  • Enable readers to optionally skip the page cache
  • Pageserver config property to use the new I/O path
  • Add/modify any metrics as needed to demonstrate the before/after behavior: we should see higher peak RPS and a higher cache hit rate for index pages.
  • Remove caching of userspace pages.

Execution

High-Level

Preview Give feedback
  1. problame
  2. 18 of 18
    a/tech_debt c/storage/pageserver
    problame
  3. 21 of 21
    problame
  4. 3 of 4
    problame
@jcsp jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver labels Apr 15, 2024
@jcsp
Copy link
Collaborator Author

jcsp commented Apr 29, 2024

Next step:

  • try removing materialized page cache & establish the impact -- what else needs to change?

problame added a commit that referenced this issue May 2, 2024
…methods (#7566)

This PR introduces no functional changes.

The read path will be done separately.

refs #6107
refs #7386
@problame
Copy link
Contributor

problame commented May 6, 2024

this week: plumb through RequestContext on read path

Also, Vlad informed me that the switch to vectored get for all Timeline::get means that we'll stop using the PageCache for user data blocks. It's starting to roll out into prod this week.

conradludgate pushed a commit that referenced this issue May 8, 2024
…methods (#7566)

This PR introduces no functional changes.

The read path will be done separately.

refs #6107
refs #7386
problame added a commit that referenced this issue May 13, 2024
…ethods

This PR introduces no functional changes.

The `open()` path will be done separately.

refs #6107
refs #7386
problame added a commit that referenced this issue May 13, 2024
…ethods (#7720)

This PR introduces no functional changes.

The `open()` path will be done separately.

refs #6107
refs #7386
problame added a commit that referenced this issue May 13, 2024
…ethods

This PR introduces no functional changes.

The `open()` path will be done separately.

refs #6107
refs #7386
problame added a commit that referenced this issue May 13, 2024
…ethods (#7725)

This PR introduces no functional changes.

The `open()` path will be done separately.

refs #6107
refs #7386

Co-authored-by: Joonas Koivunen <[email protected]>
@problame
Copy link
Contributor

This week:

  • conclude VirtualFile RequestContext'ification
    • slab-allocated RequestContext
    • include Tenant/Shard/Timeline in RequestContext
    • use that for VirtualFile metrics
  • review whether/where PS-PageCache-ing of data blocks is still relevant
  • revise plan

a-masterov pushed a commit that referenced this issue May 20, 2024
…ethods (#7720)

This PR introduces no functional changes.

The `open()` path will be done separately.

refs #6107
refs #7386
a-masterov pushed a commit that referenced this issue May 20, 2024
…ethods (#7725)

This PR introduces no functional changes.

The `open()` path will be done separately.

refs #6107
refs #7386

Co-authored-by: Joonas Koivunen <[email protected]>
problame added a commit that referenced this issue Jun 20, 2024
part of Epic #7386

# Motivation

The materialized page cache adds complexity to the code base, which
increases the maintenance burden and risk for subtle and hard to
reproduce bugs such as #8050.

Further, the best hit rate that we currently achieve in production is ca
1% of materialized page cache lookups for
`task_kind=PageRequestHandler`. Other task kinds have hit rates <0.2%.

Last, caching page images in Pageserver rewards under-sized caches in
Computes because reading from Pageserver's materialized page cache over
the network is often sufficiently fast (low hundreds of microseconds).
Such Computes should upscale their local caches to fit their working
set, rather than repeatedly requesting the same page from Pageserver.

Some more discussion and context in internal thread
https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459

# Changes

This PR removes the materialized page cache code & metrics.

The infrastructure for different key kinds in `PageCache` is left in
place, even though the "Immutable" key kind is the only remaining one.
This can be further simplified in a future commit.

Some tests started failing because their total runtime was dependent on
high materialized page cache hit rates. This test makes them
fixed-runtime or raises pytest timeouts:
* test_local_file_cache_unlink
* test_physical_replication
* test_pg_regress

# Performance

I focussed on ensuring that this PR will not result in a performance
regression in prod.

* **getpage** requests: our production metrics have shown the
materialized page cache to be irrelevant (low hit rate). Also,
Pageserver is the wrong place to cache page images, it should happen in
compute.
* **ingest** (`task_kind=WalReceiverConnectionHandler`): prod metrics
show 0 percent hit rate, so, removing will not be a regression.
* **get_lsn_by_timestamp**: important API for branch creation, used by
control pane. The clog pages that this code uses are not
materialize-page-cached because they're not 8k. No risk of introducing a
regression here.

We will watch the various nightly benchmarks closely for more results
before shipping to prod.
@problame
Copy link
Contributor

Update:

  • we removed the materialized page cache remove materialized page cache #8105
  • the remaining data pages that are still in the PageCache is the InMemoryLayer pages
    • starting point for investigation: this panel here as well as the entire "Drill down into Content Kind" section in that dashboard
    • NB: above link compares the nightly pgbench run in staging the day before materialized page cache was removed and the day after. Prod data looks likely different.

@problame
Copy link
Contributor

This time range here is instructive: in this time range, there is a high access rate coming from LayerFlushTask for InMemoryLayers which is displacing DeltaLayerBtreeNode layers which causes cache misses for WalReceiverConnectionHandler and to a lesser extent Compaction.

https://neonprod.grafana.net/d/fe7f056c-3ee1-49ef-a08d-e66055099396/pageserver-page-cache?orgId=1&from=1718937864602&to=1718957534079&var-datasource=HUNg6jvVk&var-hit_rate_drill_task_kind=WalReceiverConnectionHandler&var-hit_rate_drill_content_kind=InMemoryLayer&var-adhoc=neon_region%7C%3D%7Cus-east-2&var-adhoc=instance%7C%3D%7Cpageserver-4.us-east-2.aws.neon.build

Image

Image

@problame
Copy link
Contributor

problame commented Jun 21, 2024

If it's just the LayerFlushTask perhaps this is the time to implement

conradludgate pushed a commit that referenced this issue Jun 27, 2024
part of Epic #7386

# Motivation

The materialized page cache adds complexity to the code base, which
increases the maintenance burden and risk for subtle and hard to
reproduce bugs such as #8050.

Further, the best hit rate that we currently achieve in production is ca
1% of materialized page cache lookups for
`task_kind=PageRequestHandler`. Other task kinds have hit rates <0.2%.

Last, caching page images in Pageserver rewards under-sized caches in
Computes because reading from Pageserver's materialized page cache over
the network is often sufficiently fast (low hundreds of microseconds).
Such Computes should upscale their local caches to fit their working
set, rather than repeatedly requesting the same page from Pageserver.

Some more discussion and context in internal thread
https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459

# Changes

This PR removes the materialized page cache code & metrics.

The infrastructure for different key kinds in `PageCache` is left in
place, even though the "Immutable" key kind is the only remaining one.
This can be further simplified in a future commit.

Some tests started failing because their total runtime was dependent on
high materialized page cache hit rates. This test makes them
fixed-runtime or raises pytest timeouts:
* test_local_file_cache_unlink
* test_physical_replication
* test_pg_regress

# Performance

I focussed on ensuring that this PR will not result in a performance
regression in prod.

* **getpage** requests: our production metrics have shown the
materialized page cache to be irrelevant (low hit rate). Also,
Pageserver is the wrong place to cache page images, it should happen in
compute.
* **ingest** (`task_kind=WalReceiverConnectionHandler`): prod metrics
show 0 percent hit rate, so, removing will not be a regression.
* **get_lsn_by_timestamp**: important API for branch creation, used by
control pane. The clog pages that this code uses are not
materialize-page-cached because they're not 8k. No risk of introducing a
regression here.

We will watch the various nightly benchmarks closely for more results
before shipping to prod.
@problame
Copy link
Contributor

problame commented Jul 1, 2024

@problame
Copy link
Contributor

Status update:

VladLazar added a commit that referenced this issue Aug 6, 2024
## Problem

We have been maintaining two read paths (legacy and vectored) for a
while now. The legacy read-path was only used for cross validation in some tests.

## Summary of changes
* Tweak all tests that were using the legacy read path to use the
vectored read path instead
* Remove the read path dispatching based on the pageserver configs
* Remove the legacy read path code

We will be able to remove the single blob io code in
`pageserver/src/tenant/blob_io.rs` when #7386 is complete.

Closes #8005
jcsp pushed a commit that referenced this issue Aug 12, 2024
## Problem

We have been maintaining two read paths (legacy and vectored) for a
while now. The legacy read-path was only used for cross validation in some tests.

## Summary of changes
* Tweak all tests that were using the legacy read path to use the
vectored read path instead
* Remove the read path dispatching based on the pageserver configs
* Remove the legacy read path code

We will be able to remove the single blob io code in
`pageserver/src/tenant/blob_io.rs` when #7386 is complete.

Closes #8005
problame added a commit that referenced this issue Aug 28, 2024
…lush (#8537)

Part of [Epic: Bypass PageCache for user data
blocks](#7386).

# Problem

`InMemoryLayer` still uses the `PageCache` for all data stored in the
`VirtualFile` that underlies the `EphemeralFile`.

# Background

Before this PR, `EphemeralFile` is a fancy and (code-bloated) buffered
writer around a `VirtualFile` that supports `blob_io`.

The `InMemoryLayerInner::index` stores offsets into the `EphemeralFile`.
At those offset, we find a varint length followed by the serialized
`Value`.

Vectored reads (`get_values_reconstruct_data`) are not in fact vectored
- each `Value` that needs to be read is read sequentially.

The `will_init` bit of information which we use to early-exit the
`get_values_reconstruct_data` for a given key is stored in the
serialized `Value`, meaning we have to read & deserialize the `Value`
from the `EphemeralFile`.

The L0 flushing **also** needs to re-determine the `will_init` bit of
information, by deserializing each value during L0 flush.

# Changes

1. Store the value length and `will_init` information in the
`InMemoryLayer::index`. The `EphemeralFile` thus only needs to store the
values.
2. For `get_values_reconstruct_data`:
- Use the in-memory `index` figures out which values need to be read.
Having the `will_init` stored in the index enables us to do that.
- View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes
in size (adjustable constant). A "DIO chunk" is the minimal unit that we
can read under direct IO.
- Figure out which chunks need to be read to retrieve the serialized
bytes for thes values we need to read.
- Coalesce chunk reads such that each DIO chunk is only read once to
serve all value reads that need data from that chunk.
- Merge adjacent chunk reads into larger
`EphemeralFile::read_exact_at_eof_ok` of up to 128k (adjustable
constant).
3. The new `EphemeralFile::read_exact_at_eof_ok` fills the IO buffer
from the underlying VirtualFile and/or its in-memory buffer.
4. The L0 flush code is changed to use the `index` directly, `blob_io` 
5. We can remove the `ephemeral_file::page_caching` construct now.

The `get_values_reconstruct_data` changes seem like a bit overkill but
they are necessary so we issue the equivalent amount of read system
calls compared to before this PR where it was highly likely that even if
the first PageCache access was a miss, remaining reads within the same
`get_values_reconstruct_data` call from the same `EphemeralFile` page
were a hit.

The "DIO chunk" stuff is truly unnecessary for page cache bypass, but,
since we're working on [direct
IO](#8130) and
#8719 specifically, we need
to do _something_ like this anyways in the near future.

# Alternative Design

The original plan was to use the `vectored_blob_io` code it relies on
the invariant of Delta&Image layers that `index order == values order`.

Further, `vectored_blob_io` code's strategy for merging IOs is limited
to adjacent reads. However, with direct IO, there is another level of
merging that should be done, specifically, if multiple reads map to the
same "DIO chunk" (=alignment-requirement-sized and -aligned region of
the file), then it's "free" to read the chunk into an IO buffer and
serve the two reads from that buffer.
=> #8719

# Testing / Performance

Correctness of the IO merging code is ensured by unit tests.

Additionally, minimal tests are added for the `EphemeralFile`
implementation and the bit-packed `InMemoryLayerIndexValue`.

Performance testing results are presented below.
All pref testing done on my M2 MacBook Pro, running a Linux VM.
It's a release build without `--features testing`.

We see definitive improvement in ingest performance microbenchmark and
an ad-hoc microbenchmark for getpage against InMemoryLayer.

```
baseline: commit 7c74112 origin/main
HEAD: ef1c55c
```

<details>

```
cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta'

baseline

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [483.50 ms 498.73 ms 522.53 ms]
                        thrpt:  [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s]

HEAD

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [479.22 ms 482.92 ms 487.35 ms]
                        thrpt:  [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s]
```

</details>

We don't have a micro-benchmark for InMemoryLayer and it's quite
cumbersome to add one. So, I did manual testing in `neon_local`.

<details>

```

  ./target/release/neon_local stop
  rm -rf .neon
  ./target/release/neon_local init
  ./target/release/neon_local start
  ./target/release/neon_local tenant create --set-default
  ./target/release/neon_local endpoint create foo
  ./target/release/neon_local endpoint start foo
  psql 'postgresql://[email protected]:55432/postgres'
psql (13.16 (Debian 13.16-0+deb11u1), server 15.7)

CREATE TABLE wal_test (
    id SERIAL PRIMARY KEY,
    data TEXT
);

DO $$
DECLARE
    i INTEGER := 1;
BEGIN
    WHILE i <= 500000 LOOP
        INSERT INTO wal_test (data) VALUES ('data');
        i := i + 1;
    END LOOP;
END $$;

-- => result is one L0 from initdb and one 137M-sized ephemeral-2

DO $$
DECLARE
    i INTEGER := 1;
    random_id INTEGER;
    random_record wal_test%ROWTYPE;
    start_time TIMESTAMP := clock_timestamp();
    selects_completed INTEGER := 0;
    min_id INTEGER := 1;  -- Minimum ID value
    max_id INTEGER := 100000;  -- Maximum ID value, based on your insert range
    iters INTEGER := 100000000;  -- Number of iterations to run
BEGIN
    WHILE i <= iters LOOP
        -- Generate a random ID within the known range
        random_id := min_id + floor(random() * (max_id - min_id + 1))::int;

        -- Select the row with the generated random ID
        SELECT * INTO random_record
        FROM wal_test
        WHERE id = random_id;

        -- Increment the select counter
        selects_completed := selects_completed + 1;

        -- Check if a second has passed
        IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN
            -- Print the number of selects completed in the last second
            RAISE NOTICE 'Selects completed in last second: %', selects_completed;

            -- Reset counters for the next second
            selects_completed := 0;
            start_time := clock_timestamp();
        END IF;

        -- Increment the loop counter
        i := i + 1;
    END LOOP;
END $$;

./target/release/neon_local stop

baseline: commit 7c74112 origin/main

NOTICE:  Selects completed in last second: 1864
NOTICE:  Selects completed in last second: 1850
NOTICE:  Selects completed in last second: 1851
NOTICE:  Selects completed in last second: 1918
NOTICE:  Selects completed in last second: 1911
NOTICE:  Selects completed in last second: 1879
NOTICE:  Selects completed in last second: 1858
NOTICE:  Selects completed in last second: 1827
NOTICE:  Selects completed in last second: 1933

ours

NOTICE:  Selects completed in last second: 1915
NOTICE:  Selects completed in last second: 1928
NOTICE:  Selects completed in last second: 1913
NOTICE:  Selects completed in last second: 1932
NOTICE:  Selects completed in last second: 1846
NOTICE:  Selects completed in last second: 1955
NOTICE:  Selects completed in last second: 1991
NOTICE:  Selects completed in last second: 1973
```

NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller.

</details>

# Rollout

This PR changes the code in-place and  is not gated by a feature flag.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

No branches or pull requests

2 participants