Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: direct I/O #8130

Open
14 of 29 tasks
jcsp opened this issue Jun 21, 2024 · 18 comments
Open
14 of 29 tasks

pageserver: direct I/O #8130

jcsp opened this issue Jun 21, 2024 · 18 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/feature Issue type: feature, for new features or requests

Comments

@jcsp
Copy link
Collaborator

jcsp commented Jun 21, 2024

Project Slack: #proj-pageserver-direct-io


Currently, we do buffered reads of data pages. Direct reads would be a better fit, because:

  • Pageserver data pages have an extremely low temporal locality on reads, because any repeatedly accessed pages are cached inside postgres. This makes it largely a waste of memory, which we could be using for other things.
  • The kernel page cache gives deceptively fast read performance on lightly loaded pageservers, making performance less consistent as pageservers are packed with larger numbers of tenants.

Tasks

Preview Give feedback

Backpointer to the Slack DMs between John and Christian about this: https://neondb.slack.com/archives/D05KTCVS40H/p1718977335312439

Phase 1 Impl

Preview Give feedback
  1. c/storage/pageserver t/feature
    yliang412
  2. problame yliang412
  3. c/storage/pageserver
    yliang412
  4. c/storage/pageserver
    yliang412
  5. yliang412
  6. yliang412
  7. c/storage/pageserver
    problame

Phase 2 Impl

Preview Give feedback
  1. c/storage/pageserver
    yliang412
  2. problame
  3. c/storage/pageserver
    yliang412
  4. c/storage/pageserver
    problame

Follow-ups

Preview Give feedback
  1. c/storage/pageserver
    problame
  2. c/storage/pageserver m/good_first_issue
  3. c/storage/pageserver
  4. c/storage/pageserver
@jcsp jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver labels Jun 21, 2024
@problame problame changed the title pageserver: direct I/O for reads pageserver: direct I/O Jul 9, 2024
@problame
Copy link
Contributor

problame commented Jul 9, 2024

The direct IO RFC covers both read & write path. Renamed this epic to include reads and writes. (I don't think it's significantly more work).

@yliang412 yliang412 self-assigned this Jul 29, 2024
yliang412 added a commit that referenced this issue Aug 7, 2024
Part of #8130, [RFC: Direct IO For Pageserver](https://github.com/neondatabase/neon/blob/problame/direct-io-rfc/docs/rfcs/034-direct-io-for-pageserver.md)

## Description

Add pageserver config for evaluating/enabling direct I/O. 

- Disabled: current default, uses buffered io as is.
- Evaluate: still uses buffered io, but could do alignment checking and
perf simulation (pad latency by direct io RW to a fake file).
- Enabled: uses direct io, behavior on alignment error is configurable.


Signed-off-by: Yuchen Liang <[email protected]>
jcsp pushed a commit that referenced this issue Aug 12, 2024
Part of #8130, [RFC: Direct IO For Pageserver](https://github.com/neondatabase/neon/blob/problame/direct-io-rfc/docs/rfcs/034-direct-io-for-pageserver.md)

## Description

Add pageserver config for evaluating/enabling direct I/O. 

- Disabled: current default, uses buffered io as is.
- Evaluate: still uses buffered io, but could do alignment checking and
perf simulation (pad latency by direct io RW to a fake file).
- Enabled: uses direct io, behavior on alignment error is configurable.


Signed-off-by: Yuchen Liang <[email protected]>
@yliang412
Copy link
Contributor

Last week:

  • Drafted PS Config for Direct I/O
  • Setup managed testbench on EC2

This week:

  • Finish Simulation Mode + alignment check
  • Use aligned buffers on read path

@problame
Copy link
Contributor

problame commented Aug 13, 2024

sync meeting:

@yliang412
Copy link
Contributor

@yliang412
Copy link
Contributor

yliang412 commented Aug 26, 2024

Last week:

problame added a commit that referenced this issue Aug 26, 2024
refs #6989

Problem
-------

After unclean shutdown, we get restarted and read the local filesystem
to make decisions on those reads. Some of the data might have not yet
been fsynced when the unclean shutdown completed.

Durability matters even though Pageservers are conceptually just a cache
of state in S3. For example:
- the cloud control plane is no control loop => pageserver responses
  to tenant attachmentm, etc, needs to be durable.
  - the storage controller does not rely on this (as much?)
- we don't have layer file checksumming, so, downloaded+renamed but not
  fsynced layer files are technically not to be trusted
  - #2683

Solution
--------

`syncfs` the tenants directory during startup, before we start reading from it.

This is a bit overkill because we do remove some temp files (InMemoryLayer!)
later during startup. Further, these temp files are particularly likely to
be dirty in the kernel page cache. However, we don't want to refactor that
cleanup code right now, and the dirty data on pageservers is generally
not that high. Last, with [direct
IO](#8130) we're going to
have near-zero kernel page cache anyway quite soon.
problame added a commit that referenced this issue Aug 26, 2024
refs #6989

Problem
-------

After unclean shutdown, we get restarted and read the local filesystem
to make decisions on those reads. Some of the data might have not yet
been fsynced when the unclean shutdown completed.

Durability matters even though Pageservers are conceptually just a cache
of state in S3. For example:
- the cloud control plane is no control loop => pageserver responses
  to tenant attachmentm, etc, needs to be durable.
  - the storage controller does not rely on this (as much?)
- we don't have layer file checksumming, so, downloaded+renamed but not
  fsynced layer files are technically not to be trusted
  - #2683

Solution
--------

`syncfs` the tenants directory during startup, before we start reading from it.

This is a bit overkill because we do remove some temp files (InMemoryLayer!)
later during startup. Further, these temp files are particularly likely to
be dirty in the kernel page cache. However, we don't want to refactor that
cleanup code right now, and the dirty data on pageservers is generally
not that high. Last, with [direct
IO](#8130) we're going to
have near-zero kernel page cache anyway quite soon.
problame added a commit that referenced this issue Aug 26, 2024
refs #6989

Problem
-------

After unclean shutdown, we get restarted, start reading the local
filesystem,
and make decisions based on those reads. However, some of the data might
have
not yet been fsynced when the unclean shutdown completed.

Durability matters even though Pageservers are conceptually just a cache
of state in S3. For example:
- the cloud control plane is no control loop => pageserver responses
  to tenant attachmentm, etc, needs to be durable.
  - the storage controller does not rely on this (as much?)
- we don't have layer file checksumming, so, downloaded+renamed but not
  fsynced layer files are technically not to be trusted
  - #2683

Solution
--------

`syncfs` the tenants directory during startup, before we start reading
from it.

This is a bit overkill because we do remove some temp files
(InMemoryLayer!)
later during startup. Further, these temp files are particularly likely
to
be dirty in the kernel page cache. However, we don't want to refactor
that
cleanup code right now, and the dirty data on pageservers is generally
not that high. Last, with [direct
IO](#8130) we're going to
have near-zero kernel page cache anyway quite soon.
yliang412 added a commit that referenced this issue Aug 28, 2024
Part of #8130, closes #8719.

## Problem

Currently, vectored blob io only coalesce blocks if they are immediately
adjacent to each other. When we switch to Direct IO, we need a way to
coalesce blobs that are within the dio-aligned boundary but has gap
between them.

## Summary of changes

- Introduces a `VectoredReadCoalesceMode` for `VectoredReadPlanner` and
`StreamingVectoredReadPlanner` which has two modes:
  - `AdjacentOnly` (current implementation)
  - `Chunked(<alignment requirement>)`
- New `ChunkedVectorBuilder` that considers batching `dio-align`-sized
read, the start and end of the vectored read will respect
`stx_dio_offset_align` / `stx_dio_mem_align` (`vectored_read.start` and
`vectored_read.blobs_at.first().start_offset` will be two different
value).
- Since we break the assumption that blobs within single `VectoredRead`
are next to each other (implicit end offset), we start to store blob end
offsets in the `VectoredRead`.
- Adapted existing tests to run in both `VectoredReadCoalesceMode`.
- The io alignment can also be live configured at runtime.

Signed-off-by: Yuchen Liang <[email protected]>
yliang412 added a commit that referenced this issue Aug 28, 2024
Part of #8130, closes #8719.

## Problem

Currently, vectored blob io only coalesce blocks if they are immediately
adjacent to each other. When we switch to Direct IO, we need a way to
coalesce blobs that are within the dio-aligned boundary but has gap
between them.

## Summary of changes

- Introduces a `VectoredReadCoalesceMode` for `VectoredReadPlanner` and
`StreamingVectoredReadPlanner` which has two modes:
  - `AdjacentOnly` (current implementation)
  - `Chunked(<alignment requirement>)`
- New `ChunkedVectorBuilder` that considers batching `dio-align`-sized
read, the start and end of the vectored read will respect
`stx_dio_offset_align` / `stx_dio_mem_align` (`vectored_read.start` and
`vectored_read.blobs_at.first().start_offset` will be two different
value).
- Since we break the assumption that blobs within single `VectoredRead`
are next to each other (implicit end offset), we start to store blob end
offsets in the `VectoredRead`.
- Adapted existing tests to run in both `VectoredReadCoalesceMode`.
- The io alignment can also be live configured at runtime.

Signed-off-by: Yuchen Liang <[email protected]>
problame added a commit that referenced this issue Aug 28, 2024
…lush (#8537)

Part of [Epic: Bypass PageCache for user data
blocks](#7386).

# Problem

`InMemoryLayer` still uses the `PageCache` for all data stored in the
`VirtualFile` that underlies the `EphemeralFile`.

# Background

Before this PR, `EphemeralFile` is a fancy and (code-bloated) buffered
writer around a `VirtualFile` that supports `blob_io`.

The `InMemoryLayerInner::index` stores offsets into the `EphemeralFile`.
At those offset, we find a varint length followed by the serialized
`Value`.

Vectored reads (`get_values_reconstruct_data`) are not in fact vectored
- each `Value` that needs to be read is read sequentially.

The `will_init` bit of information which we use to early-exit the
`get_values_reconstruct_data` for a given key is stored in the
serialized `Value`, meaning we have to read & deserialize the `Value`
from the `EphemeralFile`.

The L0 flushing **also** needs to re-determine the `will_init` bit of
information, by deserializing each value during L0 flush.

# Changes

1. Store the value length and `will_init` information in the
`InMemoryLayer::index`. The `EphemeralFile` thus only needs to store the
values.
2. For `get_values_reconstruct_data`:
- Use the in-memory `index` figures out which values need to be read.
Having the `will_init` stored in the index enables us to do that.
- View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes
in size (adjustable constant). A "DIO chunk" is the minimal unit that we
can read under direct IO.
- Figure out which chunks need to be read to retrieve the serialized
bytes for thes values we need to read.
- Coalesce chunk reads such that each DIO chunk is only read once to
serve all value reads that need data from that chunk.
- Merge adjacent chunk reads into larger
`EphemeralFile::read_exact_at_eof_ok` of up to 128k (adjustable
constant).
3. The new `EphemeralFile::read_exact_at_eof_ok` fills the IO buffer
from the underlying VirtualFile and/or its in-memory buffer.
4. The L0 flush code is changed to use the `index` directly, `blob_io` 
5. We can remove the `ephemeral_file::page_caching` construct now.

The `get_values_reconstruct_data` changes seem like a bit overkill but
they are necessary so we issue the equivalent amount of read system
calls compared to before this PR where it was highly likely that even if
the first PageCache access was a miss, remaining reads within the same
`get_values_reconstruct_data` call from the same `EphemeralFile` page
were a hit.

The "DIO chunk" stuff is truly unnecessary for page cache bypass, but,
since we're working on [direct
IO](#8130) and
#8719 specifically, we need
to do _something_ like this anyways in the near future.

# Alternative Design

The original plan was to use the `vectored_blob_io` code it relies on
the invariant of Delta&Image layers that `index order == values order`.

Further, `vectored_blob_io` code's strategy for merging IOs is limited
to adjacent reads. However, with direct IO, there is another level of
merging that should be done, specifically, if multiple reads map to the
same "DIO chunk" (=alignment-requirement-sized and -aligned region of
the file), then it's "free" to read the chunk into an IO buffer and
serve the two reads from that buffer.
=> #8719

# Testing / Performance

Correctness of the IO merging code is ensured by unit tests.

Additionally, minimal tests are added for the `EphemeralFile`
implementation and the bit-packed `InMemoryLayerIndexValue`.

Performance testing results are presented below.
All pref testing done on my M2 MacBook Pro, running a Linux VM.
It's a release build without `--features testing`.

We see definitive improvement in ingest performance microbenchmark and
an ad-hoc microbenchmark for getpage against InMemoryLayer.

```
baseline: commit 7c74112 origin/main
HEAD: ef1c55c
```

<details>

```
cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta'

baseline

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [483.50 ms 498.73 ms 522.53 ms]
                        thrpt:  [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s]

HEAD

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [479.22 ms 482.92 ms 487.35 ms]
                        thrpt:  [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s]
```

</details>

We don't have a micro-benchmark for InMemoryLayer and it's quite
cumbersome to add one. So, I did manual testing in `neon_local`.

<details>

```

  ./target/release/neon_local stop
  rm -rf .neon
  ./target/release/neon_local init
  ./target/release/neon_local start
  ./target/release/neon_local tenant create --set-default
  ./target/release/neon_local endpoint create foo
  ./target/release/neon_local endpoint start foo
  psql 'postgresql://[email protected]:55432/postgres'
psql (13.16 (Debian 13.16-0+deb11u1), server 15.7)

CREATE TABLE wal_test (
    id SERIAL PRIMARY KEY,
    data TEXT
);

DO $$
DECLARE
    i INTEGER := 1;
BEGIN
    WHILE i <= 500000 LOOP
        INSERT INTO wal_test (data) VALUES ('data');
        i := i + 1;
    END LOOP;
END $$;

-- => result is one L0 from initdb and one 137M-sized ephemeral-2

DO $$
DECLARE
    i INTEGER := 1;
    random_id INTEGER;
    random_record wal_test%ROWTYPE;
    start_time TIMESTAMP := clock_timestamp();
    selects_completed INTEGER := 0;
    min_id INTEGER := 1;  -- Minimum ID value
    max_id INTEGER := 100000;  -- Maximum ID value, based on your insert range
    iters INTEGER := 100000000;  -- Number of iterations to run
BEGIN
    WHILE i <= iters LOOP
        -- Generate a random ID within the known range
        random_id := min_id + floor(random() * (max_id - min_id + 1))::int;

        -- Select the row with the generated random ID
        SELECT * INTO random_record
        FROM wal_test
        WHERE id = random_id;

        -- Increment the select counter
        selects_completed := selects_completed + 1;

        -- Check if a second has passed
        IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN
            -- Print the number of selects completed in the last second
            RAISE NOTICE 'Selects completed in last second: %', selects_completed;

            -- Reset counters for the next second
            selects_completed := 0;
            start_time := clock_timestamp();
        END IF;

        -- Increment the loop counter
        i := i + 1;
    END LOOP;
END $$;

./target/release/neon_local stop

baseline: commit 7c74112 origin/main

NOTICE:  Selects completed in last second: 1864
NOTICE:  Selects completed in last second: 1850
NOTICE:  Selects completed in last second: 1851
NOTICE:  Selects completed in last second: 1918
NOTICE:  Selects completed in last second: 1911
NOTICE:  Selects completed in last second: 1879
NOTICE:  Selects completed in last second: 1858
NOTICE:  Selects completed in last second: 1827
NOTICE:  Selects completed in last second: 1933

ours

NOTICE:  Selects completed in last second: 1915
NOTICE:  Selects completed in last second: 1928
NOTICE:  Selects completed in last second: 1913
NOTICE:  Selects completed in last second: 1932
NOTICE:  Selects completed in last second: 1846
NOTICE:  Selects completed in last second: 1955
NOTICE:  Selects completed in last second: 1991
NOTICE:  Selects completed in last second: 1973
```

NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller.

</details>

# Rollout

This PR changes the code in-place and  is not gated by a feature flag.
@yliang412
Copy link
Contributor

yliang412 commented Sep 2, 2024

Last week:

This week:

@problame
Copy link
Contributor

problame commented Sep 4, 2024

Sync meeting:

  • ship 4c0a61d as a preliminary PR, to land in prod as part of next week's release
    • retain .freeze() and .slice() for uncompressed buffers
    • additional allocation for compressed buffers is ok
    • review from @problame and @arpad-m
  • continue feat(pageserver): newtype for dio-aligned buffer allocation #8730
    • actually do O_DIRECT (quick and dirty)
    • see if it works functionally (local test bench, regression test suite run)
    • some basic pagebench getpage-at-latest-lsn benchmarking
      • 100% PS pagecache hit rate, resize like prod config, or resize further and take a note
      • 100% virtualfile fd cache hit rate, resize like prod config
    • fix until not failures
    • => this demonstrates we can do O_DIRECT on the read path without errors, and we have a performance ballpark

Next steps after that:

@problame problame self-assigned this Sep 9, 2024
@yliang412
Copy link
Contributor

Last week:

yliang412 added a commit that referenced this issue Sep 24, 2024
Part of #8130.

## Problem

Currently, decompression is performed within the `read_blobs`
implementation and the decompressed blob will be appended to the end of
the `BytesMut` buffer. We will lose this flexibility of extending the
buffer when we switch to using our own dio-aligned buffer (WIP in
#8730). To facilitate the
adoption of aligned buffer, we need to refactor the code to perform
decompression outside `read_blobs`.

## Summary of changes

- `VectoredBlobReader::read_blobs` will return `VectoredBlob` without
performing decompression and appending decompressed blob. It becomes the
caller's responsibility to decompress the buffer.
- Added a new `BufView` type that functions as `Cow<Bytes, &[u8]>`.
- Perform decompression within `VectoredBlob::read` so that people don't
have to explicitly thinking about compression when using the reader
interface.

Signed-off-by: Yuchen Liang <[email protected]>
@yliang412
Copy link
Contributor

Sync Meeting:

yliang412 added a commit that referenced this issue Sep 27, 2024
…buffer_alignment to 512 (#9175)

Part of #8130

## Problem

After deploying https://github.com/neondatabase/infra/pull/1927, we
shipped `io_buffer_alignment=512` to all prod region. The
`AdjacentVectoredReadBuilder` code path is no longer taken and we are
running pageserver unit tests 6 times in the CI. Removing it would
reduce the test duration by 30-60s.

## Summary of changes

- Remove `AdjacentVectoredReadBuilder` code.
- Bump the minimum `io_buffer_alignment` requirement to at least 512
bytes.
- Use default `io_buffer_alignment` for Rust unit tests.

Signed-off-by: Yuchen Liang <[email protected]>
bayandin pushed a commit that referenced this issue Sep 29, 2024
…buffer_alignment to 512 (#9175)

Part of #8130

## Problem

After deploying neondatabase/infra#1927, we
shipped `io_buffer_alignment=512` to all prod region. The
`AdjacentVectoredReadBuilder` code path is no longer taken and we are
running pageserver unit tests 6 times in the CI. Removing it would
reduce the test duration by 30-60s.

## Summary of changes

- Remove `AdjacentVectoredReadBuilder` code.
- Bump the minimum `io_buffer_alignment` requirement to at least 512
bytes.
- Use default `io_buffer_alignment` for Rust unit tests.

Signed-off-by: Yuchen Liang <[email protected]>
@yliang412
Copy link
Contributor

yliang412 commented Sep 30, 2024

Last week:

This week:

  • Doubling pagecache size, to be deployed as part of this week's release: https://github.com/neondatabase/infra/pull/1961
  • Running pagebench on ec2-testbench with prodlike pageserver config. Compare that to buffered io.
  • Plan and if needed implement pagecache working set estimation
    • estimate what its last-N-minutes memory would need to be for a 100% page cache hit rate.
    • Slack Discussion

@yliang412
Copy link
Contributor

yliang412 commented Oct 4, 2024

Sync Meeting:

Read Path Direct IO Deployment Plan

  • Ship VirtualFile changes that allows us to use direct IO for reads but still use buffered IO for writes pageserver: add direct io config to virtual file #9214.
  • Get rid of io_buffer_alignment config, always use 512 for alignment and chunk size.
  • Finish switching in-memory layer to use direct IO. Ship the direct IO read path.

Page cache

  • Double page cache size again in next week's release.
  • Implement Rolling HyperLogLog next week.
    • think through compaction
  • Watch page cache hit rate and relationship with getpage@lsn latency

@yliang412
Copy link
Contributor

yliang412 commented Oct 14, 2024

Last week:

This week:

@yliang412
Copy link
Contributor

yliang412 commented Oct 21, 2024

Last week:

This week:

yliang412 added a commit that referenced this issue Oct 21, 2024
Part of #8130 

## Problem

Pageserver previously goes through the kernel page cache for all the
IOs. The kernel page cache makes light-loaded pageserver have deceptive
fast performance. Using direct IO would offer predictable latencies of
our virtual file IO operations.

In particular for reads, the data pages also have an extremely low
temporal locality because the most frequently accessed pages are cached
on the compute side.

## Summary of changes

This PR enables pageserver to use direct IO for delta layer and image
layer reads. We can ship them separately because these layers are
write-once, read-many, so we will not be mixing buffered IO with direct
IO.

- implement `IoBufferMut`, an buffer type with aligned allocation
(currently set to 512).
- use `IoBufferMut` at all places we are doing reads on image + delta
layers.
- leverage Rust type system and use `IoBufAlignedMut` marker trait to
guarantee that the input buffers for the IO operations are aligned.
- page cache allocation is also made aligned.

_* in-memory layer reads and the write path will be shipped separately._

## Testing

Integration test suite run with O_DIRECT enabled:
#9350

## Performance

We evaluated performance based on the `get-page-at-latest-lsn`
benchmark. The results demonstrate a decrease in the number of IOps, no
sigificant change in the latency mean, and an slight improvement on the
p99.9 and p99.99 latencies.


[Benchmark](https://www.notion.so/neondatabase/Benchmark-O_DIRECT-for-image-and-delta-layers-2024-10-01-112f189e00478092a195ea5a0137e706?pvs=4)

## Rollout

We will add `virtual_file_io_mode=direct` region by region to enable
direct IO on image + delta layers.

Signed-off-by: Yuchen Liang <[email protected]>
@yliang412
Copy link
Contributor

Last week:

This week:

@yliang412
Copy link
Contributor

yliang412 commented Nov 25, 2024

Last week:

  • test and benchmark double-buffered writer
    • performance improved for small values.
    • performance regressed for large values, likely due to being CPU bound + disk write latency.
      • not as simple as turning on fsync for benchmarks. We don't fsync for in-memory layers in prod.

This week:

@yliang412
Copy link
Contributor

yliang412 commented Dec 2, 2024

Last week:

  • Running more benchmarks using Erik's bulk insertion benchmark (test_ingest_insert_bulk) with a focus on pageserver ingestion performance (the "recovery" time): Result
    • Absolute throughput are faster on hetzner box (I feel this is expected)
    • Double buffering with buffered IO always improves performance.
    • Double buffering with direct IO shows performance improvement on hetzner box, but regression with im4gn.2xlarge

This week:

@yliang412
Copy link
Contributor

yliang412 commented Dec 9, 2024

Last week:

This week:

@yliang412
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

No branches or pull requests

3 participants