-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bypass PageCache
for InMemoryLayer
+ avoid Value::deser
on L0 flush
#8537
Conversation
InMemoryLayer::get_values_reconstruct_data
3807 tests run: 3700 passed, 1 failed, 106 skipped (full report)Failures on Postgres 14
Flaky tests (10)Postgres 16
Postgres 15
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
b184f77 at 2024-08-28T18:40:06.504Z :recycle: |
2b98201
to
e0b40fa
Compare
59a0df0
to
e408cba
Compare
…ces (#8717) The `tokio_epoll_uring::Slice` / `tokio_uring::Slice` type is weird. The new `FullSlice` newtype is better. See the doc comment for details. The naming is not ideal, but we'll clean that up in a future refactoring where we move the `FullSlice` into `tokio_epoll_uring`. Then, we'll do the following: * tokio_epoll_uring::Slice is removed * `FullSlice` becomes `tokio_epoll_uring::IoBufView` * new type `tokio_epoll_uring::IoBufMutView` for the current `tokio_epoll_uring::Slice<IoBufMut>` Context ------- I did this work in preparation for #8537. There, I'm changing the type that the `inmemory_layer.rs` passes to `DeltaLayerWriter::put_value_bytes` and thus it seemed like a good opportunity to make this cleanup first.
e013f76
to
b580a44
Compare
6b27371
to
c7e2846
Compare
c7e2846
to
fb78185
Compare
cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta' baseline: commit d9a57ae (origin/main) ingest-small-values/ingest 128MB/100b seq, no delta time: [527.44 ms 543.79 ms 562.97 ms] thrpt: [227.36 MiB/s 235.39 MiB/s 242.68 MiB/s] HEAD~1 ingest-small-values/ingest 128MB/100b seq, no delta time: [491.37 ms 494.69 ms 498.30 ms] thrpt: [256.87 MiB/s 258.75 MiB/s 260.49 MiB/s]
on my M2 MacBook Pro, running a Linux VM ./target/release/neon_local stop rm -rf .neon ./target/release/neon_local init ./target/release/neon_local start ./target/release/neon_local tenant create --set-default ./target/release/neon_local endpoint create foo ./target/release/neon_local endpoint start foo psql 'postgresql://[email protected]:55432/postgres' psql (13.16 (Debian 13.16-0+deb11u1), server 15.7) CREATE TABLE wal_test ( id SERIAL PRIMARY KEY, data TEXT ); DO $$ DECLARE i INTEGER := 1; BEGIN WHILE i <= 500000 LOOP INSERT INTO wal_test (data) VALUES ('data'); i := i + 1; END LOOP; END $$; -- => result is one L0 from initdb and one 137M-sized ephemeral-2 DO $$ DECLARE i INTEGER := 1; random_id INTEGER; random_record wal_test%ROWTYPE; start_time TIMESTAMP := clock_timestamp(); selects_completed INTEGER := 0; min_id INTEGER := 1; -- Minimum ID value max_id INTEGER := 100000; -- Maximum ID value, based on your insert range iters INTEGER := 100000000; -- Number of iterations to run BEGIN WHILE i <= iters LOOP -- Generate a random ID within the known range random_id := min_id + floor(random() * (max_id - min_id + 1))::int; -- Select the row with the generated random ID SELECT * INTO random_record FROM wal_test WHERE id = random_id; -- Increment the select counter selects_completed := selects_completed + 1; -- Check if a second has passed IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN -- Print the number of selects completed in the last second RAISE NOTICE 'Selects completed in last second: %', selects_completed; -- Reset counters for the next second selects_completed := 0; start_time := clock_timestamp(); END IF; -- Increment the loop counter i := i + 1; END LOOP; END $$; ./target/release/neon_local stop baseline: commit d9a57ae origin/main, NOTICE: Selects completed in last second: 1286 NOTICE: Selects completed in last second: 1352 NOTICE: Selects completed in last second: 1365 NOTICE: Selects completed in last second: 1399 NOTICE: Selects completed in last second: 1410 NOTICE: Selects completed in last second: 1393 NOTICE: Selects completed in last second: 1316 ours NOTICE: Selects completed in last second: 1541 NOTICE: Selects completed in last second: 1536 NOTICE: Selects completed in last second: 1493 NOTICE: Selects completed in last second: 1379 NOTICE: Selects completed in last second: 1519 NOTICE: Selects completed in last second: 1546 NOTICE: Selects completed in last second: 1489 NOTICE: Selects completed in last second: 1578 NOTICE: Selects completed in last second: 1508
InMemoryLayer::get_values_reconstruct_data
PageCache
for InMemoryLayer
+ avoid Value::deser
on L0 flush
@jcsp @VladLazar I think this is ready for a first review. Wanted to get @jcsp 's eyes on it because of sharded ingest. And @VladLazar 's eyes because of vectored get / direct IO. @yliang412, you might also want to take another look at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I focused on the write path and the new implementation of get_values_reconstruct_data
. I need another pass for the rest.
…_ok for load_to_vec
Reran benchmarks, still a slight improvement (see "edits" on PR description for prior benchmark results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went through all of it again minus vectored dio. Looks good; nice job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok in principal, couple of suggestions on naming, and would like to confirm whether the TODO about strictly enforcing checkpoint/layer sizes is a risk to deploying this
…ter top-down reading flow
Part of Epic: Bypass PageCache for user data blocks.
Problem
InMemoryLayer
still uses thePageCache
for all data stored in theVirtualFile
that underlies theEphemeralFile
.Background
Before this PR,
EphemeralFile
is a fancy and (code-bloated) buffered writer around aVirtualFile
that supportsblob_io
.The
InMemoryLayerInner::index
stores offsets into theEphemeralFile
.At those offset, we find a varint length followed by the serialized
Value
.Vectored reads (
get_values_reconstruct_data
) are not in fact vectored - eachValue
that needs to be read is read sequentially.The
will_init
bit of information which we use to early-exit theget_values_reconstruct_data
for a given key is stored in the serializedValue
, meaning we have to read & deserialize theValue
from theEphemeralFile
.The L0 flushing also needs to re-determine the
will_init
bit of information, by deserializing each value during L0 flush.Changes
will_init
information in theInMemoryLayer::index
. TheEphemeralFile
thus only needs to store the values.get_values_reconstruct_data
:index
figures out which values need to be read. Having thewill_init
stored in the index enables us to do that.EphemeralFile::read_exact_at_eof_ok
of up to 128k (adjustable constant).EphemeralFile::read_exact_at_eof_ok
fills the IO buffer from the underlying VirtualFile and/or its in-memory buffer.index
directly,blob_io
ephemeral_file::page_caching
construct now.The
get_values_reconstruct_data
changes seem like a bit overkill but they are necessary so we issue the equivalent amount of read system calls compared to before this PR where it was highly likely that even if the first PageCache access was a miss, remaining reads within the sameget_values_reconstruct_data
call from the sameEphemeralFile
page were a hit.The "DIO chunk" stuff is truly unnecessary for page cache bypass, but, since we're working on direct IO and #8719 specifically, we need to do something like this anyways in the near future.
Alternative Design
The original plan was to use the
vectored_blob_io
code it relies on the invariant of Delta&Image layers thatindex order == values order
.Further,
vectored_blob_io
code's strategy for merging IOs is limited to adjacent reads. However, with direct IO, there is another level of merging that should be done, specifically, if multiple reads map to the same "DIO chunk" (=alignment-requirement-sized and -aligned region of the file), then it's "free" to read the chunk into an IO buffer and serve the two reads from that buffer.=> #8719
Testing / Performance
Correctness of the IO merging code is ensured by unit tests.
Additionally, minimal tests are added for the
EphemeralFile
implementation and the bit-packedInMemoryLayerIndexValue
.Performance testing results are presented below.
All pref testing done on my M2 MacBook Pro, running a Linux VM.
It's a release build without
--features testing
.We see definitive improvement in ingest performance microbenchmark and an ad-hoc microbenchmark for getpage against InMemoryLayer.
We don't have a micro-benchmark for InMemoryLayer and it's quite cumbersome to add one. So, I did manual testing in
neon_local
.NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller.
Rollout
This PR changes the code in-place and is not gated by a feature flag.