Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC parts of layers that are no longer needed (Legacy-Enhanced Compaction) #8002

Closed
24 of 33 tasks
Tracked by #8001
problame opened this issue Jun 10, 2024 · 9 comments
Closed
24 of 33 tasks
Tracked by #8001
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic

Comments

@problame
Copy link
Contributor

problame commented Jun 10, 2024

Child of 2024Q2 compaction work: #8001

This epic tracks the efforts to work around legacy compaction's deficiencies in the area of GC.

The problem: legacy compaction can produce layer pattern where we can't GC layers beyond PITR window (staircase issue).

Solution (high-level): algorithm sketched in #7948

RFC: #8425

Step 1: make the algorithm work + some necessary tests

Step 2: make the algorithm able to handle large jobs (i.e., 10-20GB, multiply by shard num it's ~100GB total DB size)

Step 3: integrate with the rest of the system

Misc

Testing

What's next: latest image layer compaction

@problame problame added c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic labels Jun 10, 2024
@problame problame changed the title Epic: GC parts of layers that are no longer needed GC parts of layers that are no longer needed Jun 10, 2024
@skyzh skyzh changed the title GC parts of layers that are no longer needed GC parts of layers that are no longer needed (Legacy-Enhanced Compaction) Jun 10, 2024
@problame
Copy link
Contributor Author

This week (and part of next week likely):

  • Coding work: items in Step 1
  • Design meeting to figure out how branching will work (@skyzh , @problame , @arpad-m , John or product person)

skyzh added a commit that referenced this issue Jun 13, 2024
#8002

We need mock WAL record to make it easier to write unit tests. This pull
request adds such a record. It has `clear` flag and `append` field. The
tests for legacy-enhanced compaction are not modified yet and will be
part of the next pull request.

---------

Signed-off-by: Alex Chi Z <[email protected]>
@skyzh
Copy link
Member

skyzh commented Jun 17, 2024

Last week: iterator APIs almost done, mock WAL record.
This week: test with test_gc_feedback, resolve PR comments.

@skyzh
Copy link
Member

skyzh commented Jun 17, 2024

Updated plan this week: improve the streaming read interface, test with test_gc_feedback.

save-buffer pushed a commit that referenced this issue Jun 17, 2024
#8002

We need mock WAL record to make it easier to write unit tests. This pull
request adds such a record. It has `clear` flag and `append` field. The
tests for legacy-enhanced compaction are not modified yet and will be
part of the next pull request.

---------

Signed-off-by: Alex Chi Z <[email protected]>
skyzh added a commit that referenced this issue Jun 18, 2024
The new image iterator and delta iterator uses an iterator-based API.
#8006 / part of
#8002

This requires the underlying thing (the btree) to have an iterator API,
and the iterator should have a type name so that it can be stored
somewhere.

```rust
pub struct DeltaLayerIterator {
  index_iterator: BTreeIterator
}
```

versus

```rust
pub struct DeltaLayerIterator {
  index_iterator: impl Stream<....>
}
```

(this requires nightly flag and still buggy in the Rust compiler)


There are multiple ways to achieve this:

1. Either write a BTreeIterator from scratch that provides `async next`.
This is the most efficient way to do that.
2. Or wrap the current `get_stream` API, which is the current approach
in the pull request.

In the future, we should do (1), and the `get_stream` API should be
refactored to use the iterator API. With (2), we have to wrap the
`get_stream` API with `Pin<Box<dyn Stream>>`, where we have the overhead
of dynamic dispatch. However, (2) needs a rewrite of the `visit`
function, which would take some time to write and review. I'd like to
define this iterator API first and work on a real iterator API later.

## Summary of changes

Add `DiskBtreeIterator` and related tests.

Signed-off-by: Alex Chi Z <[email protected]>
@skyzh
Copy link
Member

skyzh commented Jun 24, 2024

Last week: B-tree iterator (but inefficient), fixed some bugs, passed test_gc_feedback.
This week: continue with the plan and merge iterator-related pull requests.

skyzh added a commit that referenced this issue Jun 24, 2024
Part of #8002

This pull request adds tests for bottom-most gc-compaction with delta
records. Also fixed a bug in the compaction process that creates
overlapping delta layers by force splitting at the original delta layer
boundary.

---------

Signed-off-by: Alex Chi Z <[email protected]>
skyzh added a commit that referenced this issue Jun 25, 2024
part of #8002

## Summary of changes

This pull request adds the image layer iterator. It buffers a fixed
amount of key-value pairs in memory, and give developer an iterator
abstraction over the image layer. Once the buffer is exhausted, it will
issue 1 I/O to fetch the next batch.

Due to the Rust lifetime mysteries, the `get_stream_from` API has been
refactored to `into_stream` and consumes `self`.

Delta layer iterator implementation will be similar, therefore I'll add
it after this pull request gets merged.

---------

Signed-off-by: Alex Chi Z <[email protected]>
skyzh added a commit that referenced this issue Jun 25, 2024
Adds manual compaction trigger; add gc compaction to test_gc_feedback

Part of #8002

```
test_gc_feedback[debug-pg15].logical_size: 50 Mb
test_gc_feedback[debug-pg15].physical_size: 2269 Mb
test_gc_feedback[debug-pg15].physical/logical ratio: 44.5302 
test_gc_feedback[debug-pg15].max_total_num_of_deltas: 7 
test_gc_feedback[debug-pg15].max_num_of_deltas_above_image: 2 
test_gc_feedback[debug-pg15].logical_size_after_bottom_most_compaction: 50 Mb
test_gc_feedback[debug-pg15].physical_size_after_bottom_most_compaction: 287 Mb
test_gc_feedback[debug-pg15].physical/logical ratio after bottom_most_compaction: 5.6312 
test_gc_feedback[debug-pg15].max_total_num_of_deltas_after_bottom_most_compaction: 4 
test_gc_feedback[debug-pg15].max_num_of_deltas_above_image_after_bottom_most_compaction: 1
```

## Summary of changes

* Add the manual compaction trigger
* Use in test_gc_feedback
* Add a guard to avoid running it with retain_lsns
* Fix: Do `schedule_compaction_update` after compaction
* Fix: Supply deltas in the correct order to reconstruct value

---------

Signed-off-by: Alex Chi Z <[email protected]>
conradludgate pushed a commit that referenced this issue Jun 27, 2024
Part of #8002

This pull request adds tests for bottom-most gc-compaction with delta
records. Also fixed a bug in the compaction process that creates
overlapping delta layers by force splitting at the original delta layer
boundary.

---------

Signed-off-by: Alex Chi Z <[email protected]>
conradludgate pushed a commit that referenced this issue Jun 27, 2024
part of #8002

## Summary of changes

This pull request adds the image layer iterator. It buffers a fixed
amount of key-value pairs in memory, and give developer an iterator
abstraction over the image layer. Once the buffer is exhausted, it will
issue 1 I/O to fetch the next batch.

Due to the Rust lifetime mysteries, the `get_stream_from` API has been
refactored to `into_stream` and consumes `self`.

Delta layer iterator implementation will be similar, therefore I'll add
it after this pull request gets merged.

---------

Signed-off-by: Alex Chi Z <[email protected]>
conradludgate pushed a commit that referenced this issue Jun 27, 2024
Adds manual compaction trigger; add gc compaction to test_gc_feedback

Part of #8002

```
test_gc_feedback[debug-pg15].logical_size: 50 Mb
test_gc_feedback[debug-pg15].physical_size: 2269 Mb
test_gc_feedback[debug-pg15].physical/logical ratio: 44.5302 
test_gc_feedback[debug-pg15].max_total_num_of_deltas: 7 
test_gc_feedback[debug-pg15].max_num_of_deltas_above_image: 2 
test_gc_feedback[debug-pg15].logical_size_after_bottom_most_compaction: 50 Mb
test_gc_feedback[debug-pg15].physical_size_after_bottom_most_compaction: 287 Mb
test_gc_feedback[debug-pg15].physical/logical ratio after bottom_most_compaction: 5.6312 
test_gc_feedback[debug-pg15].max_total_num_of_deltas_after_bottom_most_compaction: 4 
test_gc_feedback[debug-pg15].max_num_of_deltas_above_image_after_bottom_most_compaction: 1
```

## Summary of changes

* Add the manual compaction trigger
* Use in test_gc_feedback
* Add a guard to avoid running it with retain_lsns
* Fix: Do `schedule_compaction_update` after compaction
* Fix: Supply deltas in the correct order to reconstruct value

---------

Signed-off-by: Alex Chi Z <[email protected]>
skyzh added a commit that referenced this issue Jun 27, 2024
part of #8002

## Summary of changes

Add delta layer iterator and tests.

---------

Signed-off-by: Alex Chi Z <[email protected]>
@skyzh
Copy link
Member

skyzh commented Jul 1, 2024

Last week: merged iterator-related pull requests, brainstorm + design doc on branch compaction https://www.notion.so/neondatabase/Bottom-Most-Compaction-and-Garbage-Collection-9474c713ddee4b6586c264a6df1e044b
This week: implement retain LSN PoC, merge k-merge iterator pull request

skyzh added a commit that referenced this issue Jul 10, 2024
Part of #8002. This pull
request adds a k-merge iterator for bottom-most compaction.

## Summary of changes

* Added back lsn_range / key_range in delta layer inner. This was
removed due to #8050, but added
back because iterators need that information to process lazy loading.
* Added lazy-loading k-merge iterator.
* Added iterator wrapper as a unified iterator type for image+delta
iterator.

The current status and test should cover the use case for L0 compaction
so that the L0 compaction process can bypass page cache and have a fixed
amount of memory usage. The next step is to integrate this with the new
bottom-most compaction.

---------

Signed-off-by: Alex Chi Z <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
problame added a commit that referenced this issue Jul 31, 2024
part of #8184

# Problem

We want to bypass PS PageCache for all data block reads, but
`compact_level0_phase1` currently uses `ValueRef::load` to load the WAL
records from delta layers.
Internally, that maps to `FileBlockReader:read_blk` which hits the
PageCache
[here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236).

# Solution

This PR adds a mode for `compact_level0_phase1` that uses the
`MergeIterator` for reading the `Value`s from the delta layer files.

`MergeIterator` is a streaming k-merge that uses vectored blob_io under
the hood, which bypasses the PS PageCache for data blocks.

Other notable changes:
* change the `DiskBtreeReader::into_stream` to buffer the node, instead
of holding a `PageCache` `PageReadGuard`.
* Without this, we run out of page cache slots in
`test_pageserver_compaction_smoke`.
* Generally, `PageReadGuard`s aren't supposed to be held across await
points, so, this is a general bugfix.

# Testing / Validation / Performance

`MergeIterator` has not yet been used in production; it's being
developed as part of
* #8002

Therefore, this PR adds a validation mode that compares the existing
approach's value iterator with the new approach's stream output, item by
item.
If they're not identical, we log a warning / fail the unit/regression
test.
To avoid flooding the logs, we apply a global rate limit of once per 10
seconds.
In any case, we use the existing approach's value.

Expected performance impact that will be monitored in staging / nightly
benchmarks / eventually pre-prod:
* with validation:
  * increased CPU usage
  * ~doubled VirtualFile read bytes/second metric
* no change in disk IO usage because the kernel page cache will likely
have the pages buffered on the second read
* without validation:
* slightly higher DRAM usage because each iterator participating in the
k-merge has a dedicated buffer (as opposed to before, where compactions
would rely on the PS PageCaceh as a shared evicting buffer)
* less disk IO if previously there were repeat PageCache misses (likely
case on a busy production Pageserver)
* lower CPU usage: PageCache out of the picture, fewer syscalls are made
(vectored blob io batches reads)

# Rollout

The new code is used with validation mode enabled-by-default.
This gets us validation everywhere by default, specifically in
- Rust unit tests
- Python tests
- Nightly pagebench (shouldn't really matter)
- Staging

Before the next release, I'll merge the following aws.git PR that
configures prod to continue using the existing behavior:

* neondatabase/infra#1663

# Interactions With Other Features

This work & rollout should complete before Direct IO is enabled because
Direct IO would double the IOPS & latency for each compaction read
(#8240).

# Future Work

The streaming k-merge's memory usage is proportional to the amount of
memory per participating layer.

But `compact_level0_phase1` still loads all keys into memory for
`all_keys_iter`.
Thus, it continues to have active memory usage proportional to the
number of keys involved in the compaction.

Future work should replace `all_keys_iter` with a streaming keys
iterator.
This PR has a draft in its first commit, which I later reverted because
it's not necessary to achieve the goal of this PR / issue #8184.
skyzh added a commit that referenced this issue Jul 31, 2024
part of #8002

For child branches, we will pull the image of the modified keys from the
parant into the child branch, which creates a full history for
generating key retention. If there are not enough delta keys, the image
won't be wrote eventually, and we will only keep the deltas inside the
child branch. We could avoid the wasteful work to pull the image from
the parent if we can know the number of deltas in advance, in the future
(currently we always pull image for all modified keys in the child
branch)


---------

Signed-off-by: Alex Chi Z <[email protected]>
skyzh added a commit that referenced this issue Jul 31, 2024
should be working after #8328
gets merged. Part of #8002

adds a new perf benchmark case that ensures garbages can be collected
with branches

---------

Signed-off-by: Alex Chi Z <[email protected]>
skyzh added a commit that referenced this issue Aug 1, 2024
part of #8002

Due to the limitation of the current layer map implementation, we cannot
directly replace a layer. It's interpreted as an insert and a deletion,
and there will be file exist error when renaming the newly-created layer
to replace the old layer. We work around that by changing the end key of
the image layer. A long-term fix would involve a refactor around the
layer file naming. For delta layers, we simply skip layers with the same
key range produced, though it is possible to add an extra key as an
alternative solution.

* The image layer range for the layers generated from gc-compaction will
be Key::MIN..(Key..MAX-1), to avoid being recognized as an L0 delta
layer.
* Skip existing layers if it turns out that we need to generate a layer
with the same persistent key in the same generation.

Note that it is possible that the newly-generated layer has different
content from the existing layer. For example, when the user drops a
retain_lsn, the compaction could have combined or dropped some records,
therefore creating a smaller layer than the existing one. We discard the
"optimized" layer for now because we cannot deal with such rewrites
within the same generation.


---------

Signed-off-by: Alex Chi Z <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
skyzh added a commit that referenced this issue Aug 5, 2024
part of #8002

## Summary of changes

Add a `SplitImageWriter` that automatically splits image layer based on
estimated target image layer size. This does not consider compression
and we might need a better metrics.

---------

Signed-off-by: Alex Chi Z <[email protected]>
Co-authored-by: Arpad Müller <[email protected]>
arpad-m pushed a commit that referenced this issue Aug 5, 2024
part of #8184

# Problem

We want to bypass PS PageCache for all data block reads, but
`compact_level0_phase1` currently uses `ValueRef::load` to load the WAL
records from delta layers.
Internally, that maps to `FileBlockReader:read_blk` which hits the
PageCache
[here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236).

# Solution

This PR adds a mode for `compact_level0_phase1` that uses the
`MergeIterator` for reading the `Value`s from the delta layer files.

`MergeIterator` is a streaming k-merge that uses vectored blob_io under
the hood, which bypasses the PS PageCache for data blocks.

Other notable changes:
* change the `DiskBtreeReader::into_stream` to buffer the node, instead
of holding a `PageCache` `PageReadGuard`.
* Without this, we run out of page cache slots in
`test_pageserver_compaction_smoke`.
* Generally, `PageReadGuard`s aren't supposed to be held across await
points, so, this is a general bugfix.

# Testing / Validation / Performance

`MergeIterator` has not yet been used in production; it's being
developed as part of
* #8002

Therefore, this PR adds a validation mode that compares the existing
approach's value iterator with the new approach's stream output, item by
item.
If they're not identical, we log a warning / fail the unit/regression
test.
To avoid flooding the logs, we apply a global rate limit of once per 10
seconds.
In any case, we use the existing approach's value.

Expected performance impact that will be monitored in staging / nightly
benchmarks / eventually pre-prod:
* with validation:
  * increased CPU usage
  * ~doubled VirtualFile read bytes/second metric
* no change in disk IO usage because the kernel page cache will likely
have the pages buffered on the second read
* without validation:
* slightly higher DRAM usage because each iterator participating in the
k-merge has a dedicated buffer (as opposed to before, where compactions
would rely on the PS PageCaceh as a shared evicting buffer)
* less disk IO if previously there were repeat PageCache misses (likely
case on a busy production Pageserver)
* lower CPU usage: PageCache out of the picture, fewer syscalls are made
(vectored blob io batches reads)

# Rollout

The new code is used with validation mode enabled-by-default.
This gets us validation everywhere by default, specifically in
- Rust unit tests
- Python tests
- Nightly pagebench (shouldn't really matter)
- Staging

Before the next release, I'll merge the following aws.git PR that
configures prod to continue using the existing behavior:

* neondatabase/infra#1663

# Interactions With Other Features

This work & rollout should complete before Direct IO is enabled because
Direct IO would double the IOPS & latency for each compaction read
(#8240).

# Future Work

The streaming k-merge's memory usage is proportional to the amount of
memory per participating layer.

But `compact_level0_phase1` still loads all keys into memory for
`all_keys_iter`.
Thus, it continues to have active memory usage proportional to the
number of keys involved in the compaction.

Future work should replace `all_keys_iter` with a streaming keys
iterator.
This PR has a draft in its first commit, which I later reverted because
it's not necessary to achieve the goal of this PR / issue #8184.
arpad-m pushed a commit that referenced this issue Aug 5, 2024
part of #8002

For child branches, we will pull the image of the modified keys from the
parant into the child branch, which creates a full history for
generating key retention. If there are not enough delta keys, the image
won't be wrote eventually, and we will only keep the deltas inside the
child branch. We could avoid the wasteful work to pull the image from
the parent if we can know the number of deltas in advance, in the future
(currently we always pull image for all modified keys in the child
branch)


---------

Signed-off-by: Alex Chi Z <[email protected]>
arpad-m pushed a commit that referenced this issue Aug 5, 2024
should be working after #8328
gets merged. Part of #8002

adds a new perf benchmark case that ensures garbages can be collected
with branches

---------

Signed-off-by: Alex Chi Z <[email protected]>
arpad-m pushed a commit that referenced this issue Aug 5, 2024
part of #8002

Due to the limitation of the current layer map implementation, we cannot
directly replace a layer. It's interpreted as an insert and a deletion,
and there will be file exist error when renaming the newly-created layer
to replace the old layer. We work around that by changing the end key of
the image layer. A long-term fix would involve a refactor around the
layer file naming. For delta layers, we simply skip layers with the same
key range produced, though it is possible to add an extra key as an
alternative solution.

* The image layer range for the layers generated from gc-compaction will
be Key::MIN..(Key..MAX-1), to avoid being recognized as an L0 delta
layer.
* Skip existing layers if it turns out that we need to generate a layer
with the same persistent key in the same generation.

Note that it is possible that the newly-generated layer has different
content from the existing layer. For example, when the user drops a
retain_lsn, the compaction could have combined or dropped some records,
therefore creating a smaller layer than the existing one. We discard the
"optimized" layer for now because we cannot deal with such rewrites
within the same generation.


---------

Signed-off-by: Alex Chi Z <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
arpad-m added a commit that referenced this issue Aug 5, 2024
part of #8002

## Summary of changes

Add a `SplitImageWriter` that automatically splits image layer based on
estimated target image layer size. This does not consider compression
and we might need a better metrics.

---------

Signed-off-by: Alex Chi Z <[email protected]>
Co-authored-by: Arpad Müller <[email protected]>
skyzh added a commit that referenced this issue Aug 5, 2024
part of #8002

Similar to #8574, we add
auto-split support for delta layers. Tests are reused from image layer
split writers.


---------

Signed-off-by: Alex Chi Z <[email protected]>
skyzh added a commit that referenced this issue Aug 6, 2024
…8557)

Add dry-run mode that does not produce any image layer + delta layer. I
will use this code to do some experiments and see how much space we can
reclaim for tenants on staging. Part of
#8002

* Add dry-run mode that runs the full compaction process without
updating the layer map. (We never call finish on the writers and the
files will be removed before exiting the function).
* Add compaction statistics and print them at the end of compaction.

---------

Signed-off-by: Alex Chi Z <[email protected]>
jcsp pushed a commit that referenced this issue Aug 12, 2024
part of #8002

Similar to #8574, we add
auto-split support for delta layers. Tests are reused from image layer
split writers.


---------

Signed-off-by: Alex Chi Z <[email protected]>
jcsp pushed a commit that referenced this issue Aug 12, 2024
…8557)

Add dry-run mode that does not produce any image layer + delta layer. I
will use this code to do some experiments and see how much space we can
reclaim for tenants on staging. Part of
#8002

* Add dry-run mode that runs the full compaction process without
updating the layer map. (We never call finish on the writers and the
files will be removed before exiting the function).
* Add compaction statistics and print them at the end of compaction.

---------

Signed-off-by: Alex Chi Z <[email protected]>
skyzh added a commit that referenced this issue Aug 23, 2024
)

A small PR to make it possible to run force compaction in staging for
btm-gc compaction testing.

Part of #8002

Signed-off-by: Alex Chi Z <[email protected]>
@skyzh
Copy link
Member

skyzh commented Aug 26, 2024

going to resolve the comments + merge #8608 and run dry-run mode testing in staging this week

skyzh added a commit that referenced this issue Aug 26, 2024
Part of #8002, the final big PR in the batch.

## Summary of changes

This pull request uses the new split layer writer in the gc-compaction.

* It changes how layers are split. Previously, we split layers based on
the original split point, but this creates too many layers
(test_gc_feedback has one key per layer).
* Therefore, we first verify if the layer map can be processed by the
current algorithm (See #8191,
it's basically the same check)
* On that, we proceed with the compaction. This way, it creates a large
enough layer close to the target layer size.
* Added a new set of functions `with_discard` in the split layer writer.
This helps us skip layers if we are going to produce the same persistent
key.
* The delta writer will keep the updates of the same key in a single
file. This might create a super large layer, but we can optimize it
later.
* The split layer writer is used in the gc-compaction algorithm, and it
will split layers based on size.
* Fix the image layer summary block encoded the wrong key range.

---------

Signed-off-by: Alex Chi Z <[email protected]>
Co-authored-by: Arpad Müller <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
@skyzh
Copy link
Member

skyzh commented Sep 2, 2024

running in staging, will try larger tenents. goal for Sep: run on production tenants. secondary goal: collect aux v1 garbage.

skyzh added a commit that referenced this issue Sep 23, 2024
Part of #8002

Close #8920

Legacy compaction (as well as gc-compaction) rely on the GC process to
remove unused layer files, but this relies on many factors (i.e., key
partition) to ensure data in a dropped table can be eventually removed.

In gc-compaction, we consider the keyspace information when doing the
compaction process. If a key is not in the keyspace, we will skip that
key and not include it in the final output.

However, this is not easy to implement because gc-compaction considers
branch points (i.e., retain_lsns) and the retained keyspaces could
change across different LSNs. Therefore, for now, we only remove aux v1
keys in the compaction process.

## Summary of changes

* Add `FilterIterator` to filter out keys.
* Integrate `FilterIterator` with gc-compaction.
* Add `collect_gc_compaction_keyspace` for a spec of keyspaces that can
be retained during the gc-compaction process.

---------

Signed-off-by: Alex Chi Z <[email protected]>
@skyzh
Copy link
Member

skyzh commented Sep 23, 2024

the implementation mostly complete and I'm closing it in favor of #9114

@skyzh skyzh closed this as completed Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests

2 participants