Epic: productionize gc-compaction #9114

skyzh · 2024-09-23T19:44:05Z

In #8002, we already finished the functionalities of gc-compaction, and run it successfully over some small tenants in staging. The next step is to run it on larger tenants and production tenants. This involves work to improve the gc-compaction process and add new features.

Functionality

gc-compaction: filter keys not in shard / not in keyspace #8920 feat(pageserver): drop disposable keys during gc-compaction #9765
gc-compaction: split compaction jobs across keyspace #8921
feat(pageserver): support key range for manual compaction trigger #9723
split of the compaction job
support cancel a compaction job using API (part of feat(pageserver): support schedule gc-compaction #9809)
support automatic trigger feat(pageserver): add automatic trigger for gc-compaction #10221

This ensures that the compaction process can clean-up data (that couldn't be cleaned up in legacy compaction) and yield to more important compaction job if it will take a long time.

Misc

Testing

Run on large staging tenants
Run on inactive production tenants
Run on active production tenants
Enable auto trigger

part of #9114 ## Summary of changes gc-compaction may take a lot of disk space, and if it does, the caller should do a partial gc-compaction. This patch adds space check for the compaction job. --------- Signed-off-by: Alex Chi Z <[email protected]>

skyzh · 2024-10-21T16:55:43Z

This week: hopefully get #9048 + #9134 merged

…#9493) part of #9114, #8836, #8362 The split layer writer code can be used in a more general way: the caller puts unfinished writers into the batch layer writer and let batch layer writer to ensure the atomicity of the layer produces. ## Summary of changes * Add batch layer writer, which atomically finishes the layers. `BatchLayerWriter::finish` is simply a copy-paste from previous split layer writers. * Refactor split writers to use the batch layer writer. * The current split writer tests cover all code path of batch layer writer. --------- Signed-off-by: Alex Chi Z <[email protected]>

…#9134) part of #8921, #9114 ## Summary of changes We start the partial compaction implementation with the image layer partial generation. The partial compaction API now takes a key range. We will only generate images for that key range for now, and remove layers fully included in the key range after compaction. --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Christian Schwarz <[email protected]>

The final patch for partial compaction, part of #9114, close #8921 (note that we didn't implement parallel compaction or compaction scheduler for partial compaction -- currently this needs to be scheduled by using a Python script to split the keyspace, and in the future, automatically split based on the key partitioning when the pageserver wants to trigger a gc-compaction) ## Summary of changes * Update the layer selection algorithm to use the same selection as full compaction (everything intersect/below gc horizon) * Update the layer selection algorithm to also generate a list of delta layers that need to be rewritten * Add the logic to rewrite delta layers and add them back to the layer map * Update test case to do partial compaction on deltas --------- Signed-off-by: Alex Chi Z <[email protected]>

I had an impression that gc-compaction didn't test the case where the first record of the key history is will_init because of there are some code path that will panic in this case. Luckily it got fixed in #9026 so we can now implement such tests. Part of #9114 ## Summary of changes * Randomly changed some images into will_init neon wal record * Split `test_simple_bottom_most_compaction_deltas` into two test cases, one of them has the bottom layer as delta layer with will_init flags, while the other is the original one with image layers. --------- Signed-off-by: Alex Chi Z <[email protected]>

close #9552, close #8920, part of #9114 ## Summary of changes * Drop keys not belonging to this shard during gc-compaction to avoid constructing history that might have been truncated during shard compaction. * Run gc-compaction at the end of shard compaction test. --------- Signed-off-by: Alex Chi Z <[email protected]>

) part of #9114, we want to be able to run partial gc-compaction in tests. In the future, we can also expand this functionality to legacy compaction, so that we can trigger compaction for a specific key range. ## Summary of changes * Support passing compaction key range through pageserver routes. * Refactor input parameters of compact related function to take the new `CompactOptions`. * Add tests for partial compaction. Note that the test may or may not trigger compaction based on GC horizon. We need to improve the test case to ensure things always get below the gc_horizon and the gc-compaction can be triggered. --------- Signed-off-by: Alex Chi Z <[email protected]>

## Problem part of #9114 gc-compaction can take a long time. This patch adds support for scheduling a gc-compaction job. The compaction loop will first handle L0->L1 compaction, and then gc compaction. The scheduled jobs are stored in a non-persistent queue within the tenant structure. This will be the building block for the partial compaction trigger -- if the system determines that we need to do a gc compaction, it will partition the keyspace and schedule several jobs. Each of these jobs will run for a short amount of time (i.e, 1 min). L0 compaction will be prioritized over gc compaction. ## Summary of changes * Add compaction scheduler in tenant. * Run scheduled compaction in integration tests. * Change the manual compaction API to allow schedule a compaction instead of immediately doing it. * Add LSN upper bound as gc-compaction parameter. If we schedule partial compactions, gc_cutoff might move across different runs. Therefore, we need to pass a pre-determined gc_cutoff beforehand. (TODO: support LSN lower bound so that we can compact arbitrary "rectangle" in the layer map) * Refactor the gc_compaction internal interface. --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Christian Schwarz <[email protected]>

## Problem part of #9114, stacked PR over #9809 The compaction scheduler now schedules partial compaction jobs. ## Summary of changes * Add the compaction job splitter based on size. * Schedule subcompactions using the compaction scheduler. * Test subcompaction scheduler in the smoke regress test. * Temporarily disable layer map checks --------- Signed-off-by: Alex Chi Z <[email protected]>

problame · 2024-12-09T14:56:38Z

This week:

testing on staging
figure out design for automatic trigger

## Problem close #10049, close #10030, close #8861 part of #9114 The legacy gc process calls `get_latest_gc_cutoff`, which uses a Rcu different than the gc_info struct. In the gc_compaction_smoke test case, the "latest" cutoff could be lower than the gc_info struct, causing gc-compaction to collect data that could be accessed by `latest_gc_cutoff`. Technically speaking, there's nothing wrong with gc-compaction using gc_info without considering latest_gc_cutoff, because gc_info is the source of truth. But anyways, let's fix it. ## Summary of changes * gc-compaction uses `latest_gc_cutoff` instead of gc_info to determine the gc horizon. * if a gc-compaction is scheduled via tenant compaction iteration, it will take the gc_block lock to avoid racing with functionalities like detach ancestor (if it's triggered via manual compaction API without scheduling, then it won't take the lock) --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Arpad Müller <[email protected]>

## Problem part of #9114, stacked PR over #9897, partially refactored to help with #10031 ## Summary of changes * gc-compaction takes `above_lsn` parameter. We only compact the layers above this LSN, and all data below the LSN are treated as if they are on the ancestor branch. * refactored gc-compaction to take `GcCompactJob` that describes the rectangular range to be compacted. * Added unit test for this case. --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Christian Schwarz <[email protected]>

skyzh · 2024-12-16T14:30:44Z

This week:

full run in staging after resolving all bugs
automatic trigger implementation

…tests (#10164) ## Problem part of #9114 In #10127 we fixed the race, but we didn't add the errors to the allowlist. ## Summary of changes * Allow repartition errors in the gc-compaction smoke test. I think it might be worth to refactor the code to allow multiple threads getting a copy of repartition status (i.e., using Rcu) in the future. Signed-off-by: Alex Chi Z <[email protected]>

## Problem We cannot get the size of the compaction queue and access the info. Part of #9114 ## Summary of changes * Add an API endpoint to get the compaction queue. * gc_compaction test case now waits until the compaction finishes. --------- Signed-off-by: Alex Chi Z <[email protected]>

## Problem In #8103 we changed the test case to have more test coverage of gc_compaction. Now that we have `test_gc_compaction_smoke`, we can revert this test case to serve its original purpose and revert the parameter changes. part of #9114 ## Summary of changes * Revert pitr_interval from 60s to 10s. * Assert the physical/logical size ratio in the benchmark. --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Arpad Müller <[email protected]>

…#10044) ## Problem In #9897 we temporarily disabled the layer valid check because the current one only considers the end result of all compaction algorithms, but partial gc-compaction would temporarily produce an "invalid" layer map. part of #9114 ## Summary of changes Allow LSN splits to overlap in the slow path check. Currently, the valid check is only used in storage scrubber (background job) and during gc-compaction (without taking layer lock). Therefore, it's fine for such checks to be a little bit inefficient but more accurate. --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Arpad Müller <[email protected]>

…10209) ## Problem close #10208 part of #9114 ## Summary of changes * Ensure remote `latest_gc_cutoff` is up-to-date before removing any files for gc-compaction. Signed-off-by: Alex Chi Z <[email protected]>

skyzh · 2024-12-20T21:34:27Z

This week: did a staging run among all root timelines >= 1GB. I spent some time fixing bugs so I didn't make it for all pageservers, and only did it for pageserver-27.

After the holiday: finish up the automatic trigger work. Find a way to run experiments in the background without keeping an HTTP connection open. (May need to add new APIs)

skyzh added c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic labels Sep 23, 2024

skyzh self-assigned this Sep 23, 2024

This was referenced Sep 23, 2024

GC parts of layers that are no longer needed (Legacy-Enhanced Compaction) #8002

Closed

feat(pageserver): support partial gc-compaction for lowest retain lsn #9134

Merged

skyzh mentioned this issue Oct 2, 2024

feat(pageserver): do space check before gc-compaction #9250

Merged

5 tasks

skyzh mentioned this issue Oct 20, 2024

Epic: 2024Q2/Q3 compaction work #8001

Closed

skyzh mentioned this issue Oct 23, 2024

refactor(pageserver): refactor split writers using batch layer writer #9493

Merged

5 tasks

skyzh mentioned this issue Oct 28, 2024

pageserver: failed gc-compaction runs in staging #9552

Closed

skyzh mentioned this issue Nov 1, 2024

feat(pageserver): support partial gc-compaction for delta layers #9611

Merged

5 tasks

This was referenced Nov 11, 2024

feat(pageserver): support key range for manual compaction trigger #9723

Merged

test(pageserver): add gc-compaction tests with delta will_init #9724

Merged

skyzh mentioned this issue Nov 14, 2024

feat(pageserver): drop disposable keys during gc-compaction #9765

Merged

This was referenced Nov 19, 2024

feat(pageserver): support schedule gc-compaction #9809

Merged

feat(pageserver): gc-compaction split job and partial scheduler #9897

Merged

feat(pageserver): gc-compaction split over LSN #9900

Merged

This was referenced Dec 6, 2024

fix(pageserver): consider partial compaction layer map in layer check #10044

Merged

fix(pageserver): fix gc-compaction racing with legacy gc #10052

Merged

This was referenced Dec 10, 2024

pageserver: gc-compaction prepare phase #10077

Open

refactor(test): tighten up test_gc_feedback #10126

Merged

skyzh mentioned this issue Dec 12, 2024

feat(pageserver): validate data integrity during gc-compaction #10131

Open

This was referenced Dec 16, 2024

fix(pageserver): allow repartition errors during gc-compaction smoke tests #10164

Merged

feat(pageserver): add compact queue http endpoint #10173

Merged

skyzh mentioned this issue Dec 19, 2024

fix(pageserver): update remote latest_gc_cutoff after gc-compaction #10209

Merged

skyzh mentioned this issue Dec 20, 2024

feat(pageserver): add automatic trigger for gc-compaction #10221

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: productionize gc-compaction #9114

Epic: productionize gc-compaction #9114

skyzh commented Sep 23, 2024 •

edited

Loading

skyzh commented Oct 21, 2024

problame commented Dec 9, 2024

skyzh commented Dec 16, 2024

skyzh commented Dec 20, 2024

Epic: productionize gc-compaction #9114

Epic: productionize gc-compaction #9114

Comments

skyzh commented Sep 23, 2024 • edited Loading

Functionality

Misc

Testing

skyzh commented Oct 21, 2024

problame commented Dec 9, 2024

skyzh commented Dec 16, 2024

skyzh commented Dec 20, 2024

skyzh commented Sep 23, 2024 •

edited

Loading