pageserver: add L0 compute backpressure (replace flush backpressure) #10095

erikgrinaker · 2024-12-11T16:44:33Z

In #8550, we began backpressuring layer file flushes by waiting for S3 uploads after each flush. However, this has a few issues:

It prevents upload parallelism.
It prevents flush and upload pipelining.
It slows down ingestion even when there is no need to backpressure.
It doesn't directly backpressure WAL ingestion, and will build up in-memory layers.

Recall the existing compute backpressure knobs:

max_replication_write_lag: 500 MB (based on last_received_lsn).
max_replication_flush_lag: 10 GB (based on disk_consistent_lsn).
max_replication_apply_lag: disabled (based on remote_consistent_lsn).

The flush backpressure, as it's currently implemented, delays disk_consistent_lsn. This means that we can have a pile-up of 10 GB of in-memory layers.

This backpressure was motivated by avoiding buildup of L0 files and the associated read-amplification, but it only does so indirectly. Instead, we should backpressure directly based on L0 size and the compaction backlog. Specifically, we should add:

max_replication_compact_lag: X (based on remote_compacted_lsn).

We should also find appropriate values for flush_lag and/or compact_lag -- 10 GB seems way too high. But it must be high enough that the pageserver will actually run compaction before we backpressure.

Related to #5897.
Related to #5415.
Related to #8390.

The text was updated successfully, but these errors were encountered:

skyzh · 2024-12-11T16:59:39Z

maybe related: #8390

skyzh · 2024-12-11T17:01:50Z

and https://www.notion.so/neondatabase/Backpressure-a7f66802e63244bcba0aa3a1b966d3f4

VladLazar · 2024-12-11T17:04:43Z

Is the suggestion here to add a new backpressure knob on the compute? Why not use the existing disabled one, max_replication_apply_lag?

erikgrinaker · 2024-12-11T17:07:25Z

Why not use the existing disabled one, max_replication_apply_lag?

Because that only tracks LSNs uploaded to S3, including L0 layers. So we could keep piling up L0 layers without it affecting max_replication_apply_lag. We need to backpressure based on compaction, not uploads.

Of course, we could get rid of max_replication_apply_lag if we want to. We could also repurpose it, but the naming doesn't seem accurate.

VladLazar · 2024-12-11T17:54:34Z

We could also repurpose it

That's what I meant. Re-purposing is quicker turnaround since it doesn't involve compute release. Some computes are really long running anyway.

erikgrinaker · 2024-12-11T18:04:48Z

Yeah, worth considering. Will still require a compute release though, since the compute does the mapping of remote_consistent_lsn to max_replication_apply_lag. Unless we want to repurpose remote_consistent_lsn too -- I'll have to check what else we use it for.

erikgrinaker · 2024-12-12T12:43:31Z

I have a prototype of disk_compacted_lsn in #10113. This contains the last LSN that is known to be compacted to L1 on local disk on each shard.

I think we need both disk_compacted_lsn and remote_consistent_lsn -- the former tracks compaction debt (and thus approximates read amplification), while the latter tracks upload debt, and I think it's reasonable to track and handle these separately, since we want to backpressure on both.

The compaction threshold is based on number of L0 files on the local shard (default 10). I think it makes sense to use the compacted LSN rather the number of L0 files to backpressure writes, because the former tracks how much work we have to do to catch up. However, it can be difficult to tune these values together, particularly with sharding.

Consider e.g. a shard that sees few writes: it may only have 2 L0 files, and thus won't trigger compaction, but as the LSN head keeps advancing on other shards this may cause backpressure on the unloaded shard without it choosing to compact. It's possible that we get away with this by simply setting a high enough compaction backpressure setting and relying on writes getting distributed across shards. I think we have the same problem with remote_consistent_lsn. I'll give it some thought.

erikgrinaker · 2024-12-12T14:01:00Z

Given sharding, I think it's better to look at the actual compaction debt of each shard (L0 delta sizes and possibly in-memory and frozen layers), and backpressure based on the maximum shard debt. I think we should do the same for upload debt.

These computations probably get involved enough that we should try to move them to the Safekeeper/Pageserver side rather than do them on the compute. I wrote up an issue for a single, variable backpressure signal: #10116.

erikgrinaker · 2024-12-12T16:42:50Z

For now, we'll just remove the flush backpressure, and revisit improved Pageserver backpressure later. I've written up a proposal in #8390.

## Problem In #8550, we made the flush loop wait for uploads after every layer. This was to avoid unbounded buildup of uploads, and to reduce compaction debt. However, the approach has several problems: * It prevents upload parallelism. * It prevents flush and upload pipelining. * It slows down ingestion even when there is no need to backpressure. * It does not directly backpressure WAL ingestion (only via `disk_consistent_lsn`), and will build up in-memory layers. * It does not directly backpressure based on compaction debt and read amplification. An alternative solution to these problems is proposed in #8390. In the meanwhile, we revert the change to reduce the impact on ingest throughput. This does reintroduce some risk of unbounded upload/compaction buildup. Until #8390, this can be addressed in other ways: * Use `max_replication_apply_lag` (aka `remote_consistent_lsn`), which will more directly limit upload debt. * Shard the tenant, which will spread the flush/upload work across more Pageservers and move the bottleneck to Safekeeper. Touches #10095. ## Summary of changes Remove waiting on the upload queue in the flush loop.

erikgrinaker · 2024-12-15T09:47:51Z

The flush backpressure was removed in #10135. Compaction backpressure is tracked by #8390.

erikgrinaker added a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver labels Dec 11, 2024

erikgrinaker self-assigned this Dec 11, 2024

This was referenced Dec 11, 2024

pageserver: improve flush upload queue parallelism #10096

Open

Epic: optimize WAL ingest pipeline #9624

Open

erikgrinaker mentioned this issue Dec 12, 2024

pageserver: add disk_compacted_lsn #10113

Closed

erikgrinaker mentioned this issue Dec 12, 2024

Epic: revamp pageserver backpressure #8390

Open

erikgrinaker mentioned this issue Dec 13, 2024

pageserver: revert flush backpressure (#8550) #10135

Merged

erikgrinaker closed this as not planned Won't fix, can't repro, duplicate, stale Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: add L0 compute backpressure (replace flush backpressure) #10095

pageserver: add L0 compute backpressure (replace flush backpressure) #10095

erikgrinaker commented Dec 11, 2024 •

edited

Loading

skyzh commented Dec 11, 2024

skyzh commented Dec 11, 2024

VladLazar commented Dec 11, 2024

erikgrinaker commented Dec 11, 2024 •

edited

Loading

VladLazar commented Dec 11, 2024

erikgrinaker commented Dec 11, 2024 •

edited

Loading

erikgrinaker commented Dec 12, 2024 •

edited

Loading

erikgrinaker commented Dec 12, 2024

erikgrinaker commented Dec 12, 2024 •

edited

Loading

erikgrinaker commented Dec 15, 2024

pageserver: add L0 compute backpressure (replace flush backpressure) #10095

pageserver: add L0 compute backpressure (replace flush backpressure) #10095

Comments

erikgrinaker commented Dec 11, 2024 • edited Loading

skyzh commented Dec 11, 2024

skyzh commented Dec 11, 2024

VladLazar commented Dec 11, 2024

erikgrinaker commented Dec 11, 2024 • edited Loading

VladLazar commented Dec 11, 2024

erikgrinaker commented Dec 11, 2024 • edited Loading

erikgrinaker commented Dec 12, 2024 • edited Loading

erikgrinaker commented Dec 12, 2024

erikgrinaker commented Dec 12, 2024 • edited Loading

erikgrinaker commented Dec 15, 2024

erikgrinaker commented Dec 11, 2024 •

edited

Loading

erikgrinaker commented Dec 11, 2024 •

edited

Loading

erikgrinaker commented Dec 11, 2024 •

edited

Loading

erikgrinaker commented Dec 12, 2024 •

edited

Loading

erikgrinaker commented Dec 12, 2024 •

edited

Loading