Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: add L0 compute backpressure (replace flush backpressure) #10095

Closed
Tracked by #9624 ...
erikgrinaker opened this issue Dec 11, 2024 · 10 comments
Closed
Tracked by #9624 ...
Assignees
Labels
a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Dec 11, 2024

In #8550, we began backpressuring layer file flushes by waiting for S3 uploads after each flush. However, this has a few issues:

  • It prevents upload parallelism.
  • It prevents flush and upload pipelining.
  • It slows down ingestion even when there is no need to backpressure.
  • It doesn't directly backpressure WAL ingestion, and will build up in-memory layers.

Recall the existing compute backpressure knobs:

  • max_replication_write_lag: 500 MB (based on last_received_lsn).
  • max_replication_flush_lag: 10 GB (based on disk_consistent_lsn).
  • max_replication_apply_lag: disabled (based on remote_consistent_lsn).

The flush backpressure, as it's currently implemented, delays disk_consistent_lsn. This means that we can have a pile-up of 10 GB of in-memory layers.

This backpressure was motivated by avoiding buildup of L0 files and the associated read-amplification, but it only does so indirectly. Instead, we should backpressure directly based on L0 size and the compaction backlog. Specifically, we should add:

  • max_replication_compact_lag: X (based on remote_compacted_lsn).

We should also find appropriate values for flush_lag and/or compact_lag -- 10 GB seems way too high. But it must be high enough that the pageserver will actually run compaction before we backpressure.

Related to #5897.
Related to #5415.
Related to #8390.

@erikgrinaker erikgrinaker added a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver labels Dec 11, 2024
@erikgrinaker erikgrinaker self-assigned this Dec 11, 2024
@skyzh
Copy link
Member

skyzh commented Dec 11, 2024

maybe related: #8390

@skyzh
Copy link
Member

skyzh commented Dec 11, 2024

@VladLazar
Copy link
Contributor

Is the suggestion here to add a new backpressure knob on the compute? Why not use the existing disabled one, max_replication_apply_lag?

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 11, 2024

Why not use the existing disabled one, max_replication_apply_lag?

Because that only tracks LSNs uploaded to S3, including L0 layers. So we could keep piling up L0 layers without it affecting max_replication_apply_lag. We need to backpressure based on compaction, not uploads.

Of course, we could get rid of max_replication_apply_lag if we want to. We could also repurpose it, but the naming doesn't seem accurate.

@VladLazar
Copy link
Contributor

We could also repurpose it

That's what I meant. Re-purposing is quicker turnaround since it doesn't involve compute release. Some computes are really long running anyway.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 11, 2024

Yeah, worth considering. Will still require a compute release though, since the compute does the mapping of remote_consistent_lsn to max_replication_apply_lag. Unless we want to repurpose remote_consistent_lsn too -- I'll have to check what else we use it for.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 12, 2024

I have a prototype of disk_compacted_lsn in #10113. This contains the last LSN that is known to be compacted to L1 on local disk on each shard.

I think we need both disk_compacted_lsn and remote_consistent_lsn -- the former tracks compaction debt (and thus approximates read amplification), while the latter tracks upload debt, and I think it's reasonable to track and handle these separately, since we want to backpressure on both.

The compaction threshold is based on number of L0 files on the local shard (default 10). I think it makes sense to use the compacted LSN rather the number of L0 files to backpressure writes, because the former tracks how much work we have to do to catch up. However, it can be difficult to tune these values together, particularly with sharding.

Consider e.g. a shard that sees few writes: it may only have 2 L0 files, and thus won't trigger compaction, but as the LSN head keeps advancing on other shards this may cause backpressure on the unloaded shard without it choosing to compact. It's possible that we get away with this by simply setting a high enough compaction backpressure setting and relying on writes getting distributed across shards. I think we have the same problem with remote_consistent_lsn. I'll give it some thought.

@erikgrinaker
Copy link
Contributor Author

Given sharding, I think it's better to look at the actual compaction debt of each shard (L0 delta sizes and possibly in-memory and frozen layers), and backpressure based on the maximum shard debt. I think we should do the same for upload debt.

These computations probably get involved enough that we should try to move them to the Safekeeper/Pageserver side rather than do them on the compute. I wrote up an issue for a single, variable backpressure signal: #10116.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 12, 2024

For now, we'll just remove the flush backpressure, and revisit improved Pageserver backpressure later. I've written up a proposal in #8390.

github-merge-queue bot pushed a commit that referenced this issue Dec 15, 2024
## Problem

In #8550, we made the flush loop wait for uploads after every layer.
This was to avoid unbounded buildup of uploads, and to reduce compaction
debt. However, the approach has several problems:

* It prevents upload parallelism.
* It prevents flush and upload pipelining.
* It slows down ingestion even when there is no need to backpressure.
* It does not directly backpressure WAL ingestion (only via
`disk_consistent_lsn`), and will build up in-memory layers.
* It does not directly backpressure based on compaction debt and read
amplification.

An alternative solution to these problems is proposed in #8390.

In the meanwhile, we revert the change to reduce the impact on ingest
throughput. This does reintroduce some risk of unbounded
upload/compaction buildup. Until
#8390, this can be addressed
in other ways:

* Use `max_replication_apply_lag` (aka `remote_consistent_lsn`), which
will more directly limit upload debt.
* Shard the tenant, which will spread the flush/upload work across more
Pageservers and move the bottleneck to Safekeeper.

Touches #10095.

## Summary of changes

Remove waiting on the upload queue in the flush loop.
@erikgrinaker
Copy link
Contributor Author

The flush backpressure was removed in #10135. Compaction backpressure is tracked by #8390.

@erikgrinaker erikgrinaker closed this as not planned Won't fix, can't repro, duplicate, stale Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

3 participants