-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: add L0 compute backpressure (replace flush backpressure) #10095
Comments
maybe related: #8390 |
Is the suggestion here to add a new backpressure knob on the compute? Why not use the existing disabled one, |
Because that only tracks LSNs uploaded to S3, including L0 layers. So we could keep piling up L0 layers without it affecting Of course, we could get rid of |
That's what I meant. Re-purposing is quicker turnaround since it doesn't involve compute release. Some computes are really long running anyway. |
Yeah, worth considering. Will still require a compute release though, since the compute does the mapping of remote_consistent_lsn to max_replication_apply_lag. Unless we want to repurpose remote_consistent_lsn too -- I'll have to check what else we use it for. |
I have a prototype of I think we need both The compaction threshold is based on number of L0 files on the local shard (default 10). I think it makes sense to use the compacted LSN rather the number of L0 files to backpressure writes, because the former tracks how much work we have to do to catch up. However, it can be difficult to tune these values together, particularly with sharding. Consider e.g. a shard that sees few writes: it may only have 2 L0 files, and thus won't trigger compaction, but as the LSN head keeps advancing on other shards this may cause backpressure on the unloaded shard without it choosing to compact. It's possible that we get away with this by simply setting a high enough compaction backpressure setting and relying on writes getting distributed across shards. I think we have the same problem with |
Given sharding, I think it's better to look at the actual compaction debt of each shard (L0 delta sizes and possibly in-memory and frozen layers), and backpressure based on the maximum shard debt. I think we should do the same for upload debt. These computations probably get involved enough that we should try to move them to the Safekeeper/Pageserver side rather than do them on the compute. I wrote up an issue for a single, variable backpressure signal: #10116. |
For now, we'll just remove the flush backpressure, and revisit improved Pageserver backpressure later. I've written up a proposal in #8390. |
## Problem In #8550, we made the flush loop wait for uploads after every layer. This was to avoid unbounded buildup of uploads, and to reduce compaction debt. However, the approach has several problems: * It prevents upload parallelism. * It prevents flush and upload pipelining. * It slows down ingestion even when there is no need to backpressure. * It does not directly backpressure WAL ingestion (only via `disk_consistent_lsn`), and will build up in-memory layers. * It does not directly backpressure based on compaction debt and read amplification. An alternative solution to these problems is proposed in #8390. In the meanwhile, we revert the change to reduce the impact on ingest throughput. This does reintroduce some risk of unbounded upload/compaction buildup. Until #8390, this can be addressed in other ways: * Use `max_replication_apply_lag` (aka `remote_consistent_lsn`), which will more directly limit upload debt. * Shard the tenant, which will spread the flush/upload work across more Pageservers and move the bottleneck to Safekeeper. Touches #10095. ## Summary of changes Remove waiting on the upload queue in the flush loop.
In #8550, we began backpressuring layer file flushes by waiting for S3 uploads after each flush. However, this has a few issues:
Recall the existing compute backpressure knobs:
max_replication_write_lag
: 500 MB (based onlast_received_lsn
).max_replication_flush_lag
: 10 GB (based ondisk_consistent_lsn
).max_replication_apply_lag
: disabled (based onremote_consistent_lsn
).The flush backpressure, as it's currently implemented, delays
disk_consistent_lsn
. This means that we can have a pile-up of 10 GB of in-memory layers.This backpressure was motivated by avoiding buildup of L0 files and the associated read-amplification, but it only does so indirectly. Instead, we should backpressure directly based on L0 size and the compaction backlog. Specifically, we should add:
max_replication_compact_lag
: X (based onremote_compacted_lsn
).We should also find appropriate values for
flush_lag
and/orcompact_lag
-- 10 GB seems way too high. But it must be high enough that the pageserver will actually run compaction before we backpressure.Related to #5897.
Related to #5415.
Related to #8390.
The text was updated successfully, but these errors were encountered: