-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(pageserver): add metrics for aux file size #7623
Conversation
3048 tests run: 2915 passed, 0 failed, 133 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
24ae9b2 at 2024-05-13T15:41:49.881Z :recycle: |
ce644e9
to
24d5d8e
Compare
Signed-off-by: Alex Chi Z <[email protected]>
Signed-off-by: Alex Chi Z <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens on a pageserver restart? I don't see any persistence of this value, so it's lost whenever we restart the pageserver, until the next basebackup happens.
We should add persistence somehow.
We already have a non-persisted value via #6736 but it's only of limited use as we regularly restart pageservers.
I don’t plan to persist it because it’s an estimator. If we configure an alert for aux file size and some tenant exceeds the limit, it will fire before the page server restart so it does not matter whether it’s persisted or not? |
Signed-off-by: Alex Chi Z <[email protected]>
... or we can have logical size calculation to recompute aux file size in later patches |
Signed-off-by: Alex Chi Z <[email protected]>
The huge aux file counts we have seen in the past were not always from a few days but often the results of buildups over weeks, or maybe even months. So accurate values are kinda important. Again, as I've said, we already have a metric that only runs during the lifetime of the process. It's better than nothing but still not what we want. If we computed it during logical size calculation, it would be equivalent to preparing the basebackup, so kinda expensive. Ideally, reading the prior size would be an O(1) operation. In other words, it should be materialized somewhere. |
Then I'd like to wait on #7663 and persist it in timeline metadata / index_part.json |
Per discussion, we can merge this pull request first and we will have at least some metrics, and add this to initial logical size calculation in the future. |
ref #7443 ## Summary of changes This pull request adds a size estimator for aux files. Each timeline stores a cached `isize` for the estimated total size of aux files. It gets reset on basebackup, and gets updated for each aux file modification. TODO: print a warning when it exceeds the size. The size metrics is not accurate. Race between `on_basebackup` and other functions could create a negative basebackup size, but the chance is rare. Anyways, this does not impose any extra I/Os to the storage as everything is computed in-memory. The aux files are only stored on shard 0. As basebackups are only generated on shard 0, only shard 0 will report this metrics. --------- Signed-off-by: Alex Chi Z <[email protected]>
Problem
ref #7443
Summary of changes
This pull request adds a size estimator for aux files. Each timeline stores a cached
isize
for the estimated total size of aux files. It gets reset on basebackup, and gets updated for each aux file modification. TODO: print a warning when it exceeds the size.The size metrics is not accurate. Race between
on_basebackup
and other functions could create a negative basebackup size, but the chance is rare. Anyways, this does not impose any extra I/Os to the storage as everything is computed in-memory.The aux files are only stored on shard 0. As basebackups are only generated on shard 0, only shard 0 will report this metrics.
Checklist before requesting a review
Checklist before merging