Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pageserver): add metrics for aux file size #7623

Merged
merged 5 commits into from
May 13, 2024
Merged

Conversation

skyzh
Copy link
Member

@skyzh skyzh commented May 6, 2024

Problem

ref #7443

Summary of changes

This pull request adds a size estimator for aux files. Each timeline stores a cached isize for the estimated total size of aux files. It gets reset on basebackup, and gets updated for each aux file modification. TODO: print a warning when it exceeds the size.

The size metrics is not accurate. Race between on_basebackup and other functions could create a negative basebackup size, but the chance is rare. Anyways, this does not impose any extra I/Os to the storage as everything is computed in-memory.

The aux files are only stored on shard 0. As basebackups are only generated on shard 0, only shard 0 will report this metrics.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@skyzh skyzh requested a review from a team as a code owner May 6, 2024 16:07
@skyzh skyzh requested review from jcsp and removed request for a team May 6, 2024 16:07
@skyzh skyzh marked this pull request as draft May 6, 2024 16:07
@skyzh skyzh changed the title feat(pageserver): add metrics for aux file size [WIP] feat(pageserver): add metrics for aux file size May 6, 2024
@skyzh skyzh removed the request for review from jcsp May 6, 2024 16:07
@skyzh skyzh mentioned this pull request May 6, 2024
24 tasks
Copy link

github-actions bot commented May 6, 2024

3048 tests run: 2915 passed, 0 failed, 133 skipped (full report)


Flaky tests (3)

Postgres 16

Postgres 15

  • test_partial_evict_tenant[relative_equal]: release
  • test_download_remote_layers_api: release

Code coverage* (full report)

  • functions: 31.4% (6321 of 20137 functions)
  • lines: 47.3% (47634 of 100778 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
24ae9b2 at 2024-05-13T15:41:49.881Z :recycle:

@skyzh skyzh force-pushed the skyzh/aux-metrics branch 2 times, most recently from ce644e9 to 24d5d8e Compare May 6, 2024 17:37
Base automatically changed from skyzh/aux-file-v2 to main May 7, 2024 16:30
@skyzh skyzh force-pushed the skyzh/aux-metrics branch from 24d5d8e to 3eed992 Compare May 7, 2024 20:07
@skyzh skyzh changed the title [WIP] feat(pageserver): add metrics for aux file size feat(pageserver): add metrics for aux file size May 7, 2024
@skyzh skyzh marked this pull request as ready for review May 7, 2024 20:08
@skyzh skyzh requested a review from arpad-m May 7, 2024 20:08
@skyzh skyzh force-pushed the skyzh/aux-metrics branch from 3eed992 to eb35310 Compare May 7, 2024 20:47
Signed-off-by: Alex Chi Z <[email protected]>
@skyzh skyzh force-pushed the skyzh/aux-metrics branch from eb35310 to 87dbd04 Compare May 7, 2024 20:48
@skyzh skyzh requested a review from VladLazar May 8, 2024 14:31
Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens on a pageserver restart? I don't see any persistence of this value, so it's lost whenever we restart the pageserver, until the next basebackup happens.

We should add persistence somehow.

We already have a non-persisted value via #6736 but it's only of limited use as we regularly restart pageservers.

pageserver/src/context.rs Outdated Show resolved Hide resolved
@skyzh
Copy link
Member Author

skyzh commented May 9, 2024

I don’t plan to persist it because it’s an estimator. If we configure an alert for aux file size and some tenant exceeds the limit, it will fire before the page server restart so it does not matter whether it’s persisted or not?

Signed-off-by: Alex Chi Z <[email protected]>
@skyzh skyzh requested a review from arpad-m May 9, 2024 14:30
@skyzh
Copy link
Member Author

skyzh commented May 9, 2024

... or we can have logical size calculation to recompute aux file size in later patches

@arpad-m
Copy link
Member

arpad-m commented May 9, 2024

The huge aux file counts we have seen in the past were not always from a few days but often the results of buildups over weeks, or maybe even months. So accurate values are kinda important.

Again, as I've said, we already have a metric that only runs during the lifetime of the process. It's better than nothing but still not what we want.

If we computed it during logical size calculation, it would be equivalent to preparing the basebackup, so kinda expensive. Ideally, reading the prior size would be an O(1) operation. In other words, it should be materialized somewhere.

@skyzh
Copy link
Member Author

skyzh commented May 9, 2024

Then I'd like to wait on #7663 and persist it in timeline metadata / index_part.json

@skyzh
Copy link
Member Author

skyzh commented May 13, 2024

Per discussion, we can merge this pull request first and we will have at least some metrics, and add this to initial logical size calculation in the future.

@skyzh skyzh enabled auto-merge (squash) May 13, 2024 15:21
@skyzh skyzh merged commit 7f51764 into main May 13, 2024
53 checks passed
@skyzh skyzh deleted the skyzh/aux-metrics branch May 13, 2024 15:33
a-masterov pushed a commit that referenced this pull request May 20, 2024
ref #7443

## Summary of changes

This pull request adds a size estimator for aux files. Each timeline
stores a cached `isize` for the estimated total size of aux files. It
gets reset on basebackup, and gets updated for each aux file
modification. TODO: print a warning when it exceeds the size.

The size metrics is not accurate. Race between `on_basebackup` and other
functions could create a negative basebackup size, but the chance is
rare. Anyways, this does not impose any extra I/Os to the storage as
everything is computed in-memory.

The aux files are only stored on shard 0. As basebackups are only
generated on shard 0, only shard 0 will report this metrics.

---------

Signed-off-by: Alex Chi Z <[email protected]>
@arpad-m arpad-m mentioned this pull request May 22, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants