feat(pageserver): add metrics for aux file size #7623

skyzh · 2024-05-06T16:07:01Z

Problem

Summary of changes

This pull request adds a size estimator for aux files. Each timeline stores a cached isize for the estimated total size of aux files. It gets reset on basebackup, and gets updated for each aux file modification. TODO: print a warning when it exceeds the size.

The size metrics is not accurate. Race between on_basebackup and other functions could create a negative basebackup size, but the chance is rare. Anyways, this does not impose any extra I/Os to the storage as everything is computed in-memory.

The aux files are only stored on shard 0. As basebackups are only generated on shard 0, only shard 0 will report this metrics.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-05-06T16:43:15Z

3048 tests run: 2915 passed, 0 failed, 133 skipped (full report)

Flaky tests (3)

Postgres 16

test_lock_time_tracing: release

Postgres 15

test_partial_evict_tenant[relative_equal]: release
test_download_remote_layers_api: release

Code coverage* (full report)

functions: 31.4% (6321 of 20137 functions)
lines: 47.3% (47634 of 100778 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
24ae9b2 at 2024-05-13T15:41:49.881Z :recycle:}

Signed-off-by: Alex Chi Z <[email protected]>

pageserver/src/aux_file.rs

arpad-m

What happens on a pageserver restart? I don't see any persistence of this value, so it's lost whenever we restart the pageserver, until the next basebackup happens.

We should add persistence somehow.

We already have a non-persisted value via #6736 but it's only of limited use as we regularly restart pageservers.

pageserver/src/context.rs

skyzh · 2024-05-09T13:38:01Z

I don’t plan to persist it because it’s an estimator. If we configure an alert for aux file size and some tenant exceeds the limit, it will fire before the page server restart so it does not matter whether it’s persisted or not?

Signed-off-by: Alex Chi Z <[email protected]>

skyzh · 2024-05-09T14:30:47Z

... or we can have logical size calculation to recompute aux file size in later patches

…h/aux-metrics

Signed-off-by: Alex Chi Z <[email protected]>

arpad-m · 2024-05-09T15:26:15Z

The huge aux file counts we have seen in the past were not always from a few days but often the results of buildups over weeks, or maybe even months. So accurate values are kinda important.

Again, as I've said, we already have a metric that only runs during the lifetime of the process. It's better than nothing but still not what we want.

If we computed it during logical size calculation, it would be equivalent to preparing the basebackup, so kinda expensive. Ideally, reading the prior size would be an O(1) operation. In other words, it should be materialized somewhere.

skyzh · 2024-05-09T15:32:30Z

Then I'd like to wait on #7663 and persist it in timeline metadata / index_part.json

skyzh · 2024-05-13T14:40:17Z

Per discussion, we can merge this pull request first and we will have at least some metrics, and add this to initial logical size calculation in the future.

ref #7443 ## Summary of changes This pull request adds a size estimator for aux files. Each timeline stores a cached `isize` for the estimated total size of aux files. It gets reset on basebackup, and gets updated for each aux file modification. TODO: print a warning when it exceeds the size. The size metrics is not accurate. Race between `on_basebackup` and other functions could create a negative basebackup size, but the chance is rare. Anyways, this does not impose any extra I/Os to the storage as everything is computed in-memory. The aux files are only stored on shard 0. As basebackups are only generated on shard 0, only shard 0 will report this metrics. --------- Signed-off-by: Alex Chi Z <[email protected]>

skyzh requested a review from a team as a code owner May 6, 2024 16:07

skyzh requested review from jcsp and removed request for a team May 6, 2024 16:07

skyzh marked this pull request as draft May 6, 2024 16:07

skyzh changed the title ~~feat(pageserver): add metrics for aux file size~~ [WIP] feat(pageserver): add metrics for aux file size May 6, 2024

skyzh removed the request for review from jcsp May 6, 2024 16:07

skyzh mentioned this pull request May 6, 2024

Epic: Aux file store v2 #7462

Closed

24 tasks

skyzh force-pushed the skyzh/aux-metrics branch 2 times, most recently from ce644e9 to 24d5d8e Compare May 6, 2024 17:37

Base automatically changed from skyzh/aux-file-v2 to main May 7, 2024 16:30

feat(pageserver): add metrics for aux file size

7b6636b

Signed-off-by: Alex Chi Z <[email protected]>

skyzh force-pushed the skyzh/aux-metrics branch from 24d5d8e to 3eed992 Compare May 7, 2024 20:07

skyzh changed the title ~~[WIP] feat(pageserver): add metrics for aux file size~~ feat(pageserver): add metrics for aux file size May 7, 2024

skyzh marked this pull request as ready for review May 7, 2024 20:08

skyzh requested a review from arpad-m May 7, 2024 20:08

skyzh force-pushed the skyzh/aux-metrics branch from 3eed992 to eb35310 Compare May 7, 2024 20:47

report to prometheus

87dbd04

Signed-off-by: Alex Chi Z <[email protected]>

skyzh force-pushed the skyzh/aux-metrics branch from eb35310 to 87dbd04 Compare May 7, 2024 20:48

skyzh requested a review from VladLazar May 8, 2024 14:31

arpad-m reviewed May 8, 2024

View reviewed changes

pageserver/src/aux_file.rs Show resolved Hide resolved

arpad-m requested changes May 9, 2024

View reviewed changes

pageserver/src/context.rs Outdated Show resolved Hide resolved

remove unused fields

13977ea

Signed-off-by: Alex Chi Z <[email protected]>

skyzh requested a review from arpad-m May 9, 2024 14:30

skyzh added 2 commits May 9, 2024 10:31

Merge branch 'main' of https://github.com/neondatabase/neon into skyz…

ea5014a

…h/aux-metrics

fix clippy

24ae9b2

Signed-off-by: Alex Chi Z <[email protected]>

arpad-m approved these changes May 13, 2024

View reviewed changes

skyzh enabled auto-merge (squash) May 13, 2024 15:21

skyzh merged commit 7f51764 into main May 13, 2024
53 checks passed

skyzh deleted the skyzh/aux-metrics branch May 13, 2024 15:33

skyzh mentioned this pull request May 14, 2024

feat(pageserver): persist aux file policy in index part #7668

Merged

5 tasks

arpad-m mentioned this pull request May 22, 2024

metrics for new aux file storage #7443

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pageserver): add metrics for aux file size #7623

feat(pageserver): add metrics for aux file size #7623

skyzh commented May 6, 2024 •

edited

Loading

github-actions bot commented May 6, 2024 •

edited

Loading

Postgres 16

Postgres 15

arpad-m left a comment

skyzh commented May 9, 2024

skyzh commented May 9, 2024

arpad-m commented May 9, 2024

skyzh commented May 9, 2024

skyzh commented May 13, 2024

feat(pageserver): add metrics for aux file size #7623

feat(pageserver): add metrics for aux file size #7623

Conversation

skyzh commented May 6, 2024 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented May 6, 2024 • edited Loading

3048 tests run: 2915 passed, 0 failed, 133 skipped (full report)

Postgres 16

Postgres 15

Code coverage* (full report)

arpad-m left a comment

Choose a reason for hiding this comment

skyzh commented May 9, 2024

skyzh commented May 9, 2024

arpad-m commented May 9, 2024

skyzh commented May 9, 2024

skyzh commented May 13, 2024

skyzh commented May 6, 2024 •

edited

Loading

github-actions bot commented May 6, 2024 •

edited

Loading