Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metric: add started and killed walredo processes counter #5809

Merged
merged 7 commits into from
Nov 10, 2023

Conversation

rmodpur
Copy link
Contributor

@rmodpur rmodpur commented Nov 7, 2023

In OOM situations, knowing exactly how many walredo processes there were at a time would help afterwards to understand why was pageserver OOM killed. Add pageserver_wal_redo_process_total metric to keep track of total wal redo process started, shutdown and killed since pageserver start.

Closes #5722

@rmodpur rmodpur requested a review from a team as a code owner November 7, 2023 07:15
@rmodpur rmodpur requested review from koivunej and removed request for a team November 7, 2023 07:15
@rmodpur
Copy link
Contributor Author

rmodpur commented Nov 7, 2023

@koivunej could you please review

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
@rmodpur rmodpur force-pushed the walredo-process-metric branch from 3b1b8e5 to ce2c6c7 Compare November 7, 2023 07:25
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking good. Could I ask you to instead of a single counter_vec create a type hosting the 3 different counters and make that type the pub(crate) static WAL_REDO_PROCESS_COUNTERS: Lazy<WalRedoProcessCounters> = ...?

Rationale: walredo is very critical for us, and even though the contention from these counter accesses should always be low, it would allow us moving the strings ("started", "killed", "shutdown") to one place -- the WalRedoProcessCounters::default where they are created once and can be accessed by field after that.

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
pageserver/src/metrics.rs Outdated Show resolved Hide resolved
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks, could you look at the suggestion before I approve the CI run?

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
@koivunej koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 7, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 7, 2023
@vipvap vipvap mentioned this pull request Nov 7, 2023
@koivunej koivunej enabled auto-merge (squash) November 7, 2023 10:43
Copy link

github-actions bot commented Nov 7, 2023

2376 tests run: 2258 passed, 0 failed, 118 skipped (full report)


Flaky tests (2)

Postgres 16

  • test_crafted_wal_end[last_wal_record_xlog_switch_ends_on_page_boundary]: release
  • test_pageserver_restart[True]: release

Code coverage (full report)

  • functions: 54.5% (8927 of 16375 functions)
  • lines: 81.5% (51336 of 62988 lines)

The comment gets automatically updated with the latest test results
aafde60 at 2023-11-10T11:04:59.915Z :recycle:

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
auto-merge was automatically disabled November 7, 2023 13:29

Head branch was pushed to by a user without write access

@rmodpur
Copy link
Contributor Author

rmodpur commented Nov 7, 2023

@koivunej sorry but had to fix a clippy lint

@koivunej
Copy link
Member

koivunej commented Nov 7, 2023

Zero worries, thanks for handling it!

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah it was the || ...()) suggestion, I should had caught that. It always bites me as well!

@koivunej koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 7, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 7, 2023
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of shoving the counters under the same metric.

Labels are supposed to add context / enricht the event.

But these labels are really different event types.

I'd argue for two separate counters:

  • ..._starts (no label)
  • ..._stops with label cause and values: problem_during_launch or drop

@koivunej
Copy link
Member

koivunej commented Nov 8, 2023

I'm not a fan of shoving the counters under the same metric.

Labels are supposed to add context / enricht the event.

But these labels are really different event types.

@rmodpur please wait before making changes. The PR is now very close to as it was asked in the issue. The overall design of the metrics names and how to use them should be handled separatedly.

The PR is great work anyways, thanks, and we'll take it from here.

per [review] the different events (started, stopped) should had been
different metric names, with the shutdown and killed reasons explaining
the stopping.

[review]: neondatabase#5809 (review)
@koivunej koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 9, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 9, 2023
@koivunej koivunej enabled auto-merge (squash) November 9, 2023 18:06
@koivunej koivunej requested a review from problame November 9, 2023 18:21
@koivunej koivunej disabled auto-merge November 9, 2023 20:21
@koivunej koivunej added the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 10, 2023
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label Nov 10, 2023
@koivunej koivunej merged commit a6f892e into neondatabase:main Nov 10, 2023
60 of 61 checks passed
jcsp pushed a commit that referenced this pull request Nov 14, 2023
In OOM situations, knowing exactly how many walredo processes there were
at a time would help afterwards to understand why was pageserver OOM
killed. Add `pageserver_wal_redo_process_total` metric to keep track of
total wal redo process started, shutdown and killed since pageserver
start.

Closes #5722

---------

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <me@cschwarz.com>
@rmodpur rmodpur deleted the walredo-process-metric branch December 4, 2023 12:52
koivunej added a commit that referenced this pull request Dec 7, 2023
Per [feedback], split the Layer metrics, also finally account for lost
and [re-submitted feedback] on `layer_gc` by renaming it to
`layer_delete`, `Layer::garbage_collect_on_drop` renamed to
`Layer::delete_on_drop`. References to "gc" dropped from metric names
and elsewhere.

Also fixes how the cancellations were tracked: there was one rare
counter. Now there is a top level metric for cancelled inits, and the
rare "download failed but failed to communicate" counter is kept.

Fixes: #6027

[feedback]: #5809 (review)
[re-submitted feedback]: #5108 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metric: launched and killed walredo processes
3 participants