metric: add started and killed walredo processes counter #5809

rmodpur · 2023-11-07T07:15:52Z

In OOM situations, knowing exactly how many walredo processes there were at a time would help afterwards to understand why was pageserver OOM killed. Add pageserver_wal_redo_process_total metric to keep track of total wal redo process started, shutdown and killed since pageserver start.

Closes #5722

rmodpur · 2023-11-07T07:16:13Z

@koivunej could you please review

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>

koivunej

I think this is looking good. Could I ask you to instead of a single counter_vec create a type hosting the 3 different counters and make that type the pub(crate) static WAL_REDO_PROCESS_COUNTERS: Lazy<WalRedoProcessCounters> = ...?

Rationale: walredo is very critical for us, and even though the contention from these counter accesses should always be low, it would allow us moving the strings ("started", "killed", "shutdown") to one place -- the WalRedoProcessCounters::default where they are created once and can be accessed by field after that.

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>

pageserver/src/metrics.rs

koivunej

This is great! Thanks, could you look at the suggestion before I approve the CI run?

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>

github-actions · 2023-11-07T11:21:56Z

2376 tests run: 2258 passed, 0 failed, 118 skipped (full report)

Flaky tests (2)

Postgres 16

test_crafted_wal_end[last_wal_record_xlog_switch_ends_on_page_boundary]: release
test_pageserver_restart[True]: release

Code coverage (full report)

functions: 54.5% (8927 of 16375 functions)
lines: 81.5% (51336 of 62988 lines)

_{The comment gets automatically updated with the latest test results
aafde60 at 2023-11-10T11:04:59.915Z :recycle:}

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>

rmodpur · 2023-11-07T13:32:38Z

@koivunej sorry but had to fix a clippy lint

koivunej · 2023-11-07T15:38:21Z

Zero worries, thanks for handling it!

koivunej

Aah it was the || ...()) suggestion, I should had caught that. It always bites me as well!

problame

I'm not a fan of shoving the counters under the same metric.

Labels are supposed to add context / enricht the event.

But these labels are really different event types.

I'd argue for two separate counters:

..._starts (no label)
..._stops with label cause and values: problem_during_launch or drop

koivunej · 2023-11-08T11:07:42Z

I'm not a fan of shoving the counters under the same metric.

Labels are supposed to add context / enricht the event.

But these labels are really different event types.

@rmodpur please wait before making changes. The PR is now very close to as it was asked in the issue. The overall design of the metrics names and how to use them should be handled separatedly.

The PR is great work anyways, thanks, and we'll take it from here.

per [review] the different events (started, stopped) should had been different metric names, with the shutdown and killed reasons explaining the stopping. [review]: neondatabase#5809 (review)

In OOM situations, knowing exactly how many walredo processes there were at a time would help afterwards to understand why was pageserver OOM killed. Add `pageserver_wal_redo_process_total` metric to keep track of total wal redo process started, shutdown and killed since pageserver start. Closes #5722 --------- Signed-off-by: Rahul Modpur <rmodpur2@gmail.com> Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Christian Schwarz <me@cschwarz.com>

Per [feedback], split the Layer metrics, also finally account for lost and [re-submitted feedback] on `layer_gc` by renaming it to `layer_delete`, `Layer::garbage_collect_on_drop` renamed to `Layer::delete_on_drop`. References to "gc" dropped from metric names and elsewhere. Also fixes how the cancellations were tracked: there was one rare counter. Now there is a top level metric for cancelled inits, and the rare "download failed but failed to communicate" counter is kept. Fixes: #6027 [feedback]: #5809 (review) [re-submitted feedback]: #5108 (comment)

rmodpur requested a review from a team as a code owner November 7, 2023 07:15

rmodpur requested review from koivunej and removed request for a team November 7, 2023 07:15

metric: add started and killed walredo processes counter

ce2c6c7

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>

rmodpur force-pushed the walredo-process-metric branch from 3b1b8e5 to ce2c6c7 Compare November 7, 2023 07:25

koivunej requested changes Nov 7, 2023

View reviewed changes

metrics: precreate wal_redo_process_counters with labels

f2c2556

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>

koivunej reviewed Nov 7, 2023

View reviewed changes

pageserver/src/metrics.rs Outdated Show resolved Hide resolved

koivunej approved these changes Nov 7, 2023

View reviewed changes

metrics: use IntCounter type alias

8cf3091

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>

koivunej added the approved-for-ci-run label Nov 7, 2023

github-actions bot removed the approved-for-ci-run label Nov 7, 2023

vipvap mentioned this pull request Nov 7, 2023

CI run for PR #5809 #5810

Closed

koivunej enabled auto-merge (squash) November 7, 2023 10:43

fix clippy warning

cccc05d

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>

auto-merge was automatically disabled November 7, 2023 13:29
Head branch was pushed to by a user without write access

koivunej approved these changes Nov 7, 2023

View reviewed changes

koivunej added the approved-for-ci-run label Nov 7, 2023

github-actions bot removed the approved-for-ci-run label Nov 7, 2023

problame requested changes Nov 8, 2023

View reviewed changes

koivunej added 2 commits November 9, 2023 16:36

fix: align to proper metric usage

86226c7

per [review] the different events (started, stopped) should had been different metric names, with the shutdown and killed reasons explaining the stopping. [review]: neondatabase#5809 (review)

Merge branch 'main' into walredo-process-metric

b6cd66f

koivunej added the approved-for-ci-run label Nov 9, 2023

github-actions bot removed the approved-for-ci-run label Nov 9, 2023

koivunej enabled auto-merge (squash) November 9, 2023 18:06

koivunej requested a review from problame November 9, 2023 18:21

differentiate causes through enum, and more variety

aafde60

koivunej disabled auto-merge November 9, 2023 20:21

problame approved these changes Nov 9, 2023

View reviewed changes

koivunej added the approved-for-ci-run label Nov 10, 2023

github-actions bot removed the approved-for-ci-run label Nov 10, 2023

koivunej merged commit a6f892e into neondatabase:main Nov 10, 2023
60 of 61 checks passed

koivunej mentioned this pull request Nov 22, 2023

fix(layer): metric splitting, span rename #5902

Merged

rmodpur deleted the walredo-process-metric branch December 4, 2023 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metric: add started and killed walredo processes counter #5809

metric: add started and killed walredo processes counter #5809

rmodpur commented Nov 7, 2023 •

edited by koivunej

Loading

rmodpur commented Nov 7, 2023

koivunej left a comment

koivunej left a comment

github-actions bot commented Nov 7, 2023 •

edited

Loading

Postgres 16

rmodpur commented Nov 7, 2023

koivunej commented Nov 7, 2023

koivunej left a comment

problame left a comment

koivunej commented Nov 8, 2023

metric: add started and killed walredo processes counter #5809

metric: add started and killed walredo processes counter #5809

Conversation

rmodpur commented Nov 7, 2023 • edited by koivunej Loading

rmodpur commented Nov 7, 2023

koivunej left a comment

Choose a reason for hiding this comment

koivunej left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 7, 2023 • edited Loading

2376 tests run: 2258 passed, 0 failed, 118 skipped (full report)

Postgres 16

Code coverage (full report)

rmodpur commented Nov 7, 2023

koivunej commented Nov 7, 2023

koivunej left a comment

Choose a reason for hiding this comment

problame left a comment

Choose a reason for hiding this comment

koivunej commented Nov 8, 2023

rmodpur commented Nov 7, 2023 •

edited by koivunej

Loading

github-actions bot commented Nov 7, 2023 •

edited

Loading