-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metric: add started and killed walredo processes counter #5809
Conversation
@koivunej could you please review |
Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
3b1b8e5
to
ce2c6c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking good. Could I ask you to instead of a single counter_vec create a type hosting the 3 different counters and make that type the pub(crate) static WAL_REDO_PROCESS_COUNTERS: Lazy<WalRedoProcessCounters> = ...
?
Rationale: walredo is very critical for us, and even though the contention from these counter accesses should always be low, it would allow us moving the strings ("started", "killed", "shutdown") to one place -- the WalRedoProcessCounters::default
where they are created once and can be accessed by field after that.
Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thanks, could you look at the suggestion before I approve the CI run?
Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
2376 tests run: 2258 passed, 0 failed, 118 skipped (full report)Flaky tests (2)Postgres 16Code coverage (full report)
The comment gets automatically updated with the latest test results
aafde60 at 2023-11-10T11:04:59.915Z :recycle: |
Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
Head branch was pushed to by a user without write access
@koivunej sorry but had to fix a clippy lint |
Zero worries, thanks for handling it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah it was the || ...())
suggestion, I should had caught that. It always bites me as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of shoving the counters under the same metric.
Labels are supposed to add context / enricht the event.
But these labels are really different event types.
I'd argue for two separate counters:
..._starts
(no label)..._stops
with labelcause
and values:problem_during_launch
ordrop
@rmodpur please wait before making changes. The PR is now very close to as it was asked in the issue. The overall design of the metrics names and how to use them should be handled separatedly. The PR is great work anyways, thanks, and we'll take it from here. |
per [review] the different events (started, stopped) should had been different metric names, with the shutdown and killed reasons explaining the stopping. [review]: neondatabase#5809 (review)
In OOM situations, knowing exactly how many walredo processes there were at a time would help afterwards to understand why was pageserver OOM killed. Add `pageserver_wal_redo_process_total` metric to keep track of total wal redo process started, shutdown and killed since pageserver start. Closes #5722 --------- Signed-off-by: Rahul Modpur <rmodpur2@gmail.com> Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Christian Schwarz <me@cschwarz.com>
Per [feedback], split the Layer metrics, also finally account for lost and [re-submitted feedback] on `layer_gc` by renaming it to `layer_delete`, `Layer::garbage_collect_on_drop` renamed to `Layer::delete_on_drop`. References to "gc" dropped from metric names and elsewhere. Also fixes how the cancellations were tracked: there was one rare counter. Now there is a top level metric for cancelled inits, and the rare "download failed but failed to communicate" counter is kept. Fixes: #6027 [feedback]: #5809 (review) [re-submitted feedback]: #5108 (comment)
In OOM situations, knowing exactly how many walredo processes there were at a time would help afterwards to understand why was pageserver OOM killed. Add
pageserver_wal_redo_process_total
metric to keep track of total wal redo process started, shutdown and killed since pageserver start.Closes #5722