Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metric: launched and killed walredo processes #5722

Closed
koivunej opened this issue Oct 30, 2023 · 2 comments · Fixed by #5809
Closed

metric: launched and killed walredo processes #5722

koivunej opened this issue Oct 30, 2023 · 2 comments · Fixed by #5809
Assignees
Labels
c/storage/pageserver Component: storage: pageserver m/good_first_issue Moment: when doing your first Neon contributions t/on_call_followup

Comments

@koivunej
Copy link
Member

In OOM situations, knowing exactly how many walredo processes there were at a time would help afterwards to understand why was pageserver OOM killed.

We need two counters:

  • walredo processes started
  • walredo processes shutdown

Good first issue:

  • pageserver/src/walredo.rs is the walredo implementation
  • We could increment the started counter near launching the process
  • We could increment the killed counter in the process waiting newtype wrapper
  • metric could be pageserver_walredo_processes, an IntCounterVec with "operation" key
    • the two concrete counters could be operation="started" and operation="shutdown"
    • the "shutdown" separate from "killed" if we'd ever want to distinguish on the ones we need to kill in addition to finding them having exited
@koivunej koivunej added c/storage/pageserver Component: storage: pageserver m/good_first_issue Moment: when doing your first Neon contributions t/on_call_followup labels Oct 30, 2023
@rmodpur
Copy link
Contributor

rmodpur commented Oct 31, 2023

@koivunej i would like to work on this

@koivunej koivunej assigned koivunej and unassigned koivunej Nov 6, 2023
@koivunej
Copy link
Member Author

koivunej commented Nov 6, 2023

@rmodpur feel free to do so and ping me once you have a draft! This should be more straightforward than the #5310 which I will eventually get to :)

koivunej added a commit that referenced this issue Nov 10, 2023
In OOM situations, knowing exactly how many walredo processes there were
at a time would help afterwards to understand why was pageserver OOM
killed. Add `pageserver_wal_redo_process_total` metric to keep track of
total wal redo process started, shutdown and killed since pageserver
start.

Closes #5722

---------

Signed-off-by: Rahul Modpur <[email protected]>
Co-authored-by: Joonas Koivunen <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
jcsp pushed a commit that referenced this issue Nov 14, 2023
In OOM situations, knowing exactly how many walredo processes there were
at a time would help afterwards to understand why was pageserver OOM
killed. Add `pageserver_wal_redo_process_total` metric to keep track of
total wal redo process started, shutdown and killed since pageserver
start.

Closes #5722

---------

Signed-off-by: Rahul Modpur <[email protected]>
Co-authored-by: Joonas Koivunen <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver m/good_first_issue Moment: when doing your first Neon contributions t/on_call_followup
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants