Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: add "critical events" counter for highly impactful error paths #10094

Open
jcsp opened this issue Dec 11, 2024 · 5 comments
Open
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver

Comments

@jcsp
Copy link
Collaborator

jcsp commented Dec 11, 2024

There are certain errors that are intrinsically "scary" things that we always want to know about right away:

  • walredo failures
  • getpage requests that can't find a key
  • WAL records we can't ingest
  • tenants we can't load (broken tenants).
  • 404 loading a layer file that the index thinks should exist (this can occur legitimately if an isolated PS is still attached to something attached elsewhere, but that is super rare)

These are mostly detectable some other way, but we have a plethora of different benchmarks, tests and deployed environments, and we need a simple single-metric thing that anyone can query to detect an unambiguous "You have hit a serious storage (maybe compute but probably storage) bug".

@jcsp jcsp added a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver labels Dec 11, 2024
@skyzh
Copy link
Member

skyzh commented Dec 11, 2024

getpage requests that can't find a key

the same applies to GC/compaction, we should match on all could not find key, not only on the read path

@erikgrinaker
Copy link
Contributor

Why is there no tracing::critical!? 😞 Seems like a log level would be more appropriate than matching on random error messages (of course, we'd still have to tag the relevant errors as such).

@jcsp
Copy link
Collaborator Author

jcsp commented Dec 11, 2024

Seems like a log level would be more appropriate than matching on random error messages (of course, we'd still have to tag the relevant errors as such).

Yeah, I also wish tracing had built in counters for messages of each severity.

I expect to add some global critical_event(&str) function that logs at ERROR and increments the counter. Alerting should be driven by the metric so that we don't have to go add log-driven metrics for each case.

@Bodobolero
Copy link
Contributor

Bodobolero commented Dec 11, 2024

once we have this task implemented pls create additional issues
a) create a GitHub action that a workflow/testcase can easily add to the testcase/job to validate the metric has the expected value (best with an example how to use the action)
b) create separate issues for each benchmark/testcase owner that you think should add instrumentation for checking the metric

@jcsp
Copy link
Collaborator Author

jcsp commented Dec 11, 2024

@bayandin can devprod take ownership of the parts about benchmarks + github actions that Peter mentions above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

4 participants