pageserver: add "critical events" counter for highly impactful error paths #10094

jcsp · 2024-12-11T16:32:16Z

There are certain errors that are intrinsically "scary" things that we always want to know about right away:

walredo failures
getpage requests that can't find a key
WAL records we can't ingest
tenants we can't load (broken tenants).
404 loading a layer file that the index thinks should exist (this can occur legitimately if an isolated PS is still attached to something attached elsewhere, but that is super rare)

These are mostly detectable some other way, but we have a plethora of different benchmarks, tests and deployed environments, and we need a simple single-metric thing that anyone can query to detect an unambiguous "You have hit a serious storage (maybe compute but probably storage) bug".

skyzh · 2024-12-11T16:37:09Z

getpage requests that can't find a key

the same applies to GC/compaction, we should match on all could not find key, not only on the read path

erikgrinaker · 2024-12-11T16:40:05Z

Why is there no tracing::critical!? 😞 Seems like a log level would be more appropriate than matching on random error messages (of course, we'd still have to tag the relevant errors as such).

jcsp · 2024-12-11T16:44:31Z

Seems like a log level would be more appropriate than matching on random error messages (of course, we'd still have to tag the relevant errors as such).

Yeah, I also wish tracing had built in counters for messages of each severity.

I expect to add some global critical_event(&str) function that logs at ERROR and increments the counter. Alerting should be driven by the metric so that we don't have to go add log-driven metrics for each case.

Bodobolero · 2024-12-11T17:16:44Z

once we have this task implemented pls create additional issues
a) create a GitHub action that a workflow/testcase can easily add to the testcase/job to validate the metric has the expected value (best with an example how to use the action)
b) create separate issues for each benchmark/testcase owner that you think should add instrumentation for checking the metric

jcsp · 2024-12-11T18:08:38Z

@bayandin can devprod take ownership of the parts about benchmarks + github actions that Peter mentions above?

jcsp added a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver labels Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: add "critical events" counter for highly impactful error paths #10094

pageserver: add "critical events" counter for highly impactful error paths #10094

jcsp commented Dec 11, 2024 •

edited

Loading

skyzh commented Dec 11, 2024

erikgrinaker commented Dec 11, 2024

jcsp commented Dec 11, 2024

Bodobolero commented Dec 11, 2024 •

edited

Loading

jcsp commented Dec 11, 2024

pageserver: add "critical events" counter for highly impactful error paths #10094

pageserver: add "critical events" counter for highly impactful error paths #10094

Comments

jcsp commented Dec 11, 2024 • edited Loading

skyzh commented Dec 11, 2024

erikgrinaker commented Dec 11, 2024

jcsp commented Dec 11, 2024

Bodobolero commented Dec 11, 2024 • edited Loading

jcsp commented Dec 11, 2024

jcsp commented Dec 11, 2024 •

edited

Loading

Bodobolero commented Dec 11, 2024 •

edited

Loading