You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are certain errors that are intrinsically "scary" things that we always want to know about right away:
walredo failures
getpage requests that can't find a key
WAL records we can't ingest
tenants we can't load (broken tenants).
404 loading a layer file that the index thinks should exist (this can occur legitimately if an isolated PS is still attached to something attached elsewhere, but that is super rare)
These are mostly detectable some other way, but we have a plethora of different benchmarks, tests and deployed environments, and we need a simple single-metric thing that anyone can query to detect an unambiguous "You have hit a serious storage (maybe compute but probably storage) bug".
The text was updated successfully, but these errors were encountered:
Why is there no tracing::critical!? 😞 Seems like a log level would be more appropriate than matching on random error messages (of course, we'd still have to tag the relevant errors as such).
Seems like a log level would be more appropriate than matching on random error messages (of course, we'd still have to tag the relevant errors as such).
Yeah, I also wish tracing had built in counters for messages of each severity.
I expect to add some global critical_event(&str) function that logs at ERROR and increments the counter. Alerting should be driven by the metric so that we don't have to go add log-driven metrics for each case.
once we have this task implemented pls create additional issues
a) create a GitHub action that a workflow/testcase can easily add to the testcase/job to validate the metric has the expected value (best with an example how to use the action)
b) create separate issues for each benchmark/testcase owner that you think should add instrumentation for checking the metric
There are certain errors that are intrinsically "scary" things that we always want to know about right away:
These are mostly detectable some other way, but we have a plethora of different benchmarks, tests and deployed environments, and we need a simple single-metric thing that anyone can query to detect an unambiguous "You have hit a serious storage (maybe compute but probably storage) bug".
The text was updated successfully, but these errors were encountered: