Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(charts/istio-alerts): use federated rolled up metrics #231

Merged
merged 1 commit into from
Sep 1, 2023

Conversation

diranged
Copy link
Contributor

We switched to using federated metric collection so the istio_requests_total metric no longer exists. We need to be using istio_requests:increase5m to count the actual 5xx's.

Additionally, this chart needed some more flexibility to be able to be launched multiple times in a single namespace targeting different destination_service_names with different thresholds.

Finally, the alerts need to be split out across reporter=source and reporter=destination so that the application owners are alerted both if their own istio-proxy sidecar is reporting errors (thus definitely happening in their own app), and also if the istio-ingressgateway pods are reporting errors (which could be network problems, no healthy upstream issues, etc).

Proof: istio_requests:total5m

Here is an updated image with an example of the query working.. we also happen to have a service throwing errors here at a high rate, which helps illustrate the labels that the alert would be sent out with:

Screenshot 2023-08-31 at 4 15 20 PM

Proof: Percentile-based Latency Alarm

Rather than alerting on the plain average, I've refactored the alarm to alert on a given percentile.. this allows the operator to decide how sensitive they want to be. The alert is also now broken out on destination_service_name, reporter and source_canonical_service to help them identify the source of the latency and traffic.

Screenshot 2023-08-31 at 4 27 44 PM

@diranged diranged requested a review from a team as a code owner August 31, 2023 23:30
We switched to using federated metric collection so the
`istio_requests_total` metric no longer exists. We need to be using
`istio_requests:increase5m` to count the actual 5xx's.

Additionally, this chart needed some more flexibility to be able to be
launched multiple times in a single namespace targeting different
`destination_service_names` with different thresholds.

Finally, the alerts need to be split out across `reporter=source` and
`reporter=destination` so that the application owners are alerted both
if their own `istio-proxy` sidecar is reporting errors (thus definitely
happening in their own app), and also if the `istio-ingressgateway` pods
are reporting errors (which could be network problems, no healthy
upstream issues, etc).
@diranged diranged merged commit c830808 into main Sep 1, 2023
2 checks passed
@diranged diranged deleted the fix_istio_alerts branch September 1, 2023 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants