fix(charts/istio-alerts): use federated rolled up metrics #231

diranged · 2023-08-31T23:30:06Z

We switched to using federated metric collection so the istio_requests_total metric no longer exists. We need to be using istio_requests:increase5m to count the actual 5xx's.

Additionally, this chart needed some more flexibility to be able to be launched multiple times in a single namespace targeting different destination_service_names with different thresholds.

Finally, the alerts need to be split out across reporter=source and reporter=destination so that the application owners are alerted both if their own istio-proxy sidecar is reporting errors (thus definitely happening in their own app), and also if the istio-ingressgateway pods are reporting errors (which could be network problems, no healthy upstream issues, etc).

Proof: istio_requests:total5m

Here is an updated image with an example of the query working.. we also happen to have a service throwing errors here at a high rate, which helps illustrate the labels that the alert would be sent out with:

Proof: Percentile-based Latency Alarm

Rather than alerting on the plain average, I've refactored the alarm to alert on a given percentile.. this allows the operator to decide how sensitive they want to be. The alert is also now broken out on destination_service_name, reporter and source_canonical_service to help them identify the source of the latency and traffic.

We switched to using federated metric collection so the `istio_requests_total` metric no longer exists. We need to be using `istio_requests:increase5m` to count the actual 5xx's. Additionally, this chart needed some more flexibility to be able to be launched multiple times in a single namespace targeting different `destination_service_names` with different thresholds. Finally, the alerts need to be split out across `reporter=source` and `reporter=destination` so that the application owners are alerted both if their own `istio-proxy` sidecar is reporting errors (thus definitely happening in their own app), and also if the `istio-ingressgateway` pods are reporting errors (which could be network problems, no healthy upstream issues, etc).

diranged requested a review from a team as a code owner August 31, 2023 23:30

diranged force-pushed the fix_istio_alerts branch from cd73bad to efc13bd Compare August 31, 2023 23:31

LaikaN57 approved these changes Aug 31, 2023

View reviewed changes

diranged merged commit c830808 into main Sep 1, 2023
2 checks passed

diranged deleted the fix_istio_alerts branch September 1, 2023 00:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(charts/istio-alerts): use federated rolled up metrics #231

fix(charts/istio-alerts): use federated rolled up metrics #231

diranged commented Aug 31, 2023

fix(charts/istio-alerts): use federated rolled up metrics #231

fix(charts/istio-alerts): use federated rolled up metrics #231

Conversation

diranged commented Aug 31, 2023