fix(charts/istio-alerts): use federated rolled up metrics #231
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We switched to using federated metric collection so the
istio_requests_total
metric no longer exists. We need to be usingistio_requests:increase5m
to count the actual 5xx's.Additionally, this chart needed some more flexibility to be able to be launched multiple times in a single namespace targeting different
destination_service_names
with different thresholds.Finally, the alerts need to be split out across
reporter=source
andreporter=destination
so that the application owners are alerted both if their ownistio-proxy
sidecar is reporting errors (thus definitely happening in their own app), and also if theistio-ingressgateway
pods are reporting errors (which could be network problems, no healthy upstream issues, etc).Proof:
istio_requests:total5m
Here is an updated image with an example of the query working.. we also happen to have a service throwing errors here at a high rate, which helps illustrate the labels that the alert would be sent out with:
Proof: Percentile-based Latency Alarm
Rather than alerting on the plain average, I've refactored the alarm to alert on a given percentile.. this allows the operator to decide how sensitive they want to be. The alert is also now broken out on
destination_service_name
,reporter
andsource_canonical_service
to help them identify the source of the latency and traffic.