chore(charts/istio-alerts): update runbook for 5xx increase (#324)

Due to encoding issues, the direct link to graph is not in the alerts... So adding the query here for easier lookup
Nextdoor · Aug 5, 2024 · a5f4608 · a5f4608
1 parent f2d3973
commit a5f4608
Show file tree

Hide file tree

Showing 3 changed files with 25 additions and 3 deletions.
diff --git a/charts/istio-alerts/Chart.yaml b/charts/istio-alerts/Chart.yaml
@@ -2,7 +2,7 @@ apiVersion: v2
 name: istio-alerts
 description: A Helm chart that provisions a series of alerts for istio VirtualServices
 type: application
-version: 0.5.2
+version: 0.5.3
 maintainers:
   - name: diranged
     email: [email protected]

diff --git a/charts/istio-alerts/README.md b/charts/istio-alerts/README.md
@@ -1,6 +1,6 @@
 # istio-alerts
 
-![Version: 0.5.2](https://img.shields.io/badge/Version-0.5.2-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square)
+![Version: 0.5.3](https://img.shields.io/badge/Version-0.5.3-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square)
 
 A Helm chart that provisions a series of alerts for istio VirtualServices
 

diff --git a/charts/istio-alerts/runbook.md b/charts/istio-alerts/runbook.md
@@ -1,7 +1,7 @@
 ## 5xx-Rate-Too-High
 
 This alert fires when the rate of 5xx responses from a service exceeds a
-threshold (by default, 0.05%). A 5xx indicates that some sort of server-side
+threshold (default, 0.05% for 5m). A 5xx indicates that some sort of server-side
 error is occurring, and you should investigate which status codes are being
 returned to investigate this alarm. A breakdown of responses by status code
 can be found in grafana on the "Istio Service Dashboard". Be sure to navigate
@@ -10,6 +10,28 @@ service. Many services have custom dashboards in DataDog as well which may help
 investigate this alert further, and most service also produce logs of requests
 which may provide more context into what errors are being returned and why.
 
+Can check trends/graph by:
+
+1. Going to your Grafana instance and navigating to the `Explore` tab
+2. Entering the following Prometheus query (replace `cluster` and `destination_service_namespace`):
+
+```
+sum by (destination_service_name, reporter) (
+  rate(istio_requests_total{cluster="<x>", response_code=~"5.*", destination_service_namespace="<y>"}[5m])
+)
+
+/
+
+sum by (destination_service_name, reporter) (
+  rate(istio_requests_total{cluster="<x>", destination_service_namespace="<y>"}[5m])
+)
+```
+
+Action Items:
+
+1. If trends are expected, tweak your thresholds (away from the [default 0.05% for 5 minutes](https://github.com/Nextdoor/k8s-charts/blob/f2d3973a1a9292e7c59e3feb4eb49df93dea926d/charts/istio-alerts/values.yaml#L28-L41)).
+2. If the response codes are unexpected, debug your app to see why the increase in error responses.
+
 ## HighRequestLatency
 
 TBD