feat(istio-alerts): Add service selector validity alert (#287)

> [!WARNING] > Reminder for @LaikaN57 - Update other charts in this repo after this PR is merged. Split from #285 We ran into an issue where unbeknownst to us, our pod selector was set incorrectly and therefore all of our rules deployed via prometheus-alerts did not have any series to evaluate. This lack of series data is silently ignored. Here we are resolving this by implementing a rule which will use the various selectors to see if a basic metric exists or not. If it does not exist then we want to warn the user that non of the other rules will be evaluated. Snippet from the other PR about using `kube_service_info` as our test metric. > I think I see what you're doing here ... but I see a risk. If there is no traffic to the service for example, when its being launched), there will be no metrics that match this query. I propose that instead we do this: > > ``` > count(kube_service_info{namespace={{ $.Release.Namespace }}, name={{ $.Release.Namespace $.Values.serviceRules }}) > ``` > > Sadly - I know that this means we aren't using the exact selectors (like I propose above) ... but this allows us to verify that the Service actually exists by that name. Ya know? Lastly, I set the timing of this metric up for 1 hour by default. This felt okay to me as we want to be a little bit resilient to prometheus metric collection outages and not accidentally page every service owner if we have a centralized outage of the platform itself. Normally, a team might only be paged once for this when they first setup their application if they had not setup their selectors correctly. This alert is not expected to go into an alert state without some heavy handed naming changes (mostly done during things like migrations).
Nextdoor · Apr 17, 2024 · 8655ff4 · 8655ff4
1 parent c8ddaed
commit 8655ff4
Show file tree

Hide file tree

Showing 6 changed files with 109 additions and 7 deletions.
diff --git a/charts/istio-alerts/Chart.yaml b/charts/istio-alerts/Chart.yaml
@@ -2,8 +2,10 @@ apiVersion: v2
 name: istio-alerts
 description: A Helm chart that provisions a series of alerts for istio VirtualServices
 type: application
-version: 0.2.0
+version: 0.3.0
 appVersion: 0.0.1
 maintainers:
   - name: diranged
     email: [email protected]
+  - name: laikan57
+    email: [email protected]
diff --git a/charts/istio-alerts/README.md b/charts/istio-alerts/README.md
@@ -1,6 +1,6 @@
 # istio-alerts
 
-![Version: 0.2.0](https://img.shields.io/badge/Version-0.2.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 0.0.1](https://img.shields.io/badge/AppVersion-0.0.1-informational?style=flat-square)
+![Version: 0.3.0](https://img.shields.io/badge/Version-0.3.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 0.0.1](https://img.shields.io/badge/AppVersion-0.0.1-informational?style=flat-square)
 
 A Helm chart that provisions a series of alerts for istio VirtualServices
 
@@ -9,6 +9,18 @@ A Helm chart that provisions a series of alerts for istio VirtualServices
 | Name | Email | Url |
 | ---- | ------ | --- |
 | diranged | <[email protected]> |  |
+| laikan57 | <[email protected]> |  |
+
+## Upgrade Notes
+
+### 0.2.x -> 0.3.x
+
+**BREAKING: The DestinationServiceSelectorValidity alert rule requires kube-state-metrics.**
+
+An alert was introduced in 0.3.x that requires kube-state-metrics to be installed in the cluster. If
+you do not have kube-state-metrics installed, you will need to disable the alert by setting
+`serviceRules.destinationServiceSelectorValidity.enabled` to `false`. This alert is used to detect
+if the destinationServiceSelector is actually selecting series for a service that exists.
 
 ## Values
 
@@ -25,7 +37,12 @@ A Helm chart that provisions a series of alerts for istio VirtualServices
 | defaults.additionalRuleLabels | object | `{}` | Additional custom labels attached to every PrometheusRule |
 | defaults.runbookUrl | string | `"https://github.com/Nextdoor/k8s-charts/blob/main/charts/istio-alerts/runbook.md"` | The prefix URL to the runbook_urls that will be applied to each PrometheusRule |
 | serviceRules.destinationServiceName | string | `".*"` | Narrow down the alerts to a particular Destination Service if there are multiple services that require different thresholds within the same namespace. |
+| serviceRules.destinationServiceSelectorValidity | object | `{"enabled":true,"for":"1h","severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
+| serviceRules.destinationServiceSelectorValidity.enabled | bool | `true` | Whether to enable the monitor on the selector for the VirtualService. |
+| serviceRules.destinationServiceSelectorValidity.for | string | `"1h"` | How long to evaluate. |
+| serviceRules.destinationServiceSelectorValidity.severity | string | `"warning"` | Severity of the monitor |
 | serviceRules.enabled | bool | `true` | Whether to enable the service rules template |
+| serviceRules.highRequestLatency | object | `{"enabled":true,"for":"15m","percentile":0.95,"severity":"warning","threshold":0.5}` | Configuration related to the latency monitor for the VirtualService. |
 | serviceRules.highRequestLatency.enabled | bool | `true` | Whether to enable the monitor on latency returned by the VirtualService. |
 | serviceRules.highRequestLatency.for | string | `"15m"` | How long to evaluate the latency of services. |
 | serviceRules.highRequestLatency.percentile | float | `0.95` | Which percentile to monitor - should be between 0 and 1. Default is 95th percentile. |

diff --git a/charts/istio-alerts/README.md.gotmpl b/charts/istio-alerts/README.md.gotmpl
@@ -0,0 +1,24 @@
+{{ template "chart.header" . }}
+
+{{ template "chart.versionBadge" .  }}{{ template "chart.typeBadge" .  }}{{ template "chart.appVersionBadge" .  }}
+
+{{ template "chart.description" . }}
+
+{{ template "chart.maintainersSection" . }}
+
+## Upgrade Notes
+
+### 0.2.x -> 0.3.x
+
+**BREAKING: The DestinationServiceSelectorValidity alert rule requires kube-state-metrics.**
+
+An alert was introduced in 0.3.x that requires kube-state-metrics to be installed in the cluster. If
+you do not have kube-state-metrics installed, you will need to disable the alert by setting
+`serviceRules.destinationServiceSelectorValidity.enabled` to `false`. This alert is used to detect
+if the destinationServiceSelector is actually selecting series for a service that exists.
+
+{{ template "chart.requirementsSection" . }}
+
+{{ template "chart.valuesSection" . }}
+
+{{ template "helm-docs.versionFooter" . }}
diff --git a/charts/istio-alerts/runbook.md b/charts/istio-alerts/runbook.md
@@ -9,3 +9,19 @@ to the grafana deployment for the correct EKS cluster and select the relevant
 service. Many services have custom dashboards in DataDog as well which may help
 investigate this alert further, and most service also produce logs of requests
 which may provide more context into what errors are being returned and why.
+
+## HighRequestLatency
+
+TBD
+
+## Alert-Rules-Selectors-Validity
+
+This alert fires when there may be an error in setting the proper selectors used
+by the other alerts in this chart. It attempts to read a basic metric using the
+selector you provided. For instance, if you have a pod selector that looks for
+`pod=~"foo-bar-.*"` but your pods are actually named `baz-.*`, this alert will
+notify you of the misconfiguration. Read the alert description to see exactly
+which selector is having an issue. Also note that you need to collect the
+metrics that this alert uses. For instance, to test pod selectors, we use the
+`kube_pod_info` metric. If you do not collect this metric, this alert will
+continiously fire.
diff --git a/charts/istio-alerts/templates/service-prometheusrule.yaml b/charts/istio-alerts/templates/service-prometheusrule.yaml
@@ -22,7 +22,6 @@ spec:
         description: >-
           High rate of 5xx responses from the {{`{{$labels.destination_service_name}}`}} VirtualService
           in namespace {{`{{$labels.namespace}}`}}.
-
       expr: >
         sum by (destination_service_name, reporter, source_workload) (
           istio_requests:increase5m{
@@ -48,15 +47,13 @@ spec:
         {{- end }}
     {{- end }}
     {{- end }}
-
-     # Adding latency alarm to trigger when request latency is above > threshold
     {{- with .Values.serviceRules.highRequestLatency }}
     {{- if .enabled }}
     - alert: HighRequestLatency
       annotations:
         summary: >-
           {{`{{$labels.destination_service_name}}`}} avg request latencies are above {{ .threshold }}s!
-        runbook_url: {{ $values.defaults.runbookUrl}} #HighRequestLatency
+        runbook_url: {{ $values.defaults.runbookUrl}}#HighRequestLatency
         description: >-
           Average request latency of {{`{{ $value | humanizePercentage }}`}} is above threshold ({{ .threshold }}s)
           in namespace {{`{{ $labels.namespace }}`}} for pod {{`{{ $labels.pod }}`}} (container: {{`{{ $labels.container }}`}}).
@@ -85,5 +82,36 @@ spec:
         {{ toYaml . | nindent 8}}
         {{- end }}
     {{- end }}
-  {{- end }}
+    {{- end }}
+    {{- with .Values.serviceRules.destinationServiceSelectorValidity }}
+    {{- if .enabled }}
+    - alert: DestinationServiceSelectorValidity
+      annotations:
+        summary: >-
+          DestinationServiceSelector for istio-alerts is invalid
+        runbook_url: {{ $values.defaults.runbookUrl}}#Alert-Rules-Selectors-Validity
+        description: >-
+          The DestinationServiceSelector used for service level alerts did not return any data.
+          Please check the DestinationServiceSelector applied in your istio-alerts chart
+          is correct to ensure you are properly selecting your service so that you
+          will be alerted for service issues. The current selector is
+          `{destination_service_namespace="{{ $release.Namespace }}", destination_service_name="{{ $.Values.serviceRules.destinationServiceName }}"}`.
+      expr: >-
+        (
+          count(
+            kube_service_info{
+              namespace="{{ $.Release.Namespace }}",
+              name=~"{{ $.Values.serviceRules.destinationServiceName }}"
+            }
+          ) or on() vector(0)
+        ) == 0
+      for: {{ .for }}
+      labels:
+        severity: {{ .severity }}
+        namespace: {{ $release.Namespace }}
+        {{- with $values.defaults.additionalRuleLabels}}
+        {{ toYaml . | nindent 8}}
+        {{- end }}
+    {{- end }}
+    {{- end }}
 {{- end }}
diff --git a/charts/istio-alerts/values.yaml b/charts/istio-alerts/values.yaml
@@ -99,6 +99,7 @@ serviceRules:
     # -- Severity of the 5xx monitor
     severity: critical
 
+  # -- Configuration related to the latency monitor for the VirtualService.
   highRequestLatency:
     # -- Whether to enable the monitor on latency returned by the
     # VirtualService.
@@ -117,3 +118,17 @@ serviceRules:
 
     # -- How long to evaluate the latency of services.
     for: 15m
+
+  # -- Does a basic lookup using the defined selectors to see if we can see any
+  # info for a given selector. This is the "watcher for the watcher". If we get
+  # alerted by this, we likely have a bad selector and our alerts are not going
+  # to ever fire.
+  destinationServiceSelectorValidity:
+    # -- Whether to enable the monitor on the selector for the VirtualService.
+    enabled: true
+
+    # -- Severity of the monitor
+    severity: warning
+
+    # -- How long to evaluate.
+    for: 1h