feat(prometheus-alerts): Add selector validity alarms (#297)

We ran into an issue where unbeknownst to us, our pod selector was set incorrectly and therefore all of our rules deployed via prometheus-alerts did not have any series to evaluate. This lack of series data is silently ignored. Here we are resolving this by implementing a rule which will use the various selectors to see if a basic metric exists or not. If it does not exist then we want to warn the user that none of the other rules will be evaluated. I tried to choose metrics for each selector that were generic enough and did not rely on a specific metric that may not be gathered. Most types had an associated info or label metric which are good candidates for this use-case. Lastly, I set the timing of this metric up for 1 hour by default. This felt okay to me as we want to be a little bit resilient to prometheus metric collection outages and not accidentally page every service owner if we have a centralized outage of the platform itself. Normally, a team might only be paged once for this when they first setup their application if they had not setup their selectors correctly. This alert is not expected to go into an alert state without some heavy handed naming changes (mostly done during things like migrations). See #285
Nextdoor · Apr 26, 2024 · 8b658d1 · 8b658d1
1 parent ba4838e
commit 8b658d1
Show file tree

Hide file tree

Showing 6 changed files with 256 additions and 4 deletions.
diff --git a/charts/prometheus-alerts/Chart.yaml b/charts/prometheus-alerts/Chart.yaml
@@ -2,7 +2,7 @@ apiVersion: v2
 name: prometheus-alerts
 description: Helm Chart that provisions a series of common Prometheus Alerts
 type: application
-version: 1.5.0
+version: 1.6.0
 appVersion: 0.0.1
 maintainers:
   - name: diranged

diff --git a/charts/prometheus-alerts/README.md b/charts/prometheus-alerts/README.md
@@ -1,9 +1,8 @@
-
 # prometheus-alerts
 
 Helm Chart that provisions a series of common Prometheus Alerts
 
-![Version: 1.5.0](https://img.shields.io/badge/Version-1.5.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 0.0.1](https://img.shields.io/badge/AppVersion-0.0.1-informational?style=flat-square)
+![Version: 1.6.0](https://img.shields.io/badge/Version-1.6.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 0.0.1](https://img.shields.io/badge/AppVersion-0.0.1-informational?style=flat-square)
 
 [deployments]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
 [hpa]: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
@@ -20,6 +19,15 @@ those changes in the `charts/simple-app`, `charts/daemonset-app` and
 
 ## Upgrade Notes
 
+### 1.5.x -> 1.6.x
+
+**CHANGE: The AlertSelectorValidity alert rules added.**
+
+We have added a new metric which attempts to detect if you have misconfigured
+your selectors. After upgrading, you may get alerted. You should respond to the
+alert appropriately by reading the alert information and making changes to your
+selectors.
+
 ### 1.4.x -> 1.5.x
 
 **BREAKING: Values files schema has been updated to group alerts by resource type**
@@ -83,21 +91,25 @@ This behavior can be tuned via the `defaults.podNameSelector`,
 | alertManager.repeatInterval | string | `"1h"` | How long to wait before sending a notification again if it has already been sent successfully for an alert. (Usually ~3h or more). |
 | chart_name | string | `"prometheus-rules"` |  |
 | chart_source | string | `"https://github.com/Nextdoor/k8s-charts"` |  |
+| containerRules.daemonsets.DaemonsetSelectorValidity | object | `{"for":"1h","severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
 | containerRules.daemonsets.KubeDaemonSetMisScheduled.for | string | `"15m"` |  |
 | containerRules.daemonsets.KubeDaemonSetMisScheduled.severity | string | `"warning"` |  |
 | containerRules.daemonsets.KubeDaemonSetNotScheduled.for | string | `"10m"` |  |
 | containerRules.daemonsets.KubeDaemonSetNotScheduled.severity | string | `"warning"` |  |
 | containerRules.daemonsets.KubeDaemonSetRolloutStuck.for | string | `"15m"` |  |
 | containerRules.daemonsets.KubeDaemonSetRolloutStuck.severity | string | `"warning"` |  |
 | containerRules.daemonsets.enabled | bool | `true` | Enables the DaemonSet resource rules |
+| containerRules.deployments.DeploymentSelectorValidity | object | `{"for":"1h","severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
 | containerRules.deployments.KubeDeploymentGenerationMismatch | object | `{"for":"15m","severity":"warning"}` | Deployment generation mismatch due to possible roll-back |
 | containerRules.deployments.enabled | bool | `true` | Enables the Deployment resource rules |
 | containerRules.enabled | bool | `true` | Whether or not to enable the container rules template |
+| containerRules.hpas.HpaSelectorValidity | object | `{"for":"1h","severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
 | containerRules.hpas.KubeHpaMaxedOut.for | string | `"15m"` |  |
 | containerRules.hpas.KubeHpaMaxedOut.severity | string | `"warning"` |  |
 | containerRules.hpas.KubeHpaReplicasMismatch.for | string | `"15m"` |  |
 | containerRules.hpas.KubeHpaReplicasMismatch.severity | string | `"warning"` |  |
 | containerRules.hpas.enabled | bool | `true` | Enables the HorizontalPodAutoscaler resource rules |
+| containerRules.jobs.JobSelectorValidity | object | `{"for":"1h","severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
 | containerRules.jobs.KubeJobCompletion.for | string | `"12h"` |  |
 | containerRules.jobs.KubeJobCompletion.severity | string | `"warning"` |  |
 | containerRules.jobs.KubeJobFailed.for | string | `"15m"` |  |
@@ -110,13 +122,15 @@ This behavior can be tuned via the `defaults.podNameSelector`,
 | containerRules.pods.PodContainerTerminated | object | `{"for":"1m","over":"10m","reasons":["ContainerCannotRun","DeadlineExceeded"],"severity":"warning","threshold":0}` | Monitors Pods for Containers that are terminated either for unexpected reasons like ContainerCannotRun. If that number breaches the $threshold (1) for $for (1m), then it will alert. |
 | containerRules.pods.PodCrashLoopBackOff | object | `{"for":"10m","severity":"warning"}` | Pod is in a CrashLoopBackOff state and is not becoming healthy. |
 | containerRules.pods.PodNotReady | object | `{"for":"15m","severity":"warning"}` | Pod has been in a non-ready state for more than a specific threshold |
+| containerRules.pods.PodSelectorValidity | object | `{"for":"1h","severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
 | containerRules.pods.enabled | bool | `true` | Enables the Pod resource rules |
 | containerRules.statefulsets.KubeStatefulSetGenerationMismatch.for | string | `"15m"` |  |
 | containerRules.statefulsets.KubeStatefulSetGenerationMismatch.severity | string | `"warning"` |  |
 | containerRules.statefulsets.KubeStatefulSetReplicasMismatch.for | string | `"15m"` |  |
 | containerRules.statefulsets.KubeStatefulSetReplicasMismatch.severity | string | `"warning"` |  |
 | containerRules.statefulsets.KubeStatefulSetUpdateNotRolledOut.for | string | `"15m"` |  |
 | containerRules.statefulsets.KubeStatefulSetUpdateNotRolledOut.severity | string | `"warning"` |  |
+| containerRules.statefulsets.StatefulsetSelectorValidity | object | `{"for":"1h","severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
 | containerRules.statefulsets.enabled | bool | `true` | Enables the StatefulSet resource rules |
 | defaults.additionalRuleLabels | `map` | `{}` | Additional custom labels attached to every PrometheusRule |
 | defaults.daemonsetNameSelector | `string` | `"{{ .Release.Name }}-.*"` | Pattern used to scope down the DaemonSet alerts to pods that are part of this general application. Set to `None` if you want to disable this selector and apply the rules to all the DaemonSets in the namespace. This string is run through the `tpl` function. |

diff --git a/charts/prometheus-alerts/README.md.gotmpl b/charts/prometheus-alerts/README.md.gotmpl
@@ -1,5 +1,5 @@
-
 {{ template "chart.header" . }}
+
 {{ template "chart.description" . }}
 
 {{ template "chart.versionBadge" .  }}{{ template "chart.typeBadge" .  }}{{ template "chart.appVersionBadge" .  }}
@@ -19,6 +19,15 @@ those changes in the `charts/simple-app`, `charts/daemonset-app` and
 
 ## Upgrade Notes
 
+### 1.5.x -> 1.6.x
+
+**CHANGE: The AlertSelectorValidity alert rules added.**
+
+We have added a new metric which attempts to detect if you have misconfigured
+your selectors. After upgrading, you may get alerted. You should respond to the
+alert appropriately by reading the alert information and making changes to your
+selectors.
+
 ### 1.4.x -> 1.5.x
 
 **BREAKING: Values files schema has been updated to group alerts by resource type**

diff --git a/charts/prometheus-alerts/runbook.md b/charts/prometheus-alerts/runbook.md
@@ -101,3 +101,15 @@ help you determine the root cause of the issue. Follow the instructions in the
 into the relevant cluster and namespace, and use the `kubectl describe pod <podname>`
 to see the status of the pod and any events related to it. The pod logs may also 
 provide hints as to what may be going wrong.
+
+## Alert-Rules-Selectors-Validity
+
+This alert fires when there may be an error in setting the proper selectors used
+by the other alerts in this chart. It attempts to read a basic metric using the
+selector you provided. For instance, if you have a pod selector that looks for
+`pod=~"foo-bar-.*"` but your pods are actually named `baz-.*`, this alert will
+notify you of the misconfiguration. Read the alert description to see exactly
+which selector is having an issue. Also note that you need to collect the
+metrics that this alert uses. For instance, to test pod selectors, we use the
+`kube_pod_info` metric. If you do not collect this metric, this alert will
+continiously fire.
diff --git a/charts/prometheus-alerts/templates/containers-prometheusrule.yaml b/charts/prometheus-alerts/templates/containers-prometheusrule.yaml
@@ -200,6 +200,35 @@ spec:
         {{- end }}
     {{- end }}
 
+    {{ with .PodSelectorValidity -}}
+    - alert: PodSelectorValidity
+      annotations:
+        summary: PodSelector for prometheus-alerts is invalid
+        runbook_url: {{ $.Values.defaults.runbookUrl }}#Alert-Rules-Selectors-Validity
+        description: >-
+          The PodSelector used for pod level alerts did not return any data.
+          Please check the PodSelector applied in your prometheus-alerts chart
+          is correct to ensure you are properly selecting your pods so that you
+          will be alerted for pod issues. The current selector is
+          `{{ $podSelector }}, {{ $namespaceSelector }}`.
+      expr: |-
+        (
+          count(
+            kube_pod_info{
+              {{ $podSelector }},
+              {{ $namespaceSelector }}
+            }
+          ) or on() vector(0)
+        ) == 0
+      for: {{ .for }}
+      labels:
+        severity: {{ .severity }}
+        namespace: {{ $.Release.Namespace }}
+        {{- if $.Values.defaults.additionalRuleLabels }}
+        {{ toYaml $.Values.defaults.additionalRuleLabels | nindent 8 }}
+        {{- end }}
+    {{- end }}
+
     {{- end }}
     {{- end }}
 
@@ -228,6 +257,34 @@ spec:
         {{- end }}
     {{- end }}
 
+    {{ with .DeploymentSelectorValidity -}}
+    - alert: DeploymentSelectorValidity
+      annotations:
+        summary: DeploymentSelector for prometheus-alerts is invalid
+        runbook_url: {{ $.Values.defaults.runbookUrl }}#Alert-Rules-Selectors-Validity
+        description: >-
+          The DeploymentSelector used for deployment level alerts did not return any data.
+          Please check the DeploymentSelector applied in your prometheus-alerts chart
+          is correct to ensure you are properly selecting your deployments so that you
+          will be alerted for deployment issues. The current selector is
+          `{{ $deploymentSelector }}`.
+      expr: |-
+        (
+          count(
+            kube_deployment_labels{
+              {{ $deploymentSelector }}
+            }
+          ) or on() vector(0)
+        ) == 0
+      for: {{ .for }}
+      labels:
+        severity: {{ .severity }}
+        namespace: {{ $.Release.Namespace }}
+        {{- if $.Values.defaults.additionalRuleLabels }}
+        {{ toYaml $.Values.defaults.additionalRuleLabels | nindent 8 }}
+        {{- end }}
+    {{- end }}
+
     {{- end }}
     {{- end }}
 
@@ -317,6 +374,34 @@ spec:
         {{- end }}
     {{- end }}
 
+    {{ with .StatefulsetSelectorValidity -}}
+    - alert: StatefulsetSelectorValidity
+      annotations:
+        summary: StatefulsetSelector for prometheus-alerts is invalid
+        runbook_url: {{ $.Values.defaults.runbookUrl }}#Alert-Rules-Selectors-Validity
+        description: >-
+          The StatefulsetSelector used for statefulset level alerts did not return any data.
+          Please check the StatefulsetSelector applied in your prometheus-alerts chart
+          is correct to ensure you are properly selecting your statefulsets so that you
+          will be alerted for statefulset issues. The current selector is
+          `{{ $statefulsetSelector }}`.
+      expr: |-
+        (
+          count(
+            kube_statefulset_created{
+              {{ $statefulsetSelector }}
+            }
+          ) or on() vector(0)
+        ) == 0
+      for: {{ .for }}
+      labels:
+        severity: {{ .severity }}
+        namespace: {{ $.Release.Namespace }}
+        {{- if $.Values.defaults.additionalRuleLabels }}
+        {{ toYaml $.Values.defaults.additionalRuleLabels | nindent 8 }}
+        {{- end }}
+    {{- end }}
+
     {{- end }}
     {{- end }}
 
@@ -403,6 +488,34 @@ spec:
         {{- end }}
     {{- end }}
 
+    {{ with .DaemonsetSelectorValidity -}}
+    - alert: DaemonsetSelectorValidity
+      annotations:
+        summary: DaemonsetSelector for prometheus-alerts is invalid
+        runbook_url: {{ $.Values.defaults.runbookUrl }}#Alert-Rules-Selectors-Validity
+        description: >-
+          The DaemonsetSelector used for daemonset level alerts did not return any data.
+          Please check the DaemonsetSelector applied in your prometheus-alerts chart
+          is correct to ensure you are properly selecting your daemonsets so that you
+          will be alerted for daemonset issues. The current selector is
+          `{{ $daemonsetSelector }}`.
+      expr: |-
+        (
+          count(
+            kube_daemonset_labels{
+              {{ $daemonsetSelector }}
+            }
+          ) or on() vector(0)
+        ) == 0
+      for: {{ .for }}
+      labels:
+        severity: {{ .severity }}
+        namespace: {{ $.Release.Namespace }}
+        {{- if $.Values.defaults.additionalRuleLabels }}
+        {{ toYaml $.Values.defaults.additionalRuleLabels | nindent 8 }}
+        {{- end }}
+    {{- end }}
+
     {{- end }}
     {{- end }}
 
@@ -448,6 +561,34 @@ spec:
         {{- end }}
     {{- end }}
 
+    {{ with .JobSelectorValidity -}}
+    - alert: JobSelectorValidity
+      annotations:
+        summary: JobSelector for prometheus-alerts is invalid
+        runbook_url: {{ $.Values.defaults.runbookUrl }}#Alert-Rules-Selectors-Validity
+        description: >-
+          The JobSelector used for job level alerts did not return any data.
+          Please check the JobSelector applied in your prometheus-alerts chart
+          is correct to ensure you are properly selecting your jobs so that you
+          will be alerted for job issues. The current selector is
+          `{{ $jobSelector }}`.
+      expr: |-
+        (
+          count(
+            kube_job_info{
+              {{ $jobSelector }}
+            }
+          ) or on() vector(0)
+        ) == 0
+      for: {{ .for }}
+      labels:
+        severity: {{ .severity }}
+        namespace: {{ $.Release.Namespace }}
+        {{- if $.Values.defaults.additionalRuleLabels }}
+        {{ toYaml $.Values.defaults.additionalRuleLabels | nindent 8 }}
+        {{- end }}
+    {{- end }}
+
     {{- end }}
     {{- end }}
 
@@ -509,6 +650,34 @@ spec:
         {{- end }}
     {{- end }}
 
+    {{ with .HpaSelectorValidity -}}
+    - alert: HpaSelectorValidity
+      annotations:
+        summary: HpaSelector for prometheus-alerts is invalid
+        runbook_url: {{ $.Values.defaults.runbookUrl }}#Alert-Rules-Selectors-Validity
+        description: >-
+          The HpaSelector used for hpa level alerts did not return any data.
+          Please check the HpaSelector applied in your prometheus-alerts chart
+          is correct to ensure you are properly selecting your hpas so that you
+          will be alerted for hpa issues. The current selector is
+          `{{ $hpaSelector }}`.
+      expr: |-
+        (
+          count(
+            kube_horizontalpodautoscaler_info{
+              {{ $hpaSelector }}
+            }
+          ) or on() vector(0)
+        ) == 0
+      for: {{ .for }}
+      labels:
+        severity: {{ .severity }}
+        namespace: {{ $.Release.Namespace }}
+        {{- if $.Values.defaults.additionalRuleLabels }}
+        {{ toYaml $.Values.defaults.additionalRuleLabels | nindent 8 }}
+        {{- end }}
+    {{- end }}
+
     {{- end }}
     {{- end }}