chore(prometheus-alerts): Group alerts by resource type (#295)

Values files schema has been updated to group alerts by resource type Motivation: We have regrouped alerts to be able to turn them on and off by resource type. As an example: > Value `.Values.containerRules.ContainerWaiting` has been migrated to > `.Values.containerRules.pods.ContainerWaiting`. Please update your values > files. The helm chart will produce errors if you do not migrate your values files. Proof: ``` akennedy@ndm2a prometheus-alerts % VALUES=values-akennedy.yaml make template > new install.go:214: [debug] Original chart version: "" install.go:231: [debug] CHART PATH: /Users/akennedy/src/k8s-charts/charts/prometheus-alerts Error: execution error at (prometheus-alerts/templates/containers-prometheusrule.yaml:1:4): Value `.Values.containerRules.ContainerWaiting` has been migrated to `.Values.containerRules.pods.ContainerWaiting`. Please update your values files. helm.go:84: [debug] execution error at (prometheus-alerts/templates/containers-prometheusrule.yaml:1:4): Value `.Values.containerRules.ContainerWaiting` has been migrated to `.Values.containerRules.pods.ContainerWaiting`. Please update your values files. make: *** [template] Error 1 ``` Proof: ```diff --- orig 2024-04-22 18:47:11 +++ new 2024-04-22 19:14:18 @@ -47,7 +47,7 @@ nextdoor.com/chart: prometheus-rules nextdoor.com/source: https://github.com/Nextdoor/k8s-charts labels: - helm.sh/chart: prometheus-alerts-1.4.1 + helm.sh/chart: prometheus-alerts-1.5.0 app.kubernetes.io/version: "0.0.1" app.kubernetes.io/managed-by: Helm @@ -134,18 +134,8 @@ for: 15m labels: severity: warning - - # - # Original Source: - # https://raw.githubusercontent.com/prometheus-community/helm-charts/kube-prometheus-stack-13.3.0/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kubernetes-apps.yaml - # - # This file has been modified so that the individual alarms are configurable. - # The default values for thresholds, periods and severities made these alarms - # too limited for us. - # - name: prometheus-alerts.kube-system.kubernetesAppsRules rules: - - alert: PodCrashLoopBackOff annotations: summary: Container inside pod {{ $labels.pod }} is crash looping @@ -166,7 +156,6 @@ for: 10m labels: severity: warning - - alert: PodNotReady annotations: summary: Pod has been in a non-ready state for more than 15m @@ -195,7 +184,6 @@ for: 15m labels: severity: warning - - alert: KubeDeploymentGenerationMismatch annotations: summary: Deployment generation mismatch due to possible roll-back @@ -212,7 +200,6 @@ for: 15m labels: severity: warning - - alert: KubeStatefulSetReplicasMismatch annotations: summary: StatefulSet has not matched the expected number of replicas. @@ -234,7 +221,6 @@ for: 15m labels: severity: warning - - alert: KubeStatefulSetGenerationMismatch annotations: summary: StatefulSet generation mismatch due to possible roll-back @@ -251,7 +237,6 @@ for: 15m labels: severity: warning - - alert: KubeStatefulSetUpdateNotRolledOut annotations: summary: StatefulSet update has not been rolled out. @@ -280,7 +265,6 @@ for: 15m labels: severity: warning - - alert: KubeDaemonSetRolloutStuck annotations: summary: DaemonSet rollout is stuck. @@ -316,7 +300,6 @@ for: 15m labels: severity: warning - - alert: KubeDaemonSetNotScheduled annotations: summary: DaemonSet pods are not scheduled. @@ -332,7 +315,6 @@ for: 10m labels: severity: warning - - alert: KubeDaemonSetMisScheduled annotations: summary: DaemonSet pods are misscheduled. @@ -345,7 +327,6 @@ for: 15m labels: severity: warning - - alert: KubeJobCompletion annotations: summary: Job did not complete in time @@ -361,7 +342,6 @@ for: 12h labels: severity: warning - - alert: KubeJobFailed annotations: summary: Job failed to complete. @@ -374,7 +354,6 @@ for: 15m labels: severity: warning - - alert: KubeHpaReplicasMismatch annotations: summary: HPA has not matched descired number of replicas. @@ -388,31 +367,22 @@ kube_horizontalpodautoscaler_status_desired_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"} != kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"} - ) - and - - ( + ) and ( kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"} > kube_horizontalpodautoscaler_spec_min_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"} - ) - - and - - ( + ) and ( kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"} < kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"} + ) and ( + changes(kube_horizontalpodautoscaler_status_current_replicas[15m]) + == + 0 ) - - and - - changes(kube_horizontalpodautoscaler_status_current_replicas[15m]) == 0 - for: 15m labels: severity: warning - - alert: KubeHpaMaxedOut annotations: summary: HPA is running at max replicas ```
Nextdoor · Apr 23, 2024 · 8fb7d95 · 8fb7d95
1 parent bdb4e64
commit 8fb7d95
Show file tree

Hide file tree

Showing 6 changed files with 345 additions and 239 deletions.
diff --git a/charts/prometheus-alerts/Chart.yaml b/charts/prometheus-alerts/Chart.yaml
@@ -2,7 +2,7 @@ apiVersion: v2
 name: prometheus-alerts
 description: Helm Chart that provisions a series of common Prometheus Alerts
 type: application
-version: 1.4.1
+version: 1.5.0
 appVersion: 0.0.1
 maintainers:
   - name: diranged

diff --git a/charts/prometheus-alerts/README.md b/charts/prometheus-alerts/README.md
@@ -3,7 +3,7 @@
 
 Helm Chart that provisions a series of common Prometheus Alerts
 
-![Version: 1.4.1](https://img.shields.io/badge/Version-1.4.1-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 0.0.1](https://img.shields.io/badge/AppVersion-0.0.1-informational?style=flat-square)
+![Version: 1.5.0](https://img.shields.io/badge/Version-1.5.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 0.0.1](https://img.shields.io/badge/AppVersion-0.0.1-informational?style=flat-square)
 
 [deployments]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
 [hpa]: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
@@ -20,6 +20,21 @@ those changes in the `charts/simple-app`, `charts/daemonset-app` and
 
 ## Upgrade Notes
 
+### 1.4.x -> 1.5.x
+
+**BREAKING: Values files schema has been updated to group alerts by resource type**
+
+Motivation: We have regrouped alerts to be able to turn them on and off by
+resource type.
+
+As an example:
+
+> Value `.Values.containerRules.ContainerWaiting` has been migrated to
+> `.Values.containerRules.pods.ContainerWaiting`. Please update your values
+> files.
+
+The helm chart will produce errors if you do not migrate your values files.
+
 ### 1.1.x -> 1.2.x
 
 **CHANGE: Resource Names have changed**
@@ -68,35 +83,41 @@ This behavior can be tuned via the `defaults.podNameSelector`,
 | alertManager.repeatInterval | string | `"1h"` | How long to wait before sending a notification again if it has already been sent successfully for an alert. (Usually ~3h or more). |
 | chart_name | string | `"prometheus-rules"` |  |
 | chart_source | string | `"https://github.com/Nextdoor/k8s-charts"` |  |
-| containerRules.CPUThrottlingHigh | object | `{"for":"15m","severity":"warning","threshold":5}` | Container is being throttled by the CGroup - needs more resources. This value is appropriate for applications that are highly sensitive to request latency. Insensitive workloads might need to raise this percentage to avoid alert noise. |
-| containerRules.ContainerWaiting.for | string | `"1h"` |  |
-| containerRules.ContainerWaiting.severity | string | `"warning"` |  |
-| containerRules.KubeDaemonSetMisScheduled.for | string | `"15m"` |  |
-| containerRules.KubeDaemonSetMisScheduled.severity | string | `"warning"` |  |
-| containerRules.KubeDaemonSetNotScheduled.for | string | `"10m"` |  |
-| containerRules.KubeDaemonSetNotScheduled.severity | string | `"warning"` |  |
-| containerRules.KubeDaemonSetRolloutStuck.for | string | `"15m"` |  |
-| containerRules.KubeDaemonSetRolloutStuck.severity | string | `"warning"` |  |
-| containerRules.KubeDeploymentGenerationMismatch | object | `{"for":"15m","severity":"warning"}` | Deployment generation mismatch due to possible roll-back |
-| containerRules.KubeHpaMaxedOut.for | string | `"15m"` |  |
-| containerRules.KubeHpaMaxedOut.severity | string | `"warning"` |  |
-| containerRules.KubeHpaReplicasMismatch.for | string | `"15m"` |  |
-| containerRules.KubeHpaReplicasMismatch.severity | string | `"warning"` |  |
-| containerRules.KubeJobCompletion.for | string | `"12h"` |  |
-| containerRules.KubeJobCompletion.severity | string | `"warning"` |  |
-| containerRules.KubeJobFailed.for | string | `"15m"` |  |
-| containerRules.KubeJobFailed.severity | string | `"warning"` |  |
-| containerRules.KubeStatefulSetGenerationMismatch.for | string | `"15m"` |  |
-| containerRules.KubeStatefulSetGenerationMismatch.severity | string | `"warning"` |  |
-| containerRules.KubeStatefulSetReplicasMismatch.for | string | `"15m"` |  |
-| containerRules.KubeStatefulSetReplicasMismatch.severity | string | `"warning"` |  |
-| containerRules.KubeStatefulSetUpdateNotRolledOut.for | string | `"15m"` |  |
-| containerRules.KubeStatefulSetUpdateNotRolledOut.severity | string | `"warning"` |  |
-| containerRules.PodContainerOOMKilled | object | `{"for":"1m","over":"60m","severity":"warning","threshold":0}` | Sums up all of the OOMKilled events per pod over the $over time (60m). If that number breaches the $threshold (0) for $for (1m), then it will alert. |
-| containerRules.PodContainerTerminated | object | `{"for":"1m","over":"10m","reasons":["ContainerCannotRun","DeadlineExceeded"],"severity":"warning","threshold":0}` | Monitors Pods for Containers that are terminated either for unexpected reasons like ContainerCannotRun. If that number breaches the $threshold (1) for $for (1m), then it will alert. |
-| containerRules.PodCrashLoopBackOff | object | `{"for":"10m","severity":"warning"}` | Pod is in a CrashLoopBackOff state and is not becoming healthy. |
-| containerRules.PodNotReady | object | `{"for":"15m","severity":"warning"}` | Pod has been in a non-ready state for more than a specific threshold |
+| containerRules.daemonsets.KubeDaemonSetMisScheduled.for | string | `"15m"` |  |
+| containerRules.daemonsets.KubeDaemonSetMisScheduled.severity | string | `"warning"` |  |
+| containerRules.daemonsets.KubeDaemonSetNotScheduled.for | string | `"10m"` |  |
+| containerRules.daemonsets.KubeDaemonSetNotScheduled.severity | string | `"warning"` |  |
+| containerRules.daemonsets.KubeDaemonSetRolloutStuck.for | string | `"15m"` |  |
+| containerRules.daemonsets.KubeDaemonSetRolloutStuck.severity | string | `"warning"` |  |
+| containerRules.daemonsets.enabled | bool | `true` | Enables the DaemonSet resource rules |
+| containerRules.deployments.KubeDeploymentGenerationMismatch | object | `{"for":"15m","severity":"warning"}` | Deployment generation mismatch due to possible roll-back |
+| containerRules.deployments.enabled | bool | `true` | Enables the Deployment resource rules |
 | containerRules.enabled | bool | `true` | Whether or not to enable the container rules template |
+| containerRules.hpas.KubeHpaMaxedOut.for | string | `"15m"` |  |
+| containerRules.hpas.KubeHpaMaxedOut.severity | string | `"warning"` |  |
+| containerRules.hpas.KubeHpaReplicasMismatch.for | string | `"15m"` |  |
+| containerRules.hpas.KubeHpaReplicasMismatch.severity | string | `"warning"` |  |
+| containerRules.hpas.enabled | bool | `true` | Enables the HorizontalPodAutoscaler resource rules |
+| containerRules.jobs.KubeJobCompletion.for | string | `"12h"` |  |
+| containerRules.jobs.KubeJobCompletion.severity | string | `"warning"` |  |
+| containerRules.jobs.KubeJobFailed.for | string | `"15m"` |  |
+| containerRules.jobs.KubeJobFailed.severity | string | `"warning"` |  |
+| containerRules.jobs.enabled | bool | `true` | Enables the Job resource rules |
+| containerRules.pods.CPUThrottlingHigh | object | `{"for":"15m","severity":"warning","threshold":5}` | Container is being throttled by the CGroup - needs more resources. This value is appropriate for applications that are highly sensitive to request latency. Insensitive workloads might need to raise this percentage to avoid alert noise. |
+| containerRules.pods.ContainerWaiting.for | string | `"1h"` |  |
+| containerRules.pods.ContainerWaiting.severity | string | `"warning"` |  |
+| containerRules.pods.PodContainerOOMKilled | object | `{"for":"1m","over":"60m","severity":"warning","threshold":0}` | Sums up all of the OOMKilled events per pod over the $over time (60m). If that number breaches the $threshold (0) for $for (1m), then it will alert. |
+| containerRules.pods.PodContainerTerminated | object | `{"for":"1m","over":"10m","reasons":["ContainerCannotRun","DeadlineExceeded"],"severity":"warning","threshold":0}` | Monitors Pods for Containers that are terminated either for unexpected reasons like ContainerCannotRun. If that number breaches the $threshold (1) for $for (1m), then it will alert. |
+| containerRules.pods.PodCrashLoopBackOff | object | `{"for":"10m","severity":"warning"}` | Pod is in a CrashLoopBackOff state and is not becoming healthy. |
+| containerRules.pods.PodNotReady | object | `{"for":"15m","severity":"warning"}` | Pod has been in a non-ready state for more than a specific threshold |
+| containerRules.pods.enabled | bool | `true` | Enables the Pod resource rules |
+| containerRules.statefulsets.KubeStatefulSetGenerationMismatch.for | string | `"15m"` |  |
+| containerRules.statefulsets.KubeStatefulSetGenerationMismatch.severity | string | `"warning"` |  |
+| containerRules.statefulsets.KubeStatefulSetReplicasMismatch.for | string | `"15m"` |  |
+| containerRules.statefulsets.KubeStatefulSetReplicasMismatch.severity | string | `"warning"` |  |
+| containerRules.statefulsets.KubeStatefulSetUpdateNotRolledOut.for | string | `"15m"` |  |
+| containerRules.statefulsets.KubeStatefulSetUpdateNotRolledOut.severity | string | `"warning"` |  |
+| containerRules.statefulsets.enabled | bool | `true` | Enables the StatefulSet resource rules |
 | defaults.additionalRuleLabels | `map` | `{}` | Additional custom labels attached to every PrometheusRule |
 | defaults.daemonsetNameSelector | `string` | `"{{ .Release.Name }}-.*"` | Pattern used to scope down the DaemonSet alerts to pods that are part of this general application. Set to `None` if you want to disable this selector and apply the rules to all the DaemonSets in the namespace. This string is run through the `tpl` function. |
 | defaults.deploymentNameSelector | `string` | `"{{ .Release.Name }}-.*"` | Pattern used to scope down the Deployment alerts to pods that are part of this general application. Set to `None` if you want to disable this selector and apply the rules to all the Deployments in the namespace. This string is run through the `tpl` function. |

diff --git a/charts/prometheus-alerts/README.md.gotmpl b/charts/prometheus-alerts/README.md.gotmpl
@@ -19,6 +19,21 @@ those changes in the `charts/simple-app`, `charts/daemonset-app` and
 
 ## Upgrade Notes
 
+### 1.4.x -> 1.5.x
+
+**BREAKING: Values files schema has been updated to group alerts by resource type**
+
+Motivation: We have regrouped alerts to be able to turn them on and off by
+resource type.
+
+As an example:
+
+> Value `.Values.containerRules.ContainerWaiting` has been migrated to
+> `.Values.containerRules.pods.ContainerWaiting`. Please update your values
+> files.
+
+The helm chart will produce errors if you do not migrate your values files.
+
 ### 1.1.x -> 1.2.x
 
 **CHANGE: Resource Names have changed**

diff --git a/charts/prometheus-alerts/templates/_migrations.tpl b/charts/prometheus-alerts/templates/_migrations.tpl
@@ -0,0 +1,14 @@
+{{- /*
+These checks are in place to ensure that the values files are up-to-date with
+values schema changes in v1.5.0.
+*/}}
+
+{{- define "prometheus-alerts.check_migration_alerts_grouped_by_resource" }}
+    {{- $rules := dict "ContainerWaiting" "pods" "CPUThrottlingHigh" "pods" "KubeDaemonSetMisScheduled" "daemonsets" "KubeDaemonSetNotScheduled" "daemonsets" "KubeDaemonSetRolloutStuck" "daemonsets" "KubeDeploymentGenerationMismatch" "deployments" "KubeHpaMaxedOut" "hpas" "KubeHpaReplicasMismatch" "hpas" "KubeJobCompletion" "jobs" "KubeJobFailed" "jobs" "KubeStatefulSetGenerationMismatch" "statefulsets" "KubeStatefulSetReplicasMismatch" "statefulsets" "KubeStatefulSetUpdateNotRolledOut" "statefulsets" "PodContainerOOMKilled" "pods" "PodContainerTerminated" "pods" "PodCrashLoopBackOff" "pods" "PodNotReady" "pods" }}
+
+    {{- range $rule, $type := $rules }}
+        {{- if index $.Values.containerRules $rule }}
+            {{- printf "Value `.Values.containerRules.%s` has been migrated to `.Values.containerRules.%s.%s`. Please update your values files." $rule $type $rule | fail }}
+        {{- end }}
+    {{- end }}
+{{- end -}}