Skip to content

Commit

Permalink
chore(prometheus-alerts): Group alerts by resource type (#295)
Browse files Browse the repository at this point in the history
Values files schema has been updated to group alerts by resource type

Motivation: We have regrouped alerts to be able to turn them on and off
by
resource type.

As an example:

> Value `.Values.containerRules.ContainerWaiting` has been migrated to
> `.Values.containerRules.pods.ContainerWaiting`. Please update your
values
> files.

The helm chart will produce errors if you do not migrate your values
files.

Proof:
```
akennedy@ndm2a prometheus-alerts % VALUES=values-akennedy.yaml make template > new
install.go:214: [debug] Original chart version: ""
install.go:231: [debug] CHART PATH: /Users/akennedy/src/k8s-charts/charts/prometheus-alerts

Error: execution error at (prometheus-alerts/templates/containers-prometheusrule.yaml:1:4): Value `.Values.containerRules.ContainerWaiting` has been migrated to `.Values.containerRules.pods.ContainerWaiting`. Please update your values files.
helm.go:84: [debug] execution error at (prometheus-alerts/templates/containers-prometheusrule.yaml:1:4): Value `.Values.containerRules.ContainerWaiting` has been migrated to `.Values.containerRules.pods.ContainerWaiting`. Please update your values files.
make: *** [template] Error 1
```

Proof:
```diff
--- orig        2024-04-22 18:47:11
+++ new 2024-04-22 19:14:18
@@ -47,7 +47,7 @@
     nextdoor.com/chart: prometheus-rules
     nextdoor.com/source: https://github.com/Nextdoor/k8s-charts
   labels:
-    helm.sh/chart: prometheus-alerts-1.4.1
+    helm.sh/chart: prometheus-alerts-1.5.0
     app.kubernetes.io/version: "0.0.1"
     app.kubernetes.io/managed-by: Helm
     
@@ -134,18 +134,8 @@
       for: 15m
       labels:
         severity: warning
-
-  #
-  # Original Source:
-  #    https://raw.githubusercontent.com/prometheus-community/helm-charts/kube-prometheus-stack-13.3.0/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kubernetes-apps.yaml
-  #
-  # This file has been modified so that the individual alarms are configurable.
-  # The default values for thresholds, periods and severities made these alarms
-  # too limited for us.
-  #
   - name: prometheus-alerts.kube-system.kubernetesAppsRules
     rules:
-
     - alert: PodCrashLoopBackOff
       annotations:
         summary: Container inside pod {{ $labels.pod }} is crash looping
@@ -166,7 +156,6 @@
       for: 10m
       labels:
         severity: warning
-
     - alert: PodNotReady
       annotations:
         summary: Pod has been in a non-ready state for more than 15m
@@ -195,7 +184,6 @@
       for: 15m
       labels:
         severity: warning
-
     - alert: KubeDeploymentGenerationMismatch
       annotations:
         summary: Deployment generation mismatch due to possible roll-back
@@ -212,7 +200,6 @@
       for: 15m
       labels:
         severity: warning
-
     - alert: KubeStatefulSetReplicasMismatch
       annotations:
         summary: StatefulSet has not matched the expected number of replicas.
@@ -234,7 +221,6 @@
       for: 15m
       labels:
         severity: warning
-
     - alert: KubeStatefulSetGenerationMismatch
       annotations:
         summary: StatefulSet generation mismatch due to possible roll-back
@@ -251,7 +237,6 @@
       for: 15m
       labels:
         severity: warning
-
     - alert: KubeStatefulSetUpdateNotRolledOut
       annotations:
         summary: StatefulSet update has not been rolled out.
@@ -280,7 +265,6 @@
       for: 15m
       labels:
         severity: warning
-
     - alert: KubeDaemonSetRolloutStuck
       annotations:
         summary: DaemonSet rollout is stuck.
@@ -316,7 +300,6 @@
       for: 15m
       labels:
         severity: warning
-
     - alert: KubeDaemonSetNotScheduled
       annotations:
         summary: DaemonSet pods are not scheduled.
@@ -332,7 +315,6 @@
       for: 10m
       labels:
         severity: warning
-
     - alert: KubeDaemonSetMisScheduled
       annotations:
         summary: DaemonSet pods are misscheduled.
@@ -345,7 +327,6 @@
       for: 15m
       labels:
         severity: warning
-
     - alert: KubeJobCompletion
       annotations:
         summary: Job did not complete in time
@@ -361,7 +342,6 @@
       for: 12h
       labels:
         severity: warning
-
     - alert: KubeJobFailed
       annotations:
         summary: Job failed to complete.
@@ -374,7 +354,6 @@
       for: 15m
       labels:
         severity: warning
-
     - alert: KubeHpaReplicasMismatch
       annotations:
         summary: HPA has not matched descired number of replicas.
@@ -388,31 +367,22 @@
          kube_horizontalpodautoscaler_status_desired_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"}
           !=
          kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"}
-        )
-          and
-
-        (
+        ) and (
          kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"}
           >
          kube_horizontalpodautoscaler_spec_min_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"}
-        )
-
-          and
-
-        (
+        ) and (
           kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"}
             <
           kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"}
+        ) and (
+          changes(kube_horizontalpodautoscaler_status_current_replicas[15m])
+            ==
+          0
         )
-
-          and
-
-        changes(kube_horizontalpodautoscaler_status_current_replicas[15m]) == 0
-
       for: 15m
       labels:
         severity: warning
-
     - alert: KubeHpaMaxedOut
       annotations:
         summary: HPA is running at max replicas
```
  • Loading branch information
LaikaN57 authored Apr 23, 2024
1 parent bdb4e64 commit 8fb7d95
Show file tree
Hide file tree
Showing 6 changed files with 345 additions and 239 deletions.
2 changes: 1 addition & 1 deletion charts/prometheus-alerts/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ apiVersion: v2
name: prometheus-alerts
description: Helm Chart that provisions a series of common Prometheus Alerts
type: application
version: 1.4.1
version: 1.5.0
appVersion: 0.0.1
maintainers:
- name: diranged
Expand Down
79 changes: 50 additions & 29 deletions charts/prometheus-alerts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

Helm Chart that provisions a series of common Prometheus Alerts

![Version: 1.4.1](https://img.shields.io/badge/Version-1.4.1-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 0.0.1](https://img.shields.io/badge/AppVersion-0.0.1-informational?style=flat-square)
![Version: 1.5.0](https://img.shields.io/badge/Version-1.5.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 0.0.1](https://img.shields.io/badge/AppVersion-0.0.1-informational?style=flat-square)

[deployments]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
[hpa]: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
Expand All @@ -20,6 +20,21 @@ those changes in the `charts/simple-app`, `charts/daemonset-app` and

## Upgrade Notes

### 1.4.x -> 1.5.x

**BREAKING: Values files schema has been updated to group alerts by resource type**

Motivation: We have regrouped alerts to be able to turn them on and off by
resource type.

As an example:

> Value `.Values.containerRules.ContainerWaiting` has been migrated to
> `.Values.containerRules.pods.ContainerWaiting`. Please update your values
> files.
The helm chart will produce errors if you do not migrate your values files.

### 1.1.x -> 1.2.x

**CHANGE: Resource Names have changed**
Expand Down Expand Up @@ -68,35 +83,41 @@ This behavior can be tuned via the `defaults.podNameSelector`,
| alertManager.repeatInterval | string | `"1h"` | How long to wait before sending a notification again if it has already been sent successfully for an alert. (Usually ~3h or more). |
| chart_name | string | `"prometheus-rules"` | |
| chart_source | string | `"https://github.com/Nextdoor/k8s-charts"` | |
| containerRules.CPUThrottlingHigh | object | `{"for":"15m","severity":"warning","threshold":5}` | Container is being throttled by the CGroup - needs more resources. This value is appropriate for applications that are highly sensitive to request latency. Insensitive workloads might need to raise this percentage to avoid alert noise. |
| containerRules.ContainerWaiting.for | string | `"1h"` | |
| containerRules.ContainerWaiting.severity | string | `"warning"` | |
| containerRules.KubeDaemonSetMisScheduled.for | string | `"15m"` | |
| containerRules.KubeDaemonSetMisScheduled.severity | string | `"warning"` | |
| containerRules.KubeDaemonSetNotScheduled.for | string | `"10m"` | |
| containerRules.KubeDaemonSetNotScheduled.severity | string | `"warning"` | |
| containerRules.KubeDaemonSetRolloutStuck.for | string | `"15m"` | |
| containerRules.KubeDaemonSetRolloutStuck.severity | string | `"warning"` | |
| containerRules.KubeDeploymentGenerationMismatch | object | `{"for":"15m","severity":"warning"}` | Deployment generation mismatch due to possible roll-back |
| containerRules.KubeHpaMaxedOut.for | string | `"15m"` | |
| containerRules.KubeHpaMaxedOut.severity | string | `"warning"` | |
| containerRules.KubeHpaReplicasMismatch.for | string | `"15m"` | |
| containerRules.KubeHpaReplicasMismatch.severity | string | `"warning"` | |
| containerRules.KubeJobCompletion.for | string | `"12h"` | |
| containerRules.KubeJobCompletion.severity | string | `"warning"` | |
| containerRules.KubeJobFailed.for | string | `"15m"` | |
| containerRules.KubeJobFailed.severity | string | `"warning"` | |
| containerRules.KubeStatefulSetGenerationMismatch.for | string | `"15m"` | |
| containerRules.KubeStatefulSetGenerationMismatch.severity | string | `"warning"` | |
| containerRules.KubeStatefulSetReplicasMismatch.for | string | `"15m"` | |
| containerRules.KubeStatefulSetReplicasMismatch.severity | string | `"warning"` | |
| containerRules.KubeStatefulSetUpdateNotRolledOut.for | string | `"15m"` | |
| containerRules.KubeStatefulSetUpdateNotRolledOut.severity | string | `"warning"` | |
| containerRules.PodContainerOOMKilled | object | `{"for":"1m","over":"60m","severity":"warning","threshold":0}` | Sums up all of the OOMKilled events per pod over the $over time (60m). If that number breaches the $threshold (0) for $for (1m), then it will alert. |
| containerRules.PodContainerTerminated | object | `{"for":"1m","over":"10m","reasons":["ContainerCannotRun","DeadlineExceeded"],"severity":"warning","threshold":0}` | Monitors Pods for Containers that are terminated either for unexpected reasons like ContainerCannotRun. If that number breaches the $threshold (1) for $for (1m), then it will alert. |
| containerRules.PodCrashLoopBackOff | object | `{"for":"10m","severity":"warning"}` | Pod is in a CrashLoopBackOff state and is not becoming healthy. |
| containerRules.PodNotReady | object | `{"for":"15m","severity":"warning"}` | Pod has been in a non-ready state for more than a specific threshold |
| containerRules.daemonsets.KubeDaemonSetMisScheduled.for | string | `"15m"` | |
| containerRules.daemonsets.KubeDaemonSetMisScheduled.severity | string | `"warning"` | |
| containerRules.daemonsets.KubeDaemonSetNotScheduled.for | string | `"10m"` | |
| containerRules.daemonsets.KubeDaemonSetNotScheduled.severity | string | `"warning"` | |
| containerRules.daemonsets.KubeDaemonSetRolloutStuck.for | string | `"15m"` | |
| containerRules.daemonsets.KubeDaemonSetRolloutStuck.severity | string | `"warning"` | |
| containerRules.daemonsets.enabled | bool | `true` | Enables the DaemonSet resource rules |
| containerRules.deployments.KubeDeploymentGenerationMismatch | object | `{"for":"15m","severity":"warning"}` | Deployment generation mismatch due to possible roll-back |
| containerRules.deployments.enabled | bool | `true` | Enables the Deployment resource rules |
| containerRules.enabled | bool | `true` | Whether or not to enable the container rules template |
| containerRules.hpas.KubeHpaMaxedOut.for | string | `"15m"` | |
| containerRules.hpas.KubeHpaMaxedOut.severity | string | `"warning"` | |
| containerRules.hpas.KubeHpaReplicasMismatch.for | string | `"15m"` | |
| containerRules.hpas.KubeHpaReplicasMismatch.severity | string | `"warning"` | |
| containerRules.hpas.enabled | bool | `true` | Enables the HorizontalPodAutoscaler resource rules |
| containerRules.jobs.KubeJobCompletion.for | string | `"12h"` | |
| containerRules.jobs.KubeJobCompletion.severity | string | `"warning"` | |
| containerRules.jobs.KubeJobFailed.for | string | `"15m"` | |
| containerRules.jobs.KubeJobFailed.severity | string | `"warning"` | |
| containerRules.jobs.enabled | bool | `true` | Enables the Job resource rules |
| containerRules.pods.CPUThrottlingHigh | object | `{"for":"15m","severity":"warning","threshold":5}` | Container is being throttled by the CGroup - needs more resources. This value is appropriate for applications that are highly sensitive to request latency. Insensitive workloads might need to raise this percentage to avoid alert noise. |
| containerRules.pods.ContainerWaiting.for | string | `"1h"` | |
| containerRules.pods.ContainerWaiting.severity | string | `"warning"` | |
| containerRules.pods.PodContainerOOMKilled | object | `{"for":"1m","over":"60m","severity":"warning","threshold":0}` | Sums up all of the OOMKilled events per pod over the $over time (60m). If that number breaches the $threshold (0) for $for (1m), then it will alert. |
| containerRules.pods.PodContainerTerminated | object | `{"for":"1m","over":"10m","reasons":["ContainerCannotRun","DeadlineExceeded"],"severity":"warning","threshold":0}` | Monitors Pods for Containers that are terminated either for unexpected reasons like ContainerCannotRun. If that number breaches the $threshold (1) for $for (1m), then it will alert. |
| containerRules.pods.PodCrashLoopBackOff | object | `{"for":"10m","severity":"warning"}` | Pod is in a CrashLoopBackOff state and is not becoming healthy. |
| containerRules.pods.PodNotReady | object | `{"for":"15m","severity":"warning"}` | Pod has been in a non-ready state for more than a specific threshold |
| containerRules.pods.enabled | bool | `true` | Enables the Pod resource rules |
| containerRules.statefulsets.KubeStatefulSetGenerationMismatch.for | string | `"15m"` | |
| containerRules.statefulsets.KubeStatefulSetGenerationMismatch.severity | string | `"warning"` | |
| containerRules.statefulsets.KubeStatefulSetReplicasMismatch.for | string | `"15m"` | |
| containerRules.statefulsets.KubeStatefulSetReplicasMismatch.severity | string | `"warning"` | |
| containerRules.statefulsets.KubeStatefulSetUpdateNotRolledOut.for | string | `"15m"` | |
| containerRules.statefulsets.KubeStatefulSetUpdateNotRolledOut.severity | string | `"warning"` | |
| containerRules.statefulsets.enabled | bool | `true` | Enables the StatefulSet resource rules |
| defaults.additionalRuleLabels | `map` | `{}` | Additional custom labels attached to every PrometheusRule |
| defaults.daemonsetNameSelector | `string` | `"{{ .Release.Name }}-.*"` | Pattern used to scope down the DaemonSet alerts to pods that are part of this general application. Set to `None` if you want to disable this selector and apply the rules to all the DaemonSets in the namespace. This string is run through the `tpl` function. |
| defaults.deploymentNameSelector | `string` | `"{{ .Release.Name }}-.*"` | Pattern used to scope down the Deployment alerts to pods that are part of this general application. Set to `None` if you want to disable this selector and apply the rules to all the Deployments in the namespace. This string is run through the `tpl` function. |
Expand Down
15 changes: 15 additions & 0 deletions charts/prometheus-alerts/README.md.gotmpl
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,21 @@ those changes in the `charts/simple-app`, `charts/daemonset-app` and

## Upgrade Notes

### 1.4.x -> 1.5.x

**BREAKING: Values files schema has been updated to group alerts by resource type**

Motivation: We have regrouped alerts to be able to turn them on and off by
resource type.

As an example:

> Value `.Values.containerRules.ContainerWaiting` has been migrated to
> `.Values.containerRules.pods.ContainerWaiting`. Please update your values
> files.

The helm chart will produce errors if you do not migrate your values files.

### 1.1.x -> 1.2.x

**CHANGE: Resource Names have changed**
Expand Down
14 changes: 14 additions & 0 deletions charts/prometheus-alerts/templates/_migrations.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{{- /*
These checks are in place to ensure that the values files are up-to-date with
values schema changes in v1.5.0.
*/}}

{{- define "prometheus-alerts.check_migration_alerts_grouped_by_resource" }}
{{- $rules := dict "ContainerWaiting" "pods" "CPUThrottlingHigh" "pods" "KubeDaemonSetMisScheduled" "daemonsets" "KubeDaemonSetNotScheduled" "daemonsets" "KubeDaemonSetRolloutStuck" "daemonsets" "KubeDeploymentGenerationMismatch" "deployments" "KubeHpaMaxedOut" "hpas" "KubeHpaReplicasMismatch" "hpas" "KubeJobCompletion" "jobs" "KubeJobFailed" "jobs" "KubeStatefulSetGenerationMismatch" "statefulsets" "KubeStatefulSetReplicasMismatch" "statefulsets" "KubeStatefulSetUpdateNotRolledOut" "statefulsets" "PodContainerOOMKilled" "pods" "PodContainerTerminated" "pods" "PodCrashLoopBackOff" "pods" "PodNotReady" "pods" }}

{{- range $rule, $type := $rules }}
{{- if index $.Values.containerRules $rule }}
{{- printf "Value `.Values.containerRules.%s` has been migrated to `.Values.containerRules.%s.%s`. Please update your values files." $rule $type $rule | fail }}
{{- end }}
{{- end }}
{{- end -}}
Loading

0 comments on commit 8fb7d95

Please sign in to comment.