Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore(prometheus-alerts): Group alerts by resource type (#295)
Values files schema has been updated to group alerts by resource type Motivation: We have regrouped alerts to be able to turn them on and off by resource type. As an example: > Value `.Values.containerRules.ContainerWaiting` has been migrated to > `.Values.containerRules.pods.ContainerWaiting`. Please update your values > files. The helm chart will produce errors if you do not migrate your values files. Proof: ``` akennedy@ndm2a prometheus-alerts % VALUES=values-akennedy.yaml make template > new install.go:214: [debug] Original chart version: "" install.go:231: [debug] CHART PATH: /Users/akennedy/src/k8s-charts/charts/prometheus-alerts Error: execution error at (prometheus-alerts/templates/containers-prometheusrule.yaml:1:4): Value `.Values.containerRules.ContainerWaiting` has been migrated to `.Values.containerRules.pods.ContainerWaiting`. Please update your values files. helm.go:84: [debug] execution error at (prometheus-alerts/templates/containers-prometheusrule.yaml:1:4): Value `.Values.containerRules.ContainerWaiting` has been migrated to `.Values.containerRules.pods.ContainerWaiting`. Please update your values files. make: *** [template] Error 1 ``` Proof: ```diff --- orig 2024-04-22 18:47:11 +++ new 2024-04-22 19:14:18 @@ -47,7 +47,7 @@ nextdoor.com/chart: prometheus-rules nextdoor.com/source: https://github.com/Nextdoor/k8s-charts labels: - helm.sh/chart: prometheus-alerts-1.4.1 + helm.sh/chart: prometheus-alerts-1.5.0 app.kubernetes.io/version: "0.0.1" app.kubernetes.io/managed-by: Helm @@ -134,18 +134,8 @@ for: 15m labels: severity: warning - - # - # Original Source: - # https://raw.githubusercontent.com/prometheus-community/helm-charts/kube-prometheus-stack-13.3.0/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kubernetes-apps.yaml - # - # This file has been modified so that the individual alarms are configurable. - # The default values for thresholds, periods and severities made these alarms - # too limited for us. - # - name: prometheus-alerts.kube-system.kubernetesAppsRules rules: - - alert: PodCrashLoopBackOff annotations: summary: Container inside pod {{ $labels.pod }} is crash looping @@ -166,7 +156,6 @@ for: 10m labels: severity: warning - - alert: PodNotReady annotations: summary: Pod has been in a non-ready state for more than 15m @@ -195,7 +184,6 @@ for: 15m labels: severity: warning - - alert: KubeDeploymentGenerationMismatch annotations: summary: Deployment generation mismatch due to possible roll-back @@ -212,7 +200,6 @@ for: 15m labels: severity: warning - - alert: KubeStatefulSetReplicasMismatch annotations: summary: StatefulSet has not matched the expected number of replicas. @@ -234,7 +221,6 @@ for: 15m labels: severity: warning - - alert: KubeStatefulSetGenerationMismatch annotations: summary: StatefulSet generation mismatch due to possible roll-back @@ -251,7 +237,6 @@ for: 15m labels: severity: warning - - alert: KubeStatefulSetUpdateNotRolledOut annotations: summary: StatefulSet update has not been rolled out. @@ -280,7 +265,6 @@ for: 15m labels: severity: warning - - alert: KubeDaemonSetRolloutStuck annotations: summary: DaemonSet rollout is stuck. @@ -316,7 +300,6 @@ for: 15m labels: severity: warning - - alert: KubeDaemonSetNotScheduled annotations: summary: DaemonSet pods are not scheduled. @@ -332,7 +315,6 @@ for: 10m labels: severity: warning - - alert: KubeDaemonSetMisScheduled annotations: summary: DaemonSet pods are misscheduled. @@ -345,7 +327,6 @@ for: 15m labels: severity: warning - - alert: KubeJobCompletion annotations: summary: Job did not complete in time @@ -361,7 +342,6 @@ for: 12h labels: severity: warning - - alert: KubeJobFailed annotations: summary: Job failed to complete. @@ -374,7 +354,6 @@ for: 15m labels: severity: warning - - alert: KubeHpaReplicasMismatch annotations: summary: HPA has not matched descired number of replicas. @@ -388,31 +367,22 @@ kube_horizontalpodautoscaler_status_desired_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"} != kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"} - ) - and - - ( + ) and ( kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"} > kube_horizontalpodautoscaler_spec_min_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"} - ) - - and - - ( + ) and ( kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"} < kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics", horizontalpodautoscaler=~"prometheus-alerts-.*", namespace="kube-system"} + ) and ( + changes(kube_horizontalpodautoscaler_status_current_replicas[15m]) + == + 0 ) - - and - - changes(kube_horizontalpodautoscaler_status_current_replicas[15m]) == 0 - for: 15m labels: severity: warning - - alert: KubeHpaMaxedOut annotations: summary: HPA is running at max replicas ```
- Loading branch information