Skip to content

Commit

Permalink
chore(prometheus-alerts): allow opting out of the validity selector a…
Browse files Browse the repository at this point in the history
…lerts (#307)

There are some cases where we know the selector is fine (ie, `.*`) and
yet there are not ALWAYS going to be resources that match that selector
- an example would be any `Job` that is created as part of a release
process... Jobs may come and go, and there may be periods of times when
a `Job` does not exist, so the `kube_job_info{}` metric won't exist.

In these cases, we need an escape hatch to allow the critical alarms to
exist without alerting oncall engineers just because a job is
_missing_..

Co-authored-by: Matt Wise <[email protected]>
  • Loading branch information
diranged and diranged authored May 28, 2024
1 parent c8f794b commit 8a5d38c
Show file tree
Hide file tree
Showing 4 changed files with 28 additions and 10 deletions.
2 changes: 1 addition & 1 deletion charts/prometheus-alerts/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ apiVersion: v2
name: prometheus-alerts
description: Helm Chart that provisions a series of common Prometheus Alerts
type: application
version: 1.7.3
version: 1.7.4
appVersion: 0.0.1
maintainers:
- name: diranged
Expand Down
14 changes: 7 additions & 7 deletions charts/prometheus-alerts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Helm Chart that provisions a series of common Prometheus Alerts

![Version: 1.7.3](https://img.shields.io/badge/Version-1.7.3-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 0.0.1](https://img.shields.io/badge/AppVersion-0.0.1-informational?style=flat-square)
![Version: 1.7.4](https://img.shields.io/badge/Version-1.7.4-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 0.0.1](https://img.shields.io/badge/AppVersion-0.0.1-informational?style=flat-square)

[deployments]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
[hpa]: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
Expand Down Expand Up @@ -101,7 +101,7 @@ This behavior can be tuned via the `defaults.podNameSelector`,
| alertManager.repeatInterval | string | `"1h"` | How long to wait before sending a notification again if it has already been sent successfully for an alert. (Usually ~3h or more). |
| chart_name | string | `"prometheus-rules"` | |
| chart_source | string | `"https://github.com/Nextdoor/k8s-charts"` | |
| containerRules.daemonsets.DaemonsetSelectorValidity | object | `{"for":"1h","labels":{},"severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.daemonsets.DaemonsetSelectorValidity | object | `{"enabled":true,"for":"1h","labels":{},"severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.daemonsets.KubeDaemonSetMisScheduled.for | string | `"15m"` | |
| containerRules.daemonsets.KubeDaemonSetMisScheduled.labels | object | `{}` | |
| containerRules.daemonsets.KubeDaemonSetMisScheduled.severity | string | `"warning"` | |
Expand All @@ -112,19 +112,19 @@ This behavior can be tuned via the `defaults.podNameSelector`,
| containerRules.daemonsets.KubeDaemonSetRolloutStuck.labels | object | `{}` | |
| containerRules.daemonsets.KubeDaemonSetRolloutStuck.severity | string | `"warning"` | |
| containerRules.daemonsets.enabled | bool | `true` | Enables the DaemonSet resource rules |
| containerRules.deployments.DeploymentSelectorValidity | object | `{"for":"1h","labels":{},"severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.deployments.DeploymentSelectorValidity | object | `{"enabled":true,"for":"1h","labels":{},"severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.deployments.KubeDeploymentGenerationMismatch | object | `{"for":"15m","labels":{},"severity":"warning"}` | Deployment generation mismatch due to possible roll-back |
| containerRules.deployments.enabled | bool | `true` | Enables the Deployment resource rules |
| containerRules.enabled | bool | `true` | Whether or not to enable the container rules template |
| containerRules.hpas.HpaSelectorValidity | object | `{"for":"1h","labels":{},"severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.hpas.HpaSelectorValidity | object | `{"enabled":true,"for":"1h","labels":{},"severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.hpas.KubeHpaMaxedOut.for | string | `"15m"` | |
| containerRules.hpas.KubeHpaMaxedOut.labels | object | `{}` | |
| containerRules.hpas.KubeHpaMaxedOut.severity | string | `"warning"` | |
| containerRules.hpas.KubeHpaReplicasMismatch.for | string | `"15m"` | |
| containerRules.hpas.KubeHpaReplicasMismatch.labels | object | `{}` | |
| containerRules.hpas.KubeHpaReplicasMismatch.severity | string | `"warning"` | |
| containerRules.hpas.enabled | bool | `true` | Enables the HorizontalPodAutoscaler resource rules |
| containerRules.jobs.JobSelectorValidity | object | `{"for":"1h","labels":{},"severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.jobs.JobSelectorValidity | object | `{"enabled":true,"for":"1h","labels":{},"severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.jobs.KubeJobCompletion.for | string | `"12h"` | |
| containerRules.jobs.KubeJobCompletion.labels | object | `{}` | |
| containerRules.jobs.KubeJobCompletion.severity | string | `"warning"` | |
Expand All @@ -140,7 +140,7 @@ This behavior can be tuned via the `defaults.podNameSelector`,
| containerRules.pods.PodContainerTerminated | object | `{"for":"1m","labels":{},"over":"10m","reasons":["ContainerCannotRun","DeadlineExceeded"],"severity":"warning","threshold":0}` | Monitors Pods for Containers that are terminated either for unexpected reasons like ContainerCannotRun. If that number breaches the $threshold (1) for $for (1m), then it will alert. |
| containerRules.pods.PodCrashLoopBackOff | object | `{"for":"10m","labels":{},"severity":"warning"}` | Pod is in a CrashLoopBackOff state and is not becoming healthy. |
| containerRules.pods.PodNotReady | object | `{"for":"15m","labels":{},"severity":"warning"}` | Pod has been in a non-ready state for more than a specific threshold |
| containerRules.pods.PodSelectorValidity | object | `{"for":"1h","labels":{},"severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.pods.PodSelectorValidity | object | `{"enabled":true,"for":"1h","labels":{},"severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.pods.enabled | bool | `true` | Enables the Pod resource rules |
| containerRules.statefulsets.KubeStatefulSetGenerationMismatch.for | string | `"15m"` | |
| containerRules.statefulsets.KubeStatefulSetGenerationMismatch.labels | object | `{}` | |
Expand All @@ -151,7 +151,7 @@ This behavior can be tuned via the `defaults.podNameSelector`,
| containerRules.statefulsets.KubeStatefulSetUpdateNotRolledOut.for | string | `"15m"` | |
| containerRules.statefulsets.KubeStatefulSetUpdateNotRolledOut.labels | object | `{}` | |
| containerRules.statefulsets.KubeStatefulSetUpdateNotRolledOut.severity | string | `"warning"` | |
| containerRules.statefulsets.StatefulsetSelectorValidity | object | `{"for":"1h","labels":{},"severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.statefulsets.StatefulsetSelectorValidity | object | `{"enabled":true,"for":"1h","labels":{},"severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.statefulsets.enabled | bool | `true` | Enables the StatefulSet resource rules |
| defaults.additionalRuleLabels | `map` | `{}` | Additional custom labels attached to every PrometheusRule |
| defaults.daemonsetNameSelector | `string` | `".*"` | Pattern used to scope down the DaemonSet alerts to pods that are part of this general application. Set to `None` if you want to disable this selector and apply the rules to all the DaemonSets in the namespace. This string is run through the `tpl` function. |
Expand Down
16 changes: 14 additions & 2 deletions charts/prometheus-alerts/templates/containers-prometheusrule.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,7 @@ spec:
{{- end }}

{{ with .PodSelectorValidity -}}
{{- if .enabled }}
- alert: PodSelectorValidity
annotations:
summary: PodSelector for prometheus-alerts is invalid
Expand Down Expand Up @@ -249,6 +250,7 @@ spec:
{{ toYaml . | nindent 8 }}
{{- end }}
{{- end }}
{{- end }}

{{- end }}
{{- end }}
Expand Down Expand Up @@ -281,7 +283,8 @@ spec:
{{- end }}
{{- end }}

{{ with .DeploymentSelectorValidity -}}
{{- with .DeploymentSelectorValidity -}}
{{- if .enabled }}
- alert: DeploymentSelectorValidity
annotations:
summary: DeploymentSelector for prometheus-alerts is invalid
Expand Down Expand Up @@ -311,6 +314,7 @@ spec:
{{ toYaml . | nindent 8 }}
{{- end }}
{{- end }}
{{- end }}

{{- end }}
{{- end }}
Expand Down Expand Up @@ -410,7 +414,8 @@ spec:
{{- end }}
{{- end }}

{{ with .StatefulsetSelectorValidity -}}
{{- with .StatefulsetSelectorValidity -}}
{{- if .enabled }}
- alert: StatefulsetSelectorValidity
annotations:
summary: StatefulsetSelector for prometheus-alerts is invalid
Expand Down Expand Up @@ -440,6 +445,7 @@ spec:
{{ toYaml . | nindent 8 }}
{{- end }}
{{- end }}
{{- end }}

{{- end }}
{{- end }}
Expand Down Expand Up @@ -537,6 +543,7 @@ spec:
{{- end }}

{{ with .DaemonsetSelectorValidity -}}
{{- if .enabled }}
- alert: DaemonsetSelectorValidity
annotations:
summary: DaemonsetSelector for prometheus-alerts is invalid
Expand Down Expand Up @@ -566,6 +573,7 @@ spec:
{{ toYaml . | nindent 8 }}
{{- end }}
{{- end }}
{{- end }}

{{- end }}
{{- end }}
Expand Down Expand Up @@ -619,6 +627,7 @@ spec:
{{- end }}

{{ with .JobSelectorValidity -}}
{{- if .enabled }}
- alert: JobSelectorValidity
annotations:
summary: JobSelector for prometheus-alerts is invalid
Expand Down Expand Up @@ -648,6 +657,7 @@ spec:
{{ toYaml . | nindent 8 }}
{{- end }}
{{- end }}
{{- end }}

{{- end }}
{{- end }}
Expand Down Expand Up @@ -717,6 +727,7 @@ spec:
{{- end }}

{{ with .HpaSelectorValidity -}}
{{- if .enabled }}
- alert: HpaSelectorValidity
annotations:
summary: HpaSelector for prometheus-alerts is invalid
Expand Down Expand Up @@ -746,6 +757,7 @@ spec:
{{ toYaml . | nindent 8 }}
{{- end }}
{{- end }}
{{- end }}

{{- end }}
{{- end }}
Expand Down
6 changes: 6 additions & 0 deletions charts/prometheus-alerts/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ containerRules:
# alerted by this, we likely have a bad selector and our alerts are not going
# to ever fire.
PodSelectorValidity:
enabled: true
severity: warning
for: 1h
labels: {}
Expand Down Expand Up @@ -202,6 +203,7 @@ containerRules:
# alerted by this, we likely have a bad selector and our alerts are not going
# to ever fire.
DeploymentSelectorValidity:
enabled: true
severity: warning
for: 1h
labels: {}
Expand All @@ -221,6 +223,7 @@ containerRules:
# alerted by this, we likely have a bad selector and our alerts are not going
# to ever fire.
StatefulsetSelectorValidity:
enabled: true
severity: warning
for: 1h
labels: {}
Expand Down Expand Up @@ -252,6 +255,7 @@ containerRules:
# alerted by this, we likely have a bad selector and our alerts are not going
# to ever fire.
DaemonsetSelectorValidity:
enabled: true
severity: warning
for: 1h
labels: {}
Expand Down Expand Up @@ -283,6 +287,7 @@ containerRules:
# alerted by this, we likely have a bad selector and our alerts are not going
# to ever fire.
JobSelectorValidity:
enabled: true
severity: warning
for: 1h
labels: {}
Expand All @@ -308,6 +313,7 @@ containerRules:
# alerted by this, we likely have a bad selector and our alerts are not going
# to ever fire.
HpaSelectorValidity:
enabled: true
severity: warning
for: 1h
labels: {}
Expand Down

0 comments on commit 8a5d38c

Please sign in to comment.