Skip to content

Commit

Permalink
feat(prometheus-alerts): Add selector validity alarms (#297)
Browse files Browse the repository at this point in the history
We ran into an issue where unbeknownst to us, our pod selector was set
incorrectly and therefore all of our rules deployed via
prometheus-alerts did not have any series to evaluate. This lack of
series data is silently ignored. Here we are resolving this by
implementing a rule which will use the various selectors to see if a
basic metric exists or not. If it does not exist then we want to warn
the user that none of the other rules will be evaluated.

I tried to choose metrics for each selector that were generic enough and
did not rely on a specific metric that may not be gathered. Most types
had an associated info or label metric which are good candidates for
this use-case.

Lastly, I set the timing of this metric up for 1 hour by default. This
felt okay to me as we want to be a little bit resilient to prometheus
metric collection outages and not accidentally page every service owner
if we have a centralized outage of the platform itself. Normally, a team
might only be paged once for this when they first setup their
application if they had not setup their selectors correctly. This alert
is not expected to go into an alert state without some heavy handed
naming changes (mostly done during things like migrations).

See #285
  • Loading branch information
LaikaN57 authored Apr 26, 2024
1 parent ba4838e commit 8b658d1
Show file tree
Hide file tree
Showing 6 changed files with 256 additions and 4 deletions.
2 changes: 1 addition & 1 deletion charts/prometheus-alerts/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ apiVersion: v2
name: prometheus-alerts
description: Helm Chart that provisions a series of common Prometheus Alerts
type: application
version: 1.5.0
version: 1.6.0
appVersion: 0.0.1
maintainers:
- name: diranged
Expand Down
18 changes: 16 additions & 2 deletions charts/prometheus-alerts/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@

# prometheus-alerts

Helm Chart that provisions a series of common Prometheus Alerts

![Version: 1.5.0](https://img.shields.io/badge/Version-1.5.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 0.0.1](https://img.shields.io/badge/AppVersion-0.0.1-informational?style=flat-square)
![Version: 1.6.0](https://img.shields.io/badge/Version-1.6.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 0.0.1](https://img.shields.io/badge/AppVersion-0.0.1-informational?style=flat-square)

[deployments]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
[hpa]: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
Expand All @@ -20,6 +19,15 @@ those changes in the `charts/simple-app`, `charts/daemonset-app` and

## Upgrade Notes

### 1.5.x -> 1.6.x

**CHANGE: The AlertSelectorValidity alert rules added.**

We have added a new metric which attempts to detect if you have misconfigured
your selectors. After upgrading, you may get alerted. You should respond to the
alert appropriately by reading the alert information and making changes to your
selectors.

### 1.4.x -> 1.5.x

**BREAKING: Values files schema has been updated to group alerts by resource type**
Expand Down Expand Up @@ -83,21 +91,25 @@ This behavior can be tuned via the `defaults.podNameSelector`,
| alertManager.repeatInterval | string | `"1h"` | How long to wait before sending a notification again if it has already been sent successfully for an alert. (Usually ~3h or more). |
| chart_name | string | `"prometheus-rules"` | |
| chart_source | string | `"https://github.com/Nextdoor/k8s-charts"` | |
| containerRules.daemonsets.DaemonsetSelectorValidity | object | `{"for":"1h","severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.daemonsets.KubeDaemonSetMisScheduled.for | string | `"15m"` | |
| containerRules.daemonsets.KubeDaemonSetMisScheduled.severity | string | `"warning"` | |
| containerRules.daemonsets.KubeDaemonSetNotScheduled.for | string | `"10m"` | |
| containerRules.daemonsets.KubeDaemonSetNotScheduled.severity | string | `"warning"` | |
| containerRules.daemonsets.KubeDaemonSetRolloutStuck.for | string | `"15m"` | |
| containerRules.daemonsets.KubeDaemonSetRolloutStuck.severity | string | `"warning"` | |
| containerRules.daemonsets.enabled | bool | `true` | Enables the DaemonSet resource rules |
| containerRules.deployments.DeploymentSelectorValidity | object | `{"for":"1h","severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.deployments.KubeDeploymentGenerationMismatch | object | `{"for":"15m","severity":"warning"}` | Deployment generation mismatch due to possible roll-back |
| containerRules.deployments.enabled | bool | `true` | Enables the Deployment resource rules |
| containerRules.enabled | bool | `true` | Whether or not to enable the container rules template |
| containerRules.hpas.HpaSelectorValidity | object | `{"for":"1h","severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.hpas.KubeHpaMaxedOut.for | string | `"15m"` | |
| containerRules.hpas.KubeHpaMaxedOut.severity | string | `"warning"` | |
| containerRules.hpas.KubeHpaReplicasMismatch.for | string | `"15m"` | |
| containerRules.hpas.KubeHpaReplicasMismatch.severity | string | `"warning"` | |
| containerRules.hpas.enabled | bool | `true` | Enables the HorizontalPodAutoscaler resource rules |
| containerRules.jobs.JobSelectorValidity | object | `{"for":"1h","severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.jobs.KubeJobCompletion.for | string | `"12h"` | |
| containerRules.jobs.KubeJobCompletion.severity | string | `"warning"` | |
| containerRules.jobs.KubeJobFailed.for | string | `"15m"` | |
Expand All @@ -110,13 +122,15 @@ This behavior can be tuned via the `defaults.podNameSelector`,
| containerRules.pods.PodContainerTerminated | object | `{"for":"1m","over":"10m","reasons":["ContainerCannotRun","DeadlineExceeded"],"severity":"warning","threshold":0}` | Monitors Pods for Containers that are terminated either for unexpected reasons like ContainerCannotRun. If that number breaches the $threshold (1) for $for (1m), then it will alert. |
| containerRules.pods.PodCrashLoopBackOff | object | `{"for":"10m","severity":"warning"}` | Pod is in a CrashLoopBackOff state and is not becoming healthy. |
| containerRules.pods.PodNotReady | object | `{"for":"15m","severity":"warning"}` | Pod has been in a non-ready state for more than a specific threshold |
| containerRules.pods.PodSelectorValidity | object | `{"for":"1h","severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.pods.enabled | bool | `true` | Enables the Pod resource rules |
| containerRules.statefulsets.KubeStatefulSetGenerationMismatch.for | string | `"15m"` | |
| containerRules.statefulsets.KubeStatefulSetGenerationMismatch.severity | string | `"warning"` | |
| containerRules.statefulsets.KubeStatefulSetReplicasMismatch.for | string | `"15m"` | |
| containerRules.statefulsets.KubeStatefulSetReplicasMismatch.severity | string | `"warning"` | |
| containerRules.statefulsets.KubeStatefulSetUpdateNotRolledOut.for | string | `"15m"` | |
| containerRules.statefulsets.KubeStatefulSetUpdateNotRolledOut.severity | string | `"warning"` | |
| containerRules.statefulsets.StatefulsetSelectorValidity | object | `{"for":"1h","severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| containerRules.statefulsets.enabled | bool | `true` | Enables the StatefulSet resource rules |
| defaults.additionalRuleLabels | `map` | `{}` | Additional custom labels attached to every PrometheusRule |
| defaults.daemonsetNameSelector | `string` | `"{{ .Release.Name }}-.*"` | Pattern used to scope down the DaemonSet alerts to pods that are part of this general application. Set to `None` if you want to disable this selector and apply the rules to all the DaemonSets in the namespace. This string is run through the `tpl` function. |
Expand Down
11 changes: 10 additions & 1 deletion charts/prometheus-alerts/README.md.gotmpl
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

{{ template "chart.header" . }}

{{ template "chart.description" . }}

{{ template "chart.versionBadge" . }}{{ template "chart.typeBadge" . }}{{ template "chart.appVersionBadge" . }}
Expand All @@ -19,6 +19,15 @@ those changes in the `charts/simple-app`, `charts/daemonset-app` and

## Upgrade Notes

### 1.5.x -> 1.6.x

**CHANGE: The AlertSelectorValidity alert rules added.**

We have added a new metric which attempts to detect if you have misconfigured
your selectors. After upgrading, you may get alerted. You should respond to the
alert appropriately by reading the alert information and making changes to your
selectors.

### 1.4.x -> 1.5.x

**BREAKING: Values files schema has been updated to group alerts by resource type**
Expand Down
12 changes: 12 additions & 0 deletions charts/prometheus-alerts/runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,3 +101,15 @@ help you determine the root cause of the issue. Follow the instructions in the
into the relevant cluster and namespace, and use the `kubectl describe pod <podname>`
to see the status of the pod and any events related to it. The pod logs may also
provide hints as to what may be going wrong.

## Alert-Rules-Selectors-Validity

This alert fires when there may be an error in setting the proper selectors used
by the other alerts in this chart. It attempts to read a basic metric using the
selector you provided. For instance, if you have a pod selector that looks for
`pod=~"foo-bar-.*"` but your pods are actually named `baz-.*`, this alert will
notify you of the misconfiguration. Read the alert description to see exactly
which selector is having an issue. Also note that you need to collect the
metrics that this alert uses. For instance, to test pod selectors, we use the
`kube_pod_info` metric. If you do not collect this metric, this alert will
continiously fire.
169 changes: 169 additions & 0 deletions charts/prometheus-alerts/templates/containers-prometheusrule.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,35 @@ spec:
{{- end }}
{{- end }}

{{ with .PodSelectorValidity -}}
- alert: PodSelectorValidity
annotations:
summary: PodSelector for prometheus-alerts is invalid
runbook_url: {{ $.Values.defaults.runbookUrl }}#Alert-Rules-Selectors-Validity
description: >-
The PodSelector used for pod level alerts did not return any data.
Please check the PodSelector applied in your prometheus-alerts chart
is correct to ensure you are properly selecting your pods so that you
will be alerted for pod issues. The current selector is
`{{ $podSelector }}, {{ $namespaceSelector }}`.
expr: |-
(
count(
kube_pod_info{
{{ $podSelector }},
{{ $namespaceSelector }}
}
) or on() vector(0)
) == 0
for: {{ .for }}
labels:
severity: {{ .severity }}
namespace: {{ $.Release.Namespace }}
{{- if $.Values.defaults.additionalRuleLabels }}
{{ toYaml $.Values.defaults.additionalRuleLabels | nindent 8 }}
{{- end }}
{{- end }}

{{- end }}
{{- end }}

Expand Down Expand Up @@ -228,6 +257,34 @@ spec:
{{- end }}
{{- end }}

{{ with .DeploymentSelectorValidity -}}
- alert: DeploymentSelectorValidity
annotations:
summary: DeploymentSelector for prometheus-alerts is invalid
runbook_url: {{ $.Values.defaults.runbookUrl }}#Alert-Rules-Selectors-Validity
description: >-
The DeploymentSelector used for deployment level alerts did not return any data.
Please check the DeploymentSelector applied in your prometheus-alerts chart
is correct to ensure you are properly selecting your deployments so that you
will be alerted for deployment issues. The current selector is
`{{ $deploymentSelector }}`.
expr: |-
(
count(
kube_deployment_labels{
{{ $deploymentSelector }}
}
) or on() vector(0)
) == 0
for: {{ .for }}
labels:
severity: {{ .severity }}
namespace: {{ $.Release.Namespace }}
{{- if $.Values.defaults.additionalRuleLabels }}
{{ toYaml $.Values.defaults.additionalRuleLabels | nindent 8 }}
{{- end }}
{{- end }}

{{- end }}
{{- end }}

Expand Down Expand Up @@ -317,6 +374,34 @@ spec:
{{- end }}
{{- end }}

{{ with .StatefulsetSelectorValidity -}}
- alert: StatefulsetSelectorValidity
annotations:
summary: StatefulsetSelector for prometheus-alerts is invalid
runbook_url: {{ $.Values.defaults.runbookUrl }}#Alert-Rules-Selectors-Validity
description: >-
The StatefulsetSelector used for statefulset level alerts did not return any data.
Please check the StatefulsetSelector applied in your prometheus-alerts chart
is correct to ensure you are properly selecting your statefulsets so that you
will be alerted for statefulset issues. The current selector is
`{{ $statefulsetSelector }}`.
expr: |-
(
count(
kube_statefulset_created{
{{ $statefulsetSelector }}
}
) or on() vector(0)
) == 0
for: {{ .for }}
labels:
severity: {{ .severity }}
namespace: {{ $.Release.Namespace }}
{{- if $.Values.defaults.additionalRuleLabels }}
{{ toYaml $.Values.defaults.additionalRuleLabels | nindent 8 }}
{{- end }}
{{- end }}

{{- end }}
{{- end }}

Expand Down Expand Up @@ -403,6 +488,34 @@ spec:
{{- end }}
{{- end }}

{{ with .DaemonsetSelectorValidity -}}
- alert: DaemonsetSelectorValidity
annotations:
summary: DaemonsetSelector for prometheus-alerts is invalid
runbook_url: {{ $.Values.defaults.runbookUrl }}#Alert-Rules-Selectors-Validity
description: >-
The DaemonsetSelector used for daemonset level alerts did not return any data.
Please check the DaemonsetSelector applied in your prometheus-alerts chart
is correct to ensure you are properly selecting your daemonsets so that you
will be alerted for daemonset issues. The current selector is
`{{ $daemonsetSelector }}`.
expr: |-
(
count(
kube_daemonset_labels{
{{ $daemonsetSelector }}
}
) or on() vector(0)
) == 0
for: {{ .for }}
labels:
severity: {{ .severity }}
namespace: {{ $.Release.Namespace }}
{{- if $.Values.defaults.additionalRuleLabels }}
{{ toYaml $.Values.defaults.additionalRuleLabels | nindent 8 }}
{{- end }}
{{- end }}

{{- end }}
{{- end }}

Expand Down Expand Up @@ -448,6 +561,34 @@ spec:
{{- end }}
{{- end }}

{{ with .JobSelectorValidity -}}
- alert: JobSelectorValidity
annotations:
summary: JobSelector for prometheus-alerts is invalid
runbook_url: {{ $.Values.defaults.runbookUrl }}#Alert-Rules-Selectors-Validity
description: >-
The JobSelector used for job level alerts did not return any data.
Please check the JobSelector applied in your prometheus-alerts chart
is correct to ensure you are properly selecting your jobs so that you
will be alerted for job issues. The current selector is
`{{ $jobSelector }}`.
expr: |-
(
count(
kube_job_info{
{{ $jobSelector }}
}
) or on() vector(0)
) == 0
for: {{ .for }}
labels:
severity: {{ .severity }}
namespace: {{ $.Release.Namespace }}
{{- if $.Values.defaults.additionalRuleLabels }}
{{ toYaml $.Values.defaults.additionalRuleLabels | nindent 8 }}
{{- end }}
{{- end }}

{{- end }}
{{- end }}

Expand Down Expand Up @@ -509,6 +650,34 @@ spec:
{{- end }}
{{- end }}

{{ with .HpaSelectorValidity -}}
- alert: HpaSelectorValidity
annotations:
summary: HpaSelector for prometheus-alerts is invalid
runbook_url: {{ $.Values.defaults.runbookUrl }}#Alert-Rules-Selectors-Validity
description: >-
The HpaSelector used for hpa level alerts did not return any data.
Please check the HpaSelector applied in your prometheus-alerts chart
is correct to ensure you are properly selecting your hpas so that you
will be alerted for hpa issues. The current selector is
`{{ $hpaSelector }}`.
expr: |-
(
count(
kube_horizontalpodautoscaler_info{
{{ $hpaSelector }}
}
) or on() vector(0)
) == 0
for: {{ .for }}
labels:
severity: {{ .severity }}
namespace: {{ $.Release.Namespace }}
{{- if $.Values.defaults.additionalRuleLabels }}
{{ toYaml $.Values.defaults.additionalRuleLabels | nindent 8 }}
{{- end }}
{{- end }}

{{- end }}
{{- end }}

Expand Down
Loading

0 comments on commit 8b658d1

Please sign in to comment.