Skip to content

Commit

Permalink
feat(istio-alerts): Add service selector validity alert (#287)
Browse files Browse the repository at this point in the history
> [!WARNING]
> Reminder for @LaikaN57 - Update other charts in this repo after this
PR is merged.

Split from #285

We ran into an issue where unbeknownst to us, our pod selector was set
incorrectly and therefore all of our rules deployed via
prometheus-alerts did not have any series to evaluate. This lack of
series data is silently ignored. Here we are resolving this by
implementing a rule which will use the various selectors to see if a
basic metric exists or not. If it does not exist then we want to warn
the user that non of the other rules will be evaluated.

Snippet from the other PR about using `kube_service_info` as our test
metric.

> I think I see what you're doing here ... but I see a risk. If there is
no traffic to the service for example, when its being launched), there
will be no metrics that match this query. I propose that instead we do
this:
>
> ```
> count(kube_service_info{namespace={{ $.Release.Namespace }}, name={{
$.Release.Namespace $.Values.serviceRules }})
> ```
>
> Sadly - I know that this means we aren't using the exact selectors
(like I propose above) ... but this allows us to verify that the Service
actually exists by that name. Ya know?

Lastly, I set the timing of this metric up for 1 hour by default. This
felt okay to me as we want to be a little bit resilient to prometheus
metric collection outages and not accidentally page every service owner
if we have a centralized outage of the platform itself. Normally, a team
might only be paged once for this when they first setup their
application if they had not setup their selectors correctly. This alert
is not expected to go into an alert state without some heavy handed
naming changes (mostly done during things like migrations).
  • Loading branch information
LaikaN57 authored Apr 17, 2024
1 parent c8ddaed commit 8655ff4
Show file tree
Hide file tree
Showing 6 changed files with 109 additions and 7 deletions.
4 changes: 3 additions & 1 deletion charts/istio-alerts/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@ apiVersion: v2
name: istio-alerts
description: A Helm chart that provisions a series of alerts for istio VirtualServices
type: application
version: 0.2.0
version: 0.3.0
appVersion: 0.0.1
maintainers:
- name: diranged
email: [email protected]
- name: laikan57
email: [email protected]
19 changes: 18 additions & 1 deletion charts/istio-alerts/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# istio-alerts

![Version: 0.2.0](https://img.shields.io/badge/Version-0.2.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 0.0.1](https://img.shields.io/badge/AppVersion-0.0.1-informational?style=flat-square)
![Version: 0.3.0](https://img.shields.io/badge/Version-0.3.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 0.0.1](https://img.shields.io/badge/AppVersion-0.0.1-informational?style=flat-square)

A Helm chart that provisions a series of alerts for istio VirtualServices

Expand All @@ -9,6 +9,18 @@ A Helm chart that provisions a series of alerts for istio VirtualServices
| Name | Email | Url |
| ---- | ------ | --- |
| diranged | <[email protected]> | |
| laikan57 | <[email protected]> | |

## Upgrade Notes

### 0.2.x -> 0.3.x

**BREAKING: The DestinationServiceSelectorValidity alert rule requires kube-state-metrics.**

An alert was introduced in 0.3.x that requires kube-state-metrics to be installed in the cluster. If
you do not have kube-state-metrics installed, you will need to disable the alert by setting
`serviceRules.destinationServiceSelectorValidity.enabled` to `false`. This alert is used to detect
if the destinationServiceSelector is actually selecting series for a service that exists.

## Values

Expand All @@ -25,7 +37,12 @@ A Helm chart that provisions a series of alerts for istio VirtualServices
| defaults.additionalRuleLabels | object | `{}` | Additional custom labels attached to every PrometheusRule |
| defaults.runbookUrl | string | `"https://github.com/Nextdoor/k8s-charts/blob/main/charts/istio-alerts/runbook.md"` | The prefix URL to the runbook_urls that will be applied to each PrometheusRule |
| serviceRules.destinationServiceName | string | `".*"` | Narrow down the alerts to a particular Destination Service if there are multiple services that require different thresholds within the same namespace. |
| serviceRules.destinationServiceSelectorValidity | object | `{"enabled":true,"for":"1h","severity":"warning"}` | Does a basic lookup using the defined selectors to see if we can see any info for a given selector. This is the "watcher for the watcher". If we get alerted by this, we likely have a bad selector and our alerts are not going to ever fire. |
| serviceRules.destinationServiceSelectorValidity.enabled | bool | `true` | Whether to enable the monitor on the selector for the VirtualService. |
| serviceRules.destinationServiceSelectorValidity.for | string | `"1h"` | How long to evaluate. |
| serviceRules.destinationServiceSelectorValidity.severity | string | `"warning"` | Severity of the monitor |
| serviceRules.enabled | bool | `true` | Whether to enable the service rules template |
| serviceRules.highRequestLatency | object | `{"enabled":true,"for":"15m","percentile":0.95,"severity":"warning","threshold":0.5}` | Configuration related to the latency monitor for the VirtualService. |
| serviceRules.highRequestLatency.enabled | bool | `true` | Whether to enable the monitor on latency returned by the VirtualService. |
| serviceRules.highRequestLatency.for | string | `"15m"` | How long to evaluate the latency of services. |
| serviceRules.highRequestLatency.percentile | float | `0.95` | Which percentile to monitor - should be between 0 and 1. Default is 95th percentile. |
Expand Down
24 changes: 24 additions & 0 deletions charts/istio-alerts/README.md.gotmpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{{ template "chart.header" . }}

{{ template "chart.versionBadge" . }}{{ template "chart.typeBadge" . }}{{ template "chart.appVersionBadge" . }}

{{ template "chart.description" . }}

{{ template "chart.maintainersSection" . }}

## Upgrade Notes

### 0.2.x -> 0.3.x

**BREAKING: The DestinationServiceSelectorValidity alert rule requires kube-state-metrics.**

An alert was introduced in 0.3.x that requires kube-state-metrics to be installed in the cluster. If
you do not have kube-state-metrics installed, you will need to disable the alert by setting
`serviceRules.destinationServiceSelectorValidity.enabled` to `false`. This alert is used to detect
if the destinationServiceSelector is actually selecting series for a service that exists.

{{ template "chart.requirementsSection" . }}

{{ template "chart.valuesSection" . }}

{{ template "helm-docs.versionFooter" . }}
16 changes: 16 additions & 0 deletions charts/istio-alerts/runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,19 @@ to the grafana deployment for the correct EKS cluster and select the relevant
service. Many services have custom dashboards in DataDog as well which may help
investigate this alert further, and most service also produce logs of requests
which may provide more context into what errors are being returned and why.

## HighRequestLatency

TBD

## Alert-Rules-Selectors-Validity

This alert fires when there may be an error in setting the proper selectors used
by the other alerts in this chart. It attempts to read a basic metric using the
selector you provided. For instance, if you have a pod selector that looks for
`pod=~"foo-bar-.*"` but your pods are actually named `baz-.*`, this alert will
notify you of the misconfiguration. Read the alert description to see exactly
which selector is having an issue. Also note that you need to collect the
metrics that this alert uses. For instance, to test pod selectors, we use the
`kube_pod_info` metric. If you do not collect this metric, this alert will
continiously fire.
38 changes: 33 additions & 5 deletions charts/istio-alerts/templates/service-prometheusrule.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ spec:
description: >-
High rate of 5xx responses from the {{`{{$labels.destination_service_name}}`}} VirtualService
in namespace {{`{{$labels.namespace}}`}}.
expr: >
sum by (destination_service_name, reporter, source_workload) (
istio_requests:increase5m{
Expand All @@ -48,15 +47,13 @@ spec:
{{- end }}
{{- end }}
{{- end }}

# Adding latency alarm to trigger when request latency is above > threshold
{{- with .Values.serviceRules.highRequestLatency }}
{{- if .enabled }}
- alert: HighRequestLatency
annotations:
summary: >-
{{`{{$labels.destination_service_name}}`}} avg request latencies are above {{ .threshold }}s!
runbook_url: {{ $values.defaults.runbookUrl}} #HighRequestLatency
runbook_url: {{ $values.defaults.runbookUrl}}#HighRequestLatency
description: >-
Average request latency of {{`{{ $value | humanizePercentage }}`}} is above threshold ({{ .threshold }}s)
in namespace {{`{{ $labels.namespace }}`}} for pod {{`{{ $labels.pod }}`}} (container: {{`{{ $labels.container }}`}}).
Expand Down Expand Up @@ -85,5 +82,36 @@ spec:
{{ toYaml . | nindent 8}}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
{{- with .Values.serviceRules.destinationServiceSelectorValidity }}
{{- if .enabled }}
- alert: DestinationServiceSelectorValidity
annotations:
summary: >-
DestinationServiceSelector for istio-alerts is invalid
runbook_url: {{ $values.defaults.runbookUrl}}#Alert-Rules-Selectors-Validity
description: >-
The DestinationServiceSelector used for service level alerts did not return any data.
Please check the DestinationServiceSelector applied in your istio-alerts chart
is correct to ensure you are properly selecting your service so that you
will be alerted for service issues. The current selector is
`{destination_service_namespace="{{ $release.Namespace }}", destination_service_name="{{ $.Values.serviceRules.destinationServiceName }}"}`.
expr: >-
(
count(
kube_service_info{
namespace="{{ $.Release.Namespace }}",
name=~"{{ $.Values.serviceRules.destinationServiceName }}"
}
) or on() vector(0)
) == 0
for: {{ .for }}
labels:
severity: {{ .severity }}
namespace: {{ $release.Namespace }}
{{- with $values.defaults.additionalRuleLabels}}
{{ toYaml . | nindent 8}}
{{- end }}
{{- end }}
{{- end }}
{{- end }}
15 changes: 15 additions & 0 deletions charts/istio-alerts/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ serviceRules:
# -- Severity of the 5xx monitor
severity: critical

# -- Configuration related to the latency monitor for the VirtualService.
highRequestLatency:
# -- Whether to enable the monitor on latency returned by the
# VirtualService.
Expand All @@ -117,3 +118,17 @@ serviceRules:

# -- How long to evaluate the latency of services.
for: 15m

# -- Does a basic lookup using the defined selectors to see if we can see any
# info for a given selector. This is the "watcher for the watcher". If we get
# alerted by this, we likely have a bad selector and our alerts are not going
# to ever fire.
destinationServiceSelectorValidity:
# -- Whether to enable the monitor on the selector for the VirtualService.
enabled: true

# -- Severity of the monitor
severity: warning

# -- How long to evaluate.
for: 1h

0 comments on commit 8655ff4

Please sign in to comment.