Skip to content

Commit

Permalink
Add alert rules to training-operator based on the KF093 spec (#191) (#…
Browse files Browse the repository at this point in the history
…202)

* Add alert rules to training-operator based on the KF093 spec

* Delete src/prometheus_alert_rules/unit_unavailable.rule

Co-authored-by: Robert Gildein <[email protected]>
  • Loading branch information
misohu and rgildein authored Oct 11, 2024
1 parent 7dafae8 commit 4e8908b
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 10 deletions.
24 changes: 24 additions & 0 deletions src/prometheus_alert_rules/KubeflowTrainingOperatorServices.rules
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
groups:
- name: KubeflowTrainingOperatorServices
rules:
- alert: KubeflowServiceDown
expr: up{} < 1
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.juju_charm }} service is Down ({{ $labels.juju_model }}/{{ $labels.juju_unit }})"
description: |
One or more targets of {{ $labels.juju_charm }} charm are down on unit {{ $labels.juju_model }}/{{ $labels.juju_unit }}.
LABELS = {{ $labels }}

- alert: KubeflowServiceIsNotStable
expr: avg_over_time(up{}[10m]) < 0.5
for: 0m
labels:
severity: warning
annotations:
summary: "{{ $labels.juju_charm }} service is not stable ({{ $labels.juju_model }}/{{ $labels.juju_unit }})"
description: |
{{ $labels.juju_charm }} unit {{ $labels.juju_model }}/{{ $labels.juju_unit }} has been unreachable at least 50% of the time over the last 10 minutes.
LABELS = {{ $labels }}
10 changes: 0 additions & 10 deletions src/prometheus_alert_rules/unit_unavailable.rule

This file was deleted.

0 comments on commit 4e8908b

Please sign in to comment.