Skip to content

Commit

Permalink
CARRY: Adding runbooks for alerts
Browse files Browse the repository at this point in the history
  • Loading branch information
Fiona-Waters committed Aug 12, 2024
1 parent bbcb742 commit e636617
Show file tree
Hide file tree
Showing 5 changed files with 192 additions and 0 deletions.
4 changes: 4 additions & 0 deletions config/rhoai/prometheus_rule.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ spec:
annotations:
summary: "Kueue pod is down ({{ $labels.pod }})"
description: "The Kueue pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is not ready."
triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/kueue-pod-down.md"
- name: kueue-info-alerts
rules:
- alert: LowClusterQueueResourceUsage
Expand All @@ -25,6 +26,7 @@ spec:
annotations:
summary: Low {{ $labels.resource }} resource usage in cluster queue {{ $labels.cluster_queue }}
description: The {{ $labels.resource }} resource usage in cluster queue {{ $labels.cluster_queue }} is below 20% of its nominal quota for more than 1 day.
triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/low-cluster-queue-resource-usage.md"
- alert: ResourceReservationExceedsQuota
expr: (sum(kueue_cluster_queue_resource_reservation) by (resource, cluster_queue)) / 10 > (sum(kueue_cluster_queue_nominal_quota) by (resource, cluster_queue))
for: 10m
Expand All @@ -33,6 +35,7 @@ spec:
annotations:
summary: Resource {{ $labels.resource }} reservation far exceeds the available quota in cluster queue {{ $labels.cluster_queue}}
description: Resource {{ $labels.resource }} reservation is 10 times the available quota in cluster queue {{ $labels.cluster_queue}}
triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/resource-reservation-exceeds-quota.md"
- alert: PendingWorkloadPods
expr: (sum by (namespace, pod) (sum_over_time(kube_pod_status_phase{phase="Pending"}[3d])) >= 3 * 24 * 60) >0
for: 1m
Expand All @@ -41,4 +44,5 @@ spec:
annotations:
summary: Pod {{ $labels.pod }} in the {{ $labels.namespace }} namespace has been pending for more than 3 days
description: A pod {{ $labels.pod }} in the {{ $labels.namespace }} namespace has been in the pending state for more than 3 days.
triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/pending-workload-pods.md"

42 changes: 42 additions & 0 deletions docs/alerts/runbooks/kueue-pod-down.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Kueue Pod Down

## Severity: Critical

## Impact

Any workloads running on the cluster will not be able to use the Kueue component.

## Summary

This alert is triggered when the `kube_pod_status_ready` query shows that the Kueue controller pod is not ready.

## Steps

1. Check to see if the `kueue-controller` pod is running in the `redhat-ods-applications` namespace:

```bash
$ oc -n redhat-ods-applications get pods -l app.kubernetes.io/name=kueue
```

2. If the pod is not running, look at the pod's logs/events to see what may be causing the issues. Please make sure to grab the logs/events so they can be shared with the engineering team later:

```bash
# Check pod logs
$ oc -n redhat-ods-applications logs -l app.kubernetes.io/name=kueue --prefix=true

# Check events
$ oc -n redhat-ods-applications get events | grep pod/kueue-controller

# Check pod status fields
$ oc -n redhat-ods-applications get pods -l app.kubernetes.io/name=kueue -o jsonpath="{range .items[*]}{.status}{\"\n\n\"}{end}"
```

3. Redeploy Kueue Operator by restarting the deployment:

```bash
$ oc -n redhat-ods-applications rollout restart deployments/kueue-controller-manager
```

This should result in a new pod getting deployed, attempt step (1) again and see if the pod achieves running state.

4. If the problem persists, capture the logs and escalate to the RHOAI engineering team.
53 changes: 53 additions & 0 deletions docs/alerts/runbooks/low-cluster-queue-resource-usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Low Cluster Queue Resource Usage

## Severity: Info

## Impact

Resources that are consistently unused can be redistributed.

## Summary

This alert is triggered when the resource usage in a cluster queue is below 20% of its nominal quota for more than 1 day.

## Steps

1. Check current resource usage for the cluster queue and ensure that the nominal quota for the resource in question is correctly configured. Update the cluster-queue-name in the script below to describe the cluster queue.
```bash
cluster_queue=< cluster-queue-name >
oc describe clusterqueue $cluster_queue
```
- If you would like to view just the Flavors and Nominal Quota you can use the following command:
```bash
oc describe clusterqueue $cluster_queue | awk '/Flavors:/,/^$/'
```

2. Review the workloads that are linked with the cluster queue to see if the assigned resources are required.
```bash
# Find local queues linked to the cluster queue
local_queues=$(oc get localqueues --all-namespaces -o json | jq -r --arg clusterQueue "$cluster_queue" '.items[] | select(.spec.clusterQueue == $clusterQueue) | "\(.metadata.namespace)/\(.metadata.name)"')

# Find workloads linked to the local queues
for local_queue in $local_queues; do
namespace=$(echo $local_queue | cut -d '/' -f 1)
queue_name=$(echo $local_queue | cut -d '/' -f 2)

echo "Checking workloads linked to local queue $queue_name in namespace $namespace..."

oc get workloads --namespace $namespace -o json | jq -r --arg queueName "$queue_name" '.items[] | select(.spec.queueName == $queueName) | "\(.metadata.namespace)/\(.metadata.name)"'
done
```

3. Review individual workloads. Update the namespace and workload-name in the script below to view details of the workload.
```bash
namespace=< namespace >
workload_name=< workload-name >
oc describe workload -n $namespace $workload_name
```

4. Consider reducing the cluster queue nominal quota if resource usage is consistently low.
You can patch the clusterqueue using the following command. Note that you must change the values to refer to the exact resource you want to change.
This will change the nominal quota for cpu to 10, in the first flavor referenced in the named cluster queue resource:
```bash
oc patch clusterqueue $cluster_queue --type='json' -p='[{"op": "replace", "path": "/spec/resourceGroups/0/flavors/0/resources/0/nominalQuota", "value": "10"}]'
```
37 changes: 37 additions & 0 deletions docs/alerts/runbooks/pending-workload-pods.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Pending Workload Pods

## Severity: Info

## Impact
Knowledge of pods in a prolonged pending state will allow users to troubleshoot and fix any issues in order to run their workloads successfully.

## Summary

This alert is triggered when a pod is in the pending state for more than 3 days.

## Steps

1. Identify the pending pod in your project namespace. Update the project namespace below to the name of your project namespace.
```bash
namespace=< project-namespace >
oc get pods -A --field-selector=status.phase=Pending # This will show all pods in the cluster with Pending status
oc get pods -n $namespace --field-selector=status.phase=Pending # This will show all pods in the specified namespace with Pending status
```

2. Get further details on the pod.
```bash
pod=< pod-name >
oc describe pod $pod -n $namespace
```

3. Review the pod logs and determine why it is in a pending state.
```bash
oc logs $pod -n $namespace
```

4. Review the pod events in order to determine why it is in a pending state.
```bash
oc get events --field-selector involvedObject.name=$pod --namespace=$namespace
```

5. Review the results of the steps above to determine the best course of action for successfully running the workload.
56 changes: 56 additions & 0 deletions docs/alerts/runbooks/resource-reservation-exceeds-quota.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Resource Reservation Exceeds Quota

## Severity: Info

## Impact

Knowledge of over requested resources will allow the user to adjust the nominal quota or resources requested by a workload.

## Summary

This alert is triggered when resource reservation is 10 times the available nominal quota in a cluster queue.

## Steps

1. Check current resource reservation for the cluster queue and ensure that the nominal quota for the resource in question is correctly configured. Update the cluster-queue-name in the script below to describe the cluster queue.
```bash
cluster_queue=< cluster-queue-name >
oc describe clusterqueue $cluster_queue
```

- If you would just like to view the Flavors Reservation and Flavors Usage you can use the following command:
```bash
oc describe clusterqueue $cluster_queue | awk '/Flavors Reservation:/,/^$/'
```

2. Review the workloads that are linked with the cluster queue to see if the requested resources are required.
```bash
# Find local queues linked to the cluster queue
local_queues=$(oc get localqueues --all-namespaces -o json | jq -r --arg clusterQueue "$cluster_queue" '.items[] | select(.spec.clusterQueue == $clusterQueue) | "\(.metadata.namespace)/\(.metadata.name)"')

# Find workloads linked to the local queues
for local_queue in $local_queues; do
namespace=$(echo $local_queue | cut -d '/' -f 1)
queue_name=$(echo $local_queue | cut -d '/' -f 2)

echo "Checking workloads linked to local queue $queue_name in namespace $namespace..."

oc get workloads --namespace $namespace -o json | jq -r --arg queueName "$queue_name" '.items[] | select(.spec.queueName == $queueName) | "\(.metadata.namespace)/\(.metadata.name)"'
done
```

3. Review individual workloads. Update the namespace and workload-name in the script below to view details of the workload.
```bash
namespace=< namespace >
workload_name=< workload-name >
oc describe workload -n $namespace $workload_name
```

4. Consider increasing the cluster queue nominal quota.
You can patch the clusterqueue using the following command. Note that you must change the values to refer to the exact resource you want to change.
This will change the nominal quota for cpu to 10, in the first flavor referenced in the named cluster queue resource:
```bash
oc patch clusterqueue $cluster_queue --type='json' -p='[{"op": "replace", "path": "/spec/resourceGroups/0/flavors/0/resources/0/nominalQuota", "value": "10"}]'
```

5. Alternatively consider altering the resources requested in the pending workloads, if possible.

0 comments on commit e636617

Please sign in to comment.