diff --git a/config/rhoai/prometheus_rule.yaml b/config/rhoai/prometheus_rule.yaml index 2240977e98..4c887d1a5c 100644 --- a/config/rhoai/prometheus_rule.yaml +++ b/config/rhoai/prometheus_rule.yaml @@ -15,6 +15,7 @@ spec: annotations: summary: "Kueue pod is down ({{ $labels.pod }})" description: "The Kueue pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is not ready." + triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/kueue-pod-down.md" - name: kueue-info-alerts rules: - alert: LowClusterQueueResourceUsage @@ -25,6 +26,7 @@ spec: annotations: summary: Low {{ $labels.resource }} resource usage in cluster queue {{ $labels.cluster_queue }} description: The {{ $labels.resource }} resource usage in cluster queue {{ $labels.cluster_queue }} is below 20% of its nominal quota for more than 1 day. + triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/low-cluster-queue-resource-usage.md" - alert: ResourceReservationExceedsQuota expr: (sum(kueue_cluster_queue_resource_reservation) by (resource, cluster_queue)) / 10 > (sum(kueue_cluster_queue_nominal_quota) by (resource, cluster_queue)) for: 10m @@ -33,6 +35,7 @@ spec: annotations: summary: Resource {{ $labels.resource }} reservation far exceeds the available quota in cluster queue {{ $labels.cluster_queue}} description: Resource {{ $labels.resource }} reservation is 10 times the available quota in cluster queue {{ $labels.cluster_queue}} + triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/resource-reservation-exceeds-quota.md" - alert: PendingWorkloadPods expr: (sum by (namespace, pod) (sum_over_time(kube_pod_status_phase{phase="Pending"}[3d])) >= 3 * 24 * 60) >0 for: 1m @@ -41,4 +44,5 @@ spec: annotations: summary: Pod {{ $labels.pod }} in the {{ $labels.namespace }} namespace has been pending for more than 3 days description: A pod {{ $labels.pod }} in the {{ $labels.namespace }} namespace has been in the pending state for more than 3 days. + triage: "https://github.com/opendatahub-io/kueue/tree/dev/docs/alerts/runbooks/pending-workload-pods.md" diff --git a/docs/alerts/runbooks/kueue-pod-down.md b/docs/alerts/runbooks/kueue-pod-down.md new file mode 100644 index 0000000000..695b8c4f72 --- /dev/null +++ b/docs/alerts/runbooks/kueue-pod-down.md @@ -0,0 +1,42 @@ +# Kueue Pod Down + +## Severity: Critical + +## Impact + +Any workloads running on the cluster will not be able to use the Kueue component. + +## Summary + +This alert is triggered when the `kube_pod_status_ready` query shows that the Kueue controller pod is not ready. + +## Steps + +1. Check to see if the `kueue-controller` pod is running in the `redhat-ods-applications` namespace: + +```bash +$ oc -n redhat-ods-applications get pods -l app.kubernetes.io/name=kueue +``` + +2. If the pod is not running, look at the pod's logs/events to see what may be causing the issues. Please make sure to grab the logs/events so they can be shared with the engineering team later: + +```bash +# Check pod logs +$ oc -n redhat-ods-applications logs -l app.kubernetes.io/name=kueue --prefix=true + +# Check events +$ oc -n redhat-ods-applications get events | grep pod/kueue-controller + +# Check pod status fields +$ oc -n redhat-ods-applications get pods -l app.kubernetes.io/name=kueue -o jsonpath="{range .items[*]}{.status}{\"\n\n\"}{end}" +``` + +3. Redeploy Kueue Operator by restarting the deployment: + +```bash +$ oc -n redhat-ods-applications rollout restart deployments/kueue-controller-manager +``` + +This should result in a new pod getting deployed, attempt step (1) again and see if the pod achieves running state. + +4. If the problem persists, capture the logs and escalate to the RHOAI engineering team. diff --git a/docs/alerts/runbooks/low-cluster-queue-resource-usage.md b/docs/alerts/runbooks/low-cluster-queue-resource-usage.md new file mode 100644 index 0000000000..c7f4ccdb22 --- /dev/null +++ b/docs/alerts/runbooks/low-cluster-queue-resource-usage.md @@ -0,0 +1,53 @@ +# Low Cluster Queue Resource Usage + +## Severity: Info + +## Impact + +Resources that are consistently unused can be redistributed. + +## Summary + +This alert is triggered when the resource usage in a cluster queue is below 20% of its nominal quota for more than 1 day. + +## Steps + +1. Check current resource usage for the cluster queue and ensure that the nominal quota for the resource in question is correctly configured. Update the cluster-queue-name in the script below to describe the cluster queue. +```bash +cluster_queue=< cluster-queue-name > +oc describe clusterqueue $cluster_queue +``` + - If you would like to view just the Flavors and Nominal Quota you can use the following command: +```bash +oc describe clusterqueue $cluster_queue | awk '/Flavors:/,/^$/' +``` + +2. Review the workloads that are linked with the cluster queue to see if the assigned resources are required. +```bash +# Find local queues linked to the cluster queue +local_queues=$(oc get localqueues --all-namespaces -o json | jq -r --arg clusterQueue "$cluster_queue" '.items[] | select(.spec.clusterQueue == $clusterQueue) | "\(.metadata.namespace)/\(.metadata.name)"') + +# Find workloads linked to the local queues +for local_queue in $local_queues; do + namespace=$(echo $local_queue | cut -d '/' -f 1) + queue_name=$(echo $local_queue | cut -d '/' -f 2) + + echo "Checking workloads linked to local queue $queue_name in namespace $namespace..." + + oc get workloads --namespace $namespace -o json | jq -r --arg queueName "$queue_name" '.items[] | select(.spec.queueName == $queueName) | "\(.metadata.namespace)/\(.metadata.name)"' +done +``` + +3. Review individual workloads. Update the namespace and workload-name in the script below to view details of the workload. +```bash +namespace=< namespace > +workload_name=< workload-name > +oc describe workload -n $namespace $workload_name +``` + +4. Consider reducing the cluster queue nominal quota if resource usage is consistently low. +You can patch the clusterqueue using the following command. Note that you must change the values to refer to the exact resource you want to change. +This will change the nominal quota for cpu to 10, in the first flavor referenced in the named cluster queue resource: +```bash +oc patch clusterqueue $cluster_queue --type='json' -p='[{"op": "replace", "path": "/spec/resourceGroups/0/flavors/0/resources/0/nominalQuota", "value": "10"}]' +``` diff --git a/docs/alerts/runbooks/pending-workload-pods.md b/docs/alerts/runbooks/pending-workload-pods.md new file mode 100644 index 0000000000..8b6d6db232 --- /dev/null +++ b/docs/alerts/runbooks/pending-workload-pods.md @@ -0,0 +1,37 @@ +# Pending Workload Pods + +## Severity: Info + +## Impact +Knowledge of pods in a prolonged pending state will allow users to troubleshoot and fix any issues in order to run their workloads successfully. + +## Summary + +This alert is triggered when a pod is in the pending state for more than 3 days. + +## Steps + +1. Identify the pending pod in your project namespace. Update the project namespace below to the name of your project namespace. +```bash +namespace=< project-namespace > +oc get pods -A --field-selector=status.phase=Pending # This will show all pods in the cluster with Pending status +oc get pods -n $namespace --field-selector=status.phase=Pending # This will show all pods in the specified namespace with Pending status +``` + +2. Get further details on the pod. +```bash +pod=< pod-name > +oc describe pod $pod -n $namespace +``` + +3. Review the pod logs and determine why it is in a pending state. +```bash +oc logs $pod -n $namespace +``` + +4. Review the pod events in order to determine why it is in a pending state. +```bash +oc get events --field-selector involvedObject.name=$pod --namespace=$namespace +``` + +5. Review the results of the steps above to determine the best course of action for successfully running the workload. diff --git a/docs/alerts/runbooks/resource-reservation-exceeds-quota.md b/docs/alerts/runbooks/resource-reservation-exceeds-quota.md new file mode 100644 index 0000000000..02fab78fa4 --- /dev/null +++ b/docs/alerts/runbooks/resource-reservation-exceeds-quota.md @@ -0,0 +1,56 @@ +# Resource Reservation Exceeds Quota + +## Severity: Info + +## Impact + +Knowledge of over requested resources will allow the user to adjust the nominal quota or resources requested by a workload. + +## Summary + +This alert is triggered when resource reservation is 10 times the available nominal quota in a cluster queue. + +## Steps + +1. Check current resource reservation for the cluster queue and ensure that the nominal quota for the resource in question is correctly configured. Update the cluster-queue-name in the script below to describe the cluster queue. +```bash +cluster_queue=< cluster-queue-name > +oc describe clusterqueue $cluster_queue +``` + + - If you would just like to view the Flavors Reservation and Flavors Usage you can use the following command: +```bash +oc describe clusterqueue $cluster_queue | awk '/Flavors Reservation:/,/^$/' +``` + +2. Review the workloads that are linked with the cluster queue to see if the requested resources are required. +```bash +# Find local queues linked to the cluster queue +local_queues=$(oc get localqueues --all-namespaces -o json | jq -r --arg clusterQueue "$cluster_queue" '.items[] | select(.spec.clusterQueue == $clusterQueue) | "\(.metadata.namespace)/\(.metadata.name)"') + +# Find workloads linked to the local queues +for local_queue in $local_queues; do + namespace=$(echo $local_queue | cut -d '/' -f 1) + queue_name=$(echo $local_queue | cut -d '/' -f 2) + + echo "Checking workloads linked to local queue $queue_name in namespace $namespace..." + + oc get workloads --namespace $namespace -o json | jq -r --arg queueName "$queue_name" '.items[] | select(.spec.queueName == $queueName) | "\(.metadata.namespace)/\(.metadata.name)"' +done +``` + +3. Review individual workloads. Update the namespace and workload-name in the script below to view details of the workload. +```bash +namespace=< namespace > +workload_name=< workload-name > +oc describe workload -n $namespace $workload_name +``` + +4. Consider increasing the cluster queue nominal quota. +You can patch the clusterqueue using the following command. Note that you must change the values to refer to the exact resource you want to change. +This will change the nominal quota for cpu to 10, in the first flavor referenced in the named cluster queue resource: +```bash +oc patch clusterqueue $cluster_queue --type='json' -p='[{"op": "replace", "path": "/spec/resourceGroups/0/flavors/0/resources/0/nominalQuota", "value": "10"}]' +``` + +5. Alternatively consider altering the resources requested in the pending workloads, if possible.