forked from kubernetes-sigs/kueue
-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
bbcb742
commit e636617
Showing
5 changed files
with
192 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# Kueue Pod Down | ||
|
||
## Severity: Critical | ||
|
||
## Impact | ||
|
||
Any workloads running on the cluster will not be able to use the Kueue component. | ||
|
||
## Summary | ||
|
||
This alert is triggered when the `kube_pod_status_ready` query shows that the Kueue controller pod is not ready. | ||
|
||
## Steps | ||
|
||
1. Check to see if the `kueue-controller` pod is running in the `redhat-ods-applications` namespace: | ||
|
||
```bash | ||
$ oc -n redhat-ods-applications get pods -l app.kubernetes.io/name=kueue | ||
``` | ||
|
||
2. If the pod is not running, look at the pod's logs/events to see what may be causing the issues. Please make sure to grab the logs/events so they can be shared with the engineering team later: | ||
|
||
```bash | ||
# Check pod logs | ||
$ oc -n redhat-ods-applications logs -l app.kubernetes.io/name=kueue --prefix=true | ||
|
||
# Check events | ||
$ oc -n redhat-ods-applications get events | grep pod/kueue-controller | ||
|
||
# Check pod status fields | ||
$ oc -n redhat-ods-applications get pods -l app.kubernetes.io/name=kueue -o jsonpath="{range .items[*]}{.status}{\"\n\n\"}{end}" | ||
``` | ||
|
||
3. Redeploy Kueue Operator by restarting the deployment: | ||
|
||
```bash | ||
$ oc -n redhat-ods-applications rollout restart deployments/kueue-controller-manager | ||
``` | ||
|
||
This should result in a new pod getting deployed, attempt step (1) again and see if the pod achieves running state. | ||
|
||
4. If the problem persists, capture the logs and escalate to the RHOAI engineering team. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# Low Cluster Queue Resource Usage | ||
|
||
## Severity: Info | ||
|
||
## Impact | ||
|
||
Resources that are consistently unused can be redistributed. | ||
|
||
## Summary | ||
|
||
This alert is triggered when the resource usage in a cluster queue is below 20% of its nominal quota for more than 1 day. | ||
|
||
## Steps | ||
|
||
1. Check current resource usage for the cluster queue and ensure that the nominal quota for the resource in question is correctly configured. Update the cluster-queue-name in the script below to describe the cluster queue. | ||
```bash | ||
cluster_queue=< cluster-queue-name > | ||
oc describe clusterqueue $cluster_queue | ||
``` | ||
- If you would like to view just the Flavors and Nominal Quota you can use the following command: | ||
```bash | ||
oc describe clusterqueue $cluster_queue | awk '/Flavors:/,/^$/' | ||
``` | ||
|
||
2. Review the workloads that are linked with the cluster queue to see if the assigned resources are required. | ||
```bash | ||
# Find local queues linked to the cluster queue | ||
local_queues=$(oc get localqueues --all-namespaces -o json | jq -r --arg clusterQueue "$cluster_queue" '.items[] | select(.spec.clusterQueue == $clusterQueue) | "\(.metadata.namespace)/\(.metadata.name)"') | ||
|
||
# Find workloads linked to the local queues | ||
for local_queue in $local_queues; do | ||
namespace=$(echo $local_queue | cut -d '/' -f 1) | ||
queue_name=$(echo $local_queue | cut -d '/' -f 2) | ||
|
||
echo "Checking workloads linked to local queue $queue_name in namespace $namespace..." | ||
|
||
oc get workloads --namespace $namespace -o json | jq -r --arg queueName "$queue_name" '.items[] | select(.spec.queueName == $queueName) | "\(.metadata.namespace)/\(.metadata.name)"' | ||
done | ||
``` | ||
|
||
3. Review individual workloads. Update the namespace and workload-name in the script below to view details of the workload. | ||
```bash | ||
namespace=< namespace > | ||
workload_name=< workload-name > | ||
oc describe workload -n $namespace $workload_name | ||
``` | ||
|
||
4. Consider reducing the cluster queue nominal quota if resource usage is consistently low. | ||
You can patch the clusterqueue using the following command. Note that you must change the values to refer to the exact resource you want to change. | ||
This will change the nominal quota for cpu to 10, in the first flavor referenced in the named cluster queue resource: | ||
```bash | ||
oc patch clusterqueue $cluster_queue --type='json' -p='[{"op": "replace", "path": "/spec/resourceGroups/0/flavors/0/resources/0/nominalQuota", "value": "10"}]' | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Pending Workload Pods | ||
|
||
## Severity: Info | ||
|
||
## Impact | ||
Knowledge of pods in a prolonged pending state will allow users to troubleshoot and fix any issues in order to run their workloads successfully. | ||
|
||
## Summary | ||
|
||
This alert is triggered when a pod is in the pending state for more than 3 days. | ||
|
||
## Steps | ||
|
||
1. Identify the pending pod in your project namespace. Update the project namespace below to the name of your project namespace. | ||
```bash | ||
namespace=< project-namespace > | ||
oc get pods -A --field-selector=status.phase=Pending # This will show all pods in the cluster with Pending status | ||
oc get pods -n $namespace --field-selector=status.phase=Pending # This will show all pods in the specified namespace with Pending status | ||
``` | ||
|
||
2. Get further details on the pod. | ||
```bash | ||
pod=< pod-name > | ||
oc describe pod $pod -n $namespace | ||
``` | ||
|
||
3. Review the pod logs and determine why it is in a pending state. | ||
```bash | ||
oc logs $pod -n $namespace | ||
``` | ||
|
||
4. Review the pod events in order to determine why it is in a pending state. | ||
```bash | ||
oc get events --field-selector involvedObject.name=$pod --namespace=$namespace | ||
``` | ||
|
||
5. Review the results of the steps above to determine the best course of action for successfully running the workload. |
56 changes: 56 additions & 0 deletions
56
docs/alerts/runbooks/resource-reservation-exceeds-quota.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Resource Reservation Exceeds Quota | ||
|
||
## Severity: Info | ||
|
||
## Impact | ||
|
||
Knowledge of over requested resources will allow the user to adjust the nominal quota or resources requested by a workload. | ||
|
||
## Summary | ||
|
||
This alert is triggered when resource reservation is 10 times the available nominal quota in a cluster queue. | ||
|
||
## Steps | ||
|
||
1. Check current resource reservation for the cluster queue and ensure that the nominal quota for the resource in question is correctly configured. Update the cluster-queue-name in the script below to describe the cluster queue. | ||
```bash | ||
cluster_queue=< cluster-queue-name > | ||
oc describe clusterqueue $cluster_queue | ||
``` | ||
|
||
- If you would just like to view the Flavors Reservation and Flavors Usage you can use the following command: | ||
```bash | ||
oc describe clusterqueue $cluster_queue | awk '/Flavors Reservation:/,/^$/' | ||
``` | ||
|
||
2. Review the workloads that are linked with the cluster queue to see if the requested resources are required. | ||
```bash | ||
# Find local queues linked to the cluster queue | ||
local_queues=$(oc get localqueues --all-namespaces -o json | jq -r --arg clusterQueue "$cluster_queue" '.items[] | select(.spec.clusterQueue == $clusterQueue) | "\(.metadata.namespace)/\(.metadata.name)"') | ||
|
||
# Find workloads linked to the local queues | ||
for local_queue in $local_queues; do | ||
namespace=$(echo $local_queue | cut -d '/' -f 1) | ||
queue_name=$(echo $local_queue | cut -d '/' -f 2) | ||
|
||
echo "Checking workloads linked to local queue $queue_name in namespace $namespace..." | ||
|
||
oc get workloads --namespace $namespace -o json | jq -r --arg queueName "$queue_name" '.items[] | select(.spec.queueName == $queueName) | "\(.metadata.namespace)/\(.metadata.name)"' | ||
done | ||
``` | ||
|
||
3. Review individual workloads. Update the namespace and workload-name in the script below to view details of the workload. | ||
```bash | ||
namespace=< namespace > | ||
workload_name=< workload-name > | ||
oc describe workload -n $namespace $workload_name | ||
``` | ||
|
||
4. Consider increasing the cluster queue nominal quota. | ||
You can patch the clusterqueue using the following command. Note that you must change the values to refer to the exact resource you want to change. | ||
This will change the nominal quota for cpu to 10, in the first flavor referenced in the named cluster queue resource: | ||
```bash | ||
oc patch clusterqueue $cluster_queue --type='json' -p='[{"op": "replace", "path": "/spec/resourceGroups/0/flavors/0/resources/0/nominalQuota", "value": "10"}]' | ||
``` | ||
|
||
5. Alternatively consider altering the resources requested in the pending workloads, if possible. |