You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently faced an issue while GPU resources (nvidia.com/gpu) can be shown from kubelet are not recovered (e.g. 7 -> 8) even any XID error is resolved.
I got nvidia-device-plugin-daemonset from gpu-operator and I'm using gpu-operator v23.9.2.
Here's more details:
I found that there were only 7 GPU cards shown from Kubernetes, even I'm using 8 GPU cards in H100 node:
But after elapsed some time, it seems that XID error is somewhat resolved (I think application is restarted or removed). I can't find XID error from nvidia-smi:
But even if XID error is resolved, nvidia-device-plugin-daemonset won't try to fetch new status of GPU cards and reports to kubelet, so kubelet thinks that only some of GPU cards can be used.
After I restarted nvidia-device-plugin-daemonset pod, at then it reports kubelet that they can use 8 GPU cards (the number of nvidia.com/gpu is changed in Allocatable):
I agree that it seems like Xid 94 is essentially an application error and should not disable the device.
But as a workaround you can tell it to ignore this by setting the device plugin's environment variable DP_DISABLE_HEALTHCHECKS to 94.
Hello, NVIDIA team.
I recently faced an issue while GPU resources (
nvidia.com/gpu
) can be shown fromkubelet
are not recovered (e.g. 7 -> 8) even any XID error is resolved.I got
nvidia-device-plugin-daemonset
fromgpu-operator
and I'm using gpu-operator v23.9.2.Here's more details:
I found that there were only 7 GPU cards shown from Kubernetes, even I'm using 8 GPU cards in H100 node:
nvidia-device-plugin-daemonset
reports that there is XID 94 error is coming out in one of GPU card:But after elapsed some time, it seems that XID error is somewhat resolved (I think application is restarted or removed). I can't find XID error from
nvidia-smi
:But even if XID error is resolved,
nvidia-device-plugin-daemonset
won't try to fetch new status of GPU cards and reports tokubelet
, sokubelet
thinks that only some of GPU cards can be used.After I restarted
nvidia-device-plugin-daemonset
pod, at then it reportskubelet
that they can use 8 GPU cards (the number ofnvidia.com/gpu
is changed inAllocatable
):I think
nvidia-device-plugin-daemonset
should fetch status correctly and report tokubelet
.Could you please take a look this issue?
Thanks.
The text was updated successfully, but these errors were encountered: