Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU resources are not recovered even XID error is resolved #1065

Open
jslouisyou opened this issue Oct 25, 2024 · 2 comments
Open

GPU resources are not recovered even XID error is resolved #1065

jslouisyou opened this issue Oct 25, 2024 · 2 comments

Comments

@jslouisyou
Copy link

Hello, NVIDIA team.

I recently faced an issue while GPU resources (nvidia.com/gpu) can be shown from kubelet are not recovered (e.g. 7 -> 8) even any XID error is resolved.

I got nvidia-device-plugin-daemonset from gpu-operator and I'm using gpu-operator v23.9.2.

Here's more details:

I found that there were only 7 GPU cards shown from Kubernetes, even I'm using 8 GPU cards in H100 node:

Capacity:
  cpu:                        128
  ephemeral-storage:          7441183616Ki
  hugepages-1Gi:              0
  hugepages-2Mi:              8448Mi
  memory:                     2113276288Ki
  nvidia.com/gpu:             8
  pods:                       110
Allocatable:
  cpu:                        128
  ephemeral-storage:          6857794809152
  hugepages-1Gi:              0
  hugepages-2Mi:              8448Mi
  memory:                     2062682496Ki
  nvidia.com/gpu:             7       <=========== here
  pods:                       110

nvidia-device-plugin-daemonset reports that there is XID 94 error is coming out in one of GPU card:

I1025 02:19:08.002792       1 health.go:151] Skipping non-nvmlEventTypeXidCriticalError event: {Device:{Handle:0x7f0dcf40bdf8} EventType:2 EventData:0 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.048144       1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.048185       1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.048239       1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.049436       1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.049451       1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.049483       1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.059938       1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.059948       1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.059980       1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.074343       1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.074366       1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.074389       1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41

But after elapsed some time, it seems that XID error is somewhat resolved (I think application is restarted or removed). I can't find XID error from nvidia-smi:

$ nvidia-smi
Fri Oct 25 11:35:11 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:1A:00.0 Off |                    2 |
| N/A   33C    P0              71W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:40:00.0 Off |                    0 |
| N/A   31C    P0              70W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:53:00.0 Off |                    0 |
| N/A   31C    P0              74W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:66:00.0 Off |                    0 |
| N/A   33C    P0              69W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:9C:00.0 Off |                    0 |
| N/A   35C    P0              71W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:C0:00.0 Off |                    0 |
| N/A   32C    P0              68W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:D2:00.0 Off |                    0 |
| N/A   34C    P0              70W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:E4:00.0 Off |                    0 |
| N/A   31C    P0              71W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

But even if XID error is resolved, nvidia-device-plugin-daemonset won't try to fetch new status of GPU cards and reports to kubelet, so kubelet thinks that only some of GPU cards can be used.

After I restarted nvidia-device-plugin-daemonset pod, at then it reports kubelet that they can use 8 GPU cards (the number of nvidia.com/gpu is changed in Allocatable):

Capacity:
  cpu:                        128
  ephemeral-storage:          7441183616Ki
  hugepages-1Gi:              0
  hugepages-2Mi:              8448Mi
  memory:                     2113276288Ki
  nvidia.com/gpu:             8
  pods:                       110
Allocatable:
  cpu:                        128
  ephemeral-storage:          6857794809152
  hugepages-1Gi:              0
  hugepages-2Mi:              8448Mi
  memory:                     2062682496Ki
  nvidia.com/gpu:             8       <=========== here is changed
  pods:                       110

I think nvidia-device-plugin-daemonset should fetch status correctly and report to kubelet.
Could you please take a look this issue?

Thanks.

@jslouisyou
Copy link
Author

@nwilliams-bdai
Copy link

I agree that it seems like Xid 94 is essentially an application error and should not disable the device.
But as a workaround you can tell it to ignore this by setting the device plugin's environment variable DP_DISABLE_HEALTHCHECKS to 94.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants