You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since the new version of the nvidia open gpu kernel modules which differentiates between production and lts versions, the Nvidia related so files are moved to /usr/local/glibc/lib
from /usr/local/lib in the previous versions.
Some of these are e.g.: libnvidia-ml.so required for the Nvidia cloud native stack components like Nvidia Device Plugin and DCGM exporter. Both of these loads the before mentioned shared object at runtime and without it they fail to start. The load of the object files is done by the go-nvml library which as far as I know only looks for these files under /usr/local/bin so either it should be modified or the .so files needs to be moved back to their original place.
The text was updated successfully, but these errors were encountered:
Can you give us a minimal reproducer of the problem please? E.g. you kubectl apply this, and it fails to run.
Sure, if you run the following kubectl run --image nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04 dcgm-exporter --namespace default
It will fail with the following error message:
Error: Failed to initialize NVML time="2024-12-09T15:07:24Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"
Since the new version of the nvidia open gpu kernel modules which differentiates between production and lts versions, the Nvidia related so files are moved to /usr/local/glibc/lib
from /usr/local/lib in the previous versions.
Some of these are e.g.: libnvidia-ml.so required for the Nvidia cloud native stack components like Nvidia Device Plugin and DCGM exporter. Both of these loads the before mentioned shared object at runtime and without it they fail to start. The load of the object files is done by the go-nvml library which as far as I know only looks for these files under /usr/local/bin so either it should be modified or the .so files needs to be moved back to their original place.
The text was updated successfully, but these errors were encountered: