Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia cloud native stack fails to start with new nvidia open gpu kernel modules #544

Open
Tracked by #9825
tmojzes opened this issue Dec 9, 2024 · 4 comments
Open
Tracked by #9825
Assignees

Comments

@tmojzes
Copy link

tmojzes commented Dec 9, 2024

Since the new version of the nvidia open gpu kernel modules which differentiates between production and lts versions, the Nvidia related so files are moved to /usr/local/glibc/lib
image
from /usr/local/lib in the previous versions.
image

Some of these are e.g.: libnvidia-ml.so required for the Nvidia cloud native stack components like Nvidia Device Plugin and DCGM exporter. Both of these loads the before mentioned shared object at runtime and without it they fail to start. The load of the object files is done by the go-nvml library which as far as I know only looks for these files under /usr/local/bin so either it should be modified or the .so files needs to be moved back to their original place.

@smira
Copy link
Member

smira commented Dec 9, 2024

Can you give us a minimal reproducer of the problem please? E.g. you kubectl apply this, and it fails to run.

@tmojzes
Copy link
Author

tmojzes commented Dec 9, 2024

Can you give us a minimal reproducer of the problem please? E.g. you kubectl apply this, and it fails to run.

Sure, if you run the following
kubectl run --image nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04 dcgm-exporter --namespace default
It will fail with the following error message:
Error: Failed to initialize NVML
time="2024-12-09T15:07:24Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

@frezbo
Copy link
Member

frezbo commented Dec 11, 2024

@tmojzes are you sure the pod was ran with runtimeClassName set?

The dgcm pod seems to run fine in testing

@frezbo
Copy link
Member

frezbo commented Dec 11, 2024

I can get the same error if runtimeClassName is not set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants