Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaPackages: improve the handling of cuda_compat #273797

Open
yannham opened this issue Dec 12, 2023 · 6 comments
Open

cudaPackages: improve the handling of cuda_compat #273797

yannham opened this issue Dec 12, 2023 · 6 comments
Labels
6.topic: cuda Parallel computing platform and API

Comments

@yannham
Copy link
Contributor

yannham commented Dec 12, 2023

@NixOS/cuda-maintainers

Issue description

#267247 is a first toward enabling cuda_compat by default on platforms that support it (currently, the Jetson). However, other solutions - potentially better on the longer term - were mentioned there. This issue gathers them as not to be forgotten.

Current situation

#267247 adds cuda_compat to the DT_RUNPATH of members of the CUDA package set with a mechanism very similar to autoAddOpenGLRunpathHook (which, as of now, should rather be called autoAddDriverPath).

Limitations

As mentioned in #267247 (comment), things can get hairy if we want to actually not use the current cuda_compat for whatever reason (this should be rare, but isn't impossible). The story isn't entirely crystal clear on non-NixOS (Ubuntu-based Jetpack) as well.

Alternatives or improvements

Impure binding: put it in run/openl-driver

On jetpack-nixos, one possible alternative is to make jetpack responsible of making cuda_compat available in the driver path. That is, both the decision and the responsibility to put cuda_compat/lib/libcuda.so in /run/opengl-driver in place of the original driver. It's easier to change dynamically, without impacting Nixpkgs. It also means cuda packages don't have to care about cuda_compat and autoAddCudaCompatRuntimePath at all (beside making cuda_compat available as a package).

This can't be done currently because cuda_compat isn't available in a released NixOS at the time of writing (but will probably be backported to 23.11), and jetpack-nixos is still based on 22.11. So, as long as jetpack-nixos isn't based on a NixOS version recent enough to include cudaPackages.cuda_compat available, this isn't possible.

Pure binding: use stubs for missing libs

This is a variant of the approach of #267247, and the what we actually tried first. The idea is just to add cuda_compat to the required buildInputs of core cuda packages, maybe with a patchelf --add-needed (as most of the time libcuda is dlopened) and let autoPatchElf do its magic for the rest.

Unfortunately, libcuda itself dlopens two other impure library libnvrm_mem and libnvrm_gpu which are provided by the driver located in /run/opengl-driver. They can't found at build time, and Nix complains. Those are provided by .deb-based packages built as part of jetpack-nixos, and aren't available currently in Nixpkgs. One possibility would be to build against stubs of those libraries instead, which them could be included in Nixpkgs with a relatively low cost. Currently, jetpack has been patched instead to make those libraries available as part of /run/opengl-driver, but a direct dependency to the actual store location is probably better. This is blocked on knowing if those stubs exist, and getting them from Nvidia.

@ConnorBaker ConnorBaker added the 6.topic: cuda Parallel computing platform and API label Dec 15, 2023
@SomeoneSerge
Copy link
Contributor

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/on-nixpkgs-and-the-ai-follow-up-to-2023-nix-developer-dialogues/37087/2

@SomeoneSerge
Copy link
Contributor

Actually, instead of adjust nixglhost we could just push #248547 forward. Then nixglhost saxpy becomes LD_FALLBACK_PATH=/usr/lib/aarch64-linux-gnu/tegra. I'll see if I can afford to rebase that on master and build some samples to run on jetson (not sure how long the bootstrap chain is)

@SomeoneSerge
Copy link
Contributor

Actually, instead of adjust nixglhost we could just push #248547 forward. Then nixglhost saxpy becomes LD_FALLBACK_PATH=/usr/lib/aarch64-linux-gnu/tegra. I'll see if I can afford to rebase that on master and build some samples to run on jetson (not sure how long the bootstrap chain is)

A working demo: #248547 (comment)

@yannham
Copy link
Contributor Author

yannham commented Mar 5, 2024

@SomeoneSerge sorry if my memory has become fuzzy, but why does LD_FALLBACK_PATH work but not LD_LIBRARY_PATH for the cuda_compat use-case?

@SomeoneSerge
Copy link
Contributor

@SomeoneSerge sorry if my memory has become fuzzy, but why does LD_FALLBACK_PATH work but not LD_LIBRARY_PATH for the cuda_compat use-case?

Because of the priorities. In case of jetson, there was an older libcuda.so deployed in the location exposed by LD_{LIBRARY,FALLBACK}_PATH, which didn't work with our cudart; with FALLBACK we were still loading the cuda_compat, but with LIBRARY we were loading the old driver. #248547 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: cuda Parallel computing platform and API
Projects
Status: New
Development

No branches or pull requests

4 participants