cudaPackages: improve the handling of cuda_compat #273797

yannham · 2023-12-12T17:37:23Z

@NixOS/cuda-maintainers

Issue description

#267247 is a first toward enabling cuda_compat by default on platforms that support it (currently, the Jetson). However, other solutions - potentially better on the longer term - were mentioned there. This issue gathers them as not to be forgotten.

Current situation

#267247 adds cuda_compat to the DT_RUNPATH of members of the CUDA package set with a mechanism very similar to autoAddOpenGLRunpathHook (which, as of now, should rather be called autoAddDriverPath).

Limitations

As mentioned in #267247 (comment), things can get hairy if we want to actually not use the current cuda_compat for whatever reason (this should be rare, but isn't impossible). The story isn't entirely crystal clear on non-NixOS (Ubuntu-based Jetpack) as well.

Alternatives or improvements

Impure binding: put it in `run/openl-driver`

On jetpack-nixos, one possible alternative is to make jetpack responsible of making cuda_compat available in the driver path. That is, both the decision and the responsibility to put cuda_compat/lib/libcuda.so in /run/opengl-driver in place of the original driver. It's easier to change dynamically, without impacting Nixpkgs. It also means cuda packages don't have to care about cuda_compat and autoAddCudaCompatRuntimePath at all (beside making cuda_compat available as a package).

This can't be done currently because cuda_compat isn't available in a released NixOS at the time of writing (but will probably be backported to 23.11), and jetpack-nixos is still based on 22.11. So, as long as jetpack-nixos isn't based on a NixOS version recent enough to include cudaPackages.cuda_compat available, this isn't possible.

Pure binding: use stubs for missing libs

This is a variant of the approach of #267247, and the what we actually tried first. The idea is just to add cuda_compat to the required buildInputs of core cuda packages, maybe with a patchelf --add-needed (as most of the time libcuda is dlopened) and let autoPatchElf do its magic for the rest.

Unfortunately, libcuda itself dlopens two other impure library libnvrm_mem and libnvrm_gpu which are provided by the driver located in /run/opengl-driver. They can't found at build time, and Nix complains. Those are provided by .deb-based packages built as part of jetpack-nixos, and aren't available currently in Nixpkgs. One possibility would be to build against stubs of those libraries instead, which them could be included in Nixpkgs with a relatively low cost. Currently, jetpack has been patched instead to make those libraries available as part of /run/opengl-driver, but a direct dependency to the actual store location is probably better. This is blocked on knowing if those stubs exist, and getting them from Nvidia.

The text was updated successfully, but these errors were encountered:

SomeoneSerge · 2023-12-19T18:41:37Z

Two silly attempts/PoCs in this direction:

nixos-discourse · 2023-12-20T02:42:35Z

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/on-nixpkgs-and-the-ai-follow-up-to-2023-nix-developer-dialogues/37087/2

SomeoneSerge · 2024-01-07T14:06:16Z

Actually, instead of adjust nixglhost we could just push #248547 forward. Then nixglhost saxpy becomes LD_FALLBACK_PATH=/usr/lib/aarch64-linux-gnu/tegra. I'll see if I can afford to rebase that on master and build some samples to run on jetson (not sure how long the bootstrap chain is)

SomeoneSerge · 2024-03-04T19:36:53Z

Actually, instead of adjust nixglhost we could just push #248547 forward. Then nixglhost saxpy becomes LD_FALLBACK_PATH=/usr/lib/aarch64-linux-gnu/tegra. I'll see if I can afford to rebase that on master and build some samples to run on jetson (not sure how long the bootstrap chain is)

A working demo: #248547 (comment)

yannham · 2024-03-05T09:04:49Z

@SomeoneSerge sorry if my memory has become fuzzy, but why does LD_FALLBACK_PATH work but not LD_LIBRARY_PATH for the cuda_compat use-case?

SomeoneSerge · 2024-07-04T20:32:33Z

@SomeoneSerge sorry if my memory has become fuzzy, but why does LD_FALLBACK_PATH work but not LD_LIBRARY_PATH for the cuda_compat use-case?

Because of the priorities. In case of jetson, there was an older libcuda.so deployed in the location exposed by LD_{LIBRARY,FALLBACK}_PATH, which didn't work with our cudart; with FALLBACK we were still loading the cuda_compat, but with LIBRARY we were loading the old driver. #248547 (comment)

ConnorBaker added the 6.topic: cuda Parallel computing platform and API label Dec 15, 2023

github-project-automation bot added this to CUDA Team Dec 15, 2023

github-project-automation bot moved this to New in CUDA Team Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cudaPackages: improve the handling of cuda_compat #273797

cudaPackages: improve the handling of cuda_compat #273797

yannham commented Dec 12, 2023

SomeoneSerge commented Dec 19, 2023

nixos-discourse commented Dec 20, 2023

SomeoneSerge commented Jan 7, 2024

SomeoneSerge commented Mar 4, 2024

yannham commented Mar 5, 2024

SomeoneSerge commented Jul 4, 2024

cudaPackages: improve the handling of cuda_compat #273797

cudaPackages: improve the handling of cuda_compat #273797

Comments

yannham commented Dec 12, 2023

Issue description

Current situation

Limitations

Alternatives or improvements

Impure binding: put it in run/openl-driver

Pure binding: use stubs for missing libs

SomeoneSerge commented Dec 19, 2023

nixos-discourse commented Dec 20, 2023

SomeoneSerge commented Jan 7, 2024

SomeoneSerge commented Mar 4, 2024

yannham commented Mar 5, 2024

SomeoneSerge commented Jul 4, 2024

Impure binding: put it in `run/openl-driver`