Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compilation error with type mismatch, when building with PyTorch and Kokkos #55

Open
moravveji opened this issue Oct 25, 2024 · 1 comment

Comments

@moravveji
Copy link

Dear

Upon a user request, I am trying to install LAMMPS-allegro on two different generations of Nvidia GPU nodes; we use Rocky 8 as the OS and the Nvidia driver version 560.x.x:

  1. Nvidia A100 GPU on Intel Icelake node (cuda compute capability is fixed to 8.0)
  2. Nvidia H100 GPU on AMD Zen4 node (hence kokkos_arch='ZEN3' and cuda compute capability is set to 9.0)

In both cases, I get the same compilation error down the road. I am heavily trimming off the error message, but the essence of the issue is:

            function "__half::operator unsigned long long() const" (declared at line 250 of /apps/leuven/rocky8/icelake/2023a/s
oftware/CUDA/12.1.1/include/cuda_fp16.hpp)            function "__half::operator bool() const" (declared at line 254 of /apps/leuven/rocky8/icelake/2023a/software/CUDA/1
2.1.1/include/cuda_fp16.hpp)
          __A28, __A29, __A30, __A31 };
                               ^

/vsc-hard-mounts/leuven-apps/rocky8/icelake/2023a/software/GCCcore/12.3.0/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/avx512fp16
intrin.h(2765): error: argument of type "const __half *" is incompatible with parameter of type "const unsigned *"
    return __builtin_ia32_loadsh_mask (__C, __A, __B);                                       ^

nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
make[2]: *** [CMakeFiles/lammps.dir/build.make:2981: CMakeFiles/lammps.dir/dev/shm/x0090231/eb/LAMMPS/2Aug2023_update2/foss-2023a-pair_allegro-kokkos-PyTorch-2.1.2-CUDA-12.1.1/lammps-stable_2Aug2023_update2/src/force.cpp.o] Error 9

I have to mention that non-patched installation of exactly the same LAMMPS release with the same toolchain on the same node has went very smoothly. For clarity, I have attached the EasyBuild easyconfig file used for the installation, together with the EasyBuild compilation logfile in the attachment.

Furthermore, you also see the following error occurring too, e.g. when compiling src/force.cpp (see the logfile please):

nvcc_wrapper - *warning* you have set multiple optimization flags (-O*), only the last is used because nvcc can only accept a s
ingle optimization setting.
/vsc-hard-mounts/leuven-apps/rocky8/icelake/2023a/software/GCCcore/12.3.0/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/avx512fp16
intrin.h(38): error: vector_size attribute requires an arithmetic or enum type
  typedef __half __v8hf __attribute__ ((__vector_size__ (16)));

Given that this issue happens only when patching with allegro and eventually building against Kokkos/CUDA, I decided to post it here. I hope this is the right place for it.

Please let me know if any additional information is needed.
lammps-torch.tar.gz

@anjohan
Copy link
Collaborator

anjohan commented Nov 4, 2024

Hi,

Sorry for the late reply! This issue didn't look fun. The fact that if fails on compiler header files etc. is a bad sign and points to an environment issue.

I don't have a direct answer, but here are a few random thoughts:

  • Your GCC 12.3 is too new for CUDA 12.1. table
  • Your LAMMPS version is quite old.
  • I have no experience with EasyBuild. Could you try to just do the usual building sequence interactively?
  • How do you get your PyTorch? Does it have CXX11 ABI? It looks like your version is 2.1, which is scary. 1.11 works, 1.12&13 don't (at least on NVIDIA), and the early 2.x versions also don't (but not sure exactly which ones). I would recommend using something recent (2.4/2.5).
  • Setting the CPU arch for Kokkos is unnecessary (won't affect your performance unless you're using a lot of CPU functionality) and complicates things with extra flags.
  • You're enabling a lot of LAMMPS packages, maybe these affect compiler flags? Try a base version of LAMMPS w/Allegro and no extra packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants