You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From our cluster environment, pre-built libtorch 1.11.0 doesn't properly work with openmpi. I build a LAMMPS-Allegro with prebuilt libtorch 1.11.0, but when I submit a job with multiple GPUs, then nothing is printed out to output folder even though slurm system indicates the simulation is running.
So I build a pytorch 1.11.0 using cmake from a virtual environment using following cmake settings:
Then I tried to cmake the LAMMPS-Allegro (with kokkos and openmp) using the pytorch I compiled from the same virtual environment. Following is cmake setting I used for LAMMPS-Allegro with Kokkos and OpenMP:
However, I see following error messages when I try to configure the LAMMPS-Allegro with OpenMP and Kokkos:
CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:14 (include):
include could not find requested file:
/home/Sourcecode_Pytorch1110/build/public/utils.cmake
Call Stack (most recent call first):
/home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:1082 (find_package)
CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:17 (include):
include could not find requested file:
/home/Sourcecode_Pytorch1110/build/public/threads.cmake
Call Stack (most recent call first):
/home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:1082 (find_package)
CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:88 (include):
include could not find requested file:
/home/Sourcecode_Pytorch1110/build/public/cuda.cmake
Call Stack (most recent call first):
/home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:1082 (find_package)
CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:109 (include):
include could not find requested file:
/home/Sourcecode_Pytorch1110/build/public/mkl.cmake
Call Stack (most recent call first):
/home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:1082 (find_package)
CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:112 (include):
include could not find requested file:
/home/Sourcecode_Pytorch1110/build/public/mkldnn.cmake
Call Stack (most recent call first):
/home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:1082 (find_package)
CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:116 (include):
include could not find requested file:
/home/Sourcecode_Pytorch1110/build/Caffe2Targets.cmake
Call Stack (most recent call first):
/home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:1082 (find_package)
CMake Error at /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:186 (set_target_properties):
set_target_properties Can not find target to add properties to: torch
Call Stack (most recent call first):
CMakeLists.txt:1082 (find_package)
CMake Error at /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:191 (set_property):
set_property could not find TARGET torch. Perhaps it has not yet been
created.
Call Stack (most recent call first):
CMakeLists.txt:1082 (find_package)
-- Found Torch: /home/Sourcecode_Pytorch1110/build/lib/libtorch.so
-- Configuring incomplete, errors occurred!
See also "/home/Sourcecode_LAMMPS_Allegro_cuda113_custompytorch1110_zeusgpu_20240725/build01/CMakeFiles/CMakeOutput.log".
I don't know what these error means. Would this means my pytorch 1.11.0 compilation wrong?
Modules I loaded to compile pytorch 1.11.0 and LAMMPS-Allegro in this virtual environment are: module load gcc/8.5.0-gcc-milan-a100 cuda11.3 openmpi/4.1.1-gcc-milan-a100 cudnn/8.1.1.33-11.2-gcc-milan-a100 git cmake python39
I didn't designate any CXX, C, MPI_CXX, and MPI_C compiler for cmake setting of LAMMPS-Allegro, only from Pytorch, but pytorch didn't used those MPICXX and MPIC compilers I set... Could this be related to the error I see?
Thanks.
The text was updated successfully, but these errors were encountered:
For running with LAMMPS, PyTorch should not interact with or need to know anything about MPI, and PyTorch can safely be built with -DUSE_DISTRIBUTED=OFF. If your simulation is hanging, you may want to try with Kokkos - this can sometimes make device assignment more reliable. We've also seen esoteric hang-ups related to modules on certain clusters.
As for your self-built PyTorch, you may need to specify an install prefix and run make install, then point -DCMAKE_PREFIX_PATH to that install folder, which will have the correct/expected directory structure, when configuring LAMMPS. But since you have CUDA 11.3 available, the prebuilt PyTorch 1.11 with the CXX11 ABI should work (link).
Hmmm I think I build the LAMMPS-Allegro with prebuilt libtorch with Kokkos, but maybe I messed this up. Let me try both suggestions from scratch again, I will update the results after I build test executables. Thanks.
Hi,
From our cluster environment, pre-built libtorch 1.11.0 doesn't properly work with openmpi. I build a LAMMPS-Allegro with prebuilt libtorch 1.11.0, but when I submit a job with multiple GPUs, then nothing is printed out to output folder even though slurm system indicates the simulation is running.
So I build a pytorch 1.11.0 using cmake from a virtual environment using following cmake settings:
Then I tried to cmake the LAMMPS-Allegro (with kokkos and openmp) using the pytorch I compiled from the same virtual environment. Following is cmake setting I used for LAMMPS-Allegro with Kokkos and OpenMP:
However, I see following error messages when I try to configure the LAMMPS-Allegro with OpenMP and Kokkos:
I don't know what these error means. Would this means my pytorch 1.11.0 compilation wrong?
Modules I loaded to compile pytorch 1.11.0 and LAMMPS-Allegro in this virtual environment are:
module load gcc/8.5.0-gcc-milan-a100 cuda11.3 openmpi/4.1.1-gcc-milan-a100 cudnn/8.1.1.33-11.2-gcc-milan-a100 git cmake python39
I didn't designate any CXX, C, MPI_CXX, and MPI_C compiler for cmake setting of LAMMPS-Allegro, only from Pytorch, but pytorch didn't used those MPICXX and MPIC compilers I set... Could this be related to the error I see?
Thanks.
The text was updated successfully, but these errors were encountered: