Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAMMPS-Allegro compile failed with pytorch 1.11.0 I build... #50

Open
turbosonics opened this issue Jul 26, 2024 · 3 comments
Open

LAMMPS-Allegro compile failed with pytorch 1.11.0 I build... #50

turbosonics opened this issue Jul 26, 2024 · 3 comments

Comments

@turbosonics
Copy link

turbosonics commented Jul 26, 2024

Hi,

From our cluster environment, pre-built libtorch 1.11.0 doesn't properly work with openmpi. I build a LAMMPS-Allegro with prebuilt libtorch 1.11.0, but when I submit a job with multiple GPUs, then nothing is printed out to output folder even though slurm system indicates the simulation is running.

So I build a pytorch 1.11.0 using cmake from a virtual environment using following cmake settings:

cmake \
-D BUILD_SHARED_LIBS:BOOL=ON -D CMAKE_BUILD_TYPE:STRING=Release -D BUILD_PYTHON:BOOL=OFF \
-D CMAKE_INSTALL_PREFIX=/home/Sourcecode_Pytorch1110 \
-D CMAKE_MPI_CXX_COMPILER=/cm/shared/userapps/scicomp/external/milan-a100/openmpi/4.1.1-gcc11.2.0-v2/bin/mpicxx \
-D CMAKE_MPI_C_COMPILER=/cm/shared/userapps/scicomp/external/milan-a100/openmpi/4.1.1-gcc11.2.0-v2/bin/mpicc \
-D PYTHON_LIBRARY='' -D USE_CUDA=ON -D BUILD_SHARED_LIBS=ON -D USE_DISTRIBUTED=ON ../ 2>&1| tee configure.log

Then I tried to cmake the LAMMPS-Allegro (with kokkos and openmp) using the pytorch I compiled from the same virtual environment. Following is cmake setting I used for LAMMPS-Allegro with Kokkos and OpenMP:

cmake \
-D CMAKE_BUILD_TYPE=Release \
-D CMAKE_INSTALL_PREFIX=$(pwd) \
-D PKG_OPENMP=ON \
-D PKG_KOKKOS=ON \
-D Kokkos_ENABLE_CUDA=ON \
-D Kokkos_ARCH_ZEN=ON \
-D CMAKE_PREFIX_PATH=/home/Sourcecode_Pytorch1110/build \
-D LD_LIBRARY_PATH=/home/Sourcecode_Pytorch1110/build/lib \
-D MKL_INCLUDE_DIR=`python -c "import sysconfig;from pathlib import Path;print(Path(sysconfig.get_paths()[\"include\"]).parent)"` \
../cmake 2>&1| tee configure.log

However, I see following error messages when I try to configure the LAMMPS-Allegro with OpenMP and Kokkos:

CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:14 (include):
  include could not find requested file:

    /home/Sourcecode_Pytorch1110/build/public/utils.cmake
Call Stack (most recent call first):
  /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:1082 (find_package)

CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:17 (include):
  include could not find requested file:

    /home/Sourcecode_Pytorch1110/build/public/threads.cmake
Call Stack (most recent call first):
  /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:1082 (find_package)

CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:88 (include):
  include could not find requested file:

    /home/Sourcecode_Pytorch1110/build/public/cuda.cmake
Call Stack (most recent call first):
  /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:1082 (find_package)

CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:109 (include):
  include could not find requested file:

    /home/Sourcecode_Pytorch1110/build/public/mkl.cmake
Call Stack (most recent call first):
  /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:1082 (find_package)

CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:112 (include):
  include could not find requested file:

    /home/Sourcecode_Pytorch1110/build/public/mkldnn.cmake
Call Stack (most recent call first):
  /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:1082 (find_package)

CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:116 (include):
  include could not find requested file:

    /home/Sourcecode_Pytorch1110/build/Caffe2Targets.cmake
Call Stack (most recent call first):
  /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:1082 (find_package)

CMake Error at /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:186 (set_target_properties):
  set_target_properties Can not find target to add properties to: torch
Call Stack (most recent call first):
  CMakeLists.txt:1082 (find_package)

CMake Error at /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:191 (set_property):
  set_property could not find TARGET torch.  Perhaps it has not yet been
  created.
Call Stack (most recent call first):
  CMakeLists.txt:1082 (find_package)

-- Found Torch: /home/Sourcecode_Pytorch1110/build/lib/libtorch.so
-- Configuring incomplete, errors occurred!
See also "/home/Sourcecode_LAMMPS_Allegro_cuda113_custompytorch1110_zeusgpu_20240725/build01/CMakeFiles/CMakeOutput.log".

I don't know what these error means. Would this means my pytorch 1.11.0 compilation wrong?

Modules I loaded to compile pytorch 1.11.0 and LAMMPS-Allegro in this virtual environment are:
module load gcc/8.5.0-gcc-milan-a100 cuda11.3 openmpi/4.1.1-gcc-milan-a100 cudnn/8.1.1.33-11.2-gcc-milan-a100 git cmake python39

I didn't designate any CXX, C, MPI_CXX, and MPI_C compiler for cmake setting of LAMMPS-Allegro, only from Pytorch, but pytorch didn't used those MPICXX and MPIC compilers I set... Could this be related to the error I see?

Thanks.

@anjohan
Copy link
Collaborator

anjohan commented Jul 26, 2024

Hi,

For running with LAMMPS, PyTorch should not interact with or need to know anything about MPI, and PyTorch can safely be built with -DUSE_DISTRIBUTED=OFF. If your simulation is hanging, you may want to try with Kokkos - this can sometimes make device assignment more reliable. We've also seen esoteric hang-ups related to modules on certain clusters.

As for your self-built PyTorch, you may need to specify an install prefix and run make install, then point -DCMAKE_PREFIX_PATH to that install folder, which will have the correct/expected directory structure, when configuring LAMMPS. But since you have CUDA 11.3 available, the prebuilt PyTorch 1.11 with the CXX11 ABI should work (link).

@turbosonics
Copy link
Author

Hmmm I think I build the LAMMPS-Allegro with prebuilt libtorch with Kokkos, but maybe I messed this up. Let me try both suggestions from scratch again, I will update the results after I build test executables. Thanks.

@anjohan
Copy link
Collaborator

anjohan commented Jul 26, 2024

Remember to also add the appropriate run-time command line flags. For two nodes with 4 GPUs each, it should be

mpirun/srun/etc /path/to/lmp -sf kk -k on g 4 -pk kokkos newton on neigh full -in in.script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants