-
Notifications
You must be signed in to change notification settings - Fork 279
Using MPICH on Crusher@OLCF
Yanfei Guo edited this page Apr 1, 2023
·
5 revisions
This page describes how to build and use MPICH on the 'Crusher' machine at Oak Ridge. Crusher is a AMD CPU/GPU machine with Slingshot interconnect. The 'Libfabric' device works best here. Performance is on par with the Cray MPI on Crusher.
- MPICH dev
- Libfabric 1.15.0.0 (part of Cray PE)
- Cray PMI (part of Cray PE, only required if building with srun support)
MPICH needs the following tools (and their default version on Crusher as of 09/27/2022) to build on Crusher with GPU support.
- gcc (gcc/7.5.0)
- ROCm (rocm/5.1.0)
module load rocm
./configure --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.0.0 --with-pmi=pmi2 --with-pmilib=oldcray --with-craypmi=/opt/cray/pe/pmi/default \
--with-hip=$ROCM_PATH/hip
make -j 8
make install
# $ROCM_PATH is set by the rocm module.
module load rocm
./configure --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.0.0 \
--with-hip=$ROCM_PATH/hip
make -j 8
make install
# $ROCM_PATH is set by the rocm module.
A correctly configured MPICH build should print the following message in confiugre output.
*****************************************************
***
*** device : ch4:ofi
*** shm feature : auto
*** gpu support : HIP
***
*****************************************************
module load rocm
export MPIR_CVAR_CH4_OFI_ENABLE_MULTI_NIC_STRIPING=0
# Launch two ranks each on a separate node and a separate GPU
srun -n2 --ntasks-per-node=1 --gpu-per-node=1 --gpu-bind-cloest \
./test/mpi/pt2pt/pingping \
-type=MPI_INT -sendcnt=512 -recvcnt=1024 -seed=78 -testsize=4 -sendmem=device -recvmem=device
For more srun options, please check Crusher User Guide - Running Jobs
module load rocm
export MPIR_CVAR_CH4_OFI_ENABLE_MULTI_NIC_STRIPING=0
mpiexec -np 2 -ppn 1 -gpus-per-proc=1 \
-genv MPIR_CVAR_CH4_OFI_ENABLE_MULTI_NIC_STRIPING=0 \
./test/mpi/pt2pt/pingping \
-type=MPI_INT -sendcnt=512 -recvcnt=1024 -seed=78 -testsize=4 -sendmem=device -recvmem=device
- "key [-NONEXIST-KEY] was not found" message. It is common to see error messages like the following. It is expected.
Wed Sep 28 12:01:00 2022: [PE_0]:_pmi2_kvs_get:key [-NONEXIST-KEY] was not found.
Wed Sep 28 12:01:00 2022: [PE_0]:PMI2_KVS_Get:_pmi2_kvs_get failed
Wed Sep 28 12:01:00 2022: [PE_1]:_pmi2_kvs_get:key [-NONEXIST-KEY] was not found.
Wed Sep 28 12:01:00 2022: [PE_1]:PMI2_KVS_Get:_pmi2_kvs_get failed
Wed Sep 28 12:01:00 2022: [PE_0]:_pmi2_kvs_get:key [-NONEXIST-KEY] was not found.
Wed Sep 28 12:01:00 2022: [PE_0]:PMI2_KVS_Get:_pmi2_kvs_get failed
Wed Sep 28 12:01:00 2022: [PE_1]:_pmi2_kvs_get:key [-NONEXIST-KEY] was not found.
Wed Sep 28 12:01:00 2022: [PE_1]:PMI2_KVS_Get:_pmi2_kvs_get failed