Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect MVAPICH 3 #810

Merged
merged 4 commits into from
Apr 23, 2024
Merged

Detect MVAPICH 3 #810

merged 4 commits into from
Apr 23, 2024

Conversation

Keluaa
Copy link
Contributor

@Keluaa Keluaa commented Jan 30, 2024

I recently started working on a cluster using MVAPICH version 3.0rc, yet MPIPreferences.jl couldn't correctly detect it as the version string changed. Here is a sample of it, taken from the output of MPIPreferences.identify_abi("libmpi"):

MVAPICH Version        :	3.0rc
MVAPICH Release date   :	11/09/2023
MVAPICH ABI            :	13:12:1
MVAPICH Device         :	ch4:ucx
MVAPICH configure      :	--prefix=/packages/mvapich/mvapich-3.0rc-ucx CC=gcc CXX=g++ FC=gfortran F77=gfortran --with-device=ch4:ucx --with-ucx=/usr/local/packages/ucx/ucx-git-01192024 --disable-rdma-cm --disable-mcast --enable-fast=O3 --enable-debuginfo --enable-cxx --enable-mpit-pvars=all MPICHLIB_CFLAGS=-Werror=implicit-function-declaration -Werror=discarded-qualifiers -Werror=incompatible-pointer-types CFLAGS=-Wall -pipe -fPIC -fopenmp CXXFLAGS=-fPIC -fopenmp LDFLAGS=-fopenmp FFLAGS=-fPIC -fallow-argument-mismatch -fopenmp FCFLAGS=-fPIC -fallow-argument-mismatch -fopenmp
MVAPICH CC             :	gcc -Wall -pipe -fPIC -fopenmp -Werror=implicit-function-declaration -Werror=discarded-qualifiers -Werror=incompatible-pointer-types  -O3
MVAPICH CXX            :	g++ -fPIC -fopenmp  -O3
MVAPICH F77            :	gfortran -fPIC -fallow-argument-mismatch -fopenmp  -O3
MVAPICH FC             :	gfortran -fPIC -fallow-argument-mismatch -fopenmp  -O3
MPICH Custom Information:	@MVAPICH_CUSTOM_STRING@

The fix removes only a single character. By looking at the version regex, MVAPICH version 2 shouldn't be affected.

@vchuravy
Copy link
Member

That look correct to me. I just noticed that we don't have mvapich in the test matrix.
Would you be interested in adding that?

@vchuravy
Copy link
Member

Or a test/runtests.jl file in MPIPreferences that uses different observed MPI_VERSION_STRINGs to check that identify_implementation_version_abi returns the right things?

@giordano
Copy link
Member

I just noticed that we don't have mvapich in the test matrix.

We do:

test-spack-mvapich:
timeout-minutes: 20
strategy:
matrix:
julia_version:
- "1"
fail-fast: false
runs-on: ubuntu-22.04
container: ghcr.io/juliaparallel/github-actions-buildcache:mvapich2-2.3.7-1-hs7gkcclsnk55kqm52a4behdnt3dug6b.spack
env:
JULIA_MPI_TEST_BINARY: system
JULIA_MPI_TEST_EXCLUDE: test_spawn.jl
MV2_SMP_USE_CMA: 0
steps:
- name: Checkout
uses: actions/checkout@v4
- uses: julia-actions/setup-julia@v1
with:
version: ${{ matrix.julia_version }}
- uses: julia-actions/cache@v1
- name: add MPIPreferences
shell: julia --color=yes --project=. {0}
run: |
using Pkg
Pkg.develop(path="lib/MPIPreferences")
Pkg.precompile()
- name: use system MPI
shell: julia --color=yes --project=. {0}
run: |
using MPIPreferences
MPIPreferences.use_system_binary()
- uses: julia-actions/julia-runtest@latest
But tests are currently broken because julia-actions/cache now requires jq to be available (see for example julia-actions/cache#105). I can look into it, but not right now.

@giordano
Copy link
Member

Problem with jq not available in custom containers was fixed by #811.

We can also add a test with mvapich3 in this PR, but we need to build it in https://github.com/JuliaParallel/github-actions-buildcache, will take a bit, I can try to do it later tonight.

@giordano
Copy link
Member

giordano commented Feb 1, 2024

I built MVAPICH 3.0b with JuliaParallel/github-actions-buildcache@134f6ae, but tests are failing very early:

[][mvp_generate_implicit_cpu_mapping] WARNING: This configuration might lead to oversubscription of cores !!!
Warning! : Core id 33 does not exist on this architecture! 
CPU Affinity is undefined 
Abort(2141583) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(175)...........: 
MPID_Init(597)..................: 
MPIDI_MVP_mpi_init_hook(268)....: 
MPIDI_MVP_CH4_set_affinity(3745): 
smpi_setaffinity(2808)..........: CPU Affinity is undefined.
ERROR: failed process: Process(`mpiexec -n 4 /__t/julia/1.10.0/x64/bin/julia -C native -J/__t/julia/1.10.0/x64/lib/julia/sys.so --depwarn=yes --check-bounds=yes -g1 --code-coverage=@/__w/MPI.jl/MPI.jl --color=yes --startup-file=no --pkgimages=no --startup-file=no -q /__w/MPI.jl/MPI.jl/test/../docs/examples/01-hello.jl`, ProcessExited(143)) [143]

@giordano
Copy link
Member

I found https://github.com/ciemat-tic/codec/wiki/MPI-Libraries#errors which suggests we may have to set an environment variable to work around that error. In that page they use

export MV2_ENABLE_AFFINITY=0

I presume for MVAPICH 3 it'll be slightly different, need to find the official one.

@Keluaa
Copy link
Contributor Author

Keluaa commented Apr 22, 2024

On the cluster, the environment variable MVP_ENABLE_AFFINITY is set to 0, I guess that is what you need for MVAPICH 3. There is only two other variables defined in the mvapich modulefile but they are unrelated to CPU affinity (MVP_USE_SHARED_MEM and MVP_USE_OSU_COLLECTIVES).

@giordano
Copy link
Member

Thank you! You saved me digging into MVAPICH source code, since the only v3-specific documentation at the moment is the super slim Quick Start which doesn't mention affinity at all, the rest refers to MVAPICH2. I don't know how users are supposed to find out the necessary environment variable without reading the code.

@giordano
Copy link
Member

That did the trick, thank you so much! And all tests for MVAPICH 3 are passing! 🥳

Unfortunately there are errors in the mpitrampoline_jll jobs on macOS, unrelated to this PR:

dyld[2612]: Library not loaded: '@rpath/libhwloc.15.dylib'
  Referenced from: '/Users/runner/.julia/artifacts/01e9e6b7a0b08179eab75aab324f842218191aea/lib/mpich/bin/mpiexec.hydra'
  Reason: tried: '/Users/runner/.julia/artifacts/01e9e6b7a0b08179eab75aab324f842218191aea/lib/mpich/bin/../../libhwloc.15.dylib' (no such file), '/Users/runner/.julia/artifacts/01e9e6b7a0b08179eab75aab324f842218191aea/lib/mpich/bin/../../libhwloc.15.dylib' (no such file), '/usr/local/lib/libhwloc.15.dylib' (no such file), '/usr/lib/libhwloc.15.dylib' (no such file)

@eschnett would you please be able to have a look at that? Apparently the macOS build of MPICH inside mpitrampoline isn't well functioning.

@eschnett
Copy link
Contributor

@giordano I cannot reproduce the problem. I am using Julia 1.11-beta1, using MPIPreferences to choose MPItrampoline_jll. All is fine:

julia> using MPIPreferences

julia> MPIPreferences.abi
"MPItrampoline"

julia> using MPI

julia> MPI.Init()
MPI.ThreadLevel(2)

julia> MPI.Get_library_version()
"MPIwrapper 2.10.4, using MPIABI 2.9.0, wrapping:\nMPICH Version:      4.2.0\nMPICH Release date: Fri Feb  9 12:29:21 CST 2024\nMPICH ABI:          16:0:4\nMPICH Device:       ch3:nemesis\nMPICH configure:    --build=x86_64-linux-musl --host=x86_64-apple-darwin14 --disable-dependency-tracking --docdir=/tmp --mandir=/tmp --enable-shared=no --enable-static=yes --enable-threads=multiple --enable-opencl=no --with-device=ch3 --with-hwloc=/workspace/destdir --prefix=/workspace/destdir/lib/mpich --enable-two-level-namespace\nMPICH CC:           cc -fPIC -DPIC  -fno-common  -O2\nMPICH CXX:          c++ -fPIC -DPIC  -O2\nMPICH F77:          gfortran -fPIC -DPIC  -O2\nMPICH FC:           gfortran -fPIC -DPIC  -O2\nMPICH features:     \n"

I am not setting any environment variables. This is an Intel Mac:

julia> versioninfo()
Julia Version 1.11.0-beta1
Commit 08e1fc0abb9 (2024-04-10 08:40 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 16 × Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

@giordano
Copy link
Member

@eschnett the problem is in running the mpiexecjl tests. I wonder if we're running into macOS SIP madness.

@eschnett
Copy link
Contributor

It seems the problem is that MPICH's mpiexec is a binary (here /Users/eschnett/.julia/artifacts/fd0e01a2b2f7482859cc6dc50091c6637a483184/lib/mpich/bin/mpiexec.hydra), and it can't find its libhwloc dependency.

$ otool -L /Users/eschnett/.julia/artifacts/fd0e01a2b2f7482859cc6dc50091c6637a483184/lib/mpich/bin/mpiexec.hydra
/Users/eschnett/.julia/artifacts/fd0e01a2b2f7482859cc6dc50091c6637a483184/lib/mpich/bin/mpiexec.hydra:
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1238.0.0)
	@rpath/libhwloc.15.dylib (compatibility version 23.0.0, current version 23.0.0)

I don't think Julia's artifacts support calling its binaries "as is". Nothing would ensure that Hwloc_jll is available, nothing would point this binary to the respective libhwloc.so library. Is there a standard way to handle this?

In the mean time I'll try to generate a self-contained MPICH build that doesn't look for an external hwloc. Maybe static linking will work for this binary.

@giordano
Copy link
Member

I don't think Julia's artifacts support calling its binaries "as is". Nothing would ensure that Hwloc_jll is available, nothing would point this binary to the respective libhwloc.so library. Is there a standard way to handle this?

We can when using the executable wrappers in JLLs: https://docs.binarybuilder.org/stable/jll/#ExecutableProduct. But I'm not sure we can wire that up in mpiexecjl/mpitrampoline?

@eschnett
Copy link
Contributor

See JuliaPackaging/Yggdrasil#8515.

@giordano
Copy link
Member

It's finally all green! Thanks everyone for the help and the patience!

@giordano giordano merged commit 77b935c into JuliaParallel:master Apr 23, 2024
49 of 50 checks passed
@Keluaa Keluaa deleted the mvapich3_fix branch April 24, 2024 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants