Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems when using Allreduce on GPUs #823

Closed
ZenanH opened this issue Feb 29, 2024 · 4 comments
Closed

Problems when using Allreduce on GPUs #823

ZenanH opened this issue Feb 29, 2024 · 4 comments

Comments

@ZenanH
Copy link

ZenanH commented Feb 29, 2024

Hi👋, I use this case to test MPI.Allreduce!() on GPU:

# t10.jl
using MPI
using CUDA

function main()
    MPI.Init()
    comm = MPI.COMM_WORLD
    rank = MPI.Comm_rank(comm)
    CUDA.device!(rank)

    data1 = cu([1, 2, 3, 4])
    data2 = cu([1 2; 4 5])

    CUDA.synchronize()
    MPI.Barrier(comm)
    
    MPI.Allreduce!(data1, MPI.SUM, comm)
    MPI.Allreduce!(data2, MPI.SUM, comm)

    CUDA.synchronize()
    MPI.Barrier(comm)
end

main()

Question 1:

I launched 2 processes to run this file, it works. However, I found that during the reducing and broadcasting, it will download data from GPU to CPU first and then upload to GPU (NVTX.jl). Since I'm using CUDA-aware MPI, and it support Allreduce, so I want to know why it's not using p2p between two GPUs?

mpirun -n 2 julia -O3 --color=yes t10.jl

# modules:
# 1) openmpi/gcc83-316-c112   2) cuda/12.1
# Sets the environment for OpenMPI 3.1.6 with CUDA 11.2
#    built with the GNU Compilers Suite 8.3.1 under MOFED 4.9-3

Question 2:

Since the openmpi cuda version [11.2] is mismatched with CUDA version [12.1] I used in julia, I changed to another mpi openmpi/gcc83-ucx115-c121 with same code. And then I got this error:

Error message
[1709244556.238623] [node40:1054569:0]          parser.c:2040 UCX  WARN  unused environment variables: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?); UCX_ERROR_SIGNALS (maybe: UCX_ERROR_SIGNALS?)
[1709244556.238623] [node40:1054569:0]          parser.c:2040 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1709244556.238623] [node40:1054568:0]          parser.c:2040 UCX  WARN  unused environment variables: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?); UCX_ERROR_SIGNALS (maybe: UCX_ERROR_SIGNALS?)
[1709244556.238623] [node40:1054568:0]          parser.c:2040 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[node40.octopoda:1054568][:33:hmca_rcache_ucs_query]  UCS version mismatch. Libhcoll binary was compiled with UCS 1.8 while the runtime version of UCS is 1.15. UCS Rcache framework will be disabled. Performance of ZCOPY BCAST algorithm may be degraded. Add -x HCOLL_RCACHE=^ucs in order to suppress this message.

[node40:1054569:0:1054569] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x1510f3a00000)
[node40:1054568:0:1054568] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x14864da00000)
==== backtrace (tid:1054568) ====
 0  /soft/ucx/115-cuda121/lib/libucs.so.0(ucs_handle_error+0x294) [0x14872b896224]
 1  /soft/ucx/115-cuda121/lib/libucs.so.0(+0x2d3dc) [0x14872b8963dc]
 2  /soft/ucx/115-cuda121/lib/libucs.so.0(+0x2d688) [0x14872b896688]
 3  /lib64/libc.so.6(+0x15dbb7) [0x148864541bb7]
 4  /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_allreduce+0x59a1) [0x14872d4e8c11]
 5  /soft/openmpi/gcc83/502_ucx115_c121/lib/libmpi.so(mca_coll_hcoll_allreduce+0xf4) [0x14872d90a724]
 6  /soft/openmpi/gcc83/502_ucx115_c121/lib/libmpi.so(PMPI_Allreduce+0xea) [0x14872d8725da]
==== backtrace (tid:1054569) ====
 0  /soft/ucx/115-cuda121/lib/libucs.so.0(ucs_handle_error+0x294) [0x15124eaf5224]
 1  /soft/ucx/115-cuda121/lib/libucs.so.0(+0x2d3dc) [0x15124eaf53dc]
 2  /soft/ucx/115-cuda121/lib/libucs.so.0(+0x2d688) [0x15124eaf5688]
 3  /lib64/libc.so.6(+0x15dbb7) [0x15130b92fbb7]
 4  /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_allreduce+0x59a1) [0x1512547f0c11]
 5  /soft/openmpi/gcc83/502_ucx115_c121/lib/libmpi.so(mca_coll_hcoll_allreduce+0xf4) [0x151254c12724]
 6  /soft/openmpi/gcc83/502_ucx115_c121/lib/libmpi.so(PMPI_Allreduce+0xea) [0x151254b7a5da]
 7  [0x1512f4bdb65b]
 8  [0x1512f4bdc4f3]
 9  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(ijl_apply_generic+0x2ae) [0x15130a9d8a0e]
10  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(+0x64835) [0x15130a9f6835]
11  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(+0x64345) [0x15130a9f6345]
12  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(+0x65084) [0x15130a9f7084]
13  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(+0x65e5e) [0x15130a9f7e5e]
14  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(+0x81d15) [0x15130aa13d15]
15  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(+0x824ca) [0x15130aa144ca]
16  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(ijl_toplevel_eval_in+0x8c) [0x15130aa158dc]
17  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so(+0x6d1292) [0x1512f53a1292]
18  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(ijl_apply_generic+0x2ae) [0x15130a9d8a0e]
19  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so(+0x1d0f07b) [0x1512f69df07b]
20  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so(+0x14359ef) [0x1512f61059ef]
21  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so(+0x1435a3b) [0x1512f6105a3b]
22  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(ijl_apply_generic+0x2ae) [0x15130a9d8a0e]
23  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so(+0xedcbb4) [0x1512f5bacbb4]
24  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so(+0x1686a7e) [0x1512f6356a7e]
25  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so(+0x776b54) [0x1512f5446b54]
26  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(ijl_apply_generic+0x2ae) [0x15130a9d8a0e]
27  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(+0xb1866) [0x15130aa43866]
28  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(jl_repl_entrypoint+0x8f) [0x15130aa442ef]
29  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/julia(main+0x9) [0x401089]
30  /lib64/libc.so.6(__libc_start_main+0xf3) [0x15130b7f56a3]
31  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/julia() [0x4010b9]
=================================
 7  [0x14884d7ed684]
 8  [0x14884d7ee543]
 9  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(ijl_apply_generic+0x2ae) [0x1488635eaa0e]
10  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(+0x64835) [0x148863608835]
11  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(+0x64345) [0x148863608345]
12  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(+0x65084) [0x148863609084]
13  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(+0x65e5e) [0x148863609e5e]
14  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(+0x81d15) [0x148863625d15]
15  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(+0x824ca) [0x1488636264ca]
16  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(ijl_toplevel_eval_in+0x8c) [0x1488636278dc]
17  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so(+0x6d1292) [0x14884dfb3292]
18  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(ijl_apply_generic+0x2ae) [0x1488635eaa0e]
19  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so(+0x1d0f07b) [0x14884f5f107b]
20  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so(+0x14359ef) [0x14884ed179ef]
21  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so(+0x1435a3b) [0x14884ed17a3b]
22  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(ijl_apply_generic+0x2ae) [0x1488635eaa0e]
23  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so(+0xedcbb4) [0x14884e7bebb4]
24  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so(+0x1686a7e) [0x14884ef68a7e]
25  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so(+0x776b54) [0x14884e058b54]
26  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(ijl_apply_generic+0x2ae) [0x1488635eaa0e]
27  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(+0xb1866) [0x148863655866]
28  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10(jl_repl_entrypoint+0x8f) [0x1488636562ef]
29  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/julia(main+0x9) [0x401089]
30  /lib64/libc.so.6(__libc_start_main+0xf3) [0x1488644076a3]
31  /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/bin/julia() [0x4010b9]
=================================

[1054568] signal (11.-6): Segmentation fault

[1054569] signal (11.-6): Segmentation fault
in expression starting at /home/zhuo/Workbench/t10.jl:31
in expression starting at /home/zhuo/Workbench/t10.jl:31
__memcpy_avx_unaligned_erms at /lib64/libc.so.6 (unknown line)
__memcpy_avx_unaligned_erms at /lib64/libc.so.6 (unknown line)
hmca_coll_ml_allreduce at /opt/mellanox/hcoll/lib/libhcoll.so.1 (unknown line)
hmca_coll_ml_allreduce at /opt/mellanox/hcoll/lib/libhcoll.so.1 (unknown line)
mca_coll_hcoll_allreduce at /soft/openmpi/gcc83/502_ucx115_c121/lib/libmpi.so (unknown line)
mca_coll_hcoll_allreduce at /soft/openmpi/gcc83/502_ucx115_c121/lib/libmpi.so (unknown line)
PMPI_Allreduce at /soft/openmpi/gcc83/502_ucx115_c121/lib/libmpi.so (unknown line)
PMPI_Allreduce at /soft/openmpi/gcc83/502_ucx115_c121/lib/libmpi.so (unknown line)
MPI_Allreduce at /home/zhuo/.julia/packages/MPI/z2owj/src/api/generated_api.jl:288 [inlined]
Allreduce! at /home/zhuo/.julia/packages/MPI/z2owj/src/collective.jl:745 [inlined]
Allreduce! at /home/zhuo/.julia/packages/MPI/z2owj/src/collective.jl:750 [inlined]
Allreduce! at /home/zhuo/.julia/packages/MPI/z2owj/src/collective.jl:754 [inlined]
main at /home/zhuo/Workbench/t10.jl:17
MPI_Allreduce at /home/zhuo/.julia/packages/MPI/z2owj/src/api/generated_api.jl:288 [inlined]
Allreduce! at /home/zhuo/.julia/packages/MPI/z2owj/src/collective.jl:745 [inlined]
Allreduce! at /home/zhuo/.julia/packages/MPI/z2owj/src/collective.jl:750 [inlined]
Allreduce! at /home/zhuo/.julia/packages/MPI/z2owj/src/collective.jl:754 [inlined]
main at /home/zhuo/Workbench/t10.jl:17
unknown function (ip: 0x1512f4bdc4f2)
unknown function (ip: 0x14884d7ee542)
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_call at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/interpreter.c:126
jl_apply at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_call at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/interpreter.c:126
eval_value at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/interpreter.c:223
eval_value at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/interpreter.c:223
eval_stmt_value at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/interpreter.c:174 [inlined]
eval_body at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/interpreter.c:617
eval_stmt_value at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/interpreter.c:174 [inlined]
eval_body at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/interpreter.c:617
jl_interpret_toplevel_thunk at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_interpret_toplevel_thunk at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/toplevel.c:877
jl_toplevel_eval_flex at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/toplevel.c:877
ijl_toplevel_eval_in at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/toplevel.c:985
ijl_toplevel_eval_in at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
include_string at ./loading.jl:2076
eval at ./boot.jl:385 [inlined]
include_string at ./loading.jl:2076
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
_include at ./loading.jl:2136
_include at ./loading.jl:2136
include at ./Base.jl:495
include at ./Base.jl:495
jfptr_include_46380.1 at /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
jfptr_include_46380.1 at /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
exec_options at ./client.jl:318
exec_options at ./client.jl:318
_start at ./client.jl:552
_start at ./client.jl:552
jfptr__start_82662.1 at /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
jfptr__start_82662.1 at /home/zhuo/.julia/juliaup/julia-1.10.1+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
_jl_invoke at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_apply at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/jlapi.c:731
jl_repl_entrypoint at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/cli/loader_exe.c:58
main at /cache/build/default-maughin-0/julialang/julia-release-1-dot-10/cli/loader_exe.c:58
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 2599354 (Pool: 2597514; Big: 1840); GC: 3
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 2599357 (Pool: 2597517; Big: 1840); GC: 3
--------------------------------------------------------------------------
prterun noticed that process rank 1 with PID 1054569 on node node40 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

By the way, it works well on CPU.

mpirun -n 2 julia -O3 --color=yes t10.jl

# modules:
# 1) openmpi/gcc83-ucx115-c121   2) cuda/12.1
# Sets the environment for OpenMPI 5.0.2 with CUDA 12.1
#          built with the GNU Compilers Suite 8.3.1
#                   with UCX 1.15 support
@ZenanH
Copy link
Author

ZenanH commented Mar 1, 2024

It seems like Question 2 is the problem from our machine, I'll put this question aside for now.

@simonbyrne
Copy link
Member

Since I'm using CUDA-aware MPI, and it support Allreduce, so I want to know why it's not using p2p between two GPUs?

It could be any number of things, most likely unrelated to Julia or MPI.jl. One cause I had recently is if you're using Slurm with GPU task binding, it can prevent CUDA IPC from working correctly, which then causes MPI to fallback on host-buffering. See https://bugs.schedmd.com/show_bug.cgi?id=17875. The solution was to disable GPU binding (--gpu-bind=none) and allocate the GPUs to tasks manually.

If that isn't the cause, then I suggest asking your system admins. Each HPC system is frustrating and slightly broken in its own way. Sometimes it helps to have a simple reproducer in C to show them.

@simonbyrne
Copy link
Member

simonbyrne commented Mar 2, 2024

If you are the sort of persistent person who likes to try to debug these things yourself, then some useful environment variables for OpeMPI + UCX are:

export OMPI_MCA_btl_base_verbose=10
export UCX_LOG_LEVEL=info  # can increase to debug or trace
export UCX_PROTO_ENABLE=y
export UCX_PROTO_INFO=y

It can give a lot of information, but sometimes there is a clue lurking in there.

@ZenanH
Copy link
Author

ZenanH commented Mar 2, 2024

Thank you😊, I will add more information if I find the reasons later~

@ZenanH ZenanH closed this as completed Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants