-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address Registration Error in CUDA Aware MPICH 4.2.2 + UCX 1.17.0 Application #10085
Comments
is |
Yes, it's enabled. I tried using the
|
|
To provide more context, I'm testing the implementation of an MPI plugin for LLVM/offload (The first PR of the plugin is available here). The goal is to offload OpenMP tasks to remote devices. The application I'm testing is XSBench, compiled with clang + the MPI plugin. In the test, XSBench acts as the client that offloads tasks to the remote server. 1 and 2, The example I'm running does not use CUDA Aware MPI; all the buffers being sent are in host memory. I'm using While debugging the application, I noticed that the error occurs on the first 3, I uploaded the logs obtained from UCX here. Here are the last lines of the logs:
|
@cl3to can you pls share the application code, or at least the part that allocates the buffer? |
I’ve found the root cause of the issue. The address causing the failure in the UCX For context, my application sends the __tgt_device_image structure from LLVM/OpenMP Offloading to the server process so it can run kernels from the image. The issue arises because the section holding this I worked around this by copying the However, I noticed that when the |
@raffenet WDYT about MPICH MPI_Testall handling (see above)? |
Hi @cl3to Could you please provide the output when running the application with Thanks |
I'm running an application on a cluster that uses CUDA Aware MPICH (v4.2.2) and UCX (v1.17.0). My application consists of two binaries, a server and a client, so I use the MPMD mode of
mpirun
to execute it:mpirun -np 1 server : -np 1 client
. The problem is that when I try to run the application, either intra-node or inter-node, I get the following error and the application hangs:After some research, I found that setting the environment variable
UCX_RCACHE_ENABLE=n
allows my application to run without errors. However, the application’s runtime performance is not as expected. Profiling the application revealed that most of the time is spent on data transfer between the nodes.When running the OSU 7.4 benchmark, I observed that the bandwidth between nodes using InfiniBand is approximately 5.75 times slower when I set the variable
UCX_RCACHE_ENABLE=n
.Any suggestions on why the application might be failing to register addresses?
Setup and versions
OS version:
cat /etc/redhat-release
: Red Hat Enterprise Linux Server release 7.9 (Maipo)uname -r
: 3.10.0-1160.49.1.el7.x86_64RDMA/IB version:
rpm -q libibverbs
: libibverbs-54mlnx1-1.54310.x86_64rpm -q rdma-core
: rdma-core-devel-54mlnx1-1.54310.x86_64IB HW:
ibstat
:CUDA 12.0:
Each node has four 32GB V100 GPUs
cuda libraries: cuda-toolkit-12-0-12.0.0-1.x86_64
cuda drivers: cuda-driver-devel-12-0-12.0.107-1.x86_64
lsmod |grep nv_peer_mem
:lsmod|grep gdrdrv
:ucx_info -v
:The text was updated successfully, but these errors were encountered: