-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems when using Allreduce on GPUs #823
Comments
It seems like Question 2 is the problem from our machine, I'll put this question aside for now. |
It could be any number of things, most likely unrelated to Julia or MPI.jl. One cause I had recently is if you're using Slurm with GPU task binding, it can prevent CUDA IPC from working correctly, which then causes MPI to fallback on host-buffering. See https://bugs.schedmd.com/show_bug.cgi?id=17875. The solution was to disable GPU binding ( If that isn't the cause, then I suggest asking your system admins. Each HPC system is frustrating and slightly broken in its own way. Sometimes it helps to have a simple reproducer in C to show them. |
If you are the sort of persistent person who likes to try to debug these things yourself, then some useful environment variables for OpeMPI + UCX are:
It can give a lot of information, but sometimes there is a clue lurking in there. |
Thank you😊, I will add more information if I find the reasons later~ |
Hi👋, I use this case to test
MPI.Allreduce!()
on GPU:Question 1:
I launched 2 processes to run this file, it works. However, I found that during the reducing and broadcasting, it will download data from GPU to CPU first and then upload to GPU (NVTX.jl). Since I'm using CUDA-aware MPI, and it support
Allreduce
, so I want to know why it's not using p2p between two GPUs?Question 2:Since the openmpi cuda version [11.2] is mismatched with CUDA version [12.1] I used in julia, I changed to another mpiopenmpi/gcc83-ucx115-c121
with same code. And then I got this error:Error message
By the way, it works well on CPU.The text was updated successfully, but these errors were encountered: