-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
P4est MPI simulation gets stuck when not all ranks have BC faces #1878
Comments
I share your conclusion. It seems to be a good explanation for the blocking, but does not explain why it works with the out-of-the-box MPI. Maybe you can verify (using a "wololo" statement or just plain output to stderr) that indeed the |
I double-checked this: |
This becomes more and more baffling... Alas, the joys of MPI-based parallelism and debugging 😱 What happens if you put an I do not see how a simulation might continue if a collective communication is not issued from all ranks, unless it is mixed with later calls to collective calls that somehow "match" with one MPI implementation and do not match with others. Incidentally, are system MPI and Julia MPI the same MPI implementations? If not, can you check what happens if you use the same? That is, maybe it's not a Julia thing but really an MPI thing... |
Thanks for the suggestions! Here is a small update: When I put When I put the barriers within I also tried to change the MPI implementation. MPItrampoline_jll showed the same behavior. Unfortunately I could not make OpenMPI_jll work so far. It led to errors while precompiling P4est.jll and T8code.jll. |
I first encountered this issue for the baroclinic instability test case. Then I found a reduced example based on elixir_advection_basic but with nonperiodic boundary conditions:
Running this with system MPI and on 3 ranks makes the simulation hang. Running with
tmpi
and looking at the backtraces when aborting shows that two ranks have calledinit_boundaries
and eventuallyp4est_iterate
, while the third (the middle one) is already somewhere inrhs!
.It seems to be caused by the check
Trixi.jl/src/solvers/dgsem_p4est/containers.jl
Lines 266 to 268 in 2dfde7f
Consequently only two ranks eventually call
p4est_iterate
. On the p4est sidep4est_iterate
first callsp4est_is_valid
, which callsMPI_Allreduce
.This would explain the blocking. What I do not understand is why it works when not using system MPI.
MWE
The text was updated successfully, but these errors were encountered: