stable MPI.COMM_WORLD for scaling out to hundreds of node #100
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously,
initialize()
allows creating MPI comm world afterimport distributed
However, as I tested on large-scale clusters with hundreds of nodes, as the number of workers increases, typically when it gets more than 32,
initialize()
function will be stuck at:and
Because some worker processes fail to receive the bcast'd scheduler address.
After a long time of debugging, I found that this is strongly related to the order of
import mpi4py
andimport distributed
(orfrom distributed import
). I am guessing that indistributed
, some communication environment settings are made first which then leads to some conflicts whenmpi4py
tries to bootstrap the MPI.COMM_WORLD after it.By strictly requiring the user to create the MPI.COMM_WORLD before calling the
initialize()
function, the above problem no longer bothers. According to my test, it can scale out to more than 128 workers (maybe more, as my resource is limited) without any hanging issues.