Add support for distributed training (multi-node and multi-GPU) via PyTorch DDP #2018
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds support for distributed training (multi-node and multi-GPU) via PyTorch DDP.
It can be used in the following ways:
Learner.train()
is called.torchrun
CLI command. For example, to run on a single machine with 4 GPUs:torchrun --standalone --nnodes=1 --nproc-per-node=4 --no-python \ rastervision run local rastervision_pytorch_backend/rastervision/pytorch_backend/examples/tiny_spacenet.py
Checklist
needs-backport
label if the change should be back-ported to the previous releaseNotes
Learner.eval_model()
has been renamed toLearner.validate()
.RASTERVISION_USE_DDP
:YES
by default. Set toNO
to disable distributed training.RASTERVISION_DDP_BACKEND
:nccl
by default. This is the recommended backend for CUDA GPUs.RASTERVISION_DDP_START_METHOD
: One ofspawn
,fork
, orforkserver
. Default:spawn
.spawn
is what PyTorch documentation recommends (in fact, it doesn't even mention the alternatives), but it has the disadvantage that it requires everything to be pickleable, which rasterio dataset objects are not. This is also true forforkserver
, which needs to spawn a server process. However,fork
does not have the same limitation.fork
, we avoid building the dataset in the base process and instead delay it until the worker processes are created.fork
orforkserver
, the CUDA runtime must not be initialized before the fork happens; otherwise, aRuntimeError: Cannot re-initialize CUDA in forked subprocess.
error will be raised. We avoid this by not calling anytorch.cuda
functions or creating tensors on the GPU.TORCH_HOME
to$TORCH_HOME/<local rank>
. And only the master process copies the downloaded files to the training directory.WORLD_SIZE
env var.Testing Instructions
torchrun --standalone --nnodes=1 --nproc-per-node=1 --no-python \ rastervision run local rastervision_pytorch_backend/rastervision/pytorch_backend/examples/tiny_spacenet.py
# pragma: no cover
) since the CI machine does not have a GPU.Sample output from running
isprs_potsdam.py
on a p3.8xlarge instance.Closes #1352