group convolution error when using DDP training #1992
Unanswered
AnonymousAccount6688
asked this question in
Contributing
Replies: 1 comment 2 replies
-
@AnonymousAccount6688 DW models work fine for me, probably some modifications in the train script or added special cases for rank 0 that are breaking things |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I tried to use group convolution with the following line of code:
dw_conv = torch.nn.Conv(64, 64, 3, 1, 1, groups=64)
But got the following error:
`
Using native Torch AMP. Training in mixed precision.
Using native Torch DistributedDataParallel.
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
main(args)
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
main(args)main(args)
main(args)
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^ ^^ ^^ ^^^ ^^ ^^ ^^ ^ ^^ ^^ ^ ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^ File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^ File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
main(args)
_verify_param_shape_across_processes(self.process_group, parameters)_verify_param_shape_across_processes(self.process_group, parameters)
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
return dist._verify_params_across_processes(process_group, tensors, logger)
return dist._verify_params_across_processes(process_group, tensors, logger)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^ ^ ^ ^ ^ ^ ^ ^ ^^ ^^ ^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^RuntimeError^: ^^
params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.^^^^
^^^^^^RuntimeError^^^: ^^params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
^
^^^^^^^^^^^^^^^RuntimeError^: ^params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.^
^^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022927 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022928 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022929 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022930 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022931 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 5 (pid: 1022932) of binary: /scratch365/ypeng4/software/bin/anaconda/envs/python311/bin/python
`
When I changed it to
dw_conv = torch.nn.Conv(64, 64, 3, 1, 1, groups=1)
Everything works fine.
Is there anything wrong with the DDP training of GroupConv?
Beta Was this translation helpful? Give feedback.
All reactions