group convolution error when using DDP training #1992

AnonymousAccount6688 · 2023-10-17T14:47:25Z

AnonymousAccount6688
Oct 17, 2023

I tried to use group convolution with the following line of code:

dw_conv = torch.nn.Conv(64, 64, 3, 1, 1, groups=64)

But got the following error:

`
Using native Torch AMP. Training in mixed precision.
Using native Torch DistributedDataParallel.
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
main(args)
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
main(args)main(args)
main(args)

File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1024, in
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^ ^^ ^^ ^^^ ^^ ^^ ^^ ^ ^^ ^^ ^ ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^ File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^ File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
main(args)
_verify_param_shape_across_processes(self.process_group, parameters)_verify_param_shape_across_processes(self.process_group, parameters)

_verify_param_shape_across_processes(self.process_group, parameters)

File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 514, in main
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
return dist._verify_params_across_processes(process_group, tensors, logger)
return dist._verify_params_across_processes(process_group, tensors, logger)
model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
^^ ^ ^ ^ ^ ^ ^ ^ ^^ ^^ ^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^RuntimeError^: ^^
params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.^^^^
^^^^^^RuntimeError^^^: ^^params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
^
^^^^^^^^^^^^^^^RuntimeError^: ^params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.^
^^
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/scratch365/ypeng4/software/bin/anaconda/envs/python311/lib/python3.11/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022927 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022928 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022929 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022930 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1022931 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 5 (pid: 1022932) of binary: /scratch365/ypeng4/software/bin/anaconda/envs/python311/bin/python

`

When I changed it to

dw_conv = torch.nn.Conv(64, 64, 3, 1, 1, groups=1)

Everything works fine.

Is there anything wrong with the DDP training of GroupConv?

rwightman · 2023-10-17T18:19:26Z

rwightman
Oct 17, 2023
Maintainer

@AnonymousAccount6688 DW models work fine for me, probably some modifications in the train script or added special cases for rank 0 that are breaking things

2 replies

AnonymousAccount6688 Oct 17, 2023
Author

Thank you for the reply.

I just tried to add a convolution after a Transformer Block. I added one line self.dw_conv = torch.nn.Conv(64, 64, 3, 1, 1, groups=1) in the __init__ and x = self.dw_conv(x), and everything works fine. I don't change anything but set groups=64, then I got the above error: params[6] in this process with sizes [64, 1, 3, 3] appears not to match strides of the same param in process 0.

May I know which part I should modify?

AnonymousAccount6688 Oct 19, 2023
Author

This seems to be a problem with PyTorch's NativeDDP:

from torch.nn.parallel import DistributedDataParallel as NativeDDP

When I used the nvidia apex DDP, everything worked fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group convolution error when using DDP training #1992

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

group convolution error when using DDP training #1992

AnonymousAccount6688 Oct 17, 2023

Replies: 1 comment · 2 replies

rwightman Oct 17, 2023 Maintainer

AnonymousAccount6688 Oct 17, 2023 Author

AnonymousAccount6688 Oct 19, 2023 Author

AnonymousAccount6688
Oct 17, 2023

Replies: 1 comment 2 replies

rwightman
Oct 17, 2023
Maintainer

AnonymousAccount6688 Oct 17, 2023
Author

AnonymousAccount6688 Oct 19, 2023
Author