An error was reported while training the model with two 3090. #23

gushengbo · 2024-03-31T07:41:58Z

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
[INFO] ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

[INFO] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[INFO] LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
Traceback (most recent call last):
File "/home/shengbo/HumanGaussian-main/launch.py", line 239, in
main(args, extras)
File "/home/shengbo/HumanGaussian-main/launch.py", line 182, in main
trainer.fit(system, datamodule=dm, ckpt_path=cfg.resume)
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
Traceback (most recent call last):
File "/home/shengbo/HumanGaussian-main/launch.py", line 239, in
main(args, extras)
File "/home/shengbo/HumanGaussian-main/launch.py", line 182, in main
trainer.fit(system, datamodule=dm, ckpt_path=cfg.resume)
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
call._call_and_handle_interrupt(
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
return function(*args, **kwargs)
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
self._run(model, ckpt_path=ckpt_path)
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 963, in _run
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 963, in _run
self.strategy.setup(self)
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 171, in setup
self.strategy.setup(self)
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 171, in setup
self.configure_ddp()
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 283, in configure_ddp
self.configure_ddp()
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 283, in configure_ddp
self.model = self._setup_model(self.model)
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 195, in _setup_model
self.model = self._setup_model(self.model)
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 195, in _setup_model
return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs)
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 678, in init
return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs)
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 678, in init
self._log_and_throw(
self._log_and_throw(
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1037, in _log_and_throw
File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1037, in _log_and_throw
raise err_type(err_msg)
RuntimeError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.
raise err_type(err_msg)
RuntimeError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An error was reported while training the model with two 3090. #23

An error was reported while training the model with two 3090. #23

gushengbo commented Mar 31, 2024

An error was reported while training the model with two 3090. #23

An error was reported while training the model with two 3090. #23

Comments

gushengbo commented Mar 31, 2024

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 [INFO] ---------------------------------------------------------------------------------------------------- distributed_backend=nccl All distributed processes registered. Starting with 2 processes

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
[INFO] ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes