You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My server has four NVIDIA 4090 GPUs. Single-card training doesn't throw any errors, but when the batch size is changed to 2 for single-card training, it throws an error after completing just one epoch. No other parameters have been changed. I wanted to try multi-GPU training, but it keeps throwing errors. I searched online for solutions, but none of them seem to resolve the issue. The error message is as follows:
Traceback (most recent call last):
File "./train.py", line 186, in
trainer.train()
File "../../tasks/semantic/modules/trainer.py", line 280, in train
show_scans=self.ARCH["train"]["show_scans"])
File "../../tasks/semantic/modules/trainer.py", line 391, in train_epoch
output = model(in_vol)
File "/root/anaconda3/envs/salsanext/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/salsanext/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
"them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
The text was updated successfully, but these errors were encountered:
My server has four NVIDIA 4090 GPUs. Single-card training doesn't throw any errors, but when the batch size is changed to 2 for single-card training, it throws an error after completing just one epoch. No other parameters have been changed. I wanted to try multi-GPU training, but it keeps throwing errors. I searched online for solutions, but none of them seem to resolve the issue. The error message is as follows:
Traceback (most recent call last):
File "./train.py", line 186, in
trainer.train()
File "../../tasks/semantic/modules/trainer.py", line 280, in train
show_scans=self.ARCH["train"]["show_scans"])
File "../../tasks/semantic/modules/trainer.py", line 391, in train_epoch
output = model(in_vol)
File "/root/anaconda3/envs/salsanext/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/salsanext/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
"them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
The text was updated successfully, but these errors were encountered: