About the issue of multi-GPU training. #79

do1nothing · 2023-12-31T17:46:05Z

My server has four NVIDIA 4090 GPUs. Single-card training doesn't throw any errors, but when the batch size is changed to 2 for single-card training, it throws an error after completing just one epoch. No other parameters have been changed. I wanted to try multi-GPU training, but it keeps throwing errors. I searched online for solutions, but none of them seem to resolve the issue. The error message is as follows:
Traceback (most recent call last):
File "./train.py", line 186, in
trainer.train()
File "../../tasks/semantic/modules/trainer.py", line 280, in train
show_scans=self.ARCH["train"]["show_scans"])
File "../../tasks/semantic/modules/trainer.py", line 391, in train_epoch
output = model(in_vol)
File "/root/anaconda3/envs/salsanext/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/salsanext/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
"them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the issue of multi-GPU training. #79

About the issue of multi-GPU training. #79

do1nothing commented Dec 31, 2023

About the issue of multi-GPU training. #79

About the issue of multi-GPU training. #79

Comments

do1nothing commented Dec 31, 2023