You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I used moby_main for training, Linux memory grew until it crashed. What is the reason and how to solve it
The error is:
Traceback (most recent call last):
File "moby_main.py", line 236, in
main(config)
File "moby_main.py", line 121, in main
train_one_epoch(config, model, data_loader_train, optimizer, epoch, lr_scheduler)
File "moby_main.py", line 151, in train_one_epoch
scaled_loss.backward()
File "/root/anaconda3/envs/transformer-ssl/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/root/anaconda3/envs/transformer-ssl/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
File "/root/anaconda3/envs/transformer-ssl/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2605) is killed by signal: Killed.
The text was updated successfully, but these errors were encountered:
When I used moby_main for training, Linux memory grew until it crashed. What is the reason and how to solve it
The error is:
Traceback (most recent call last):
File "moby_main.py", line 236, in
main(config)
File "moby_main.py", line 121, in main
train_one_epoch(config, model, data_loader_train, optimizer, epoch, lr_scheduler)
File "moby_main.py", line 151, in train_one_epoch
scaled_loss.backward()
File "/root/anaconda3/envs/transformer-ssl/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/root/anaconda3/envs/transformer-ssl/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
File "/root/anaconda3/envs/transformer-ssl/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2605) is killed by signal: Killed.
The text was updated successfully, but these errors were encountered: