You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to train the model using vimeo90k dataset, but I get "EOFError: Ran out of input" issue. I was able to train the flow estimator successfully, but this kind of error occurs when training the whole framework. I ran the model with one A6000 GPU and had set the default num_workders as 2. Any ideas..?
File "/data/projects/chaeyun/VFIformer/models/archs/VFIformer_arch.py", line 346, in __init__
self.load_networks('flownet', args.resume_flownet)
File "/data/projects/chaeyun/VFIformer/models/archs/VFIformer_arch.py", line 354, in load_networks
load_net = torch.load(load_path, map_location=torch.device(self.device))
File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/serialization.py", line 713, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/serialization.py", line 920, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 199954) of binary: /home/chaeyun/.conda/envs/vfiformer/bin/python
Traceback (most recent call last):
File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/chaeyun/.conda/envs/vfiformer/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I am trying to train the model using vimeo90k dataset, but I get "EOFError: Ran out of input" issue. I was able to train the flow estimator successfully, but this kind of error occurs when training the whole framework. I ran the model with one A6000 GPU and had set the default num_workders as 2. Any ideas..?
Here's my script with arguments :
python -m torch.distributed.launch --nproc_per_node=1 --master_port=4178 train.py --launcher pytorch --gpu_ids 0 --loss_l1 --loss_ter --loss_flow --use_tb_logger --batch_size 128 --net_name VFIformer --name train_VFIformer --max_iter 300 --crop_size 192 --save_epoch_freq 5 --resume_flownet ./weights/train_IFNet/snapshot/net_final.pth
The text was updated successfully, but these errors were encountered: