You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.
I run the same command on two nodes, and they run successfully, But when I killed one node process with Ctrl-C, the other node also aborted. Here is the traceback if it helps:
Traceback (most recent call last):
File "mnmc_ddp_launch.py", line 119, in <module>
main()
File "mnmc_ddp_launch.py", line 90, in main
outputs = net(inputs)
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 696, in forward
self._sync_params()
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1222, in _sync_params
self._distributed_broadcast_coalesced(
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL communicator was aborted.
[ERROR] 2021-06-03 19:39:40,572 api: failed (exitcode: 1) local_rank: 0 (pid: 22262) of binary: /home/zhongyinmin/anaconda3/bin/python
[ERROR] 2021-06-03 19:39:40,572 local_elastic_agent: [default] Worker group failed
[INFO] 2021-06-03 19:39:40,572 api: [default] Worker group FAILED. 3/3 attempts left; will restart worker group
[INFO] 2021-06-03 19:39:40,573 api: [default] Stopping worker group
[INFO] 2021-06-03 19:39:40,573 api: [default] Rendezvous'ing worker group
INFO 2021-06-03 19:39:40,573 Attempting to join next rendezvous
INFO 2021-06-03 19:39:40,582 Observed existing rendezvous state: {'status': 'closed', 'version': '1', 'participants': [0, 1], 'keep_alives': ['/torchelastic/p2p/run_1/rdzv/v_1/rank_1', '/torchelastic/p2p/run_1/rdzv/v_1/rank_0'], 'num_workers_waiting': 0}
INFO 2021-06-03 19:39:40,582 Rendezvous for run_id=1 was observed to be closed
{"name": "torchelastic.worker.status.FAILED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "1", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "k8s-master", "state": "FAILED", "total_run_time": 80, "rdzv_backend": "etcd", "raw_error": "Traceback (most recent call last):\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/distributed/launch.py\", line 531, in main\n run_result = elastic_agent.run(spec.role)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n result = f(*args, **kwargs)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 680, in run\n result = self._invoke_run(role)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 831, in _invoke_run\n self._restart_workers(self._worker_group)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n result = f(*args, **kwargs)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 674, in _restart_workers\n self._initialize_workers(worker_group)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n result = f(*args, **kwargs)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 654, in _initialize_workers\n self._rendezvous(worker_group)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n result = f(*args, **kwargs)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 518, in _rendezvous\n store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 154, in next_rendezvous\n rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 287, in rendezvous_barrier\n return self.init_phase()\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 349, in init_phase\n raise RendezvousClosedException()\ntorchelastic.rendezvous.api.RendezvousClosedException\n", "metadata": "{\"group_world_size\": 2, \"entry_point\": \"python\"}", "agent_restarts": 1}}
[ERROR] 2021-06-03 19:39:40,588 error_handler: {
"message": {
"message": "RendezvousClosedException: ",
"extraInfo": {
"py_callstack": "Traceback (most recent call last):\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/multiprocessing/errors/__init__.py\", line 320, in wrapper\n return f(*args, **kwargs)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/distributed/launch.py\", line 531, in main\n run_result = elastic_agent.run(spec.role)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n result = f(*args, **kwargs)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 680, in run\n result = self._invoke_run(role)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 831, in _invoke_run\n self._restart_workers(self._worker_group)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n result = f(*args, **kwargs)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 674, in _restart_workers\n self._initialize_workers(worker_group)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n result = f(*args, **kwargs)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 654, in _initialize_workers\n self._rendezvous(worker_group)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n result = f(*args, **kwargs)\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 518, in _rendezvous\n store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 154, in next_rendezvous\n rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 287, in rendezvous_barrier\n return self.init_phase()\n File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 349, in init_phase\n raise RendezvousClosedException()\ntorchelastic.rendezvous.api.RendezvousClosedException\n",
"timestamp": "1622720380"
}
}
}
Traceback (most recent call last):
File "/home/zhongyinmin/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/zhongyinmin/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/distributed/launch.py", line 561, in <module>
main()
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/multiprocessing/errors/__init__.py", line 320, in wrapper
return f(*args, **kwargs)
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/distributed/launch.py", line 531, in main
run_result = elastic_agent.run(spec.role)
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py", line 126, in wrapper
result = f(*args, **kwargs)
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 680, in run
result = self._invoke_run(role)
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 831, in _invoke_run
self._restart_workers(self._worker_group)
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py", line 126, in wrapper
result = f(*args, **kwargs)
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 674, in _restart_workers
self._initialize_workers(worker_group)
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py", line 126, in wrapper
result = f(*args, **kwargs)
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 654, in _initialize_workers
self._rendezvous(worker_group)
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py", line 126, in wrapper
result = f(*args, **kwargs)
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 518, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py", line 154, in next_rendezvous
rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py", line 287, in rendezvous_barrier
return self.init_phase()
File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py", line 349, in init_phase
raise RendezvousClosedException()
torchelastic.rendezvous.api.RendezvousClosedException
I searched for this exception and it says "This Exception is raised when a rendezvous for the specified run_id is closed.
This is used to signal completion to nodes that arrive late." I don't understand what does it mean. And when I want to run the same command with --rdzv_id still set to 1, this error emerged again until I change the --rdzv_id to another number. Can not I reuse --rdzv_id between different job ?
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Question
I followed the tutorial and used the following command to launch the torchelastic:
I run the same command on two nodes, and they run successfully, But when I killed one node process with Ctrl-C, the other node also aborted. Here is the traceback if it helps:
I searched for this exception and it says "This Exception is raised when a rendezvous for the specified run_id is closed.
This is used to signal completion to nodes that arrive late." I don't understand what does it mean. And when I want to run the same command with --rdzv_id still set to 1, this error emerged again until I change the --rdzv_id to another number. Can not I reuse --rdzv_id between different job ?
The text was updated successfully, but these errors were encountered: