Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

Cannot reuse --rdzv_id between different elastic launch ? #151

Open
PKUFlyingPig opened this issue Jun 3, 2021 · 0 comments
Open

Cannot reuse --rdzv_id between different elastic launch ? #151

PKUFlyingPig opened this issue Jun 3, 2021 · 0 comments

Comments

@PKUFlyingPig
Copy link

PKUFlyingPig commented Jun 3, 2021

Question

I followed the tutorial and used the following command to launch the torchelastic:

export NUM_TRAINERS=2
python -m torchelastic.distributed.launch \
    --nnodes=1:4 \
    --nproc_per_node=$NUM_TRAINERS \
    --rdzv_id=1 \
    --rdzv_backend=etcd \
    --rdzv_endpoint=162.105.19.156:2379 \
    mnmc_ddp_launch.py

I run the same command on two nodes, and they run successfully, But when I killed one node process with Ctrl-C, the other node also aborted. Here is the traceback if it helps:

Traceback (most recent call last):
  File "mnmc_ddp_launch.py", line 119, in <module>
    main()
  File "mnmc_ddp_launch.py", line 90, in main
    outputs = net(inputs)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 696, in forward
    self._sync_params()
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1222, in _sync_params
    self._distributed_broadcast_coalesced(
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL communicator was aborted.
[ERROR] 2021-06-03 19:39:40,572 api: failed (exitcode: 1) local_rank: 0 (pid: 22262) of binary: /home/zhongyinmin/anaconda3/bin/python
[ERROR] 2021-06-03 19:39:40,572 local_elastic_agent: [default] Worker group failed
[INFO] 2021-06-03 19:39:40,572 api: [default] Worker group FAILED. 3/3 attempts left; will restart worker group
[INFO] 2021-06-03 19:39:40,573 api: [default] Stopping worker group
[INFO] 2021-06-03 19:39:40,573 api: [default] Rendezvous'ing worker group
INFO 2021-06-03 19:39:40,573 Attempting to join next rendezvous
INFO 2021-06-03 19:39:40,582 Observed existing rendezvous state: {'status': 'closed', 'version': '1', 'participants': [0, 1], 'keep_alives': ['/torchelastic/p2p/run_1/rdzv/v_1/rank_1', '/torchelastic/p2p/run_1/rdzv/v_1/rank_0'], 'num_workers_waiting': 0}
INFO 2021-06-03 19:39:40,582 Rendezvous for run_id=1 was observed to be closed
{"name": "torchelastic.worker.status.FAILED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "1", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "k8s-master", "state": "FAILED", "total_run_time": 80, "rdzv_backend": "etcd", "raw_error": "Traceback (most recent call last):\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/distributed/launch.py\", line 531, in main\n    run_result = elastic_agent.run(spec.role)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 680, in run\n    result = self._invoke_run(role)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 831, in _invoke_run\n    self._restart_workers(self._worker_group)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 674, in _restart_workers\n    self._initialize_workers(worker_group)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 654, in _initialize_workers\n    self._rendezvous(worker_group)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 518, in _rendezvous\n    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 154, in next_rendezvous\n    rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 287, in rendezvous_barrier\n    return self.init_phase()\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 349, in init_phase\n    raise RendezvousClosedException()\ntorchelastic.rendezvous.api.RendezvousClosedException\n", "metadata": "{\"group_world_size\": 2, \"entry_point\": \"python\"}", "agent_restarts": 1}}
[ERROR] 2021-06-03 19:39:40,588 error_handler: {
  "message": {
    "message": "RendezvousClosedException: ",
    "extraInfo": {
      "py_callstack": "Traceback (most recent call last):\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/multiprocessing/errors/__init__.py\", line 320, in wrapper\n    return f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/distributed/launch.py\", line 531, in main\n    run_result = elastic_agent.run(spec.role)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 680, in run\n    result = self._invoke_run(role)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 831, in _invoke_run\n    self._restart_workers(self._worker_group)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 674, in _restart_workers\n    self._initialize_workers(worker_group)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 654, in _initialize_workers\n    self._rendezvous(worker_group)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py\", line 126, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py\", line 518, in _rendezvous\n    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 154, in next_rendezvous\n    rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 287, in rendezvous_barrier\n    return self.init_phase()\n  File \"/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py\", line 349, in init_phase\n    raise RendezvousClosedException()\ntorchelastic.rendezvous.api.RendezvousClosedException\n",
      "timestamp": "1622720380"
    }
  }
}
Traceback (most recent call last):
  File "/home/zhongyinmin/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/zhongyinmin/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/distributed/launch.py", line 561, in <module>
    main()
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/multiprocessing/errors/__init__.py", line 320, in wrapper
    return f(*args, **kwargs)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/distributed/launch.py", line 531, in main
    run_result = elastic_agent.run(spec.role)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py", line 126, in wrapper
    result = f(*args, **kwargs)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 680, in run
    result = self._invoke_run(role)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 831, in _invoke_run
    self._restart_workers(self._worker_group)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py", line 126, in wrapper
    result = f(*args, **kwargs)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 674, in _restart_workers
    self._initialize_workers(worker_group)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py", line 126, in wrapper
    result = f(*args, **kwargs)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 654, in _initialize_workers
    self._rendezvous(worker_group)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/metrics/api.py", line 126, in wrapper
    result = f(*args, **kwargs)
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/agent/server/api.py", line 518, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py", line 154, in next_rendezvous
    rdzv_version, rank, world_size = self._rdzv_impl.rendezvous_barrier()
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py", line 287, in rendezvous_barrier
    return self.init_phase()
  File "/home/zhongyinmin/anaconda3/lib/python3.8/site-packages/torchelastic/rendezvous/etcd_rendezvous.py", line 349, in init_phase
    raise RendezvousClosedException()
torchelastic.rendezvous.api.RendezvousClosedException

I searched for this exception and it says "This Exception is raised when a rendezvous for the specified run_id is closed.
This is used to signal completion to nodes that arrive late." I don't understand what does it mean. And when I want to run the same command with --rdzv_id still set to 1, this error emerged again until I change the --rdzv_id to another number. Can not I reuse --rdzv_id between different job ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant