You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.
The hostname is sh-db2kkt73p534vd-sh-0-0 but Volcano gives the addresss sh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd. Between hosts, resolve.conf and hostname there's all the information required to realize that these addresses are equivalent but the current logic isn't sufficient.
is this from torch-1.10 or torchelastic-0.1.0rc1? if the former, then can you move this issue to pytorch and tag it with the "elastic" tag and assign it to me for now? thanks!
🐛 Bug
Component (check all that applies):
state api
train_step api
train_loop
rendezvous
checkpoint
rollback
metrics
petctl
examples
docker
To Reproduce
Steps to reproduce the behavior:
LOGLEVEL=INFO python -m torch.distributed.run --rdzv_backend c10d --rdzv_id 1 --rdzv_endpoint "$VC_SH_0_HOSTS" --nnodes 2 echo hello
The hostname is
sh-db2kkt73p534vd-sh-0-0
but Volcano gives the addressssh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd
. Between hosts, resolve.conf and hostname there's all the information required to realize that these addresses are equivalent but the current logic isn't sufficient.https://github.com/pytorch/pytorch/blob/1b745efbe8ee0ac3bae594ea88ff27e71a734c88/torch/distributed/elastic/rendezvous/utils.py#L110
We may want to do a full dns resolution on the address and check if it matches any of the local IP addresses.
Expected behavior
It realizes the host name is the current node and starts the
c10d
server.Environment
conda
,pip
, source,docker
): dockerAdditional context
The text was updated successfully, but these errors were encountered: