Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

rendezvous: _matches_machine_hostname doesn't resolve hostnames fully #165

Open
1 of 11 tasks
d4l3k opened this issue Feb 24, 2022 · 2 comments
Open
1 of 11 tasks

rendezvous: _matches_machine_hostname doesn't resolve hostnames fully #165

d4l3k opened this issue Feb 24, 2022 · 2 comments

Comments

@d4l3k
Copy link
Member

d4l3k commented Feb 24, 2022

🐛 Bug

Component (check all that applies):

  • state api
  • train_step api
  • train_loop
  • rendezvous
  • checkpoint
  • rollback
  • metrics
  • petctl
  • examples
  • docker
  • other

To Reproduce

Steps to reproduce the behavior:

  1. Launch a 2 node job on Kubernetes+Volcano
  2. LOGLEVEL=INFO python -m torch.distributed.run --rdzv_backend c10d --rdzv_id 1 --rdzv_endpoint "$VC_SH_0_HOSTS" --nnodes 2 echo hello
  3. rendezvous times out since the rank 0 host doesn't realize it's the master due to insufficient hostname resolution
root@sh-db2kkt73p534vd-sh-0-0:/app# echo $VC_SH_0_HOSTS
sh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd
root@sh-db2kkt73p534vd-sh-0-0:/app# hostname
sh-db2kkt73p534vd-sh-0-0
root@sh-db2kkt73p534vd-sh-0-0:/app# cat /etc/resolv.conf 
nameserver 10.100.0.10
search default.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
options ndots:5
root@sh-db2kkt73p534vd-sh-0-0:/app# cat /etc/hosts    
# Kubernetes-managed hosts file.
127.0.0.1	localhost
::1	localhost ip6-localhost ip6-loopback
fe00::0	ip6-localnet
fe00::0	ip6-mcastprefix
fe00::1	ip6-allnodes
fe00::2	ip6-allrouters
192.168.15.246	sh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd.default.svc.cluster.local	sh-db2kkt73p534vd-sh-0-0

The hostname is sh-db2kkt73p534vd-sh-0-0 but Volcano gives the addresss sh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd. Between hosts, resolve.conf and hostname there's all the information required to realize that these addresses are equivalent but the current logic isn't sufficient.

https://github.com/pytorch/pytorch/blob/1b745efbe8ee0ac3bae594ea88ff27e71a734c88/torch/distributed/elastic/rendezvous/utils.py#L110

We may want to do a full dns resolution on the address and check if it matches any of the local IP addresses.

Expected behavior

It realizes the host name is the current node and starts the c10d server.

Environment

  • torchelastic version (e.g. 0.1.0rc1):
  • OS (e.g., Linux): Linux sh-db2kkt73p534vd-sh-0-0 4.14.241-184.433.amzn2.x86_64 [torchelastic][circleci] Fix etcd download path #1 SMP Wed Aug 4 14:35:15 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • How you installed torchelastic (conda, pip, source, docker): docker
  • Docker image and tag (if using docker): https://github.com/pytorch/torchx/pkgs/container/torchx/15644476?tag=0.1.2dev0
  • Build command you used (if compiling from source):
  • Git commit (if installed from source):
  • Python version: 3.7.11
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Execution environment (on-prem, aws, etc): EKS + Volcano
  • Any other relevant information:

Additional context

@kiukchung
Copy link
Contributor

kiukchung commented Feb 24, 2022

is this from torch-1.10 or torchelastic-0.1.0rc1? if the former, then can you move this issue to pytorch and tag it with the "elastic" tag and assign it to me for now? thanks!

@d4l3k
Copy link
Member Author

d4l3k commented Feb 24, 2022

1.10.0

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants