You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can you plate the full yaml of the PyTorchJob and the pod yaml of master which you mentioned wrong ?
First of all, thank you for your attention to the problem. The following are two supplements of information. For better display, I have provided the file in the form of a file for you to view. file.md
Hi @Jmengfei, it is strange since the MASTER_ADDR in the master pod should also be inserted by Training Operator controller. Can you try to submit this simple example to see if the env vars will be correct ?
What happened?
When I use pytorch-operator for distributed training, my worker node reports the following error:
I checked the environment variables of the worker's env:
this looks correct
Then I checked the env of the master and found the problem:
What did you expect to happen?
I hope its MASTER_ADDR can be the same as that of the worker
Environment
Kubernetes version:
Training Operator version:
$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}" Training Operator version: kubeflow/training-operator:v1-5a5f92droot
Training Operator Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered: