Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I change the default MASTER_ADDR in Pytorchjob? #2331

Open
Jmengfei opened this issue Nov 22, 2024 · 3 comments
Open

How can I change the default MASTER_ADDR in Pytorchjob? #2331

Jmengfei opened this issue Nov 22, 2024 · 3 comments

Comments

@Jmengfei
Copy link

What happened?

When I use pytorch-operator for distributed training, my worker node reports the following error:
image

I checked the environment variables of the worker's env:
image
this looks correct

Then I checked the env of the master and found the problem:
image

What did you expect to happen?

I hope its MASTER_ADDR can be the same as that of the worker

Environment

Kubernetes version:

$ kubectl version

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

Training Operator version:
kubeflow/training-operator:v1-5a5f92droot

Training Operator Python SDK version:

$ pip show kubeflow-training

The version of golang I use is:
github.com/kubeflow/training-operator v1.6.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@kuizhiqing
Copy link
Member

Can you plate the full yaml of the PyTorchJob and the pod yaml of master which you mentioned wrong ?

@Jmengfei
Copy link
Author

Can you plate the full yaml of the PyTorchJob and the pod yaml of master which you mentioned wrong ?

First of all, thank you for your attention to the problem. The following are two supplements of information. For better display, I have provided the file in the form of a file for you to view.
file.md

@andreyvelich
Copy link
Member

Hi @Jmengfei, it is strange since the MASTER_ADDR in the master pod should also be inserted by Training Operator controller. Can you try to submit this simple example to see if the env vars will be correct ?

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorch-simple
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants