-
Notifications
You must be signed in to change notification settings - Fork 701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorchjob didn't create worker pod ,seems hang #2327
Comments
Thanks for creating this @Twilighter9527!
|
yes,just like the demo,event set the level of node to sidecar.istio.io/inject: "false",nothing happen |
The error log in |
Have you ever try with the origin yaml file with touching nothing ? |
|
Try with CHANGE NOTHING, even image not found error will happen after pod has been created. |
Hi @Twilighter9527, it looks like your cluster doesn't have access to the public DockerHub registry. |
What happened?
I follow this create a pytorchjob https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/simple.yaml
pytorchjob yaml
pod status
kubeclt describe pytorchjob -n kubeflow
training operator log
What did you expect to happen?
this should create pytorch-simple-master-0 and pytorch-simple-worker-0,but the pytorch-simple-worker-0 seems hang,the log show that some wrong with the yaml to json, i didn't think that is the reason. first, kubeclt create -f simple.yaml, nothing wrong happened. second, i use python to read the yaml to json, it is ok. I had a similar issue a few weeks ago, and I set sidecar.istio=false to solve it。 but now is not in default namespaces and I label the node sidecar.istio=false
Environment
Kubernetes version:
Training Operator version:
$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
Training Operator Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered: