You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 1, 2022. It is now read-only.
when we submit a Job and assign a namespace, it cannot work,
submit like this:
"
kubectl create -f xgboost-operator/config/samples/xgboost-dist/xgboostjob_v1_iris_train.yaml -n aisys
"
and the error message like this:
"
starting the train job
starting to extract system env
extract the Rabit env from cluster : xgboost-dist-iris-test-train-master-0, port: 9991, rank: 0, word_size: 3
start the master node
start listen on 0.0.0.0:9991
RabitTracker Setup Finished
Rabit rank setup with below envs
DMLC_NUM_WORKER=3
DMLC_TRACKER_URI=xgboost-dist-iris-test-train-master-0
DMLC_TRACKER_PORT=9991
DMLC_TASK_ID=0
retry connect to ip(retry time 1): [xgboost-dist-iris-test-train-master-0]
retry connect to ip(retry time 2): [xgboost-dist-iris-test-train-master-0]
retry connect to ip(retry time 3): [xgboost-dist-iris-test-train-master-0]
retry connect to ip(retry time 4): [xgboost-dist-iris-test-train-master-0]
connect to (failed): [xgboost-dist-iris-test-train-master-0]
Socket Connect Error:Connection refused
"
The text was updated successfully, but these errors were encountered:
this is not related to the operator, the network connection is related to the isolation of your network. you can ping the worker in your env before starting a new job.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
when we submit a Job and assign a namespace, it cannot work,
submit like this:
"
kubectl create -f xgboost-operator/config/samples/xgboost-dist/xgboostjob_v1_iris_train.yaml -n aisys
"
and the error message like this:
"
starting the train job
starting to extract system env
extract the Rabit env from cluster : xgboost-dist-iris-test-train-master-0, port: 9991, rank: 0, word_size: 3
start the master node
start listen on 0.0.0.0:9991
RabitTracker Setup Finished
Rabit rank setup with below envs
DMLC_NUM_WORKER=3
DMLC_TRACKER_URI=xgboost-dist-iris-test-train-master-0
DMLC_TRACKER_PORT=9991
DMLC_TASK_ID=0
retry connect to ip(retry time 1): [xgboost-dist-iris-test-train-master-0]
retry connect to ip(retry time 2): [xgboost-dist-iris-test-train-master-0]
retry connect to ip(retry time 3): [xgboost-dist-iris-test-train-master-0]
retry connect to ip(retry time 4): [xgboost-dist-iris-test-train-master-0]
connect to (failed): [xgboost-dist-iris-test-train-master-0]
Socket Connect Error:Connection refused
"
The text was updated successfully, but these errors were encountered: