Skip to content
This repository has been archived by the owner on Feb 1, 2022. It is now read-only.

cannot work in namespace #121

Open
daniel985 opened this issue Jul 2, 2021 · 2 comments
Open

cannot work in namespace #121

daniel985 opened this issue Jul 2, 2021 · 2 comments

Comments

@daniel985
Copy link

daniel985 commented Jul 2, 2021

when we submit a Job and assign a namespace, it cannot work,
submit like this:
"
kubectl create -f xgboost-operator/config/samples/xgboost-dist/xgboostjob_v1_iris_train.yaml -n aisys
"

and the error message like this:
"
starting the train job
starting to extract system env
extract the Rabit env from cluster : xgboost-dist-iris-test-train-master-0, port: 9991, rank: 0, word_size: 3
start the master node
start listen on 0.0.0.0:9991

RabitTracker Setup Finished
Rabit rank setup with below envs

DMLC_NUM_WORKER=3
DMLC_TRACKER_URI=xgboost-dist-iris-test-train-master-0
DMLC_TRACKER_PORT=9991
DMLC_TASK_ID=0
retry connect to ip(retry time 1): [xgboost-dist-iris-test-train-master-0]
retry connect to ip(retry time 2): [xgboost-dist-iris-test-train-master-0]
retry connect to ip(retry time 3): [xgboost-dist-iris-test-train-master-0]
retry connect to ip(retry time 4): [xgboost-dist-iris-test-train-master-0]
connect to (failed): [xgboost-dist-iris-test-train-master-0]
Socket Connect Error:Connection refused
"

@eunjilee
Copy link

eunjilee commented Aug 4, 2021

My job doesn't work in namespace, too.

My error msg is like this:

$ kubectl -n ranking logs lightgbm-dist-train-test-ej-master-0 -c xgboostjob

extract cluster info from env variables
master_addr: lightgbm-dist-train-test-ej-master-0
master_port: 9991
worker_addrs: lightgbm-dist-train-test-ej-worker-0,lightgbm-dist-train-test-ej-worker-1
worker_port: 9991
world_size: 3
rank: 0

starting the train job
extra args:
 ['--boosting_type=gbdt', '--objective=binary', '--metric=binary_logloss,auc', '--metric_freq=1', '--is_training_metric=true', '--max_bin=255', '--data=data/binary.train', '--valid_data=data/binary.test', '--num_trees=100', '--learning_rate=01', '--num_leaves=63', '--tree_learner=feature', '--feature_fraction=0.8', '--bagging_freq=5', '--bagging_fraction=0.8', '--min_data_in_leaf=50', '--min_sum_hessian_in_leaf=50', '--is_enable_sparse=true', '--use_two_round_loading=false', '--is_save_binary_file=false']
starting to extract system env
machine list generated in: /tmp/tmpw6aieq34
config generated in: /tmp/tmprwnjp8dw
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Trying to bind port 9991...
[LightGBM] [Info] Binding port 9991 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 2
[LightGBM] [Info] Local rank: 0, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Fatal] Socket recv error, code: 104
Met Exceptions:
[LightGBM] [Info] Finished linking network in 0.000000 seconds
Socket recv error, code: 104
Finish distributed job

Can I get some advice for this log?

@merlintang
Copy link
Contributor

this is not related to the operator, the network connection is related to the isolation of your network. you can ping the worker in your env before starting a new job.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants