You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 1, 2022. It is now read-only.
(base) asaha-mbp151:maven asaha$ kubectl logs -f xgboost-asaha-rfw4as3le3u-master-0 -n asaha
starting the train job
starting to extract system env
extract the Rabit env from cluster : xgboost-asaha-rfw4as3le3u-master-0, port: 9991, rank: 0, world_size: 3
start the master node
start listen on 0.0.0.0:9991
###### RabitTracker Setup Finished ######
##### Rabit rank setup with below envs #####
DMLC_NUM_WORKER=3
DMLC_TRACKER_URI=xgboost-asaha-rfw4as3le3u-master-0
DMLC_TRACKER_PORT=9991
DMLC_TASK_ID=0
worker(ip_address=10.46.85.245) connected!
worker(ip_address=10.46.95.126) connected!
##### Rabit rank = 1
@tracker All of 3 nodes getting started
worker(ip_address=10.44.239.26) connected!
Read data from IRIS data source with range from 50 to 100
starting to train xgboost at node with rank 1
On Wed, Mar 17, 2021 at 4:43 AM Anindya Saha ***@***.***> wrote:
Exactly. I also see the same problem today.
(base) asaha-mbp151:maven asaha$ kubectl logs -f xgboost-asaha-rfw4as3le3u-master-0 -n asaha
starting the train job
starting to extract system env
extract the Rabit env from cluster : xgboost-asaha-rfw4as3le3u-master-0, port: 9991, rank: 0, world_size: 3
start the master node
start listen on 0.0.0.0:9991
###### RabitTracker Setup Finished ######
##### Rabit rank setup with below envs #####
DMLC_NUM_WORKER=3
DMLC_TRACKER_URI=xgboost-asaha-rfw4as3le3u-master-0
DMLC_TRACKER_PORT=9991
DMLC_TASK_ID=0
worker(ip_address=10.46.85.245) connected!
worker(ip_address=10.46.95.126) connected!
##### Rabit rank = 1
@TracKer All of 3 nodes getting started
worker(ip_address=10.44.239.26) connected!
Read data from IRIS data source with range from 50 to 100
starting to train xgboost at node with rank 1
@terrytangyuan <https://github.com/terrytangyuan> if you could please
comment.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAK5R6JZGN6WZI7UHSSOZLDTD67IJANCNFSM4TZST7MQ>
.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I ran distributed training on k8s.
The
rank
number was got byextract_xgbooost_cluster_env()
as in https://github.com/kubeflow/xgboost-operator/blob/master/config/samples/xgboost-dist/train.py#L29However,
xgb.rabit.get_rank()
got anotherrank
number as in https://github.com/kubeflow/xgboost-operator/blob/master/config/samples/xgboost-dist/train.py#L57.There are two things confusing me:
extract_xgbooost_cluster_env()
had got therank
number, why usexgb.rabit.get_rank()
to getrank
number again?rank
numbers different?The text was updated successfully, but these errors were encountered: