questions about applying for nodes and gpus #558

ThomaswellY · 2023-05-24T02:39:48Z

Hi, i have been using mpi-operator to achieve distributed training recently。
the most command i used is “kubectl apply -f yaml”. Let me take the mpi-operator yaml for example
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
name: cifar
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
restartPolicy: Never
template:
spec:
nodeName:
containers:
- image: 10.252.39.13:5000/deepspeed_ms:v2
name: mpijob-cifar-deepspeed-container
imagePullPolicy: Always
command:
- mpirun
- --allow-run-as-root
- python
- cifar/cifar10_deepspeed.py
- --epochs=100
- --deepspeed_mpi
- --deepspeed
- --deepspeed_config
- cifar/ds_config.json
env:
- name: OMP_NUM_THREADS
value: "1"
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
nodeName:
containers:
- image: 10.252.39.13:5000/deepspeed_ms:v2
name: deepspeed-mpijob-container
resources:
limits:
cpu: 2
memory: 8Gi
nvidia.com/gpu: 2
there are some questions i'm confused about:

the content about applying for gpu-resources seems in "Worker". Does the cifar-worker-0 and cifar-worker-1 pods are separatly applying for an node(in k8s cluster) with 2 gpu? Then what role does "slotsPerWorker" play?
I have excuted the "apply -f yaml" on the example yaml, with different replicas like "replicas: 1" ,"replicas: 4", and the resources limits was fixed at "nvidia.com/gpu: 1". I found interesting results :
*When replicas is set to large numerber, It takes a bit more time for the cifar-launcher pod to complete.
*the logs printed in cifar-launcher pod (when replicas: 4) were just like the result ( when replicas: 1) repeated 4 times.
so does these mean, the four pods have separately applyed for one gpu (from node in k8s cluster, and preferentially from the same node if gpus are enough), and printed out the average result. the whole process had nothing to do with distribution？
*by the way, when setting "repicas: 3" , there is error reported in my case:
train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 64 !=21 * 1 * 3
this did confuse me.
If i have node-A with 1 gpus and node-B with 3 gpus, and wanna apply for 4 gpus, then how should i modify the "Worker" part?
Thank in advance for your apply~

tenzen-y · 2023-05-25T19:34:43Z

@ThomaswellY Can you create an issue on https://github.com/kubeflow/training-operator since the mpi-operator doesn't support v1 API?

alculquicondor · 2023-05-25T19:42:52Z

or you can consider upgrading to the v2beta API :)

To answer some of your questions:
Ideally the number of workers should match the number of nodes you want to run on. The slotsPerWorker filed denotes how many tasks will be run in each worker. Generally, this should match the number of GPUs you have per worker.
You don't need to set the OMP_NUM_THREADS, since that's actually what slotsPerWorker sets.

If i have node-A with 1 gpus and node-B with 3 gpus, and wanna apply for 4 gpus, then how should i modify the "Worker" part?

In that case, you might want to set the number of GPUs per worker to 1 (along with slotsPerWorker to 1) and have replicas=4. Not ideal, but it should work.

ThomaswellY · 2023-05-26T00:52:08Z

@ThomaswellY Can you create an issue on https://github.com/kubeflow/training-operator since the mpi-operator doesn't support v1 API?

Thanks for your reply~
the api-resources of my k8s clusters in shown below:
(base) [root@gpu-233 operator]# kubectl api-resources | grep jobs
cronjobs cj batch/v1 true CronJob
jobs batch/v1 true Job
mpijobs kubeflow.org/v1 true MPIJob
mxjobs kubeflow.org/v1 true MXJob
pytorchjobs kubeflow.org/v1 true PyTorchJob
tfjobs kubeflow.org/v1 true TFJob
xgboostjobs kubeflow.org/v1 true XGBoostJob
doesn't that indicates that, in my k8s cluster env, mpijobs is supported by kubeflow.org/v1 API ?
I have applied configs of the example yaml with kubeflow.org/v1 API successfully, and have seen no siginificant errors in pod logs.
@tenzen-y

ThomaswellY · 2023-05-26T01:32:14Z

Thanks for your reply~
I am a little confused about which type of API can support my resource (mpijob in my case).
The command "kubectl api-resources" shows mpijobs in my k8s cluster is supported by kubeflow.org/v1,
if not, what is the suitable way to confirm which API can support my mpijobs-resource？ any official docs would be helpful~

or you can consider upgrading to the v2beta API :)

To answer some of your questions: Ideally the number of workers should match the number of nodes you want to run on. The slotsPerWorker filed denotes how many tasks will be run in each worker. Generally, this should match the number of GPUs you have per worker. You don't need to set the OMP_NUM_THREADS, since that's actually what slotsPerWorker sets.

If i have node-A with 1 gpus and node-B with 3 gpus, and wanna apply for 4 gpus, then how should i modify the "Worker" part?

In that case, you might want to set the number of GPUs per worker to 1 (along with slotsPerWorker to 1) and have replicas=4. Not ideal, but it should work.

I have applied the example yaml in this way successfully, but it seems that 4 gpus are separately be used by 4 pods, and what each worker excuted is a single-gpu training. So it's not distributed training( in this case, i means multi-node with singe-gpu per node training), and whole process takes more time than single-gpu training in one pod which set "replicas=1". What confused me is that, the value of "replicas" seems to only serve as a multiplier for "nvidia.com/gpu".
in general, there are some things i wanna confirm:

How to confirm which API can support the mpi-operator, if "kubectl api-resources" did not work, then which command should be submitted?
when resource limit sets gpu number to 1 ( because one node of k8s cluster has only one gpu available in this case), then distributed training can not be launched, even multi-pod can separately executes single-gpu training when set replicas>1, it's in fact a repetitive behavior of single-training.
If i have node-1 with 2 gpus and node-2 with 4 gpus, the most effective distributed training that mpi-operator can launcher is about 2 nodes with 2 gpus per node, and the ideal config is that setting "slotsPerWorker: 2","replicas: 2",and "nvidia.com/gpu: 2".
The questions are a little too many, I am sorry if that troubles you.
Thanks you in advance~
@alculquicondor

alculquicondor · 2023-05-26T12:34:16Z

doesn't that indicates that, in my k8s cluster env, mpijobs is supported by kubeflow.org/v1 API ?

That is correct. @tenzen-y's point is that the v1 implementation is no longer hosted in this repo.
If you wish to use the newer v2beta1 version, you have to disable training-operator and install the operator in this repo https://github.com/kubeflow/mpi-operator#installation

The rest of the questions:

the command did work, you are running v1.
It sounds like a problem in your application, not mpi-operator. Did you miss any parameters in your command? I'm not familiar with deepspeed.
yes

tenzen-y · 2023-05-26T17:41:49Z

@ThomaswellY
Thanks @alculquicondor.
Yes, I meant this repo doesn't support kubeflow.org/v1, and this repo supports only kubeflow.org/v2beta1.
Currently, the kubeflow.org/v1 is supported in https://github.com/kubeflow/training-operator.

Also, I would suggest v2beta1 MPIJob for the deepspeed since kubeflow/training-operator#1792 (comment).

alculquicondor · 2023-05-26T17:55:31Z

Also it seems that #549 has proof that v2beta1 can run deepspeed

ThomaswellY · 2023-05-29T00:54:58Z

@alculquicondor @tenzen-y thanks for your kind help! maybe i should use v2beta1 for deepspeed.
Anyway, I have executed #549 successfully even in v1, however it seems only cifar10_deepspeed.py needs no modifications, as for gan_deepspeed_train.py, the extra modification is necessary (like args.local_rank = int(os.environ['LOCAL_RANK'])).
So #549 is only an example for applying mpi-operator with deepspeed, maybe we can do more for normally applying other script with deepspeed.

tenzen-y · 2023-05-29T04:19:49Z

@ThomaswellY Thank you for the report!

So #549 is only an example for applying mpi-operator with deepspeed, maybe we can do more for normally applying other script with deepspeed.

Feel free to open PRs. I'm happy to review them :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

questions about applying for nodes and gpus #558

questions about applying for nodes and gpus #558

ThomaswellY commented May 24, 2023 •

edited

Loading

tenzen-y commented May 25, 2023

alculquicondor commented May 25, 2023

ThomaswellY commented May 26, 2023 •

edited

Loading

ThomaswellY commented May 26, 2023 •

edited

Loading

alculquicondor commented May 26, 2023

tenzen-y commented May 26, 2023

alculquicondor commented May 26, 2023

ThomaswellY commented May 29, 2023

tenzen-y commented May 29, 2023

questions about applying for nodes and gpus #558

questions about applying for nodes and gpus #558

Comments

ThomaswellY commented May 24, 2023 • edited Loading

tenzen-y commented May 25, 2023

alculquicondor commented May 25, 2023

ThomaswellY commented May 26, 2023 • edited Loading

ThomaswellY commented May 26, 2023 • edited Loading

alculquicondor commented May 26, 2023

tenzen-y commented May 26, 2023

alculquicondor commented May 26, 2023

ThomaswellY commented May 29, 2023

tenzen-y commented May 29, 2023

ThomaswellY commented May 24, 2023 •

edited

Loading

ThomaswellY commented May 26, 2023 •

edited

Loading

ThomaswellY commented May 26, 2023 •

edited

Loading