Proposal：add priority and queue in scheduling for the common operator #46

YesterdayxD · 2019-08-26T01:10:06Z

Problem

1.Currently in kube-batch，it has PodGroupSpec that it includes some status about scheduling policy,for example MinAvailable,Queue,PriorityClassName.But kubeflow operators don't provide the parameters for kube-batch now.

2.MPI-operator and tf-operator don't use common operator,and pytorch-operator and mxnet-operator use tf-operator/pkg/common package.

Proposed Solution

1.Supplement these attributions in type RunPolicy.SchedulingPolicy. When it uses kubeflow and kube-batch,
kubeflow can pass parameters to kube-batch.

// SchedulingPolicy encapsulates various scheduling policies of the distributed training
// job, for example `minAvailable` for gang-scheduling.
type SchedulingPolicy struct {
    MinAvailable *int32 `json:"minAvailable,omitempty"`

    //PriorityClassName is a type of k8s resource.(kubectl get priorityclass)
    PriorityClassName *string `json:"priorityClassName,omitempty"`
  
    Queue *string `json:"queue,omitempty"`
}

2.All operators use common operator.Because tf,pytorch and mxnet are similar.The bad news is that mpi maybe need more changes.

Advantages

Unify all operators about runPolicy and packages where are imported.

Frameworks Support

pytorch

mxnet

mpi

tensorflow

Rough API Spec(pytorch-operator)

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-gloo"
spec:
  priorityClassName: high
  queue:default
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: gcr.io/<your_project>/pytorch_dist_mnist:latest
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers: 
            - name: pytorch
              image: gcr.io/<your_project>/pytorch_dist_mnist:latest
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2019-08-26T01:10:08Z

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.91. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

YesterdayxD · 2019-08-27T07:44:15Z

/cc @gaocegege
/cc @k82cn

gaocegege · 2019-08-28T01:48:47Z

/cc @richardsliu @johnugeorge @hougangliu

Thanks for the proposal!

k82cn · 2019-08-30T01:38:36Z

Are we going to inline SchedulingPolicy in PyTorchJobSpec? What's the suggestion to other operator?

davidstack · 2019-09-02T08:40:33Z

maybe all operrator should add SchedulingPolicy,so we can add SchedulingPolicy in this comnon package

gaocegege · 2019-09-02T08:49:11Z

Yeah, we should add SchedulingPolicy to common. But pytorch-operator and tf-operator does not use common now. We should re-implement the logic in these operators, too

4everming · 2019-09-04T06:11:24Z

Yes, we should implement the logic in MXNet-Operator too.

gaocegege · 2019-09-09T10:02:02Z

@johnugeorge @richardsliu

Do you have any suggestion?

YesterdayxD · 2019-10-21T01:29:30Z

retire wg-machine-learning？It is so bad.

k82cn · 2019-10-21T07:43:38Z

retire wg-machine-learning？It is so bad.

Nop for now, I'll help to maintain ML-WG for a while; if still not working items, we'll retire it :)

gaocegege · 2019-10-29T09:31:05Z

Hi is there any update?

k82cn · 2020-04-15T06:11:19Z

hm... are we going to do this feature?

terrytangyuan · 2020-04-15T13:14:21Z

Yes it's part of our roadmap so contribution is welcomed.

Co-authored-by: depfu[bot] <23717796+depfu[bot]@users.noreply.github.com>

kerthcet · 2022-06-14T09:40:59Z

I think we can close this refer to

common/pkg/apis/common/v1/types.go

Lines 204 to 209 in 21f5ba8

    
           type SchedulingPolicy struct { 
        
           	MinAvailable  *int32           `json:"minAvailable,omitempty"` 
        
           	Queue         string           `json:"queue,omitempty"` 
        
           	MinResources  *v1.ResourceList `json:"minResources,omitempty"` 
        
           	PriorityClass string           `json:"priorityClass,omitempty"` 
        
           }

issue-label-bot bot added the feature_request label Aug 26, 2019

gaocegege mentioned this issue Aug 26, 2019

add job priority for kube-batch scheduling #141 #45

Closed

YesterdayxD changed the title ~~add priority in scheduling for the common operator~~ Proposal：add priority in scheduling for the common operator Aug 27, 2019

YesterdayxD changed the title ~~Proposal：add priority in scheduling for the common operator~~ Proposal：add priority and queue in scheduling for the common operator Aug 27, 2019

k82cn mentioned this issue Sep 10, 2019

retire wg-machine-learning kubernetes/community#3396

Closed

gaocegege mentioned this issue Oct 29, 2019

Use pod group instead of PDB for gang scheduling kubeflow/training-operator#916

Closed

Jeffwan mentioned this issue Sep 1, 2020

[Release 1.2] Feature Planning / Roadmap kubeflow/kubeflow#5224

Closed

gaocegege mentioned this issue Aug 23, 2021

volcano scheduler with customized queue kubeflow/training-operator#1377

Closed

georgkaleido pushed a commit to georgkaleido/common that referenced this issue Jun 9, 2022

Update Node.js to version 14.18.1 (kubeflow#46)

e15464b

Co-authored-by: depfu[bot] <23717796+depfu[bot]@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal：add priority and queue in scheduling for the common operator #46

Proposal：add priority and queue in scheduling for the common operator #46

YesterdayxD commented Aug 26, 2019 •

edited

Loading

issue-label-bot bot commented Aug 26, 2019

YesterdayxD commented Aug 27, 2019

gaocegege commented Aug 28, 2019 •

edited

Loading

k82cn commented Aug 30, 2019

davidstack commented Sep 2, 2019

gaocegege commented Sep 2, 2019

4everming commented Sep 4, 2019

gaocegege commented Sep 9, 2019

YesterdayxD commented Oct 21, 2019

k82cn commented Oct 21, 2019 •

edited

Loading

gaocegege commented Oct 29, 2019

k82cn commented Apr 15, 2020

terrytangyuan commented Apr 15, 2020

kerthcet commented Jun 14, 2022

Proposal：add priority and queue in scheduling for the common operator #46

Proposal：add priority and queue in scheduling for the common operator #46

Comments

YesterdayxD commented Aug 26, 2019 • edited Loading

Problem

Proposed Solution

Advantages

Frameworks Support

Rough API Spec(pytorch-operator)

issue-label-bot bot commented Aug 26, 2019

YesterdayxD commented Aug 27, 2019

gaocegege commented Aug 28, 2019 • edited Loading

k82cn commented Aug 30, 2019

davidstack commented Sep 2, 2019

gaocegege commented Sep 2, 2019

4everming commented Sep 4, 2019

gaocegege commented Sep 9, 2019

YesterdayxD commented Oct 21, 2019

k82cn commented Oct 21, 2019 • edited Loading

gaocegege commented Oct 29, 2019

k82cn commented Apr 15, 2020

terrytangyuan commented Apr 15, 2020

kerthcet commented Jun 14, 2022

YesterdayxD commented Aug 26, 2019 •

edited

Loading

gaocegege commented Aug 28, 2019 •

edited

Loading

k82cn commented Oct 21, 2019 •

edited

Loading