Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Proposal:add priority and queue in scheduling for the common operator #46

Open
YesterdayxD opened this issue Aug 26, 2019 · 14 comments
Open

Comments

@YesterdayxD
Copy link

YesterdayxD commented Aug 26, 2019

Problem

1.Currently in kube-batch,it has PodGroupSpec that it includes some status about scheduling policy,for example MinAvailable,Queue,PriorityClassName.But kubeflow operators don't provide the parameters for kube-batch now.

2.MPI-operator and tf-operator don't use common operator,and pytorch-operator and mxnet-operator use tf-operator/pkg/common package.

Proposed Solution

1.Supplement these attributions in type RunPolicy.SchedulingPolicy. When it uses kubeflow and kube-batch,
kubeflow can pass parameters to kube-batch.

// SchedulingPolicy encapsulates various scheduling policies of the distributed training
// job, for example `minAvailable` for gang-scheduling.
type SchedulingPolicy struct {
    MinAvailable *int32 `json:"minAvailable,omitempty"`

    //PriorityClassName is a type of k8s resource.(kubectl get priorityclass)
    PriorityClassName *string `json:"priorityClassName,omitempty"`
  
    Queue *string `json:"queue,omitempty"`
}

2.All operators use common operator.Because tf,pytorch and mxnet are similar.The bad news is that mpi maybe need more changes.

Advantages

Unify all operators about runPolicy and packages where are imported.

Frameworks Support

pytorch

mxnet

mpi

tensorflow

Rough API Spec(pytorch-operator)

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-gloo"
spec:
  priorityClassName: high
  queue:default
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: gcr.io/<your_project>/pytorch_dist_mnist:latest
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers: 
            - name: pytorch
              image: gcr.io/<your_project>/pytorch_dist_mnist:latest
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.91. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@YesterdayxD YesterdayxD changed the title add priority in scheduling for the common operator Proposal:add priority in scheduling for the common operator Aug 27, 2019
@YesterdayxD YesterdayxD changed the title Proposal:add priority in scheduling for the common operator Proposal:add priority and queue in scheduling for the common operator Aug 27, 2019
@YesterdayxD
Copy link
Author

/cc @gaocegege
/cc @k82cn

@gaocegege
Copy link
Member

gaocegege commented Aug 28, 2019

/cc @richardsliu @johnugeorge @hougangliu

Thanks for the proposal!

@k82cn
Copy link
Contributor

k82cn commented Aug 30, 2019

Are we going to inline SchedulingPolicy in PyTorchJobSpec? What's the suggestion to other operator?

@davidstack
Copy link

maybe all operrator should add SchedulingPolicy,so we can add SchedulingPolicy in this comnon package

@gaocegege
Copy link
Member

Yeah, we should add SchedulingPolicy to common. But pytorch-operator and tf-operator does not use common now. We should re-implement the logic in these operators, too

@4everming
Copy link

Yes, we should implement the logic in MXNet-Operator too.

@gaocegege
Copy link
Member

@johnugeorge @richardsliu

Do you have any suggestion?

@YesterdayxD
Copy link
Author

retire wg-machine-learning?It is so bad.

@k82cn
Copy link
Contributor

k82cn commented Oct 21, 2019

retire wg-machine-learning?It is so bad.

Nop for now, I'll help to maintain ML-WG for a while; if still not working items, we'll retire it :)

@gaocegege
Copy link
Member

Hi is there any update?

@k82cn
Copy link
Contributor

k82cn commented Apr 15, 2020

hm... are we going to do this feature?

@terrytangyuan
Copy link
Member

Yes it's part of our roadmap so contribution is welcomed.

georgkaleido pushed a commit to georgkaleido/common that referenced this issue Jun 9, 2022
Co-authored-by: depfu[bot] <23717796+depfu[bot]@users.noreply.github.com>
@kerthcet
Copy link

I think we can close this refer to

type SchedulingPolicy struct {
MinAvailable *int32 `json:"minAvailable,omitempty"`
Queue string `json:"queue,omitempty"`
MinResources *v1.ResourceList `json:"minResources,omitempty"`
PriorityClass string `json:"priorityClass,omitempty"`
}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants