Implement PyTorch controller for multi node GPU scaling #155

jaystary · 2022-02-10T22:22:05Z

Use Case

We want to utilize multi node GPU scaling for PyTorch for a benchmark / potential larger scale model training

Ideas of Implementation

Implement KF Training Operator which as an added benefit should unlock all the relevant frameworks as well.
https://github.com/kubeflow/training-operator

Message from the maintainers:

Excited about this feature? Give it a 👍. We factor engagement into prioritization.

davidspek · 2022-02-14T13:46:03Z

The PyTorch operator resource can already be used in the current Kubeflow deployment, so there's no need to pause development while we update the deployment to the training operator.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement PyTorch controller for multi node GPU scaling #155

Implement PyTorch controller for multi node GPU scaling #155

jaystary commented Feb 10, 2022

davidspek commented Feb 14, 2022

Implement PyTorch controller for multi node GPU scaling #155

Implement PyTorch controller for multi node GPU scaling #155

Comments

jaystary commented Feb 10, 2022

Use Case

Ideas of Implementation

Message from the maintainers:

davidspek commented Feb 14, 2022