Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement PyTorch controller for multi node GPU scaling #155

Open
jaystary opened this issue Feb 10, 2022 · 1 comment
Open

Implement PyTorch controller for multi node GPU scaling #155

jaystary opened this issue Feb 10, 2022 · 1 comment

Comments

@jaystary
Copy link

Use Case

We want to utilize multi node GPU scaling for PyTorch for a benchmark / potential larger scale model training

Ideas of Implementation

Implement KF Training Operator which as an added benefit should unlock all the relevant frameworks as well.
https://github.com/kubeflow/training-operator


Message from the maintainers:

Excited about this feature? Give it a 👍. We factor engagement into prioritization.

@davidspek
Copy link
Contributor

The PyTorch operator resource can already be used in the current Kubeflow deployment, so there's no need to pause development while we update the deployment to the training operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants