Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kfp pipeline for running a pytorch job #14

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Shreyanand
Copy link
Member

This PR adds the kfp pipeline for launching the Instructlab train Pytorch job. It is based on the kfp pytorch launcher component. However that component is 3 years old and I had to make some changes for it to work.

In the current state, the pipeline works- kfp workflow triggers a pytorch job. However it is not able to run multinode training yet. Fixes found in #10 should make this run completely.

One caveat with the current launcher is that it does not have a way to run the pytorch job with just the master node. It has to have a worker node. That is not a limitation of a vanilla pytorch job.

Putting it in draft for now but it can be tested if curious.

@cooktheryan
Copy link
Collaborator

@tumido do you want to review this in the current state and @MichaelClifford maybe Monday we pair this with Tom's PR and start generating the MVP run

Copy link
Member

@tumido tumido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please remove the create_worker_spec component? It seems to be unnecessary here.

Comment on lines 55 to 57
worker_spec_output = namedtuple(
"MyWorkerOutput", ["worker_spec"]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to output a named tuple here? This is useful only if you output multiple params. I don't think it's needed here at all.

https://www.kubeflow.org/docs/components/pipelines/user-guides/data-handling/parameters/#output-parameters

Comment on lines 12 to 13
worker = {}
if worker_num > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole thing can be rewritten as:

if worker_num <= 0:
  return {}

return {}

or even better - not a component at all. Afterall, it is a single if statement + setting of a single value in a dict. This doesn't have to be a component at all. Remember that each component we create we start a container - this only slows the workflow. Especially in cases where it's a simple data formatting.

Comment on lines +97 to +111
"image": "quay.io/michaelclifford/test-train:0.0.11",
"name": "pytorch",
"resources": {
"requests": {
"memory": "8Gi",
"cpu": "2000m",
# Uncomment for GPU
"nvidia.com/gpu": 1,
},
"limits": {
"memory": "8Gi",
"cpu": "2000m",
# Uncomment for GPU
"nvidia.com/gpu": 1,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we parametrize this?

@tumido tumido self-requested a review September 9, 2024 12:07
Copy link
Member

@tumido tumido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong button 😄 I've meant to request changes... 😄



if __name__ == "__main__":
import kfp.compiler as compiler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to nest the import here. This is against https://pylint.readthedocs.io/en/latest/user_guide/messages/convention/import-outside-toplevel.html

Comment on lines 142 to 148
pipeline_file = "pipeline.yaml"
print(
f"Compiling pipeline as {pipeline_file}"
)
compiler.Compiler().compile(
ilab_train, pipeline_file
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat weirdly formatted. I think simple:

Suggested change
pipeline_file = "pipeline.yaml"
print(
f"Compiling pipeline as {pipeline_file}"
)
compiler.Compiler().compile(
ilab_train, pipeline_file
)
pipeline_file = "pipeline.yaml"
print(f"Compiling pipeline as {pipeline_file}")
compiler.Compiler().compile(ilab_train, pipeline_file)

Would be fully compliant with PEP8.

@leseb
Copy link
Collaborator

leseb commented Oct 3, 2024

Please install https://docs.astral.sh/ruff/installation/ on your system and run ruff format . once #57 is merged. THEN rebase your PR, everything should go fine :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants