-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add kfp pipeline for running a pytorch job #14
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Shreyanand <[email protected]>
@tumido do you want to review this in the current state and @MichaelClifford maybe Monday we pair this with Tom's PR and start generating the MVP run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please remove the create_worker_spec
component? It seems to be unnecessary here.
training/pipeline.py
Outdated
worker_spec_output = namedtuple( | ||
"MyWorkerOutput", ["worker_spec"] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to output a named tuple here? This is useful only if you output multiple params. I don't think it's needed here at all.
training/pipeline.py
Outdated
worker = {} | ||
if worker_num > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole thing can be rewritten as:
if worker_num <= 0:
return {}
return {}
or even better - not a component at all. Afterall, it is a single if
statement + setting of a single value in a dict. This doesn't have to be a component at all. Remember that each component we create we start a container - this only slows the workflow. Especially in cases where it's a simple data formatting.
"image": "quay.io/michaelclifford/test-train:0.0.11", | ||
"name": "pytorch", | ||
"resources": { | ||
"requests": { | ||
"memory": "8Gi", | ||
"cpu": "2000m", | ||
# Uncomment for GPU | ||
"nvidia.com/gpu": 1, | ||
}, | ||
"limits": { | ||
"memory": "8Gi", | ||
"cpu": "2000m", | ||
# Uncomment for GPU | ||
"nvidia.com/gpu": 1, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we parametrize this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong button 😄 I've meant to request changes... 😄
training/pipeline.py
Outdated
|
||
|
||
if __name__ == "__main__": | ||
import kfp.compiler as compiler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to nest the import here. This is against https://pylint.readthedocs.io/en/latest/user_guide/messages/convention/import-outside-toplevel.html
training/pipeline.py
Outdated
pipeline_file = "pipeline.yaml" | ||
print( | ||
f"Compiling pipeline as {pipeline_file}" | ||
) | ||
compiler.Compiler().compile( | ||
ilab_train, pipeline_file | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat weirdly formatted. I think simple:
pipeline_file = "pipeline.yaml" | |
print( | |
f"Compiling pipeline as {pipeline_file}" | |
) | |
compiler.Compiler().compile( | |
ilab_train, pipeline_file | |
) | |
pipeline_file = "pipeline.yaml" | |
print(f"Compiling pipeline as {pipeline_file}") | |
compiler.Compiler().compile(ilab_train, pipeline_file) |
Would be fully compliant with PEP8.
Signed-off-by: Shreyanand <[email protected]>
Please install https://docs.astral.sh/ruff/installation/ on your system and run |
This PR adds the kfp pipeline for launching the Instructlab train Pytorch job. It is based on the kfp pytorch launcher component. However that component is 3 years old and I had to make some changes for it to work.
In the current state, the pipeline works- kfp workflow triggers a pytorch job. However it is not able to run multinode training yet. Fixes found in #10 should make this run completely.
One caveat with the current launcher is that it does not have a way to run the pytorch job with just the master node. It has to have a worker node. That is not a limitation of a vanilla pytorch job.
Putting it in draft for now but it can be tested if curious.