Add kfp pipeline for running a pytorch job #14

Shreyanand · 2024-09-06T18:58:32Z

This PR adds the kfp pipeline for launching the Instructlab train Pytorch job. It is based on the kfp pytorch launcher component. However that component is 3 years old and I had to make some changes for it to work.

In the current state, the pipeline works- kfp workflow triggers a pytorch job. However it is not able to run multinode training yet. Fixes found in #10 should make this run completely.

One caveat with the current launcher is that it does not have a way to run the pytorch job with just the master node. It has to have a worker node. That is not a limitation of a vanilla pytorch job.

Putting it in draft for now but it can be tested if curious.

Signed-off-by: Shreyanand <[email protected]>

cooktheryan · 2024-09-06T20:15:53Z

@tumido do you want to review this in the current state and @MichaelClifford maybe Monday we pair this with Tom's PR and start generating the MVP run

tumido

Can we please remove the create_worker_spec component? It seems to be unnecessary here.

tumido · 2024-09-09T11:53:41Z

training/pipeline.py

+    worker_spec_output = namedtuple(
+        "MyWorkerOutput", ["worker_spec"]
+    )


Why do you want to output a named tuple here? This is useful only if you output multiple params. I don't think it's needed here at all.

https://www.kubeflow.org/docs/components/pipelines/user-guides/data-handling/parameters/#output-parameters

tumido · 2024-09-09T11:57:16Z

training/pipeline.py

+    worker = {}
+    if worker_num > 0:


The whole thing can be rewritten as:

if worker_num <= 0: return {} return {}

or even better - not a component at all. Afterall, it is a single if statement + setting of a single value in a dict. This doesn't have to be a component at all. Remember that each component we create we start a container - this only slows the workflow. Especially in cases where it's a simple data formatting.

tumido · 2024-09-09T12:06:01Z

training/pipeline.py

+                        "image": "quay.io/michaelclifford/test-train:0.0.11",
+                        "name": "pytorch",
+                        "resources": {
+                            "requests": {
+                                "memory": "8Gi",
+                                "cpu": "2000m",
+                                # Uncomment for GPU
+                                "nvidia.com/gpu": 1,
+                            },
+                            "limits": {
+                                "memory": "8Gi",
+                                "cpu": "2000m",
+                                # Uncomment for GPU
+                                "nvidia.com/gpu": 1,
+                            },


Should we parametrize this?

tumido

Wrong button 😄 I've meant to request changes... 😄

tumido · 2024-09-09T13:08:37Z

training/pipeline.py

+
+
+if __name__ == "__main__":
+    import kfp.compiler as compiler


I don't think you need to nest the import here. This is against https://pylint.readthedocs.io/en/latest/user_guide/messages/convention/import-outside-toplevel.html

tumido · 2024-09-09T13:10:01Z

training/pipeline.py

+    pipeline_file = "pipeline.yaml"
+    print(
+        f"Compiling pipeline as {pipeline_file}"
+    )
+    compiler.Compiler().compile(
+        ilab_train, pipeline_file
+    )


This is somewhat weirdly formatted. I think simple:

Suggested change

pipeline_file = "pipeline.yaml"

print(

f"Compiling pipeline as {pipeline_file}"

)

compiler.Compiler().compile(

ilab_train, pipeline_file

)

pipeline_file = "pipeline.yaml"

print(f"Compiling pipeline as {pipeline_file}")

compiler.Compiler().compile(ilab_train, pipeline_file)

Would be fully compliant with PEP8.

Signed-off-by: Shreyanand <[email protected]>

leseb · 2024-10-03T16:06:56Z

Please install https://docs.astral.sh/ruff/installation/ on your system and run ruff format . once #57 is merged. THEN rebase your PR, everything should go fine :).

Add kfp pipeline for running a pytorch job

b1882ed

Signed-off-by: Shreyanand <[email protected]>

tumido approved these changes Sep 9, 2024

View reviewed changes

tumido self-requested a review September 9, 2024 12:07

tumido requested changes Sep 9, 2024

View reviewed changes

Remove worker spec container component

fd8be21

Signed-off-by: Shreyanand <[email protected]>

Shreyanand mentioned this pull request Sep 13, 2024

feat(kfp): add train stage #20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add kfp pipeline for running a pytorch job #14

Add kfp pipeline for running a pytorch job #14

Shreyanand commented Sep 6, 2024

cooktheryan commented Sep 6, 2024

tumido left a comment

tumido Sep 9, 2024

tumido Sep 9, 2024

tumido Sep 9, 2024

tumido left a comment

tumido Sep 9, 2024

tumido Sep 9, 2024

leseb commented Oct 3, 2024

Add kfp pipeline for running a pytorch job #14

Are you sure you want to change the base?

Add kfp pipeline for running a pytorch job #14

Conversation

Shreyanand commented Sep 6, 2024

cooktheryan commented Sep 6, 2024

tumido left a comment

Choose a reason for hiding this comment

tumido Sep 9, 2024

Choose a reason for hiding this comment

tumido Sep 9, 2024

Choose a reason for hiding this comment

tumido Sep 9, 2024

Choose a reason for hiding this comment

tumido left a comment

Choose a reason for hiding this comment

tumido Sep 9, 2024

Choose a reason for hiding this comment

tumido Sep 9, 2024

Choose a reason for hiding this comment

leseb commented Oct 3, 2024