Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(kfp): add train stage #20

Merged
merged 6 commits into from
Sep 13, 2024
Merged

Conversation

tumido
Copy link
Member

@tumido tumido commented Sep 11, 2024

Competition PR to #14

This bring Training stage to the pipeline

This is how it looks when run with faked SDG (I didn't want to wait for the PyTorchJob so I made it timeout on the wait step):

image

What works:

  • Scheduling PyTorchJob
  • Waiting for it
  • Consuming SDG output and passing it to PVC
  • Downloading base model from HuggingFace and storing it in Artifact
  • Populating a PVC with base model
  • Creating a PVC for output collection
  • Collecting outputs from PVC into artifacts (Model + other outputs)
  • Mounting all PVC to PyTorchJob

TODO:

@tumido tumido marked this pull request as ready for review September 12, 2024 17:56
@tumido tumido changed the title WIP: feat(kfp): add train stage feat(kfp): add train stage Sep 12, 2024
Signed-off-by: Tomas Coufal <[email protected]>

# We don't expect to keep Triton cache after pod is finished
ENV TRITON_CACHE_DIR="/tmp"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an [rank1]: PermissionError: [Errno 13] Permission denied: '/.cache' error during training, as torch attempts to write to the .cache dir. We can fix it by setting the following env variable.

Suggested change
XDG_CACHE_HOME=/tmp

Copy link
Member

@Shreyanand Shreyanand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in the current state. The PR adds most of the sdg + training pipeline skeleton. We could merge and iterate over it.

@cooktheryan cooktheryan merged commit 0fb529e into opendatahub-io:main Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants