-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(kfp): add train stage #20
Conversation
Signed-off-by: Tomas Coufal <[email protected]>
Signed-off-by: Tomas Coufal <[email protected]>
Signed-off-by: Tomas Coufal <[email protected]>
…mplate properly Signed-off-by: Tomas Coufal <[email protected]>
Signed-off-by: Tomas Coufal <[email protected]>
Signed-off-by: Tomas Coufal <[email protected]>
|
||
# We don't expect to keep Triton cache after pod is finished | ||
ENV TRITON_CACHE_DIR="/tmp" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an [rank1]: PermissionError: [Errno 13] Permission denied: '/.cache'
error during training, as torch attempts to write to the .cache dir. We can fix it by setting the following env variable.
XDG_CACHE_HOME=/tmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in the current state. The PR adds most of the sdg + training pipeline skeleton. We could merge and iterate over it.
Competition PR to #14
This bring Training stage to the pipeline
This is how it looks when run with faked SDG (I didn't want to wait for the PyTorchJob so I made it timeout on the wait step):
What works:
TODO:
Fix passing data to PytorchJob, PVC requirement may not be the best option herekubectl wait
to allow timeout/namespace params, I was initially not successful here due to some KFP templating issues...metadata.name
untildsl.PIPELINE_*
is available ([sdk] Cannot get real job_id by using PIPELINE_JOB_ID_PLACEHOLDER in component kubeflow/pipelines#10453)