Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulty Debugging the Vertex AI Training Pipeline #49

Open
clopezhrimac opened this issue May 24, 2023 · 3 comments
Open

Difficulty Debugging the Vertex AI Training Pipeline #49

clopezhrimac opened this issue May 24, 2023 · 3 comments

Comments

@clopezhrimac
Copy link

I am facing difficulties in debugging the Vertex AI training pipeline. The issue lies in the fact that I cannot run the pipeline locally for testing and debugging purposes. Instead, I have to submit the pipeline to Vertex AI and wait for it to execute in order to obtain debugging information.

The current debugging process involves sending the pipeline with multiple print statements or logging messages to trace the execution flow and pinpoint the exact location of the error. This becomes a slow and tedious cycle as it requires resubmitting the pipeline every time an adjustment or error identification is needed.

Steps to Reproduce the Problem:

  • Create a training pipeline in Vertex AI.
  • Submit the pipeline to Vertex AI for execution.
  • Wait for the pipeline to execute and retrieve the results.
  • Analyze the generated messages or logs to identify the error.
  • Make adjustments or corrections to the pipeline.
  • Repeat steps 2-5 until the problem is identified and resolved.

What would be the best way to handle this training component development cycle?

@felix-datatonic
Copy link
Contributor

Hi @clopezhrimac, thanks for your question! KubeFlow limits the execution of pipelines locally, alternatives pointed by the community are:

  1. creating a local kubernetes cluster and submitting your pipeline to the local cluster (instead of Vertex AI)
  2. isolating individual KubeFlow operations into python- or container-based components to test your business logic locally, however, only for each component separately

In this project, we've optimised components to be inline with option (2). Commands which help you with testing locally are:

make setup-all-components
make test-all-components

or

make setup-component GROUP=<e.g. vertex-components>
make test-component GROUP=<e.g. vertex-components>

Further, we've replaced the python-based training component in the pipelines with a CustomTrainingJob recently which allows you to run your training script locally before submitting to Vertex AI.

While these don't provide full parity between local pipeline runs and submitting pipelines to Vertex AI, these will help you to iterate locally over any changes related to custom python-based components and your training code.

We're currently evaluating the use of CustomPythonPackageTrainingJob, too, and are open to any suggestions you might have!

@clopezhrimac
Copy link
Author

What is the diference between CustomTrainingJob y CustomPythonPackageTrainingJob ?

@felix-datatonic
Copy link
Contributor

felix-datatonic commented Nov 15, 2023

Hi @clopezhrimac,

Thanks for this issue. Please check out the most recent PR and release.

We've moved away from CustomTrainingJob and CustomPythonPackageTrainingJob since KubeFlow 2.0 supports container components now.

You can cd into the model folder and run your training and prediction code locally before triggering a pipeline in Vertex AI. However, this will only test the training and prediction steps, not the pipeline end-to-end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants