-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/pred2bq bulk update #230
base: main
Are you sure you want to change the base?
Feature/pred2bq bulk update #230
Conversation
Thanks for the PR! 🚀 Instructions: Approve using |
@michaelwsherman fyi |
schema=schema_gen.outputs['schema'] | ||
transform_graph=transform.outputs['transform_graph'], | ||
bq_table_name='my_bigquery_table', | ||
gcs_temp_dir='gs://bucket/temp-dir', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just use the temp_dir from beam_pipeline_args instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just realized that the WriteToBigQuery
transform used by the Beam pipeline defaults to temp_location
if custom_gcs_temp_location
is not specified, and temp_location
should be an argument supported by beam_pipeline_args
. Although I'm not quite sure if temp_location
needs to be explicitly set or if Beam will create one by default otherwise.
I think this would be a more involved change as the Vertex integration test needs a custom container having the pred2bq component in Artifact Registry to run. Perhaps we have it as a separate PR instead?
- Adds unit tests - Also adds credits to original code author
Adds a test that runs the executor module's Beam pipeline using a DirectRunner and exports prediction data to an actual BigQuery table.
Changes: - Refactors the component integrate test and adds a test to run the component on Vertex AI Pipelines. - Adds a Dockerfile to package the component code into a Docker image based on OSS TFX image. The image can then be used as the base image when running a pipeline in Vertex AI. - Updates the `bigquery_export` output of the predictions-to-bigquery to store the generated BigQuery table name. This aids with checking the output of the component during testing, but also allows any downstream component receive this component's output.
Adds a test that integrates the transform component into the pipeline. Test is implemented for local runner only.
Adds a container component stub to represent the TFX Transform component for integration testing on Vertex AI.
Replaces create_tempfile and create_tempdir calls from abseil's absltest.TestCase and parameterized.TestCase with equivalent methods from the tempfile package. The reason is that the abseil methods require parsing of the FLAGs variable, which may not be executed if absltest.main() is not invoked. This can happen when test filtering is performed, e.g. ``` python -m unittest path.to.test ```
Mentions the predictions-to-bigquery component in top-level readme.
- Fix issues in pred2bq readme - Reverted version change in setup.py - Add abls-py test prerequisite in setup.py
20f4068
to
d091922
Compare
@casassg thanks for the comments! Made some changes and replied to your comments, ptal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. nit are optional (good for later as well), undo chnage in version.py as instructed before merge. Also fixing CI so you can merge
may need to rebase to use #244 so we can run end to end all CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Carlos, I trust you to fix anything I flagged that's worth fixing so I'm approving this PR.
I didn't review or run the tests, and I didn't make sure that everything in utils is used. If you want the tests run/reviewed, let me know and I'll delegate it. If you've run them I'm good.
Overall this is great. I appreciate that you've written out a full example in the integration test, documented everything well, and included an example.
tfx_addons/version.py
Outdated
@@ -16,7 +16,7 @@ | |||
|
|||
# We follow Semantic Versioning (https://semver.org/) | |||
_MAJOR_VERSION = "0" | |||
_MINOR_VERSION = "6" | |||
_MINOR_VERSION = "7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heads up to other reviewers that this matches where we're currently at with the releases, not sure why github still shows it as a diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be due to missing to rebase the branch from master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelwsherman why do we need to remove the Dockerfile? It's currently used to define the tfx-addons container that's needed by the integration test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional comments @casassg and @michaelwsherman .
Made some fixes.
|
||
|
||
def _get_compress_type(file_path): | ||
def _get_compress_type(file_path: str) -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from the original code by Hannes and I decided to reuse it. I saw a Python library that may provide a similar functionality: https://pypi.org/project/filetype
I could create an issue for it, and it might be good first issue to take on for a new contributor.
Fixes #78
Includes changes in PR #225 .
This pull request contains bulk updates to the Predictions to BigQuery component.
Changes:
Checks: