-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CopyExampleGen component #66
Comments
I will take lead on this (if any one wants to join, feel free to). Working on project proposal and will bring to SIG TFX-Addons bi-weeklys |
Just adding some comments here since we have a component internally that does this, but it's not particularly robust (but has sat around for 3ish years as-is because it's been good-enough). I've also helped at least one team develop a variation of the component to suit their needs. Over time we've realized a few inefficiencies with our implementation that may help with the development of a component like this. Overarching problem with TFX's split implementationThe reason why this component is useful is due to TFX's implementation of
This means that you can't just take a dataset or set of datasets from anywhere, use an Importer and pass that data to a component unless your source data already is already formatted that way. In many cases it's not too hard to get the team to output their data in this format, but there are cases where it's not possible. Currently, the only way to handle this is to use a ImportExampleGen. If you have pre-existing splits you have to do a kind of hacky approach with the In addition to it being non-obvious, it's also very inefficient, especially for large datasets (TB+), as a whole dataflow job with a shuffle has to be kicked off. For this reason, a copy-based component is much preferred and simpler. Typical User Journeys
Insights from existing workWhile I didn't work on the component myself, we had an engineer who did some basic benchmarking of a few approaches, which happened in 2019, so may be out of date. He compared the python gfile API in combination with multi thread/processing agaainst shelling out to gsutil. I dont have stats on the test dataset, but the results are summarized as so:
There were some significant cons noted to using gsutil, which made it much less appealing despite its superior performance:
Our component went with a multi-process gfile approach, but we learned later that this has some significant downsides, which is that for very large datasets (example dataset was multiple TB with 4.4gb shards), we would either OOM or hit the iops limit of the docker container, since I guess it writes the data being copied temporarily to disk. This can be mitigated by allowing user control of threading (which defaulted to a multiple of machine cores) Conclusion Summary (TL;DR)In my experience, the ideal component would allow you to pass a dict mapping of Choose the underlying copy implementation wisely/test with large datasets if possible for robustness. Gfile has issues with performance, and gsutil has issues with maintainability (not python native) and flexibility (ie can't be used in local mode with local files). It may be possible that the APIs have improved, for example this thread was unresolved at the time of our implementation. |
Hey @rclough, thanks for the helpful input and suggestions as we try and develop this project proposal. I have a question in regards to your conclusion, the ideal component would allow the user to:
|
I think these questions both kind of highlight a need for clarity of input that I'd glossed over-
I don't aim to answer these conclusively but to bring up some considerations. I would argue that it is more helpful to avoid expecting an My mistake on mentioning So ultimately my thought was a dict input like this example: {
"train": "gs://some/path/to/train_data"
"eval": "gs://golden_eval_data"
"extra_holdout": "gs://somewhere/else/data"
} Or alternatively where those URIs are actually artifacts. The component would loop through and create a split for each key, and copy the data from the value URI to Lastly, regarding other cloud providers, that's actually a really good question, and is probably part of a tradeoff that must be made when choosing the copy implementation. I'm sure it would be great if many common cloud APIs were supported, but you probably need to draw a reasonable scope. In my case, we only use local directories in GCS, but I'm not sure how much is realistic to support, multiplied with the performance considerations (ie, something like gfile might be more generic and support s3 etc, but may not perform as well as gsutil for gcs) |
In cases where data does not need to be shuffled, this component will avoid using a Beam job and instead do a simple copy of the data to create a dataset artifact. This will need to be a completely custom ExampleGen and not extend BaseExampleGen in order to implement this behavior.
@rclough @1025KB
The text was updated successfully, but these errors were encountered: