-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Pipelines] Design - Data orchestration input #1005
Comments
TBH I'm not sure I understand the concept of how Parsons could implement orchestration. As I think of it, orchestration really requires cloud infrastructure to be provisioned and configured, including code storage in the cloud (dockerizing and pushing to a docker store or copying code to s3), cloud compute, cloud secret storage for access in production, a healthy layer of IAM roles for development access and appropriately scoped execution privileges, billing information / a credit card on file, etc. etc. For the Prefect example, wrapping a python script in Most of this feels outside the scope of what a python package (Parsons) can really implement |
Austin, those are some good points. Here is what the current Prefect implementation provides and what it doesn't: Here is what it doesn't provide:
Here is what it does:
I'm not convinced that it couldn't help with scheduling and some of that other stuff, although it'd be pretty dependent on whatever plugins we built. An option I looked into was Apache Airflow, which would be possible to integrate in a similar way to how Prefect is currently handled, I think. I think you raise three really good questions for this part of the design:
|
Overview
The Pipelines project targets two user groups. One of them are advanced users who are already fluent in Python. One of the main value-add features of pipelines to advanced users is easy data orchestration integration. Data orchestration gives many benefits, such as error logging, data visibility, etc. It is a key goal of the pipelines system that you get drop-in data orchestration of your pipelines "for free."
Currently (2/27/2024) the pipelines branch has hard-coded Prefect integration. This is a good proof of concept, as the Prefect integration is entirely behind the scenes. However, because Prefect is closed source and cloud based, it's not acceptable to lock pipelines into that tool.
Discussion
The initial goal of this discussion is to gather input from the community about:
Once we have collected data about a wide variety of tools, we will design an abstraction that allows the pipelines system to work with as many data orchestration tools as possible. Then, data orchestration "plugins" that target the abstraction can be added either inside or outside of Parsons, allowing pipelines to be used with any data orchestration platform.
Without a thorough discussion of different data orchestration use cases, we risk designing an abstraction that cannot accommodate many of the tools that pipelines users will want to target in their code.
The text was updated successfully, but these errors were encountered: