Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming pipelines subscribed to source data arriving in a cache bucket #598

Open
cisaacstern opened this issue Sep 1, 2023 · 2 comments

Comments

@cisaacstern
Copy link
Member

Over in leap-stc/data-management#49 (comment) @alxmrs suggested this as a way to integrate Pangeo Forge and weather-dl. This seems like a great pattern to adopt, and would be broadly useful for any slow caching operations (using any out-of-band caching operation, not exclusively weather-dl). ECMWF here serves is a motivating (extreme) example of a generalized problem for which this could be a very desirable solution.

From his experience with stream windowing in the context of https://github.com/bytewax/bytewax, @rabernat suggested a very elegant idea that timers could be configured such that the timestamps used to label events are not the wall time when that data arrives in the cache, but rather a key corresponding to the indexed position the cached data represents in the FilePattern. (A less general case would be the timestep the data represents in the concat dim. We could start this way, using a concat-only recipe, but for accommodating n-dimensionality it probably would need to an n-dimensional key, not unlike what is used for the Rechunk transform's GroupByKey.)

Processing would then be configured to begin once a logically complete set of IndexedPositions was cached, which for a first processing group in the stream would be the same set as would otherwise be generated by pattern.items(). Subsequent triggers could be for smaller append-able units.

xref #447 for appending and #570 for caching

@alxmrs
Copy link
Contributor

alxmrs commented Sep 1, 2023 via email

@cisaacstern
Copy link
Member Author

Quick question for now: why not use Beam’s built-in
streaming primitives? The KISS solution here, IMO, is to build off of
Beam’s PubSub and Kafka connectors to react to bucket event updates.

💯 we should do this.

In which case this may be more a matter of documenting best practices rather than adding (much, any?) code here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants