Streaming pipelines subscribed to source data arriving in a cache bucket #598

cisaacstern · 2023-09-01T22:35:22Z

Over in leap-stc/data-management#49 (comment) @alxmrs suggested this as a way to integrate Pangeo Forge and weather-dl. This seems like a great pattern to adopt, and would be broadly useful for any slow caching operations (using any out-of-band caching operation, not exclusively weather-dl). ECMWF here serves is a motivating (extreme) example of a generalized problem for which this could be a very desirable solution.

From his experience with stream windowing in the context of https://github.com/bytewax/bytewax, @rabernat suggested a very elegant idea that timers could be configured such that the timestamps used to label events are not the wall time when that data arrives in the cache, but rather a key corresponding to the indexed position the cached data represents in the FilePattern. (A less general case would be the timestep the data represents in the concat dim. We could start this way, using a concat-only recipe, but for accommodating n-dimensionality it probably would need to an n-dimensional key, not unlike what is used for the Rechunk transform's GroupByKey.)

Processing would then be configured to begin once a logically complete set of IndexedPositions was cached, which for a first processing group in the stream would be the same set as would otherwise be generated by pattern.items(). Subsequent triggers could be for smaller append-able units.

xref #447 for appending and #570 for caching

The text was updated successfully, but these errors were encountered:

alxmrs · 2023-09-01T23:30:14Z

Love this direction. Happy to discuss another time (I see a meeting link in another thread). Quick question for now: why not use Beam’s built-in streaming primitives? The KISS solution here, IMO, is to build off of Beam’s PubSub and Kafka connectors to react to bucket event updates.

…

On Fri, Sep 1, 2023 at 4:35 PM Charles Stern ***@***.***> wrote: Over in leap-stc/data-management#49 (comment) <leap-stc/data-management#49 (comment)> @alxmrs <https://github.com/alxmrs> suggested this as a way to integrate Pangeo Forge and weather-dl <https://github.com/google/weather-tools/tree/main/weather_dl#readme>. This seems like a *great* pattern to adopt, and would be broadly useful for any slow caching operations (using any out-of-band caching operation, not exclusively weather-dl). ECMWF here serves is a motivating (extreme) example of a generalized problem for which this could be a very desirable solution. From his experience with stream windowing in the context of https://github.com/bytewax/bytewax, @rabernat <https://github.com/rabernat> suggested a very elegant idea that timers could be configured such that the timestamps used to label events are not the wall time when that data arrives in the cache, but rather a key corresponding to the indexed position the cached data represents in the FilePattern. (A less general case would be the timestep the data represents in the concat dim. We could start this way, using a concat-only recipe, but for accommodating n-dimensionality it probably would need to an n-dimensional key, not unlike what is used for the Rechunk transform's GroupByKey.) Processing would then be configured to begin once a logically complete set of IndexedPositions was cached, which for a first processing group in the stream would be the same set as would otherwise be generated by pattern.items(). Subsequent triggers could be for smaller append-able units. xref #447 <#447> for appending and #570 <#570> for caching — Reply to this email directly, view it on GitHub <#598>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AARXAB7XYV6EUVFV6AI5EATXYJPLLANCNFSM6AAAAAA4IEFZXE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

cisaacstern · 2023-09-01T23:45:12Z

Quick question for now: why not use Beam’s built-in
streaming primitives? The KISS solution here, IMO, is to build off of
Beam’s PubSub and Kafka connectors to react to bucket event updates.

💯 we should do this.

In which case this may be more a matter of documenting best practices rather than adding (much, any?) code here.

cisaacstern mentioned this issue Sep 1, 2023

Ocean Reanalysis System 5 [ORAS5 ECMWF] leap-stc/data-management#49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming pipelines subscribed to source data arriving in a cache bucket #598

Streaming pipelines subscribed to source data arriving in a cache bucket #598

cisaacstern commented Sep 1, 2023

alxmrs commented Sep 1, 2023 via email

cisaacstern commented Sep 1, 2023

Streaming pipelines subscribed to source data arriving in a cache bucket #598

Streaming pipelines subscribed to source data arriving in a cache bucket #598

Comments

cisaacstern commented Sep 1, 2023

alxmrs commented Sep 1, 2023 via email

cisaacstern commented Sep 1, 2023