-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming pipelines subscribed to source data arriving in a cache bucket #598
Comments
Love this direction. Happy to discuss another time (I see a meeting link in
another thread). Quick question for now: why not use Beam’s built-in
streaming primitives? The KISS solution here, IMO, is to build off of
Beam’s PubSub and Kafka connectors to react to bucket event updates.
…On Fri, Sep 1, 2023 at 4:35 PM Charles Stern ***@***.***> wrote:
Over in leap-stc/data-management#49 (comment)
<leap-stc/data-management#49 (comment)>
@alxmrs <https://github.com/alxmrs> suggested this as a way to integrate
Pangeo Forge and weather-dl
<https://github.com/google/weather-tools/tree/main/weather_dl#readme>.
This seems like a *great* pattern to adopt, and would be broadly useful
for any slow caching operations (using any out-of-band caching operation,
not exclusively weather-dl). ECMWF here serves is a motivating (extreme)
example of a generalized problem for which this could be a very desirable
solution.
From his experience with stream windowing in the context of
https://github.com/bytewax/bytewax, @rabernat
<https://github.com/rabernat> suggested a very elegant idea that timers
could be configured such that the timestamps used to label events are not
the wall time when that data arrives in the cache, but rather a key
corresponding to the indexed position the cached data represents in the
FilePattern. (A less general case would be the timestep the data represents
in the concat dim. We could start this way, using a concat-only recipe, but
for accommodating n-dimensionality it probably would need to an
n-dimensional key, not unlike what is used for the Rechunk transform's
GroupByKey.)
Processing would then be configured to begin once a logically complete set
of IndexedPositions was cached, which for a first processing group in the
stream would be the same set as would otherwise be generated by
pattern.items(). Subsequent triggers could be for smaller append-able
units.
xref #447
<#447> for
appending and #570
<#570> for
caching
—
Reply to this email directly, view it on GitHub
<#598>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AARXAB7XYV6EUVFV6AI5EATXYJPLLANCNFSM6AAAAAA4IEFZXE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
💯 we should do this. In which case this may be more a matter of documenting best practices rather than adding (much, any?) code here. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Over in leap-stc/data-management#49 (comment) @alxmrs suggested this as a way to integrate Pangeo Forge and
weather-dl
. This seems like a great pattern to adopt, and would be broadly useful for any slow caching operations (using any out-of-band caching operation, not exclusivelyweather-dl
). ECMWF here serves is a motivating (extreme) example of a generalized problem for which this could be a very desirable solution.From his experience with stream windowing in the context of https://github.com/bytewax/bytewax, @rabernat suggested a very elegant idea that timers could be configured such that the timestamps used to label events are not the wall time when that data arrives in the cache, but rather a key corresponding to the indexed position the cached data represents in the FilePattern. (A less general case would be the timestep the data represents in the concat dim. We could start this way, using a concat-only recipe, but for accommodating n-dimensionality it probably would need to an n-dimensional key, not unlike what is used for the
Rechunk
transform's GroupByKey.)Processing would then be configured to begin once a logically complete set of IndexedPositions was cached, which for a first processing group in the stream would be the same set as would otherwise be generated by
pattern.items()
. Subsequent triggers could be for smaller append-able units.xref #447 for appending and #570 for caching
The text was updated successfully, but these errors were encountered: