-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making appending work in the beam refactor #447
Comments
Darshan is working on an implementation of this feature. However, it occurs to us that this may be made more complicated by the existence of consolidated dimensions (like time): #556 (comment) @rabernat or @cisaacstern: Any pointers or ideas on how the two features could be compatible? |
@alxmrs & @DarshanSP19 thanks for looking into this! As a naive starting place, could we just say that appending is incompatible with consolidated dims? And then deal with the more complex case of appending + consolidated dims in a follow-on PR? |
@alxmrs & @DarshanSP19, before we get too deep into implementation here, @rabernat requested that we do a design review. Could I ask we all fill out the following when2meet with availability: https://www.when2meet.com/?21157486-cpxSV ? Note the poll covers:
@DarshanSP19 I'm not sure what timezone you are in, if the time ranges in the poll need to be adjusted please let me know. |
Quick re-ping for @rabernat and @DarshanSP19 to complete the when2meet. I was going to suggest we just meet about this during the Pangeo Forge Coordination Meeting at 11am ET on Sept 25, but based your when2meet response looks like you are not available then, @alxmrs? |
Thanks for the reminder. I will fill this out soon. I think I only
submitted a partial schedule to the when2meet. I am happy to meet in our
coordination meeting. It’s worth nothing that Darshan is on IST.
…On Fri, Sep 8, 2023 at 12:05 PM Charles Stern ***@***.***> wrote:
Quick re-ping for @rabernat <https://github.com/rabernat> and @DarshanSP19
<https://github.com/DarshanSP19> to complete the when2meet
<https://www.when2meet.com/?21157486-cpxSV>.
I was going to suggest we just meet about this during the Pangeo Forge
Coordination Meeting at 11am ET on Sept 25, but based your when2meet
response looks like you are not available then, @alxmrs
<https://github.com/alxmrs>?
—
Reply to this email directly, view it on GitHub
<#447 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AARXABZNAUXITOGP3OOWIKTXZNT5RANCNFSM6AAAAAASUWJYUU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Charles, would you mind sending out another When2Meet with earlier times as options? I'd like to better accommodate Darshan's availability (IST time zone). |
Apologies for the delay on this. Here is a new poll, with times available in the range of:
@alxmrs & @DarshanSP19, does this look workable? |
Looks great to me! Thanks for your flexibility in times. I submitted my availability. |
Thanks all for filling out the poll. I've emailed an invite (to Alex and Ryan):
@DarshanSP19 could you share your email so I can add you to the invite? Either here or [email protected]. |
This issue is for @alxmrs, who indicated some willingness to work on it.
In order to append to existing Zarr datasets, we need to update the StoreToZarr PTransform such that it can handle an existing dataset at the
target_url
.That PTransform first figures out the schema of the input data, then pass this to the
PrepareZarrTarget
schema:pangeo-forge-recipes/pangeo_forge_recipes/transforms.py
Lines 205 to 227 in 0e16f15
which ultimately calls the function
schema_to_zarr
pangeo-forge-recipes/pangeo_forge_recipes/aggregation.py
Lines 233 to 242 in 0e16f15
One way to implement this would be to add an
append
option to this function. If active, we could try to first open the target zarr dataset and see if it is compatible with the schema (e.g. same variables and dimensions [other than the concat dim]). If so, we could skip initializing / overwriting it. Instead, we would simply have to resize the arrays (e.g. using the zarr resize function). The original zarr dataset size would be used to determine an offset for future writes. (Note there is a tricky edge case if the offset does not evenly divide by the target chunk size.)The other change that would have to be made is in
store_dataset_fragment
pangeo-forge-recipes/pangeo_forge_recipes/writers.py
Lines 54 to 76 in 0e16f15
We would need some way to pass the offset (determined in
schema_to_zarr
as described above) through the pipeline so that the new data is written at the correct location.xref:
The text was updated successfully, but these errors were encountered: