-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Append mode for StoreToZarr #721
Conversation
I would recommend using |
This now works! The best "documentation" of it so far is the end to end test. Tagged a bunch of you for review, for visibility. Also @thodson-usgs since I know you're interested in this feature. To state the (presumably) obvious, this minimal implementation requires that the user/deployer know what range of data should be appended to the existing dataset, and does not provide any idempotency guarantees or much in the way of logical checks. It will happily append time ranges that do not monotonically increase from the existing concat dim, or re-append the same data multiple times. But because resizing of the contact dimension is done with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work Charles! It's great to see this feature added. Do you plan to add a section to the docs?
self._append_offset = 0 | ||
if self.append_dim: | ||
logger.warn( | ||
"When `append_dim` is given, StoreToZarr is NOT idempotent. Successive deployment " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In future PR's to this, do you see a good path forward to making append safe/idempotent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! For future readers, we had a good discussion of this in today's Coordination meeting, minutes here.
@@ -362,6 +369,7 @@ def expand(self, pcoll: beam.PCollection) -> beam.PCollection: | |||
attrs=self.attrs, | |||
encoding=self.encoding, | |||
consolidated_metadata=False, | |||
append_dim=self.append_dim, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing how easy it was to modify this step to allow appending.
@norlandrhagen how does aaa5a69 seem? Rendered version here Edit: Noticed a small rendering issue, fixed that just now. |
if append_dim: | ||
# if appending, only keep schema for coordinate to append. if we don't drop other | ||
# coords, we may end up overwriting existing data on the `ds.to_zarr` call below. | ||
schema["coords"] = {k: v for k, v in schema["coords"].items() if k == append_dim} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realizing that this, while seemingly necessary to avoid overwriting non-append_dim
coords, may prevent us from benefiting from the dimension consistency checks offered by ds.to_zarr
... we may need to re-implement such checks in StoreToZarr.__post_init__
if we want them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this further... in StoreToZarr.__post_init__
we actually don't know what the coordinates in the new data are (aside from the user-provided append_dim
), because we haven't actually fetched and opened any source files yet.
So if we want to check for consistency, we could actually do it right here. Before dropping non-append coords from the schema, we would just need to open the target dataset and check that all coords match first. Could be a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having just reviewed xarray's checks again, I actually don't see that this is done there, so may be overly strict. I think we can move forward with what we have for now.
Towards #447
This first pass focuses just on
StoreToZarr
(not reference recipes).Will open for review once it looks like it's working.