Append mode for StoreToZarr #721

cisaacstern · 2024-03-29T18:03:01Z

Towards #447

This first pass focuses just on StoreToZarr (not reference recipes).

Will open for review once it looks like it's working.

rabernat · 2024-03-29T19:09:02Z

I would recommend using append_dim: str | None instead of mode.

cisaacstern · 2024-03-31T22:38:22Z

This now works! The best "documentation" of it so far is the end to end test.

Tagged a bunch of you for review, for visibility. Also @thodson-usgs since I know you're interested in this feature.

To state the (presumably) obvious, this minimal implementation requires that the user/deployer know what range of data should be appended to the existing dataset, and does not provide any idempotency guarantees or much in the way of logical checks. It will happily append time ranges that do not monotonically increase from the existing concat dim, or re-append the same data multiple times.

But because resizing of the contact dimension is done with xarray.Dataset.to_zarr, we do get the benefit of some built-in consistency checks there.

norlandrhagen

Awesome work Charles! It's great to see this feature added. Do you plan to add a section to the docs?

pangeo_forge_recipes/aggregation.py

norlandrhagen · 2024-04-01T16:20:27Z

pangeo_forge_recipes/transforms.py

+        self._append_offset = 0
+        if self.append_dim:
+            logger.warn(
+                "When `append_dim` is given, StoreToZarr is NOT idempotent. Successive deployment "


In future PR's to this, do you see a good path forward to making append safe/idempotent?

Good question! For future readers, we had a good discussion of this in today's Coordination meeting, minutes here.

rabernat · 2024-04-01T18:39:02Z

pangeo_forge_recipes/transforms.py

@@ -362,6 +369,7 @@ def expand(self, pcoll: beam.PCollection) -> beam.PCollection:
            attrs=self.attrs,
            encoding=self.encoding,
            consolidated_metadata=False,
+            append_dim=self.append_dim,


Amazing how easy it was to modify this step to allow appending.

cisaacstern · 2024-04-02T21:49:34Z

Do you plan to add a section to the docs?

@norlandrhagen how does aaa5a69 seem? Rendered version here

Edit: Noticed a small rendering issue, fixed that just now.

cisaacstern · 2024-04-02T22:40:48Z

pangeo_forge_recipes/aggregation.py

+    if append_dim:
+        # if appending, only keep schema for coordinate to append. if we don't drop other
+        # coords, we may end up overwriting existing data on the `ds.to_zarr` call below.
+        schema["coords"] = {k: v for k, v in schema["coords"].items() if k == append_dim}


Just realizing that this, while seemingly necessary to avoid overwriting non-append_dim coords, may prevent us from benefiting from the dimension consistency checks offered by ds.to_zarr... we may need to re-implement such checks in StoreToZarr.__post_init__ if we want them.

Thinking about this further... in StoreToZarr.__post_init__ we actually don't know what the coordinates in the new data are (aside from the user-provided append_dim), because we haven't actually fetched and opened any source files yet.

So if we want to check for consistency, we could actually do it right here. Before dropping non-append coords from the schema, we would just need to open the target dataset and check that all coords match first. Could be a good idea.

Having just reviewed xarray's checks again, I actually don't see that this is done there, so may be overly strict. I think we can move forward with what we have for now.

cisaacstern added 6 commits March 29, 2024 10:57

mode kw for schema_to_zarr

b74905b

pass mode kw through transforms

8e3d55e

appending end to end test WIP

9d18ba8

test_schema_to_zarr

342dd26

mypy

401ee46

Merge remote-tracking branch 'origin/main' into append-mode

ee41b76

cisaacstern added 15 commits March 29, 2024 13:07

note writer offset + test

b9f6b8b

append_dim, not mode

9defde9

schema_to_zarr append test

9eff610

fix broken test

9a3cb4b

simplify schema to zarr append test

cd13679

get dimension resizing to work in end to end test

d0cbb3f

append_offset for aughment_index...

3b4c31e

pass append_offset through from transforms layer

5b43dae

only append schema for append dim coord

99e41ed

appending end to end test passes

4bf2eae

revert writers changes

468f326

remove unused variable in test

5ce2d62

test append dim asserts raises

7eb866c

unit test augment with append offset

6822f56

idempotentcy warning

ffd21b9

cisaacstern added the test-integration Apply this label to run integration tests on a PR. label Mar 31, 2024

cisaacstern marked this pull request as ready for review March 31, 2024 22:24

cisaacstern requested review from rabernat, norlandrhagen, moradology, ranchodeluxe and abarciauskas-bgse March 31, 2024 22:24

remove unused test stub

a51d3cc

norlandrhagen approved these changes Apr 1, 2024

View reviewed changes

rabernat reviewed Apr 1, 2024

View reviewed changes

appending note in docs

aaa5a69

fix doc crosslink

c75a274

cisaacstern commented Apr 2, 2024

View reviewed changes

cisaacstern merged commit 30b9bb2 into main Apr 3, 2024
8 checks passed

cisaacstern deleted the append-mode branch April 3, 2024 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Append mode for StoreToZarr #721

Append mode for StoreToZarr #721

cisaacstern commented Mar 29, 2024

rabernat commented Mar 29, 2024

cisaacstern commented Mar 31, 2024 •

edited

Loading

norlandrhagen left a comment

norlandrhagen Apr 1, 2024

cisaacstern Apr 1, 2024

rabernat Apr 1, 2024

cisaacstern commented Apr 2, 2024 •

edited

Loading

cisaacstern Apr 2, 2024

cisaacstern Apr 2, 2024

cisaacstern Apr 3, 2024

Append mode for StoreToZarr #721

Append mode for StoreToZarr #721

Conversation

cisaacstern commented Mar 29, 2024

rabernat commented Mar 29, 2024

cisaacstern commented Mar 31, 2024 • edited Loading

norlandrhagen left a comment

Choose a reason for hiding this comment

norlandrhagen Apr 1, 2024

Choose a reason for hiding this comment

cisaacstern Apr 1, 2024

Choose a reason for hiding this comment

rabernat Apr 1, 2024

Choose a reason for hiding this comment

cisaacstern commented Apr 2, 2024 • edited Loading

cisaacstern Apr 2, 2024

Choose a reason for hiding this comment

cisaacstern Apr 2, 2024

Choose a reason for hiding this comment

cisaacstern Apr 3, 2024

Choose a reason for hiding this comment

cisaacstern commented Mar 31, 2024 •

edited

Loading

cisaacstern commented Apr 2, 2024 •

edited

Loading