-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add virtual-rechunk example #520
base: main
Are you sure you want to change the base?
Conversation
This is super cool @thodson-usgs! Excited to see a rechunking <-> virtualizarr <-> cubed example. |
@thodson-usgs have you tried a version of this with rechunker? |
I haven't. Cubed uses the rechunker algorithm, but this could help to isolate the problem. I'll add that the workflow runs fine with Dask. |
Looking at this a bit more closely @thodson-usgs ... So you you're doing two 3 pretty unrelated things in this example:
That is awesome (especially (2), which I haven't even tried to do myself yet), but I think we should make sure each of these steps works individually first. The I still think that putting this rechunking example in the Cubed repo is the right call though, to show how Cubed is basically a superset of Rechunker. But perhaps it should be broken up - i.e. have an example of (1) and (2) together in the virtualizarr library, and the example (3) lives here. Then if you want a full end-to-end notebook example that's maybe more of a pythia-level thing. But the important thing is to get it working first. Also note that you could test cubed's rechunk just by using Speaking of which, when I try to use |
Debugging this has been tricky. Invariably something is inconsistent with my environment and the behavior changes. All that is good justification for breaking up the problem, which I'm happy to do in the long run. In the meantime, I'll try all your good suggestions. On your last point, I'd reiterate that the workflow runs if I comment one line: # chunked_array_type='cubed', which makes me think that this bug is not in lithops, virtualizarr, kerchunk, or dask... |
Okay what. That's really surprising to me and breaks my mental model of what's going on here.
Yes, let's try to get |
Thanks for opening this @thodson-usgs! Using a local Lithops executor (or even just the default single-threaded Cubed executor on your local machine), might be helpful in isolating where the problem is. |
Indeed, there was an issue with my cubed environment, and I needed to repass my Following @tomwhite's suggestion, I tried: from cubed import Spec
spec = Spec(work_dir="tmp", allowed_mem="2GB")
combined_ds = xr.open_dataset('combined.json',
engine="kerchunk",
chunks={},
chunked_array_type='cubed', # optional so long as the spec is passed to .chunk
from_array_kwargs={'spec': spec}, #optional
)
combined_ds['Time'].attrs = {} # to_zarr complains about attrs
rechunked_ds = combined_ds.chunk(
chunks={'Time': 5, 'south_north': 25, 'west_east': 32},
chunked_array_type='cubed',
from_array_kwargs={'spec': spec},
)
# this succeeds
rechunked_ds.compute()
# but this fails
rechunked_ds.to_zarr('rechunked.zarr',
mode='w',
encoding={}, # TODO
consolidated=True,
safe_chunks=False,
) which complains ValueError: Arrays must have same spec in single computation. Specs: [cubed.Spec(work_dir=tmp, allowed_mem=2000000000, reserved_mem=0, executor=None, storage_options=None), cubed.Spec(work_dir=tmp, allowed_mem=2000000000, reserved_mem=0, executor=None, storage_options=None), cubed.Spec(work_dir=None, allowed_mem=200000000, reserved_mem=100000000, executor=None, storage_options=None)] Apparently If I comment every instance of Anyway, I don't think this is the bug I'm looking for, so I'll keep at it. |
🙏 for helping us surface these issues @thodson-usgs !
That makes a lot more sense, but also it sounds like potentially a bug with xarray's
Again this could be another problem with the
You should be able to pass cubed-related arguments via the |
After resolving the regression in the recent version of VirtualiZarr, we still hit an error during the final rechunck operation that runs within The script succeeds for certain cases:
spec:
work_dir: "tmp"
allowed_mem: "2GB" But it fails for the Lambda executor, both for the virtual dataset AND Furthermore, I don't have a great strategy for debugging a Lambda-only failure mode. In the meantime, I'll carefully rebuild my runtime environment. Error messagesFor the virtual dataset: Traceback (most recent call last):
File "/Users/thodson/Desktop/dev/software/cubed/examples/virtual-rechunk/rechunk-only.py", line 29, in <module>
rechunked_ds.to_zarr("rechunked.zarr",
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/core/dataset.py", line 2549, in to_zarr
return to_zarr( # type: ignore[call-overload,misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/api.py", line 1698, in to_zarr
writes = writer.sync(
^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/common.py", line 267, in sync
delayed_store = chunkmanager.store(
^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed_xarray/cubedmanager.py", line 207, in store
return store(
^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/ops.py", line 168, in store
compute(*arrays, executor=executor, _return_in_memory_array=False, **kwargs)
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/array.py", line 282, in compute
plan.execute(
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/plan.py", line 212, in execute
executor.execute_dag(
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 265, in execute_dag
execute_dag(
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 190, in execute_dag
for _, stats in map_unordered(
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 119, in map_unordered
future.status(throw_except=True)
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops_retries.py", line 83, in status
reraise(*self.response_future._exception)
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/six.py", line 719, in reraise
raise value
File "/function/lithops/worker/jobrunner.py", line 210, in run
File "/usr/local/lib/python3.11/site-packages/fsspec/spec.py", line 33, in make_instance
File "/usr/local/lib/python3.11/site-packages/fsspec/spec.py", line 81, in __call__
File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/reference.py", line 713, in __init__
File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/reference.py", line 59, in __iter__
File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/reference.py", line 141, in __getattr__
File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/reference.py", line 148, in setup
File "/usr/local/lib/python3.11/site-packages/fsspec/spec.py", line 773, in cat_file
File "/usr/local/lib/python3.11/site-packages/fsspec/spec.py", line 1303, in open
File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/local.py", line 191, in _open
File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/local.py", line 355, in __init__
File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/local.py", line 360, in _open
FileNotFoundError: [Errno 2] No such file or directory: '/function/combined.json/.zmetadata' For the Traceback (most recent call last):
File "/Users/thodson/Desktop/dev/software/cubed/examples/virtual-rechunk/mf-rechunk.py", line 46, in <module>
rechunked_ds.to_zarr("rechunked.zarr",
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/core/dataset.py", line 2549, in to_zarr
return to_zarr( # type: ignore[call-overload,misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/api.py", line 1698, in to_zarr
writes = writer.sync(
^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/common.py", line 267, in sync
delayed_store = chunkmanager.store(
^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed_xarray/cubedmanager.py", line 207, in store
return store(
^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/ops.py", line 168, in store
compute(*arrays, executor=executor, _return_in_memory_array=False, **kwargs)
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/array.py", line 282, in compute
plan.execute(
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/plan.py", line 212, in execute
executor.execute_dag(
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 265, in execute_dag
execute_dag(
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 190, in execute_dag
for _, stats in map_unordered(
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 119, in map_unordered
future.status(throw_except=True)
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops_retries.py", line 83, in status
reraise(*self.response_future._exception)
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/six.py", line 718, in reraise
raise value.with_traceback(tb)
File "/function/lithops/worker/jobrunner.py", line 210, in run
File "/usr/local/lib/python3.11/site-packages/zarr/core.py", line 2570, in __setstate__
File "/usr/local/lib/python3.11/site-packages/zarr/core.py", line 170, in __init__
File "/usr/local/lib/python3.11/site-packages/zarr/core.py", line 193, in _load_metadata
File "/usr/local/lib/python3.11/site-packages/zarr/core.py", line 204, in _load_metadata_nosync
zarr.errors.ArrayNotFoundError: array not found at path %r' "array not found at path %r' 'SNOWH'" Here the error was associated with the SNOWH variable but the variable changes run to run. |
Does it work with Cubed running on local Lithops (https://lithops-cloud.github.io/docs/source/compute_config/localhost.html)? |
Good suggestion. This fails, but it does reproduce the JSON serilization error. I suppose the next step is to try entering the debugger within the container runtime or else litter print statements into the source until I've isolated the error. Should I dig into Traceback (most recent call last):
File "/Users/thodson/Desktop/dev/software/cubed/examples/virtual-rechunk/rechunk-only.py", line 29, in <module>
rechunked_ds.to_zarr("rechunked.zarr",
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/core/dataset.py", line 2549, in to_zarr
return to_zarr( # type: ignore[call-overload,misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/api.py", line 1698, in to_zarr
writes = writer.sync(
^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/common.py", line 267, in sync
delayed_store = chunkmanager.store(
^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed_xarray/cubedmanager.py", line 207, in store
return store(
^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/ops.py", line 168, in store
compute(*arrays, executor=executor, _return_in_memory_array=False, **kwargs)
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/array.py", line 282, in compute
plan.execute(
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/plan.py", line 212, in execute
executor.execute_dag(
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 265, in execute_dag
execute_dag(
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 190, in execute_dag
for _, stats in map_unordered(
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 93, in map_unordered
futures = lithops_function_executor.map(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops_retries.py", line 120, in map
futures_list = self.executor.map(
^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/lithops/executors.py", line 276, in map
futures = self.invoker.run_job(job)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/lithops/invokers.py", line 268, in run_job
futures = self._run_job(job)
^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/lithops/invokers.py", line 210, in _run_job
raise e
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/lithops/invokers.py", line 207, in _run_job
self._invoke_job(job)
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/lithops/invokers.py", line 255, in _invoke_job
activation_id = self.compute_handler.invoke(payload)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/lithops/localhost/v2/localhost.py", line 140, in invoke
self.env.run_job(job_payload)
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/lithops/localhost/v2/localhost.py", line 225, in run_job
self.work_queue.put(json.dumps(task_payload))
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/json/encoder.py", line 200, in encode
chunks = self.iterencode(o, _one_shot=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/json/encoder.py", line 258, in iterencode
return _iterencode(o, 0)
^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/json/encoder.py", line 180, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type slice is not JSON serializable |
Can you try with Lithops localhost version 1 - that's what we use in the Cubed unit tests as version 2 had some problems when we last tried it. Another thing to try would be the Cubed Is the data to run this publicly available? I could try to look at it next week if there's a minimal reproducible example. |
Thanks for the suggestion, but neither worked for me. To replicate the error, download the data aws s3api get-object --region us-west-2 --no-sign-request --bucket wrf-se-ak-ar5 --key ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc WRFDS_2060-01-01.nc then run this import xarray as xr
combined_ds = xr.open_dataset("WRFDS_2060-01-01.nc", chunks={})
combined_ds['Time'].attrs = {} # to_zarr complains about attrs
rechunked_ds = combined_ds.chunk(
chunks={'Time': 1, 'south_north': 25, 'west_east': 32},
chunked_array_type="cubed",
)
rechunked_ds.to_zarr("rechunked.zarr",
mode="w",
consolidated=True,
safe_chunks=False,
) |
Thanks for the the instructions @thodson-usgs! I have managed to reproduce this - getting the With the processes executor I got a different error: When I remove these kwargs in a local copy of cubed-xarray, the rechunk works using the local processes executor. I will look into writing a proper fix. |
Yay progress! Thank you @tomwhite !
Ah my bad for apparently not properly testing that.
Is that cubed's fault for not understanding dask-relevant kwargs, xarray's for passing them, or cubed-xarray somehow? We should raise an issue specifically to track this bug. |
Not sure. I have opened cubed-dev/cubed-xarray#14 to show a possible fix. |
Pebcak! This may have worked all along. The problem was I was passing local file paths off to lambda. I just moved the kerchunk json and target zarr to S3, and everything ran..albiet slowly. I should up the chunk sizes, but then this should be ready. I'll add a bit more explanation, but a minimal example is given below. We might reference the import xarray as xr
import fsspec
combined_ds = xr.open_dataset("s3://wma-uncertainty/scratch/combined.json",
engine="kerchunk",
chunks={},
chunked_array_type="cubed",
)
combined_ds['Time'].attrs = {} # to_zarr complains about attrs
rechunked_ds = combined_ds.chunk(
chunks={'Time': 5, 'south_north': 25, 'west_east': 32},
chunked_array_type="cubed",
)
target = fsspec.get_mapper("s3://wma-uncertainty/scratch/rechunked.zarr",
client_kwargs={'region_name':'us-west-2'})
rechunked_ds.to_zarr(target,
mode="w",
consolidated=True,
safe_chunks=False,
) |
f1a295e
to
ee0dfcf
Compare
@TomNicholas (cc @norlandrhagen), Seems like a regression somewhere, but I haven't been able to track it down. Any ideas looking at this traceback? Traceback (most recent call last):
File "/Users/thodson/Desktop/dev/software/cubed/examples/virtual-rechunk/rechunk-virtual-zarr.py", line 29, in <module>
rechunked_ds.to_zarr(
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/core/dataset.py", line 2595, in to_zarr
return to_zarr( # type: ignore[call-overload,misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/backends/api.py", line 2239, in to_zarr
dump_to_store(dataset, zstore, writer, encoding=encoding)
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/backends/api.py", line 1919, in dump_to_store
store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/backends/zarr.py", line 844, in store
variables_encoded, attributes = self.encode(
^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/backends/common.py", line 312, in encode
variables = {k: self.encode_variable(v) for k, v in variables.items()}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/backends/common.py", line 312, in <dictcomp>
variables = {k: self.encode_variable(v) for k, v in variables.items()}
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/backends/zarr.py", line 778, in encode_variable
variable = encode_zarr_variable(variable)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/backends/zarr.py", line 449, in encode_zarr_variable
var = conventions.encode_cf_variable(var, name=name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/conventions.py", line 195, in encode_cf_variable
var = coder.encode(var, name=name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/coding/variables.py", line 424, in encode
data = duck_array_ops.fillna(data, fill_value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/core/duck_array_ops.py", line 367, in fillna
return where(notnull(data), data, other)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/core/duck_array_ops.py", line 354, in where
return xp.where(condition, *as_shared_dtype([x, y], xp=xp))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/cubed/array_api/searching_functions.py", line 42, in where
return elemwise(nxp.where, condition, x1, x2, dtype=dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/cubed/core/ops.py", line 414, in elemwise
return blockwise(
^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/cubed/core/ops.py", line 237, in blockwise
chunkss, arrays = unify_chunks(*args)
^^^^^^^^^^^^^^^^^^^
File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/cubed/core/ops.py", line 1413, in unify_chunks
nameinds.append((a.name, ind))
^^^^^^
AttributeError: 'numpy.float32' object has no attribute 'name' |
ds = futures.get_result() | ||
|
||
# Save the virtual zarr manifest | ||
ds.virtualize.to_kerchunk(f"combined.json", format="json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to write the ref directly to s3, if not, we should figure out why!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ty!
ds = futures.get_result() | ||
|
||
# Save the virtual zarr manifest | ||
ds.virtualize.to_kerchunk(f"combined.json", format="json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, might be worth trying parquet or icechunk if you're willing into venture into the bleeding edge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second using icechunk instead of kerchunk at this point.
Is the reference file from part 1 available somewhere? A few thoughts (bearing in mind I'm a total lithops/cubed noob):
ds = open_dataset(<ref.json>, engine='kerchunk', chunked_array_type="cubed",)
rechunked_ds = ds.chunk({'Time': 5, 'south_north': 25, 'west_east': 32}, chunked_array_type="cubed")
rechunked_ds.to_zarr(...) |
Rechunk a virtual dataset
This example demonstrates how to rechunk a collection of necdf files on s3 into a single zarr store.
First, lithops and VirtualiZarr construct a virtual dataset comprised of the netcdf files on s3. Then, xarray-cubed rechunks the virtual dataset into a zarr. Inspired by the Pythia cookbook by @norlandrhagen.
STATUS
I'm pretty sure I got this workflow to work, albeit slowly; however, now I'm getting a new AttributeError. Details below.
PLANNING
Rechunking has been a thorn in the side for many of us, and I think there's general interest in a serverless workflow. It remains to be seen whether this example should live as part of
cubed
or as part of a pangeo community of practice. Once this example is working again, the next two steps are:rechunker
does.