WIP: Add virtual-rechunk example #520

thodson-usgs · 2024-07-25T15:02:13Z

Rechunk a virtual dataset

This example demonstrates how to rechunk a collection of necdf files on s3 into a single zarr store.

First, lithops and VirtualiZarr construct a virtual dataset comprised of the netcdf files on s3. Then, xarray-cubed rechunks the virtual dataset into a zarr. Inspired by the Pythia cookbook by @norlandrhagen.

STATUS

I'm pretty sure I got this workflow to work, albeit slowly; however, now I'm getting a new AttributeError. Details below.

PLANNING

Rechunking has been a thorn in the side for many of us, and I think there's general interest in a serverless workflow. It remains to be seen whether this example should live as part of cubed or as part of a pangeo community of practice. Once this example is working again, the next two steps are:

Increase the chunk size to ~100MB, which might involve finding a better demo dataset. The demo chunks are currently too small, which is not performant.
Explore how difficult it would be to alter cube's rechunk algorithm such that each work writes multiple chunks, just as rechunker does.

norlandrhagen · 2024-07-25T17:38:49Z

This is super cool @thodson-usgs! Excited to see a rechunking <-> virtualizarr <-> cubed example.

norlandrhagen · 2024-07-25T18:34:15Z

@thodson-usgs have you tried a version of this with rechunker?

thodson-usgs · 2024-07-25T18:57:01Z

@thodson-usgs have you tried a version of this with rechunker?

I haven't. Cubed uses the rechunker algorithm, but this could help to isolate the problem. I'll add that the workflow runs fine with Dask.

TomNicholas · 2024-07-25T18:58:26Z

Looking at this a bit more closely @thodson-usgs ... So you you're doing two 3 pretty unrelated things in this example:

Using virtualizarr on your specific netCDF files
Using lithops to do a the concatenation of virtualizarr virtual datasets
Using cubed to perform a rechunk

That is awesome (especially (2), which I haven't even tried to do myself yet), but I think we should make sure each of these steps works individually first.

The FileNotFoundError error you're currently seeing is completely something to do with virtualizarr / kerchunk, nothing to do with Cubed.

I still think that putting this rechunking example in the Cubed repo is the right call though, to show how Cubed is basically a superset of Rechunker. But perhaps it should be broken up - i.e. have an example of (1) and (2) together in the virtualizarr library, and the example (3) lives here. Then if you want a full end-to-end notebook example that's maybe more of a pythia-level thing. But the important thing is to get it working first.

Also note that you could test cubed's rechunk just by using xr.open_mfdataset to open your netcdf files. That would allow you to bypass any problems with virtualizarr until they are fixed upstream.

Speaking of which, when I try to use open_virtual_dataset on just one of your netcdf files, save it as kerchunk references, and open it with engine='kerchunk', I get a different error. I'll raise an issue on virtualizarr to track that (thanks for making your example so easily reproducible!)

thodson-usgs · 2024-07-25T19:59:54Z

The FileNotFoundError error you're currently seeing is completely something to do with virtualizarr / kerchunk, nothing to do with Cubed.

Debugging this has been tricky. Invariably something is inconsistent with my environment and the behavior changes. All that is good justification for breaking up the problem, which I'm happy to do in the long run. In the meantime, I'll try all your good suggestions.

On your last point, I'd reiterate that the workflow runs if I comment one line:

   # chunked_array_type='cubed',

which makes me think that this bug is not in lithops, virtualizarr, kerchunk, or dask...

TomNicholas · 2024-07-25T20:06:14Z

On your last point, I'd reiterate that the workflow runs if I comment one line:

Okay what. That's really surprising to me and breaks my mental model of what's going on here.

breaking up the problem

Yes, let's try to get xr.open_dataset/open_mfdataset + cubed working without virtualizarr, and get virtualizarr working without lithops or cubed (i.e. zarr-developers/VirtualiZarr#201), then combine them all back together at the end.

tomwhite · 2024-07-26T09:18:25Z

Thanks for opening this @thodson-usgs!

Using a local Lithops executor (or even just the default single-threaded Cubed executor on your local machine), might be helpful in isolating where the problem is.

thodson-usgs · 2024-07-26T16:31:07Z

Indeed, there was an issue with my cubed environment, and I needed to repass my spec to .chunk or else the array was converted back to dask, which leads to the previous error. I suspect we might be hitting multiple bugs. I'll isolate on one here

Following @tomwhite's suggestion, I tried:

from cubed import Spec
spec = Spec(work_dir="tmp", allowed_mem="2GB")

combined_ds = xr.open_dataset('combined.json',
                              engine="kerchunk",
                              chunks={},
                              chunked_array_type='cubed', # optional so long as the spec is passed to .chunk
                              from_array_kwargs={'spec': spec}, #optional
                              )

combined_ds['Time'].attrs = {}  # to_zarr complains about attrs

rechunked_ds = combined_ds.chunk(
    chunks={'Time': 5, 'south_north': 25, 'west_east': 32},
    chunked_array_type='cubed',
    from_array_kwargs={'spec': spec},
)

# this succeeds
rechunked_ds.compute()

# but this fails
rechunked_ds.to_zarr('rechunked.zarr',
                     mode='w',
                     encoding={},  # TODO
                     consolidated=True,
                     safe_chunks=False,
                     )

which complains

ValueError: Arrays must have same spec in single computation. Specs: [cubed.Spec(work_dir=tmp, allowed_mem=2000000000, reserved_mem=0, executor=None, storage_options=None), cubed.Spec(work_dir=tmp, allowed_mem=2000000000, reserved_mem=0, executor=None, storage_options=None), cubed.Spec(work_dir=None, allowed_mem=200000000, reserved_mem=100000000, executor=None, storage_options=None)]

Apparently to_zarr is defaulting to another cubed spec, nor can I pass a spec to to_zarr.

If I comment every instance of # from_array_kwargs={'spec': spec} such that everything uses the default spec, the script will run to completion.

Anyway, I don't think this is the bug I'm looking for, so I'll keep at it.

TomNicholas · 2024-07-26T17:18:44Z

🙏 for helping us surface these issues @thodson-usgs !

Indeed, there was an issue with my cubed environment, and I needed to repass my spec to .chunk or else the array was converted back to dask, which leads to the previous error.

That makes a lot more sense, but also it sounds like potentially a bug with xarray's ChunkManager system (which dispatches to dask or cubed as available).

Apparently to_zarr is defaulting to another cubed spec

Again this could be another problem with the ChunkManager, or maybe with cubed.

nor can I pass a spec to to_zarr

You should be able to pass cubed-related arguments via the chunkmanager_store_kwargs keyword argument to .to_zarr. Though I'm not sure you want to pass a spec in again.

thodson-usgs · 2024-08-01T18:07:49Z

After resolving the regression in the recent version of VirtualiZarr, we still hit an error during the final rechunck operation that runs within to_zarr(). The same error occurs with .chunk(...).compute()

The script succeeds for certain cases:

substituting a local Dask cluster
or even a local cubed executor like

spec:
  work_dir: "tmp"
  allowed_mem: "2GB"

But it fails for the Lambda executor, both for the virtual dataset AND open_mfdataset() with different errors.
(at one point I was getting JSON serialization errors for both cases, but can't reproduce that now).

Furthermore, I don't have a great strategy for debugging a Lambda-only failure mode. In the meantime, I'll carefully rebuild my runtime environment.

Error messages

For the virtual dataset:

Traceback (most recent call last):
  File "/Users/thodson/Desktop/dev/software/cubed/examples/virtual-rechunk/rechunk-only.py", line 29, in <module>
    rechunked_ds.to_zarr("rechunked.zarr",
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/core/dataset.py", line 2549, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/api.py", line 1698, in to_zarr
    writes = writer.sync(
             ^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/common.py", line 267, in sync
    delayed_store = chunkmanager.store(
                    ^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed_xarray/cubedmanager.py", line 207, in store
    return store(
           ^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/ops.py", line 168, in store
    compute(*arrays, executor=executor, _return_in_memory_array=False, **kwargs)
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/array.py", line 282, in compute
    plan.execute(
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/plan.py", line 212, in execute
    executor.execute_dag(
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 265, in execute_dag
    execute_dag(
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 190, in execute_dag
    for _, stats in map_unordered(
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 119, in map_unordered
    future.status(throw_except=True)
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops_retries.py", line 83, in status
    reraise(*self.response_future._exception)
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/six.py", line 719, in reraise
    raise value
  File "/function/lithops/worker/jobrunner.py", line 210, in run
  File "/usr/local/lib/python3.11/site-packages/fsspec/spec.py", line 33, in make_instance
  File "/usr/local/lib/python3.11/site-packages/fsspec/spec.py", line 81, in __call__
  File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/reference.py", line 713, in __init__
  File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/reference.py", line 59, in __iter__
  File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/reference.py", line 141, in __getattr__
  File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/reference.py", line 148, in setup
  File "/usr/local/lib/python3.11/site-packages/fsspec/spec.py", line 773, in cat_file
  File "/usr/local/lib/python3.11/site-packages/fsspec/spec.py", line 1303, in open
  File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/local.py", line 191, in _open
  File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/local.py", line 355, in __init__
  File "/usr/local/lib/python3.11/site-packages/fsspec/implementations/local.py", line 360, in _open
FileNotFoundError: [Errno 2] No such file or directory: '/function/combined.json/.zmetadata'

For the open_mfdataset() case:

Traceback (most recent call last):
  File "/Users/thodson/Desktop/dev/software/cubed/examples/virtual-rechunk/mf-rechunk.py", line 46, in <module>
    rechunked_ds.to_zarr("rechunked.zarr",
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/core/dataset.py", line 2549, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/api.py", line 1698, in to_zarr
    writes = writer.sync(
             ^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/common.py", line 267, in sync
    delayed_store = chunkmanager.store(
                    ^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed_xarray/cubedmanager.py", line 207, in store
    return store(
           ^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/ops.py", line 168, in store
    compute(*arrays, executor=executor, _return_in_memory_array=False, **kwargs)
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/array.py", line 282, in compute
    plan.execute(
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/plan.py", line 212, in execute
    executor.execute_dag(
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 265, in execute_dag
    execute_dag(
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 190, in execute_dag
    for _, stats in map_unordered(
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 119, in map_unordered
    future.status(throw_except=True)
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops_retries.py", line 83, in status
    reraise(*self.response_future._exception)
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/six.py", line 718, in reraise
    raise value.with_traceback(tb)
  File "/function/lithops/worker/jobrunner.py", line 210, in run
  File "/usr/local/lib/python3.11/site-packages/zarr/core.py", line 2570, in __setstate__
  File "/usr/local/lib/python3.11/site-packages/zarr/core.py", line 170, in __init__
  File "/usr/local/lib/python3.11/site-packages/zarr/core.py", line 193, in _load_metadata
  File "/usr/local/lib/python3.11/site-packages/zarr/core.py", line 204, in _load_metadata_nosync
zarr.errors.ArrayNotFoundError: array not found at path %r' "array not found at path %r' 'SNOWH'"

Here the error was associated with the SNOWH variable but the variable changes run to run.

tomwhite · 2024-08-01T21:10:22Z

Furthermore, I don't have a great strategy for debugging a Lambda-only failure mode.

Does it work with Cubed running on local Lithops (https://lithops-cloud.github.io/docs/source/compute_config/localhost.html)?

thodson-usgs · 2024-08-02T16:17:38Z

Does it work with Cubed running on local Lithops (https://lithops-cloud.github.io/docs/source/compute_config/localhost.html)?

Good suggestion. This fails, but it does reproduce the JSON serilization error. I suppose the next step is to try entering the debugger within the container runtime or else litter print statements into the source until I've isolated the error. Should I dig into cubed or xarray-cubed, first? Do you have a suspect?

Traceback (most recent call last):
  File "/Users/thodson/Desktop/dev/software/cubed/examples/virtual-rechunk/rechunk-only.py", line 29, in <module>
    rechunked_ds.to_zarr("rechunked.zarr",
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/core/dataset.py", line 2549, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/api.py", line 1698, in to_zarr
    writes = writer.sync(
             ^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/common.py", line 267, in sync
    delayed_store = chunkmanager.store(
                    ^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed_xarray/cubedmanager.py", line 207, in store
    return store(
           ^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/ops.py", line 168, in store
    compute(*arrays, executor=executor, _return_in_memory_array=False, **kwargs)
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/array.py", line 282, in compute
    plan.execute(
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/core/plan.py", line 212, in execute
    executor.execute_dag(
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 265, in execute_dag
    execute_dag(
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 190, in execute_dag
    for _, stats in map_unordered(
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops.py", line 93, in map_unordered
    futures = lithops_function_executor.map(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/cubed/runtime/executors/lithops_retries.py", line 120, in map
    futures_list = self.executor.map(
                   ^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/lithops/executors.py", line 276, in map
    futures = self.invoker.run_job(job)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/lithops/invokers.py", line 268, in run_job
    futures = self._run_job(job)
              ^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/lithops/invokers.py", line 210, in _run_job
    raise e
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/lithops/invokers.py", line 207, in _run_job
    self._invoke_job(job)
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/lithops/invokers.py", line 255, in _invoke_job
    activation_id = self.compute_handler.invoke(payload)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/lithops/localhost/v2/localhost.py", line 140, in invoke
    self.env.run_job(job_payload)
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/lithops/localhost/v2/localhost.py", line 225, in run_job
    self.work_queue.put(json.dumps(task_payload))
                        ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/json/encoder.py", line 200, in encode
    chunks = self.iterencode(o, _one_shot=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/json/encoder.py", line 258, in iterencode
    return _iterencode(o, 0)
           ^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type slice is not JSON serializable

tomwhite · 2024-08-02T16:33:13Z

Can you try with Lithops localhost version 1 - that's what we use in the Cubed unit tests as version 2 had some problems when we last tried it.

Another thing to try would be the Cubed processes executor which uses a different serialization mechanism to Lithops. If that worked then it would show that it's something to do with how Lithops is doing serialization.

Is the data to run this publicly available? I could try to look at it next week if there's a minimal reproducible example.

thodson-usgs · 2024-08-04T13:31:30Z

Thanks for the suggestion, but neither worked for me.

To replicate the error, download the data

aws s3api get-object --region us-west-2 --no-sign-request --bucket wrf-se-ak-ar5 --key ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc WRFDS_2060-01-01.nc

then run this example.py:

import xarray as xr

combined_ds = xr.open_dataset("WRFDS_2060-01-01.nc", chunks={})

combined_ds['Time'].attrs = {}  # to_zarr complains about attrs

rechunked_ds = combined_ds.chunk(
    chunks={'Time': 1, 'south_north': 25, 'west_east': 32},
    chunked_array_type="cubed",
)

rechunked_ds.to_zarr("rechunked.zarr",
                     mode="w",
                     consolidated=True,
                     safe_chunks=False,
                     )

tomwhite · 2024-08-05T13:54:06Z

Thanks for the the instructions @thodson-usgs! I have managed to reproduce this - getting the Object of type slice is not JSON serializable error with the local lithops executor.

With the processes executor I got a different error: TypeError: run_func() got an unexpected keyword argument 'lock'. This led me to cubed-xarray's store function, which is being passed various arguments from Xarray that it doesn't know how to handle (namely flush, lock, compute, and regions):
https://github.com/pydata/xarray/blob/1ac19c4231e3e649392add503ae5a13d8e968aef/xarray/backends/common.py#L267-L274

When I remove these kwargs in a local copy of cubed-xarray, the rechunk works using the local processes executor.

I will look into writing a proper fix.

TomNicholas · 2024-08-05T15:08:20Z

the rechunk works using the local processes executor.

Yay progress! Thank you @tomwhite !

This led me to cubed-xarray's store function, which is being passed various arguments from Xarray that it doesn't know how to handle (namely flush, lock, compute, and regions):

Ah my bad for apparently not properly testing that.

I will look into writing a proper fix.

Is that cubed's fault for not understanding dask-relevant kwargs, xarray's for passing them, or cubed-xarray somehow? We should raise an issue specifically to track this bug.

tomwhite · 2024-08-05T16:43:04Z

Is that cubed's fault for not understanding dask-relevant kwargs, xarray's for passing them, or cubed-xarray somehow? We should raise an issue specifically to track this bug.

Not sure. I have opened cubed-dev/cubed-xarray#14 to show a possible fix.

thodson-usgs · 2024-08-30T16:15:53Z

Pebcak! This may have worked all along. The problem was I was passing local file paths off to lambda. I just moved the kerchunk json and target zarr to S3, and everything ran..albiet slowly. I should up the chunk sizes, but then this should be ready.
The question now is whether to contribute this to cubed or cubed-xarray...

I'll add a bit more explanation, but a minimal example is given below. We might reference the VirtualiZarr example for instruction on how to create combined.json rather than doing it here.

import xarray as xr
import fsspec

combined_ds = xr.open_dataset("s3://wma-uncertainty/scratch/combined.json",
                              engine="kerchunk",
                              chunks={},
                              chunked_array_type="cubed",
                              )

combined_ds['Time'].attrs = {}  # to_zarr complains about attrs

rechunked_ds = combined_ds.chunk(
    chunks={'Time': 5, 'south_north': 25, 'west_east': 32},
    chunked_array_type="cubed",
)

target = fsspec.get_mapper("s3://wma-uncertainty/scratch/rechunked.zarr",
                                                 client_kwargs={'region_name':'us-west-2'})


rechunked_ds.to_zarr(target,
                     mode="w",
                     consolidated=True,
                     safe_chunks=False,
                     )

thodson-usgs · 2024-10-25T20:04:45Z

@TomNicholas (cc @norlandrhagen),
The 1st script to stage the virtual zarr runs fine, but the 2nd rechunking script hits a new error.

Seems like a regression somewhere, but I haven't been able to track it down.

Any ideas looking at this traceback?

Traceback (most recent call last):
  File "/Users/thodson/Desktop/dev/software/cubed/examples/virtual-rechunk/rechunk-virtual-zarr.py", line 29, in <module>
    rechunked_ds.to_zarr(
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/core/dataset.py", line 2595, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/backends/api.py", line 2239, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/backends/api.py", line 1919, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/backends/zarr.py", line 844, in store
    variables_encoded, attributes = self.encode(
                                    ^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/backends/common.py", line 312, in encode
    variables = {k: self.encode_variable(v) for k, v in variables.items()}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/backends/common.py", line 312, in <dictcomp>
    variables = {k: self.encode_variable(v) for k, v in variables.items()}
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/backends/zarr.py", line 778, in encode_variable
    variable = encode_zarr_variable(variable)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/backends/zarr.py", line 449, in encode_zarr_variable
    var = conventions.encode_cf_variable(var, name=name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/conventions.py", line 195, in encode_cf_variable
    var = coder.encode(var, name=name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/coding/variables.py", line 424, in encode
    data = duck_array_ops.fillna(data, fill_value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/core/duck_array_ops.py", line 367, in fillna
    return where(notnull(data), data, other)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/xarray/core/duck_array_ops.py", line 354, in where
    return xp.where(condition, *as_shared_dtype([x, y], xp=xp))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/cubed/array_api/searching_functions.py", line 42, in where
    return elemwise(nxp.where, condition, x1, x2, dtype=dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/cubed/core/ops.py", line 414, in elemwise
    return blockwise(
           ^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/cubed/core/ops.py", line 237, in blockwise
    chunkss, arrays = unify_chunks(*args)
                      ^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-test2/lib/python3.11/site-packages/cubed/core/ops.py", line 1413, in unify_chunks
    nameinds.append((a.name, ind))
                     ^^^^^^
AttributeError: 'numpy.float32' object has no attribute 'name'

norlandrhagen · 2024-10-25T20:50:27Z

examples/virtual-rechunk/stage-virtual-zarr.py

+ds = futures.get_result()
+
+# Save the virtual zarr manifest
+ds.virtualize.to_kerchunk(f"combined.json", format="json")


I think we should be able to write the ref directly to s3, if not, we should figure out why!

I'll open the issue.

norlandrhagen · 2024-10-25T20:51:04Z

examples/virtual-rechunk/stage-virtual-zarr.py

+ds = futures.get_result()
+
+# Save the virtual zarr manifest
+ds.virtualize.to_kerchunk(f"combined.json", format="json")


Also, might be worth trying parquet or icechunk if you're willing into venture into the bleeding edge.

Second using icechunk instead of kerchunk at this point.

norlandrhagen · 2024-10-25T20:52:09Z

Is the reference file from part 1 available somewhere?

A few thoughts (bearing in mind I'm a total lithops/cubed noob):

Does your [rechunk-virtual-zarr.py](https://github.com/cubed-dev/cubed/pull/520/files#diff-a25ef0d52cb02414b94e4c174627891d2ba4bec487dcd94464b4d0afd0eebb3c) work with a normal Zarr?
Does calling a .load before to_zarr cause the same error?
Outside of a lithops context, does this work?

ds = open_dataset(<ref.json>, engine='kerchunk',     chunked_array_type="cubed",)
rechunked_ds = ds.chunk({'Time': 5, 'south_north': 25, 'west_east': 32}, chunked_array_type="cubed")
rechunked_ds.to_zarr(...)

thodson-usgs mentioned this pull request Jul 25, 2024

WIP: Add rechunking example zarr-developers/VirtualiZarr#197

Closed

TomNicholas added bug Something isn't working xarray-integration Uses or required for cubed-xarray integration upstream Involves changes to an upstream library labels Jul 25, 2024

TomNicholas mentioned this pull request Jul 25, 2024

MetadataError from ValueError: Could not convert object to NumPy datetime zarr-developers/VirtualiZarr#201

Closed

TomNicholas mentioned this pull request Jul 31, 2024

Changes to rechunker executors #82

Closed

tomwhite mentioned this pull request Aug 5, 2024

Handle kwargs better in store cubed-dev/cubed-xarray#14

Merged

thodson-usgs marked this pull request as draft October 24, 2024 03:59

thodson-usgs added 4 commits October 25, 2024 11:47

Add virtual-rechunk example

b7a3313

Set chunked_array_type in .chunk call

56c4bf8

Move manifest to s3 and split workflow

351761b

Update requirements.txt

b35f01e

Move manifest to s3

ee0dfcf

thodson-usgs force-pushed the virtual-rechunk-example branch from f1a295e to ee0dfcf Compare October 25, 2024 16:49

thodson-usgs added 2 commits October 25, 2024 13:42

Update to_zarr bucket url

7d9d654

Rename scripts

63dca1c

norlandrhagen reviewed Oct 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add virtual-rechunk example #520

WIP: Add virtual-rechunk example #520

thodson-usgs commented Jul 25, 2024 •

edited

Loading

norlandrhagen commented Jul 25, 2024

norlandrhagen commented Jul 25, 2024

thodson-usgs commented Jul 25, 2024

TomNicholas commented Jul 25, 2024

thodson-usgs commented Jul 25, 2024

TomNicholas commented Jul 25, 2024

tomwhite commented Jul 26, 2024

thodson-usgs commented Jul 26, 2024

TomNicholas commented Jul 26, 2024

thodson-usgs commented Aug 1, 2024 •

edited

Loading

tomwhite commented Aug 1, 2024

thodson-usgs commented Aug 2, 2024

tomwhite commented Aug 2, 2024

thodson-usgs commented Aug 4, 2024

tomwhite commented Aug 5, 2024

TomNicholas commented Aug 5, 2024

tomwhite commented Aug 5, 2024

thodson-usgs commented Aug 30, 2024 •

edited

Loading

thodson-usgs commented Oct 25, 2024

norlandrhagen Oct 25, 2024

thodson-usgs Oct 25, 2024

norlandrhagen Oct 25, 2024

norlandrhagen Oct 25, 2024

TomNicholas Oct 25, 2024

norlandrhagen commented Oct 25, 2024 •

edited

Loading

WIP: Add virtual-rechunk example #520

Are you sure you want to change the base?

WIP: Add virtual-rechunk example #520

Conversation

thodson-usgs commented Jul 25, 2024 • edited Loading

Rechunk a virtual dataset

STATUS

PLANNING

norlandrhagen commented Jul 25, 2024

norlandrhagen commented Jul 25, 2024

thodson-usgs commented Jul 25, 2024

TomNicholas commented Jul 25, 2024

thodson-usgs commented Jul 25, 2024

TomNicholas commented Jul 25, 2024

tomwhite commented Jul 26, 2024

thodson-usgs commented Jul 26, 2024

TomNicholas commented Jul 26, 2024

thodson-usgs commented Aug 1, 2024 • edited Loading

Error messages

tomwhite commented Aug 1, 2024

thodson-usgs commented Aug 2, 2024

tomwhite commented Aug 2, 2024

thodson-usgs commented Aug 4, 2024

tomwhite commented Aug 5, 2024

TomNicholas commented Aug 5, 2024

tomwhite commented Aug 5, 2024

thodson-usgs commented Aug 30, 2024 • edited Loading

thodson-usgs commented Oct 25, 2024

norlandrhagen Oct 25, 2024

Choose a reason for hiding this comment

thodson-usgs Oct 25, 2024

Choose a reason for hiding this comment

norlandrhagen Oct 25, 2024

Choose a reason for hiding this comment

norlandrhagen Oct 25, 2024

Choose a reason for hiding this comment

TomNicholas Oct 25, 2024

Choose a reason for hiding this comment

norlandrhagen commented Oct 25, 2024 • edited Loading

thodson-usgs commented Jul 25, 2024 •

edited

Loading

thodson-usgs commented Aug 1, 2024 •

edited

Loading

thodson-usgs commented Aug 30, 2024 •

edited

Loading

norlandrhagen commented Oct 25, 2024 •

edited

Loading