Failure to inline #26

rabernat · 2024-09-27T14:27:06Z

This is as minimal as I could make this. I can't reproduce it by just creating arrays from scratch, has to be loaded from real data afaict.

Just cubed, without xarray

import cubed
spec = cubed.Spec(
    allowed_mem=2000000000,
)

url1 = "s3://cmip6-pds/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/6hrLev/ua/gr/v20190119/"

raw_array = cubed.from_zarr(url1 + "ua", spec=spec)
subset = raw_array[:, 10, 20, 30]
subset.visualize()

As you can see, the "from_zarr" and "getitem" get inlined into a single task. Good

Now with Xarray

ds = xr.open_dataset(url1, engine="zarr", chunked_array_type="cubed", from_array_kwargs={"spec": spec}, chunks={})
ds_subset = ds.isel(lat=20, lon=30, presnivs=10)
ds_subset.ua.data.visualize()

No inline, twice the number of tasks.

TomNicholas · 2024-09-27T14:48:35Z

Thanks for reporting this.

Not totally sure, but I suspect the behavior difference is because currently xr.open_dataset(engine='zarr') ultimately calls cubed.from_array on the lazily indexed xarray variable to create the cubed array, not cubed.from_zarr (the .from_array() call is inside Variable.chunk()). So these two code snippets don't end up calling the same cubed functions.

Interestingly dask.array.from_zarr also exists, but like cubed.from_zarr is also never called anywhere inside xarray! Maybe we could improve xarray's zarr backend by using this? You might know better than me about that though @rabernat .

tomwhite · 2024-09-27T15:35:54Z

Thanks for the reproducible example - I managed to run it fine to see what is going on.

As Tom says, Cubed is not seeing the raw Zarr file. If Cubed's from_array was given a zarr.Array instance then it would take a fast path, but on this line

https://github.com/cubed-dev/cubed/blob/d24f83bbab7032e5c54072fdb7af4430d933dd99/cubed/core/ops.py#L57

x is Xarray's ImplicitToExplicitIndexingAdapter, so it falls through to using map_blocks.

This in itself wouldn't be a problem, but the selection (indexing) is implemented uses a Cubed map_direct primitive, which can't be fused (see cubed-dev/cubed#464), which is why you aren't seeing the inlining that you would expect.

For simple slices like these we could implement indexing using a simple map_blocks (the general case is harder since slices can cross chunk boundaries). This would be worth doing as it's such a common thing to do.

rabernat · 2024-09-27T16:16:28Z

Thanks Toms! Thoughts on the best place in the stack to fix this?

For context, this is a killer application for cubed: fast, serverless subsetting of big Zarr datasets.

TomNicholas · 2024-09-27T17:12:56Z

For simple slices like these we could implement indexing using a simple map_blocks

Sounds like this would fix your use case Ryan, and it's an optimization in cubed that should be done anyway.

If Cubed's from_array was given a zarr.Array instance

x is Xarray's ImplicitToExplicitIndexingAdapter

Presumably we can't just bypass the ImplicitToExplicitIndexingAdapter wrapping in xarray else some other lazy indexing feature will be lost? This would be a divergence from how it currently works for dask.

tomwhite · 2024-09-30T09:41:59Z

I've created a fix in cubed-dev/cubed#586, which I ran against your example @rabernat and it fuses as expected now:

I'll merge it later, and I could do a release too if that's helpful.

tomwhite mentioned this issue Sep 27, 2024

Re-implement map_direct functions using general_blockwise cubed-dev/cubed#464

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to inline #26

Failure to inline #26

rabernat commented Sep 27, 2024

TomNicholas commented Sep 27, 2024 •

edited

Loading

tomwhite commented Sep 27, 2024

rabernat commented Sep 27, 2024

TomNicholas commented Sep 27, 2024

tomwhite commented Sep 30, 2024

Failure to inline #26

Failure to inline #26

Comments

rabernat commented Sep 27, 2024

TomNicholas commented Sep 27, 2024 • edited Loading

tomwhite commented Sep 27, 2024

rabernat commented Sep 27, 2024

TomNicholas commented Sep 27, 2024

tomwhite commented Sep 30, 2024

TomNicholas commented Sep 27, 2024 •

edited

Loading