Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to inline #26

Open
rabernat opened this issue Sep 27, 2024 · 5 comments
Open

Failure to inline #26

rabernat opened this issue Sep 27, 2024 · 5 comments

Comments

@rabernat
Copy link

This is as minimal as I could make this. I can't reproduce it by just creating arrays from scratch, has to be loaded from real data afaict.

Just cubed, without xarray

import cubed
spec = cubed.Spec(
    allowed_mem=2000000000,
)

url1 = "s3://cmip6-pds/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/6hrLev/ua/gr/v20190119/"

raw_array = cubed.from_zarr(url1 + "ua", spec=spec)
subset = raw_array[:, 10, 20, 30]
subset.visualize()

image

As you can see, the "from_zarr" and "getitem" get inlined into a single task. Good

Now with Xarray

ds = xr.open_dataset(url1, engine="zarr", chunked_array_type="cubed", from_array_kwargs={"spec": spec}, chunks={})
ds_subset = ds.isel(lat=20, lon=30, presnivs=10)
ds_subset.ua.data.visualize()

image

No inline, twice the number of tasks.

@TomNicholas
Copy link
Member

TomNicholas commented Sep 27, 2024

Thanks for reporting this.

Not totally sure, but I suspect the behavior difference is because currently xr.open_dataset(engine='zarr') ultimately calls cubed.from_array on the lazily indexed xarray variable to create the cubed array, not cubed.from_zarr (the .from_array() call is inside Variable.chunk()). So these two code snippets don't end up calling the same cubed functions.

Interestingly dask.array.from_zarr also exists, but like cubed.from_zarr is also never called anywhere inside xarray! Maybe we could improve xarray's zarr backend by using this? You might know better than me about that though @rabernat .

@tomwhite
Copy link
Member

Thanks for the reproducible example - I managed to run it fine to see what is going on.

As Tom says, Cubed is not seeing the raw Zarr file. If Cubed's from_array was given a zarr.Array instance then it would take a fast path, but on this line

https://github.com/cubed-dev/cubed/blob/d24f83bbab7032e5c54072fdb7af4430d933dd99/cubed/core/ops.py#L57

x is Xarray's ImplicitToExplicitIndexingAdapter, so it falls through to using map_blocks.

This in itself wouldn't be a problem, but the selection (indexing) is implemented uses a Cubed map_direct primitive, which can't be fused (see cubed-dev/cubed#464), which is why you aren't seeing the inlining that you would expect.

For simple slices like these we could implement indexing using a simple map_blocks (the general case is harder since slices can cross chunk boundaries). This would be worth doing as it's such a common thing to do.

@rabernat
Copy link
Author

Thanks Toms! Thoughts on the best place in the stack to fix this?

For context, this is a killer application for cubed: fast, serverless subsetting of big Zarr datasets.

@TomNicholas
Copy link
Member

For simple slices like these we could implement indexing using a simple map_blocks

Sounds like this would fix your use case Ryan, and it's an optimization in cubed that should be done anyway.

If Cubed's from_array was given a zarr.Array instance

x is Xarray's ImplicitToExplicitIndexingAdapter

Presumably we can't just bypass the ImplicitToExplicitIndexingAdapter wrapping in xarray else some other lazy indexing feature will be lost? This would be a divergence from how it currently works for dask.

@tomwhite
Copy link
Member

I've created a fix in cubed-dev/cubed#586, which I ran against your example @rabernat and it fuses as expected now:

cubed2

I'll merge it later, and I could do a release too if that's helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants