Issue: Repeated Dask Computation When Calling `write_image` or `write_labels` with Downsampling Level > 0 #392

Karl5766 · 2024-07-18T01:39:05Z

Situation

I'm writing a GPU image segmentation job where the end result is a dask array written to a local multi-scale ome zarr file. In testing I found the map_blocks() function involving GPU computation is invoked twice for each block in the array. After some debugging I found that it happens whenever any level of downsampling is applied during saving.

I've tried the following things:

setting nlayers = 1 does not trigger repeated computation, while setting it to any number >1 will result in two map_blocks call per block, even if nlayers is greater than 2.
as a workaround, saving the dask array on disk first and then load it and write it as multi-scale image using writer.write_image; this indeed eliminates repeated map_blocks calls

Minimum Reproducible Example

import os
import dask.array as da
from dask.distributed import Client
import zarr
from ome_zarr.writer import write_image
from ome_zarr.scale import Scaler
from ome_zarr.io import parse_url
import shutil
import threading


TMP_PATH = "D://progtools/RobartsResearch/data/scratch/tmp"


def write_da_as_ome_zarr(ome_zarr_path, da_arr):
    store = parse_url(ome_zarr_path, mode='w').store
    zarr_group = zarr.group(store)
    nlayers = 2
    scaler = Scaler(max_layer=nlayers - 1, method='nearest')
    coordinate_transformations = []
    for layer in range(nlayers):
        coordinate_transformations.append(
            [{'scale': [(2 ** layer) * 1., (2 ** layer) * 1.],  # image-pyramids in XY only
              'type': 'scale'}])
    write_image(image=da_arr,
                group=zarr_group,
                scaler=scaler,
                coordinate_transformations=coordinate_transformations,
                storage_options={'dimension_separator': '/'},
                axes=['y', 'x'])


def map_fn(block, block_info=None):
    if block_info is not None:
        tid = threading.get_ident()
        dim_ranges = block_info[0]['array-location']
        print(tid, dim_ranges)  # thread id, assigned block ranges
    return block


def main():
    import dask
    with dask.config.set({'temporary_directory': TMP_PATH}):
        client = Client(threads_per_worker=1, n_workers=1)
        da_arr = da.zeros((32, 64), chunks=(32, 32))  # there are two chunks in total
        da_arr = da_arr.map_blocks(map_fn)

        path = f'{TMP_PATH}/im_result'
        if os.path.exists(path):
            shutil.rmtree(path)
        write_da_as_ome_zarr(path, da_arr=da_arr)


if __name__ == '__main__':
    main()

Expected result (when nlayer=1 this is indeed what I get):

9556 [(0, 32), (32, 64)]
9556 [(0, 32), (0, 32)]

What I actually get:

4216 [(0, 32), (32, 64)]
4216 [(0, 32), (0, 32)]
4216 [(0, 32), (32, 64)]
4216 [(0, 32), (0, 32)]

Summary

Intermediate computations are executed twice when saving Dask computation results into an ome zarr array with downsampling option. This first happened when I called writer.write_labels, so both write_image and write_labels functions have this issue. Is this a bug of the ome zarr writer?

Edit: I made some errors in writing the above example, but it should be fixed now.

The text was updated successfully, but these errors were encountered:

Karl5766 · 2024-07-18T22:22:20Z

Above workaround is limited that it requires to write large array to a scratch space one more time, which is a bit of an issue. Currently the scratch space of the compute canada server I'm working with has been down for a few weeks - so hopefully this issue can be fixed soon.

joshmoore · 2024-07-19T14:37:42Z

Thanks for all the sleuthing, @Karl5766. It definitely could be a bug in the writer itself. What version of ome-zarr-py are you using?

Karl5766 · 2024-07-19T16:08:42Z

The package version is ome-zarr==0.9.1.dev0 directly installed from main branch, probably unrelated but the Python version I'm using is 3.10.14.

I think the issue is dask chooses not to save the image on disk but instead perform the calculation multiple times? Similar to mentioned in this post. In the current implementation, the scaling doesn't apply to the first level ['0'] of the ome zarr array, so maybe the simplest solution is to write level 0 first and read from it, then write to the rest of the downsampling levels?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: Repeated Dask Computation When Calling `write_image` or `write_labels` with Downsampling Level > 0 #392

Issue: Repeated Dask Computation When Calling `write_image` or `write_labels` with Downsampling Level > 0 #392

Karl5766 commented Jul 18, 2024 •

edited

Loading

Karl5766 commented Jul 18, 2024

joshmoore commented Jul 19, 2024

Karl5766 commented Jul 19, 2024 •

edited

Loading

Issue: Repeated Dask Computation When Calling write_image or write_labels with Downsampling Level > 0 #392

Issue: Repeated Dask Computation When Calling write_image or write_labels with Downsampling Level > 0 #392

Comments

Karl5766 commented Jul 18, 2024 • edited Loading

Situation

Minimum Reproducible Example

Summary

Karl5766 commented Jul 18, 2024

joshmoore commented Jul 19, 2024

Karl5766 commented Jul 19, 2024 • edited Loading

Issue: Repeated Dask Computation When Calling `write_image` or `write_labels` with Downsampling Level > 0 #392

Issue: Repeated Dask Computation When Calling `write_image` or `write_labels` with Downsampling Level > 0 #392

Karl5766 commented Jul 18, 2024 •

edited

Loading

Karl5766 commented Jul 19, 2024 •

edited

Loading