You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm writing a GPU image segmentation job where the end result is a dask array written to a local multi-scale ome zarr file. In testing I found the map_blocks() function involving GPU computation is invoked twice for each block in the array. After some debugging I found that it happens whenever any level of downsampling is applied during saving.
I've tried the following things:
setting nlayers = 1 does not trigger repeated computation, while setting it to any number >1 will result in two map_blocks call per block, even if nlayers is greater than 2.
as a workaround, saving the dask array on disk first and then load it and write it as multi-scale image using writer.write_image; this indeed eliminates repeated map_blocks calls
Minimum Reproducible Example
import os
import dask.array as da
from dask.distributed import Client
import zarr
from ome_zarr.writer import write_image
from ome_zarr.scale import Scaler
from ome_zarr.io import parse_url
import shutil
import threading
TMP_PATH = "D://progtools/RobartsResearch/data/scratch/tmp"
def write_da_as_ome_zarr(ome_zarr_path, da_arr):
store = parse_url(ome_zarr_path, mode='w').store
zarr_group = zarr.group(store)
nlayers = 2
scaler = Scaler(max_layer=nlayers - 1, method='nearest')
coordinate_transformations = []
for layer in range(nlayers):
coordinate_transformations.append(
[{'scale': [(2 ** layer) * 1., (2 ** layer) * 1.], # image-pyramids in XY only
'type': 'scale'}])
write_image(image=da_arr,
group=zarr_group,
scaler=scaler,
coordinate_transformations=coordinate_transformations,
storage_options={'dimension_separator': '/'},
axes=['y', 'x'])
def map_fn(block, block_info=None):
if block_info is not None:
tid = threading.get_ident()
dim_ranges = block_info[0]['array-location']
print(tid, dim_ranges) # thread id, assigned block ranges
return block
def main():
import dask
with dask.config.set({'temporary_directory': TMP_PATH}):
client = Client(threads_per_worker=1, n_workers=1)
da_arr = da.zeros((32, 64), chunks=(32, 32)) # there are two chunks in total
da_arr = da_arr.map_blocks(map_fn)
path = f'{TMP_PATH}/im_result'
if os.path.exists(path):
shutil.rmtree(path)
write_da_as_ome_zarr(path, da_arr=da_arr)
if __name__ == '__main__':
main()
Expected result (when nlayer=1 this is indeed what I get):
Intermediate computations are executed twice when saving Dask computation results into an ome zarr array with downsampling option. This first happened when I called writer.write_labels, so both write_image and write_labels functions have this issue. Is this a bug of the ome zarr writer?
Edit: I made some errors in writing the above example, but it should be fixed now.
The text was updated successfully, but these errors were encountered:
Above workaround is limited that it requires to write large array to a scratch space one more time, which is a bit of an issue. Currently the scratch space of the compute canada server I'm working with has been down for a few weeks - so hopefully this issue can be fixed soon.
The package version is ome-zarr==0.9.1.dev0 directly installed from main branch, probably unrelated but the Python version I'm using is 3.10.14.
I think the issue is dask chooses not to save the image on disk but instead perform the calculation multiple times? Similar to mentioned in this post. In the current implementation, the scaling doesn't apply to the first level ['0'] of the ome zarr array, so maybe the simplest solution is to write level 0 first and read from it, then write to the rest of the downsampling levels?
Situation
I'm writing a GPU image segmentation job where the end result is a dask array written to a local multi-scale ome zarr file. In testing I found the map_blocks() function involving GPU computation is invoked twice for each block in the array. After some debugging I found that it happens whenever any level of downsampling is applied during saving.
I've tried the following things:
nlayers = 1
does not trigger repeated computation, while setting it to any number >1 will result in twomap_blocks
call per block, even if nlayers is greater than 2.writer.write_image
; this indeed eliminates repeatedmap_blocks
callsMinimum Reproducible Example
Expected result (when nlayer=1 this is indeed what I get):
What I actually get:
Summary
Intermediate computations are executed twice when saving Dask computation results into an ome zarr array with downsampling option. This first happened when I called
writer.write_labels
, so bothwrite_image
andwrite_labels
functions have this issue. Is this a bug of the ome zarr writer?Edit: I made some errors in writing the above example, but it should be fixed now.
The text was updated successfully, but these errors were encountered: