Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sub-frame read/chunking #85

Open
tlambert03 opened this issue Aug 15, 2022 · 12 comments · May be fixed by #242
Open

Implement sub-frame read/chunking #85

tlambert03 opened this issue Aug 15, 2022 · 12 comments · May be fixed by #242
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@tlambert03
Copy link
Owner

in the reader, at around

cpdef np.ndarray _read_image_from_memmap(self, index: int):

we should implement subframe cropping.

@gatoniel
Copy link

Hi, I think I really would like this enhancement to be available. I have several huge nd2 files (> 100 GB) consisting of large overview stitched images (7117 x 21044 pixels). Often I only need to extract small subvolumes of them. Looking at the files with NIS Elements is fairly fast for the size. But when I use xarray with

crop = xa_nd2[:, slicex, slicey].compute()

it definitely takes longer to load the data into RAM than NIS requires. The chunk size of the xarray is (1, 21044, 7117). Do you think implementing the sub-frame read/chunking would help with that, eg, by reducing the chunk size?

What skills are required to help on this issue? Unfortunately, I have zero experience with C/C++ ... But I could try my best with specific info on the task...

@tlambert03
Copy link
Owner Author

tlambert03 commented Nov 24, 2023

Hey @gatoniel,

Actually, it looks like this is already possible in the current nd2 library, but you need to do it in a specific way:

There is a method in ND2File called .read_frame(). You call it with a single integer representing the frame number in the file. Here, frame number means the index of the 2D plane (possibly multichannel) acquired over the course of the experiment (let me know if you need help determining the frame numbers in your files). But lets take the simple case of a single 2d tiled image:

import nd2

f = nd2.NDFile(...)
array = f.read_frame(0)

that returned np.ndarray object is just a view onto a memmap object that knows the location on disk where the data resides, but nothing has been read from disk and loaded into memory yet). So, you can slice that object before reading which should help you save memory.

crop = array[:, slicex, slicey]
# do what you want with the crop.

f.close()  # reminder

Currently, that won't work in the multi-dimensional xarray and dask wrappers (since the data underlying those is non-contiguous on disk, so it's harder to use a simple numpy view like this) ... so I guess that is the current "state" of this feature request. It can be done already on a single-plane level, using read_frame, but if you want to do more complex slicing that includes X/Y and other dimensions, more work is needed. Let me know if that's your situation or if the read_frame approach would work for you for now


Unfortunately, I have zero experience with C/C++

btw, there's no more C code in this library! (since #135 🎉 )

@gatoniel
Copy link

I would like to try out the .read_frame(). Could you help me in determining the correct frame numbers? nd2_file.sizes gives {'T': 30, 'P': 5, 'C': 2, 'Y': 21044, 'X': 7117}.

@tlambert03
Copy link
Owner Author

tlambert03 commented Nov 27, 2023

sure, you can use something like np.ravel_multi_index. There is a private method that does this here:

nd2/src/nd2/nd2file.py

Lines 935 to 938 in eb4bf8f

def _seq_index_from_coords(self, coords: Sequence) -> Sequence[int] | SupportsInt:
if not self._coord_shape:
return self._NO_IDX
return np.ravel_multi_index(coords, self._coord_shape) # type: ignore

so, for your example size {'T': 30, 'P': 5, 'C': 2, 'Y': 21044, 'X': 7117}, your coordinate shape would be (30, 5) (i.e. all coordinates that are not C Y or X). So, if you wanted to get the sequence index for timepoint=5 position=3 (both zero indexed), then you would use:

coord_shape = (30, 5)
frame_coordinate = (5, 3)  # t=5, p=3 
sequence_index = np.ravel_multi_index(frame_coordinate,  coord_shape)  # would be 28 in this case
frame_array = f.read_frame(sequence_index)

you can also use the private f._seq_index_from_coords if you like, but it may break in the future.

@gatoniel
Copy link

@tlambert03 thanks a lot. A preliminary test in my code shows that it is way faster that way!!!

@tlambert03
Copy link
Owner Author

great!

@tlambert03
Copy link
Owner Author

i'll update this issue when the dask interface also has chunking

@gatoniel
Copy link

What exactly would be needed to make the dask interface able to use the chunking? Wouldn't it just need an intermediate function that relays to the .read_frame() function?

@tlambert03
Copy link
Owner Author

that intermediate function already exists, it's called _dask_block:

nd2/src/nd2/nd2file.py

Lines 940 to 962 in eb4bf8f

def _dask_block(self, copy: bool, block_id: tuple[int]) -> np.ndarray:
if isinstance(block_id, np.ndarray):
return None
with self._lock:
was_closed = self.closed
if self.closed:
self.open()
try:
ncoords = len(self._coord_shape)
idx = self._seq_index_from_coords(block_id[:ncoords])
if idx == self._NO_IDX:
if any(block_id): # pragma: no cover
raise ValueError(
f"Cannot get chunk {block_id} for single frame image."
)
idx = 0
data = self.read_frame(int(idx)) # type: ignore
data = data.copy() if copy else data
return data[(np.newaxis,) * ncoords]
finally:
if was_closed:
self.close()

so, what additionally needs to happen is:

  • the to_dask method needs to accept a frame_chunks parameter that lets the user define how they would like each CYX plane to be chunked. It would modify this line
  • the _dask_block function needs to add a block_info keyword argument as described in the dask map_blocks docs.
  • the array-location key in the block_info dict would then need to be used to slice the buffer returned from read_frame here, before the copy operation:

nd2/src/nd2/nd2file.py

Lines 957 to 959 in eb4bf8f

data = self.read_frame(int(idx)) # type: ignore
data = data.copy() if copy else data
return data[(np.newaxis,) * ncoords]

@gatoniel
Copy link

gatoniel commented Dec 1, 2023

Hey @tlambert03 I have one more question when using .read_frame() for timeseries. I want to extract the same xy crop for all timepoints, so I use the following:

timestack_ = []
for i in range(t_len):
    frame_coordinate = (i, pos)
    sequence_index = np.ravel_multi_index(frame_coordinate, coord_shape)
    buffer = nd2_file.read_frame(sequence_index)
    timestack_.append(buffer[1, yslice, xslice])
timestack = np.stack(timestack_, axis=0)

Do you have ideas on how to further speed that up? The first thing would be to allocate an empty array, instead of stacking everything afterwards. But is it better to directly read from the buffer with np.copy or just assign the view to the empty array like so

timestack = np.empty((t_len, ylen, xlen))
for i in range(t_len):
    frame_coordinate = (i, pos)
    sequence_index = np.ravel_multi_index(frame_coordinate, coord_shape)
    buffer = nd2_file.read_frame(sequence_index)
    timestack[i, ...] = buffer[1, yslice, xslice]

?
Or is there another trick that could speed this up?
best,
Niklas

@tlambert03
Copy link
Owner Author

tlambert03 commented Dec 1, 2023 via email

@gatoniel
Copy link

gatoniel commented Dec 1, 2023

Not excessively slow. I was curious, mainly.

@gatoniel gatoniel linked a pull request Sep 27, 2024 that will close this issue
@tlambert03 tlambert03 linked a pull request Sep 27, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants