Extendable zarr arrays #802

dylanmcreynolds · 2024-10-27T21:34:06Z

Checklist

Add a Changelog entry
Unit tests
User documentation

This is extraordinarily early work towards append-able arrays, which should be supportable with zarr. For use cases where we are doing processing off of an instrument and sending the results to Tiled, we don't always know the eventual size of the array. We can do this by writing the array directly outside of Tiled and then sending a message to Tiled to reindex. But it would be super useful to send the new block to Tiled directly.

Changes:

Add a new client method, server route and ZarrAdapter for a method:
client method:
new link for append that the client uses, and new route in routes.py
In the zarr adapter and its parent12, try and find the largest block along that axis, resize along that axis and write the data.
Client needs append_block method that sends the block to append and the axis to append it to
Router needs new PATCH method for /array/block endpoint
Zarr adapter gets the message and resizes the zarr array and appends
f you edit the metadata or specs, we keep those edits in the revisions table. So, when you update things to structure, they're final, the user can't go back in history. In CatalogArrayAdapter, we need to update the database with the new size. We can look at replace_metadata but need to avoid the revisions update.
We also need to rehash the structure (see create_node)
Questions and Issues:
Not even sure this is the right approach
Would I be responsible for updating the index in the SQL database about the resize? If so, doest the ZarrAdapter have access to database objects?
the link with with the format /arrays/append is a little unsatisfying as it uses a verb, which I don't love
@danielballan and I have has many discussions about the HTTP verb to use. Currently this uses PATCH, as it's a change to the array, without a complete replacement of the array.

danielballan · 2024-10-28T15:49:13Z

Notes from discussion with @dylanmcreynolds

Background

We let users patch metadata and specs but in general not structure. If you mess up the structure at register-time, just blow it away and register again. If you mess up the structure on data you are uploading to Tiled, then altering later could be a computationally heavy operation, such as changing the data type after the fact. So we do not allow it.
We want to let users edit specific things about the structure that can be done in the space of an HTTP request/response cycle.
We laid some track for this by placing resizable as a key in the structure for array, sparse, and table. Currently, it's always False. For array and sparse, as the type hints suggest, we anticipated we might extend it to allow a tuple of booleans aligned with the shape tuple. For example, a (10, 1024, 1024) image stack that is appendable along the time axis would have resizable=(True, False, False). Zarr does not require you to declare up front which axes you might resize later. But for HTTP caching purposes, we expect that clients can leverage this knowledge. So we may insist that you declare up front which axes may resize and hold you to that.

Routes

This route updates the shape metadata and uploads data in a single request:

PATCH /array/append/{path}?axis=i

The above works if you can fit the full axis' breadth of data in one request body. This satisfies ALS' current use case.

For other use cases, we may need to upload multiple chunks in separate requests. For example, to do an update on X with O in three requests...

X  X  O
X  X  O
X  X  O

we want two routes. One to declare the change in shape and then one (existing) route to upload the chunks, same as we upload chunks in a new array.

PATCH /array/resize/{path}
POST /array/block?block=i,j,...  # N of these requests

where i must correspond to a True in the resizable structure.

dylanmcreynolds · 2024-10-28T15:59:14Z

More notes from conversation:

If you edit the metadata or specs, we keep those edits in the revisions table. So, when you update things to structure, they're final, the user can't go back in history. In CatalogArrayAdapter, we need to update the database with the new size. We can look at replace_metadata but need to avoid the revisions update.
We also need to rehash the structure (see create_node)

dylanmcreynolds · 2024-10-29T21:50:34Z

This ready for review, provided that can start before unit tests and user documentation.

dylanmcreynolds · 2024-10-30T00:37:32Z

Fixed issues causing unit test to fail.

dylanmcreynolds · 2024-10-30T17:08:56Z

Talking with @danielballan for a while, there is always going to be an issue here where multiple clients appending an array could write them out of order. Should the append_block method take a parameter that says where to put the block in the array and resize if the array needs the size, and simply write to the correct place if it's already been resized?

danielballan · 2024-10-30T17:30:21Z

In our chat I think we identified two problems:

Requests processed out of order
Write contention resulting in data corruption

For (1): Instead of PATCH /append/{path} implement PATCH /array/full/{path}?slice=10,:. That is, instead of "appending" to an array (which is kind of a tabular concept...) overlay a specific slice of data into an array chunk. The above example writes data into row 10. Requests that arrive in any order will achieve the same result.

For (2): Rely on the database as a semaphor. Maybe transactions can get us what we need, or maybe we need to involve explicit row-level locking. Now, if the chunks are stored on a filesystem, there is a fundamental limit to how much we can guarantee. Use cases that are pushing the limits of streaming can upgrade to a Zarr store with stronger locking semantics.

Edit: I would think that, if the PATCH lies outside the current array shape, the array would be reshaped (i.e. allocating new empty chunks) to accommodate the added slice. If we're concerned about accidental resizing, there could be a query parameter to opt in or out.

danielballan · 2024-10-31T15:02:50Z

Demo:

In [1]: from tiled.client import from_profile

In [2]: import numpy

In [3]: c = from_profile('local', api_key='secret')

In [4]: ac = c.write_array(numpy.ones((3, 2, 2)), key='y')

In [5]: ac
Out[5]: <ArrayClient shape=(3, 2, 2) chunks=((3,), (2,), (2,)) dtype=float64>

In [6]: ac.read()
Out[6]:
array([[[1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.]]])

In [7]: ac.patch(numpy.zeros((1, 2, 2)), slice=slice(3, 4), extend=True)

In [8]: ac.refresh()
Out[8]: <ArrayClient shape=(4, 2, 2) chunks=((3, 1), (2,), (2,)) dtype=float64>

In [9]: ac.read()
Out[9]:
array([[[1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.]],

       [[0., 0.],
        [0., 0.]]])

dylanmcreynolds

Awesome! I left a couple minor comments...

tiled/adapters/zarr.py

tiled/catalog/adapter.py

danielballan · 2024-10-31T16:48:15Z

Review comments addressed. Test updated. Also, some usability improvements:

If you do not pass extend=True and your slice exceed the array bounds, the server returns 409 CONFLICT and the Python client catches it to raise a clear ValueError.
It is no longer necessary to run a second request with ac.refresh() to update the local cache of the structure (i.e. get the new shape and chunks). The PATCH /array/full endpoint now returns the new ArrayStructure, and the client updates its cache.

Demo of both:

In [4]: ac = c.write_array(numpy.ones((3, 2, 2)), key='x')

In [5]: ac.patch(numpy.zeros((1, 2, 2)), slice=slice(4, 5))
<snipped>
ValueError: Slice slice(4, 5, None) does not fit within current array shape. Pass keyword argument extend=True to extend the array dimensions to fit.

In [6]: ac.patch(numpy.zeros((1, 2, 2)), slice=slice(4, 5), extend=True)

In [7]: ac
Out[7]: <ArrayClient shape=(5, 2, 2) chunks=((3, 2), (2,), (2,)) dtype=float64>

danielballan · 2024-10-31T17:12:22Z

Needs more testing and user docs, but otherwise this is in coherent shape I think, ready for tires to be kicked.

dylanmcreynolds · 2024-10-31T20:59:51Z

tiled/adapters/zarr.py

+                    f"Slice {slice} does not fit into array shape {current_shape}. "
+                    f"Use ?extend=true to extend array dimension to fit."
+                )
+        self._array[slice] = data


If the array was resized, but an exception happens from here on (like someone putting in a bad slicing parameter), the file was resized, but new data not put in. Would we consider a try block that resizes back to the original size?

padraic-shafer

I like that by specifying the slice indices, requests can safely arrive out of order. It would be useful to add a unit test for out-of-order updates.

From an earlier iteration I liked the implications of "append" coupled with passing the expected "before" size, which for multiple writers would be a conflict-resolution mechanism that favors the first request that arrives...rejecting the others. However this is probably an obscure edge case. Much more likely usage would be a single writer emitting multiple updates with order known a priori. Current solution handles this nicely.

I skimmed, but did not look too closely at:

tiled/client/array.py
tiled/adapters/zarr.py
tiled/catalog/adapter.py

dylanmcreynolds · 2024-11-01T13:35:03Z

tiled/client/array.py

+        array : array-like
+            The data to write
+        slice : NDSlice
+            Where to place this data in the array


I think it would be nice to give more information on how to format the slice parameter.

danielballan · 2024-11-01T15:45:13Z

Lifting a comment here that from private chats:

I think that concurrent appending to arrays is a broken concept. Appending makes sense on tables, but not on arrays. The atomic unit of a table is a row, and the row can contain metadata (e.g. sequence number, timestamp) on which we can to locate a given value later. The only metadata that arrays have are coordinates themselves, and so when placing data we must address coordinates explicitly.

We can extend arrays, enlarging the canvas of the coordinate system, but when data is added it must be placed at explicit coordinates.

If I'm wrong about that, we can revisit appending later with an additional PATCH /array/append route.

genematx · 2024-11-01T17:07:27Z

I think this is a great addition to Tiled; thank you, @dylanmcreynolds for putting all this work together! I've gone through the code, and have just been trying to break it. Here are some of my findings; most likely they are just edge cases that fall out of scope of this PR, or maybe I've missed some assumptions here, but I think they are still worth mentioning.

I think we need to check that the shape of the slice in the client request matches the shape of the array being inserted and to raise an informative exception if they don't match. Alternatively, and I personally would prefer this option, assume that the slice specifies the top ~~right~~ left corner (lowest indices) where the new array should be put and figure out the full slice by the server itself (maybe only if extend=True or maybe we need to introduce another flag to control this behaviour). Although I see how requests arriving out of order could be an issue here.
Currently, ac.patch(numpy.zeros((2, 2, 2)), slice=slice(3, 4), extend=True) just raises a 500 error.
On the other hand, ac.patch(numpy.zeros((1, 2, 2)), slice=slice(33, 34), extend=True) works and fills the array with zeros. Is this the intended behaviour?
Would be nice to have some default value for slice (e.g. -1 or (..., -1)) that would allow to append to the end of an axis, like in numpy.append(arr, values, axis=None) but, as I understand, this may not be possible due to the unknown order of arriving requests.
Perhaps, more interesting:

import zarr
arr = zarr.ones((3,4,5), chunks=(1,2,3))
ac = c.write_array(arr, key='y')
# <ArrayClient shape=(3, 4, 5) chunks=((1, 1, 1), (2, 2), (3, 2)) dtype=float64>

ac.patch(numpy.zeros((1, 4, 5)), slice=slice(0,1))

errors with a 500 error and the following traceback on the server: ValueError: parameter 'value': expected array with shape (1, 4, 6), got (1, 4, 5). Also, this raises an interesting question of how we should rechunk the array (if at all) when it's expanded (perhaps you're handling this already, and I just missed it).

What should we do about dtype mismatch?

ac = c.write_array(numpy.ones((3, 2, 2), dtype='int'), key='w')
ac.patch(numpy.zeros((1, 2, 2), dtype='float'), slice=slice(0, 1))

The above code works and converts the new data to int. I wonder if this should raise an exception.

I'll keep digging into it. Really like this new functionality!

genematx · 2024-11-01T17:19:44Z

tiled/_tests/test_writing.py

+        ac.patch(ones * 7, slice=slice(7, 8), extend=True)
+        ac.patch(ones * 5, slice=slice(5, 6), extend=True)
+        ac.patch(ones * 6, slice=slice(6, 7), extend=True)
+        numpy.testing.assert_equal(ac[5:6], ones * 5)


Shouldn't we better use something like rng.random instead of ones here?

Good point, that would avoid rotational confusion.

danielballan · 2024-11-01T17:20:13Z

This is interesting! The formulation I put in this PR is "overdetermined" in the sense that we give slice and shape. Your proposal would have us give offset and shape. I think this is worth considering. To be precise, we wouldn't be giving slice to the corner, we would be giving a tuple of coordinates: tuple[int], not tuple[slice]. I should think about this but my initial reaction is that this is strictly better, from both an engineering standpoint and a usability standpoint.
Zarr has a "fill value" configured at creation time (default 0) and that's what happening here. Zeros are actually written to storage, but filled in. The same situation occurs if you try to read and array before all the chunks have been written. So I would say this is fine and expected.
I'm against this, for the reasons you say. But we can always add it later if we change our minds.
Using offset instead of slice would fix this, I think. There rechunking code---see new_chunks and surrounding comments.
I agree, this should raise 400 Bad Request or similar.

danielballan · 2024-11-01T17:37:39Z

The only generality you lose with offset is striding, as in 10:60:2. But this more of a computation idiom than a storage idiom.

danielballan · 2024-11-05T19:30:32Z

Rebased on main. Refactored to use offset instead of slice, following @genematx's suggestion.

dylanmcreynolds · 2024-11-08T19:12:39Z

I added text about the new patch method to the writing tutorial.

dylanmcreynolds marked this pull request as draft October 27, 2024 21:37

dylanmcreynolds marked this pull request as ready for review October 29, 2024 21:50

dylanmcreynolds requested a review from danielballan October 29, 2024 21:50

dylanmcreynolds commented Oct 31, 2024

View reviewed changes

tiled/adapters/zarr.py Outdated Show resolved Hide resolved

tiled/catalog/adapter.py Show resolved Hide resolved

danielballan force-pushed the append_zarr branch from a5d18dd to 19a69ed Compare October 31, 2024 16:45

danielballan requested review from padraic-shafer and genematx October 31, 2024 16:50

dylanmcreynolds commented Oct 31, 2024

View reviewed changes

padraic-shafer reviewed Nov 1, 2024

View reviewed changes

dylanmcreynolds commented Nov 1, 2024

View reviewed changes

danielballan changed the title ~~Appendendable zarr arrays~~ Extendable zarr arrays Nov 1, 2024

genematx reviewed Nov 1, 2024

View reviewed changes

danielballan force-pushed the append_zarr branch from 149f4a0 to 0f053c7 Compare November 5, 2024 19:30

danielballan force-pushed the append_zarr branch from c2ea055 to 98ba2bf Compare November 18, 2024 14:23

add parameter to from_uri

b657dc1

dylanmcreynolds and others added 28 commits November 18, 2024 09:42

fix errant code in write_block

7a0b308

Add append link to arrays, client sends shape and axis

bf0520c

zarr adapter is called for write_block

cea39b3

fix growth of zarr array

c56091c

update database for size in array append_block

a0ab7b5

precommit cleanup

d05ac2a

fix test

f127215

fix database issue

cc7cbc9

add client call docstring

a27f3e4

WIP: Move type_aliases to root and fix misspelled name

e387e7b

WIP: Rework to use PATCH /array/full

fd8d535

Appending works

6d9c9df

Rename 'grow' to 'extend'.

c1a5783

Raise if assumptions are not met

dd1eab2

Improve usability. Test.

da0d25a

Update imports after rebase.

074169f

Use Python 3.9 compatibile typing.

815acba

Fix regression in refresh

3978295

Reference documentation

526c406

Add example to docstring.

0c595db

Test overwrite and out-of-order updates.

3d51ed0

Change from 'slice' to 'offset'.

59bff0e

update array test

a9b970f

add extend array to writing tutorial

6b3a6fb

whitespace

015b0e5

Finesse docs

cb395d4

Data type of patch must match.

24b9027

Update CHANGELOG

323be10

danielballan force-pushed the append_zarr branch from 98ba2bf to 323be10 Compare November 18, 2024 14:53

danielballan merged commit 2459624 into bluesky:main Nov 18, 2024
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extendable zarr arrays #802

Extendable zarr arrays #802

dylanmcreynolds commented Oct 27, 2024 •

edited by danielballan

Loading

danielballan commented Oct 28, 2024

dylanmcreynolds commented Oct 28, 2024

dylanmcreynolds commented Oct 29, 2024

dylanmcreynolds commented Oct 30, 2024

dylanmcreynolds commented Oct 30, 2024

danielballan commented Oct 30, 2024 •

edited

Loading

danielballan commented Oct 31, 2024

dylanmcreynolds left a comment

danielballan commented Oct 31, 2024

danielballan commented Oct 31, 2024

dylanmcreynolds Oct 31, 2024

padraic-shafer left a comment

dylanmcreynolds Nov 1, 2024

danielballan commented Nov 1, 2024

genematx commented Nov 1, 2024 •

edited

Loading

genematx Nov 1, 2024

danielballan Nov 1, 2024

danielballan commented Nov 1, 2024

danielballan commented Nov 1, 2024

danielballan commented Nov 5, 2024

dylanmcreynolds commented Nov 8, 2024

Extendable zarr arrays #802

Extendable zarr arrays #802

Conversation

dylanmcreynolds commented Oct 27, 2024 • edited by danielballan Loading

Checklist

danielballan commented Oct 28, 2024

Background

Routes

dylanmcreynolds commented Oct 28, 2024

dylanmcreynolds commented Oct 29, 2024

dylanmcreynolds commented Oct 30, 2024

dylanmcreynolds commented Oct 30, 2024

danielballan commented Oct 30, 2024 • edited Loading

danielballan commented Oct 31, 2024

dylanmcreynolds left a comment

Choose a reason for hiding this comment

danielballan commented Oct 31, 2024

danielballan commented Oct 31, 2024

dylanmcreynolds Oct 31, 2024

Choose a reason for hiding this comment

padraic-shafer left a comment

Choose a reason for hiding this comment

dylanmcreynolds Nov 1, 2024

Choose a reason for hiding this comment

danielballan commented Nov 1, 2024

genematx commented Nov 1, 2024 • edited Loading

genematx Nov 1, 2024

Choose a reason for hiding this comment

danielballan Nov 1, 2024

Choose a reason for hiding this comment

danielballan commented Nov 1, 2024

danielballan commented Nov 1, 2024

danielballan commented Nov 5, 2024

dylanmcreynolds commented Nov 8, 2024

dylanmcreynolds commented Oct 27, 2024 •

edited by danielballan

Loading

danielballan commented Oct 30, 2024 •

edited

Loading

genematx commented Nov 1, 2024 •

edited

Loading