-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extendable zarr arrays #802
Conversation
Notes from discussion with @dylanmcreynolds Background
RoutesThis route updates the
The above works if you can fit the full axis' breadth of data in one request body. This satisfies ALS' current use case. For other use cases, we may need to upload multiple chunks in separate requests. For example, to do an update on
we want two routes. One to declare the change in shape and then one (existing) route to upload the chunks, same as we upload chunks in a new array.
where |
More notes from conversation:
|
This ready for review, provided that can start before unit tests and user documentation. |
Fixed issues causing unit test to fail. |
Talking with @danielballan for a while, there is always going to be an issue here where multiple clients appending an array could write them out of order. Should the append_block method take a parameter that says where to put the block in the array and resize if the array needs the size, and simply write to the correct place if it's already been resized? |
In our chat I think we identified two problems:
For (1): Instead of For (2): Rely on the database as a semaphor. Maybe transactions can get us what we need, or maybe we need to involve explicit row-level locking. Now, if the chunks are stored on a filesystem, there is a fundamental limit to how much we can guarantee. Use cases that are pushing the limits of streaming can upgrade to a Zarr store with stronger locking semantics. Edit: I would think that, if the |
Demo: In [1]: from tiled.client import from_profile
In [2]: import numpy
In [3]: c = from_profile('local', api_key='secret')
In [4]: ac = c.write_array(numpy.ones((3, 2, 2)), key='y')
In [5]: ac
Out[5]: <ArrayClient shape=(3, 2, 2) chunks=((3,), (2,), (2,)) dtype=float64>
In [6]: ac.read()
Out[6]:
array([[[1., 1.],
[1., 1.]],
[[1., 1.],
[1., 1.]],
[[1., 1.],
[1., 1.]]])
In [7]: ac.patch(numpy.zeros((1, 2, 2)), slice=slice(3, 4), extend=True)
In [8]: ac.refresh()
Out[8]: <ArrayClient shape=(4, 2, 2) chunks=((3, 1), (2,), (2,)) dtype=float64>
In [9]: ac.read()
Out[9]:
array([[[1., 1.],
[1., 1.]],
[[1., 1.],
[1., 1.]],
[[1., 1.],
[1., 1.]],
[[0., 0.],
[0., 0.]]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I left a couple minor comments...
a5d18dd
to
19a69ed
Compare
Review comments addressed. Test updated. Also, some usability improvements:
Demo of both: In [4]: ac = c.write_array(numpy.ones((3, 2, 2)), key='x')
In [5]: ac.patch(numpy.zeros((1, 2, 2)), slice=slice(4, 5))
<snipped>
ValueError: Slice slice(4, 5, None) does not fit within current array shape. Pass keyword argument extend=True to extend the array dimensions to fit.
In [6]: ac.patch(numpy.zeros((1, 2, 2)), slice=slice(4, 5), extend=True)
In [7]: ac
Out[7]: <ArrayClient shape=(5, 2, 2) chunks=((3, 2), (2,), (2,)) dtype=float64> |
Needs more testing and user docs, but otherwise this is in coherent shape I think, ready for tires to be kicked. |
tiled/adapters/zarr.py
Outdated
f"Slice {slice} does not fit into array shape {current_shape}. " | ||
f"Use ?extend=true to extend array dimension to fit." | ||
) | ||
self._array[slice] = data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the array was resized, but an exception happens from here on (like someone putting in a bad slicing parameter), the file was resized, but new data not put in. Would we consider a try block that resizes back to the original size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that by specifying the slice indices, requests can safely arrive out of order. It would be useful to add a unit test for out-of-order updates.
From an earlier iteration I liked the implications of "append" coupled with passing the expected "before" size, which for multiple writers would be a conflict-resolution mechanism that favors the first request that arrives...rejecting the others. However this is probably an obscure edge case. Much more likely usage would be a single writer emitting multiple updates with order known a priori. Current solution handles this nicely.
I skimmed, but did not look too closely at:
- tiled/client/array.py
- tiled/adapters/zarr.py
- tiled/catalog/adapter.py
array : array-like | ||
The data to write | ||
slice : NDSlice | ||
Where to place this data in the array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to give more information on how to format the slice parameter.
Lifting a comment here that from private chats: I think that concurrent appending to arrays is a broken concept. Appending makes sense on tables, but not on arrays. The atomic unit of a table is a row, and the row can contain metadata (e.g. sequence number, timestamp) on which we can to locate a given value later. The only metadata that arrays have are coordinates themselves, and so when placing data we must address coordinates explicitly. We can extend arrays, enlarging the canvas of the coordinate system, but when data is added it must be placed at explicit coordinates. If I'm wrong about that, we can revisit appending later with an additional |
I think this is a great addition to Tiled; thank you, @dylanmcreynolds for putting all this work together! I've gone through the code, and have just been trying to break it. Here are some of my findings; most likely they are just edge cases that fall out of scope of this PR, or maybe I've missed some assumptions here, but I think they are still worth mentioning.
import zarr
arr = zarr.ones((3,4,5), chunks=(1,2,3))
ac = c.write_array(arr, key='y')
# <ArrayClient shape=(3, 4, 5) chunks=((1, 1, 1), (2, 2), (3, 2)) dtype=float64>
ac.patch(numpy.zeros((1, 4, 5)), slice=slice(0,1)) errors with a 500 error and the following traceback on the server:
ac = c.write_array(numpy.ones((3, 2, 2), dtype='int'), key='w')
ac.patch(numpy.zeros((1, 2, 2), dtype='float'), slice=slice(0, 1)) The above code works and converts the new data to I'll keep digging into it. Really like this new functionality! |
ac.patch(ones * 7, slice=slice(7, 8), extend=True) | ||
ac.patch(ones * 5, slice=slice(5, 6), extend=True) | ||
ac.patch(ones * 6, slice=slice(6, 7), extend=True) | ||
numpy.testing.assert_equal(ac[5:6], ones * 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we better use something like rng.random
instead of ones
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, that would avoid rotational confusion.
|
The only generality you lose with |
149f4a0
to
0f053c7
Compare
Rebased on |
I added text about the new patch method to the |
c2ea055
to
98ba2bf
Compare
98ba2bf
to
323be10
Compare
Checklist
This is extraordinarily early work towards append-able arrays, which should be supportable with zarr. For use cases where we are doing processing off of an instrument and sending the results to Tiled, we don't always know the eventual size of the array. We can do this by writing the array directly outside of Tiled and then sending a message to Tiled to reindex. But it would be super useful to send the new block to Tiled directly.
Changes:
client method:
routes.py
append_block
method that sends the block to append and the axis to append it toPATCH
method for/array/block
endpointQuestions and Issues:
/arrays/append
is a little unsatisfying as it uses a verb, which I don't lovePATCH
, as it's a change to the array, without a complete replacement of the array.