Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust API: Store lacks a method for querying the size of values #277

Open
LDeakin opened this issue Oct 15, 2024 · 13 comments
Open

Rust API: Store lacks a method for querying the size of values #277

LDeakin opened this issue Oct 15, 2024 · 13 comments

Comments

@LDeakin
Copy link
Contributor

LDeakin commented Oct 15, 2024

As far as I can see, retrieving trailing bytes (e.g. CRC32C checksum, shard index) from a chunk with Store::get or Store::get_partial_values is not possible (with the ByteRange abstraction) without knowing the size of a value.

@rabernat
Copy link
Contributor

Thanks for the feedback @LDeakin! I assume you're interested in sharding with this. Are there other use cases?

Icechunk has the ability to do sharding in a different way--by packing multiple chunks into the same object, but without Zarr really even knowing about it. This is also potentially more flexible, because the store can decide at runtime how to pack the chunks, or they can be repacked retroactively. I'm curious about the tradeoffs between this (currently unimplemented) approach to sharding and the current Zarr spec one.

TBH I have never really understood the whole "sharding as a codec" concept. I think it makes sense for sharding to be an implementation detail of the store.

As for chunk-level metadata like checkksum, with Icechunk we have the option of putting that in the chunk manifest rather than the chunk itself! This could be a lot more efficient to query.

@LDeakin
Copy link
Contributor Author

LDeakin commented Oct 15, 2024

Icechunk has the ability to do sharding in a different way--by packing multiple chunks into the same object, but without Zarr really even knowing about it

When I first scanned over Icechunk, I wondered how it would work with a shard written incrementally (chunk-by-chunk). But that sounds much better. Delegating sharding-like functionality to Icechunk could give history at chunk granularity, and array producers/consumers would not need to concern themselves with shards 👍.

I assume you're interested in sharding with this. Are there other use cases?

Not currently. But, a Zarr store either needs to support reading from the end of a value or querying its size (or ideally both) to support partial decoding with all current Zarr V3 codecs.

@rabernat
Copy link
Contributor

Size definitely can and should be implemented! It's already in the chunk manifest.

@paraseba
Copy link
Contributor

@LDeakin what's the issue with ByteRange(Bound::Included(42), Bound::Unbounded)

@LDeakin
Copy link
Contributor Author

LDeakin commented Oct 16, 2024

@LDeakin what's the issue with ByteRange(Bound::Included(42), Bound::Unbounded)

That represents the 42nd byte onwards right? What I am after is the last 42 bytes, for example. I think I would need to know the size of the value to construct such a ByteRange with the current implementation.

Note that many stores support requesting the last N bytes from an object. object_store supports it: object_store::GetRange::Suffix.

@paraseba
Copy link
Contributor

@LDeakin I'll change ByteRange to allow this type of query. Thank you for flagging this!

paraseba added a commit that referenced this issue Oct 16, 2024
This provides an approach to deal with #277
@paraseba
Copy link
Contributor

@LDeakin please take a look at #285 . Hopefully you can use that, and we'll introduce access to the chunk size in a separate PR.

@LDeakin
Copy link
Contributor Author

LDeakin commented Oct 16, 2024

Looks good!

paraseba added a commit that referenced this issue Oct 16, 2024
This provides an approach to deal with #277
@paraseba
Copy link
Contributor

We have given Lachlan away around this, but I'll keep the ticket open until we offer a way to retrieve the size of a chunk using the Store interface.

@paraseba
Copy link
Contributor

@LDeakin we have released 0.1.0-alpha.3 with this change and the new list_dir_items method. Hope it helps.

@LDeakin
Copy link
Contributor Author

LDeakin commented Oct 17, 2024

@LDeakin we have released 0.1.0-alpha.3 with this change and the new list_dir_items method. Hope it helps.

Sure did! zarrs now supports icechunk stores: https://crates.io/crates/zarrs_icechunk

@paraseba
Copy link
Contributor

Unbelievable @LDeakin !

@paraseba
Copy link
Contributor

Related conversation in zarr-python zarr-developers/zarr-python#2420

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants