Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toward sharding #122

Open
bogovicj opened this issue May 1, 2024 · 2 comments
Open

toward sharding #122

bogovicj opened this issue May 1, 2024 · 2 comments

Comments

@bogovicj
Copy link
Contributor

bogovicj commented May 1, 2024

  • Loading a block / chunk currently opens a Filehandle per block which is expensive on some backends (e.g. s3)
    • for sharded backends this is bad because several chunks live in the same file
    • n5 does not support arbitrary regions because its done in imglib2, we won't add it here
  • solution: add methods to load + write a collection of blocks together
    • this allows it to sharding implementations to sort block coordinates into shards and load + write all blocks from the same blocks using the same handle
/**
* Reads a collection {@link DataBlock}s from a {@link Set} 
* of grid positions. The output maps from grid positions to 
* output data blocks. Values in the output map may be null
*
* @param pathName
*            dataset path
* @param datasetAttributes
*            the dataset attributes
* @param gridPositions
*            a set of grid positions
* @return a map DataBlocks
* @throws N5Exception
*             the exception
*/
Map<long[], DataBlock<?>> readBlocks(
	final String pathName,
	final DatasetAttributes datasetAttributes,
	final Set<long[]> gridPositions) throws N5Exception;

// TODO could be nice to use a chunkKey object that enforces equality appropriately
// and is a little more general

n5-imglib2 triggers block loading during random access, and so can't immediately benefit from this. Rather, it would need to know whether it needs to to load many blocks at one time as well. implement an explicit pre-fetching operation. this might belong in imglib2-cache.

@bogovicj
Copy link
Contributor Author

bogovicj commented Jul 15, 2024

More related considerations

  • KeyValueAccess implementations should have new methods seek, partialRead, partialWrite
  • Consider adding flush-like capabilities?
    • A method that triggers any IO requests to be executed.
    • Would enable optimization for multiple blocks in the same shard
  • Perhaps by default, IO requests are aggregated over some fixed period of time, then executed
    • This would enable some optimization if subsequent calls would read blocks in the same shard
  • Consider leaving file handles open when possible
  • Consider caching block offsets for sharded arrays

@axtimwalde @cmhulbert

@bogovicj
Copy link
Contributor Author

work in progress implementation: https://github.com/saalfeldlab/n5/tree/wip/codecsShards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant