RFC: Dataset storage information in HDF5/JSON #75

ajelenak · 2022-06-07T11:28:33Z

This is a proposal to add dataset storage information to HDF5/JSON. JSON key for this is named byteBlocks. The word "block" is hopefully still technically accurate while not too similar to "chunk".

Below is an example for one dataset. Block JSON keys are in the same format as in the HSDS schema, e.g. 0_0 and 0_1. Two blocks are in the example; the 0_1 block has an optional url key in case of remote blocks. The url key can also apply to the entire dataset, in which case cannot appear in the blocks.

{
    "datasets": {
        "7f335a2e-7ab1-11e4-87a5-3c15c2da029e": {
            "attributes": [], 
            "dcpl": {
                "fillValue": 0,
                "layout": {
                    "class": "H5D_CHUNKED",
                    "dims": [8]
                }
            },
            "shape": {
                "class": "H5S_SIMPLE",
                "dims": [10, 10], 
                "maxdims": [10, 10]
            }, 
            "type": {
                "base": "H5T_STD_I32BE", 
                "class": "H5T_INTEGER"
            }, 
            "byteBlocks": {
                "0_0": {
                    "offset": 1234,
                    "size": 2567,
                },
                "0_1": {
                    "offset": 56789,
                    "size": 1967,
                    "url": "s3://mybucket/path/to/object"
                }
            }
        }
    }
}

cc: @derobins @jreadey @gheber

The text was updated successfully, but these errors were encountered:

jreadey · 2022-06-16T18:26:48Z

Why does 0_1 have a url key but 0_0 does not?

How about using the same storage methods (e.g. CHUNK_REF_INDIRECT, etc.) that HSDS uses?

Another idea would be to have an option to store the chunk data directly in the file (hex encoded). So offset would be a byte office in the file itself. This is what ASDF does.

gheber · 2022-06-20T12:00:27Z

I think byteBlocks should have a key whose value indicates the blocking scheme, which would also explain how to interpret the block keys. Right now, the assumption appears to be that there's one block per chunk. That's not very flexible. Remember Francesc's sub-blocking scheme for Blosc for better selectivity, etc.? There also might be sparse chunks at some point.

ajelenak · 2022-06-23T13:05:10Z

Why does 0_1 have a url key but 0_0 does not?

Just illustrate both possible options. In a real case, only one would be used.

How about using the same storage methods (e.g. CHUNK_REF_INDIRECT, etc.) that HSDS uses?

Does this require an additional anonymous dataset?

jreadey · 2022-06-23T13:38:53Z

H5D_CHUNKREF and H5D_CHUNKREF_INDIRECT are documented here: https://github.com/HDFGroup/hsds/blob/master/docs/design/single_object/SingleObject.md.

ajelenak · 2022-07-30T13:21:33Z

H5D_CHUNKED_REF could work, although "chunk" is a very important term and some may find it confusing in this context.

ajelenak · 2022-07-30T13:34:13Z

Here's the updated example. It includes H5D_CHUNKED_REF as layout and new index_schema key inside byteBlocks.

The block index schema is defined with a URI and its index separator is included as well for convenience. There are also url keys to illustrate both cases: a single resource with all the blocks or different resources for each block.

{
    "datasets": {
        "7f335a2e-7ab1-11e4-87a5-3c15c2da029e": {
            "attributes": [], 
            "dcpl": {
                "fillValue": 0,
                "layout": {
                    "class": "H5D_CHUNKED_REF",
                    "dims": [8]
                }
            },
            "shape": {
                "class": "H5S_SIMPLE",
                "dims": [10, 10], 
                "maxdims": [10, 10]
            }, 
            "type": {
                "base": "H5T_STD_I32BE", 
                "class": "H5T_INTEGER"
            }, 
            "url": "s3://mybucket/path/to/object/where/all/blocks/are",
            "byteBlocks": {
                "index_schema": {
                    "uri": "https://schema.hdfgroup.org/hdf5-json/block/index/regular",
                    "separator": "_"
                },
                "0_0": {
                    "offset": 1234,
                    "size": 2567,
                    "url": "s3://mybucket/path/to/block/object1"
                },
                "0_1": {
                    "offset": 56789,
                    "size": 1967,
                    "url": "s3://mybucket/path/to/block/object2"
                }
            }
        }
    }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Dataset storage information in HDF5/JSON #75

RFC: Dataset storage information in HDF5/JSON #75

ajelenak commented Jun 7, 2022

jreadey commented Jun 16, 2022

gheber commented Jun 20, 2022

ajelenak commented Jun 23, 2022

jreadey commented Jun 23, 2022

ajelenak commented Jul 30, 2022

ajelenak commented Jul 30, 2022

RFC: Dataset storage information in HDF5/JSON #75

RFC: Dataset storage information in HDF5/JSON #75

Comments

ajelenak commented Jun 7, 2022

jreadey commented Jun 16, 2022

gheber commented Jun 20, 2022

ajelenak commented Jun 23, 2022

jreadey commented Jun 23, 2022

ajelenak commented Jul 30, 2022

ajelenak commented Jul 30, 2022