Develop a cookiecutter template for virtualization #319

maxrjones · 2024-11-25T18:25:33Z

Context

I think there would be value in building and sharing a cookiecutter template for virtualizing datasets, to incentivize open and accessible VirtualiZarr workflows. We could also use cruft to allow updating workflows for upstream changes.

There are shared steps between most virtualization workflows:

Generate a list of input files
Generate virtual datasets for each input file, with optional pre- or post-processing for this step
Concatenate virtual datasets into a single virtual dataset
Write the virtual dataset to a virtual Icechunk store or Kerchunk reference file
(Optional) apply the above workflow to multiple datasets
(Optional) generate a catalog (e.g., STAC) for multiple virtual datasets

There are many other boilerplate components:

Typing
Documentation
Licensing
CI/CD
Environment management

Lastly, there are parallelization, orchestration, and execution tools tools which could enhance virtualization workflows, with options including:

Dask
Flyte
Lithops
Modal
Coiled

This template would enable people to use best-practices and avoid spending time on boilerplate components.

Suggested task components

Build out an example of a well-structured virtualization pipeline (https://github.com/developmentseed/virtualize-nex-gddp-cmip6 could grow into this but is currently insufficient)
Fork https://github.com/fpgmaas/cookiecutter-uv to build a cookiecutter template for virtualizing data as Zarr
Add different execution backends, following design in https://github.com/earth-mover/serverless-datacube-demo
Add additional components to the template over time (e.g., appending, validation)

maxrjones mentioned this issue Nov 25, 2024

Create a community structure for sharing VirtualiZarr workflows and Icechunk virtual stores / Kerchunk references #320

Open

4 tasks

maxrjones added the usage example Real world use case examples label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop a cookiecutter template for virtualization #319

Develop a cookiecutter template for virtualization #319

maxrjones commented Nov 25, 2024

Develop a cookiecutter template for virtualization #319

Develop a cookiecutter template for virtualization #319

Comments

maxrjones commented Nov 25, 2024

Context

Suggested task components