Skip to content

Latest commit

 

History

History
53 lines (34 loc) · 1.28 KB

README.md

File metadata and controls

53 lines (34 loc) · 1.28 KB

bio2zarr

Convert bioinformatics file formats to Zarr

Initially supports converting VCF to the sgkit vcf-zarr specification

This is early alpha-status code: everything is subject to change, a and it has not been thoroughly tested

Usage

Convert a VCF to zarr format:

python3 -m bio2zarr vcf2zarr convert <VCF> <zarr>

Converts the VCF to zarr format.

Do not use this for anything but the smallest files

The recommended approach is to use a multi-stage conversion

First, convert the VCF into an intermediate columnar format:

python3 -m bio2zarr vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded

Then, (optionally) inspect this representation to get a feel for your dataset

python3 -m bio2zarr vcf2zarr inspec tmp/sample.exploded

Then, (optionally) generate a conversion schema to describe the corresponding Zarr arrays:

python3 -m bio2zarr vcf2zarr mkschema tmp/sample.exploded > sample.schema.json

View and edit the schema, deleting any columns you don't want.

Finally, convert to Zarr

python3 -m bio2zarr vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json

Use the -p, --worker-processes argument to control the number of workers used to do zarr encoding.