Support other file formats for data files #89

twanvl · 2024-10-22T17:21:20Z

Motivation

In my opinion there is no good reason why we could not have used a standard data frame file format. These formats can store columns with different data types, removing the need to split data over multiple files.

Adding support for this would make it easier for other people to use our analysis pipelines, and for ourselves to work with other people's data. It also means that we can more easily use tools that other people have made to work on our data.

Performance

I did some benchmarks (using IMU+time data), and while the raw file format used by tsdf is faster when reading numpy files, this difference disappears as soon as you use dataframes:

Numpy array IO

Format	Compression	File size	Writing	Reading
TSDF	no	33.6 MB	55 ms	4.0 ms
numpy	no	33.6 MB	229 ms	14.5 ms
numpy	yes	15.8 MB	1150 ms	111 ms

Dataframe IO (pandas and polars)

Format	Compression	File size	Writing	Reading	Writing (polars)	Reading (polars)
TSDF	no	33.6 MB	298 ms	16.2 ms
Parquet	no	21.1 MB	329 ms	27.7 ms	232 ms	5.8 ms
Parquet	snappy	19.1 MB	361 ms	30.0 ms	194 ms	8.9 ms
Parquet	zstd	17.1 MB	360 ms	30.8 ms	285 ms	13.4 ms
Parquet	gzip	16.1 MB	1490 ms	42.0 ms	536 ms	18.7 ms
Feather	no	33.6 MB	223 ms	14.3 ms	229 ms	18.7 ms
Feather	lz4	19.8 MB	156 ms	20.0 ms	201 ms	28.4 ms

I suspect that file IO is not going to be the bottleneck. And if data is stored on a slower filesystem having a smaller file size or fewer files might outweigh parsing and compression overhead. Also, by using compression we would save on file size, and by combining files together we save on inodes.

So, I think we might as well just use feather-v2/ipc or parquet.

Metadata

The other reason for using tsdf is storing metadata along with the data itself. In theory, parquet files can also contain metadata (as a list of key/value pairs). But, unfortunately, pandas does not support that. So, we would still need to have a separate metadata file. Or we could use pyarrow for writing the files, which does support storing metadata in parquet files.

Proposal

What would be nice is if tsdf supported other file formats for data storage.

We could add a format field to the metadata.
The current format is a "binary" data file.
When format is "parquet", the file_name points to a parquet file. The metadata should then not contain bits, data_type, endianness, rows, channels.

On the API side we would just need to add parameters to specify the storage type

write_dataframe(path, df, format="binary", compression=None)

The text was updated successfully, but these errors were encountered:

twanvl added the enhancement New feature or request label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support other file formats for data files #89

Support other file formats for data files #89

twanvl commented Oct 22, 2024

Support other file formats for data files #89

Support other file formats for data files #89

Comments

twanvl commented Oct 22, 2024

Motivation

Performance

Metadata

Proposal