Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support other file formats for data files #89

Open
twanvl opened this issue Oct 22, 2024 · 0 comments
Open

Support other file formats for data files #89

twanvl opened this issue Oct 22, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@twanvl
Copy link
Contributor

twanvl commented Oct 22, 2024

Motivation

In my opinion there is no good reason why we could not have used a standard data frame file format. These formats can store columns with different data types, removing the need to split data over multiple files.

Adding support for this would make it easier for other people to use our analysis pipelines, and for ourselves to work with other people's data. It also means that we can more easily use tools that other people have made to work on our data.

Performance

I did some benchmarks (using IMU+time data), and while the raw file format used by tsdf is faster when reading numpy files, this difference disappears as soon as you use dataframes:

Numpy array IO

Format Compression File size Writing Reading
TSDF no 33.6 MB 55 ms 4.0 ms
numpy no 33.6 MB 229 ms 14.5 ms
numpy yes 15.8 MB 1150 ms 111 ms

Dataframe IO (pandas and polars)

Format Compression File size Writing Reading Writing (polars) Reading (polars)
TSDF no 33.6 MB 298 ms 16.2 ms
Parquet no 21.1 MB 329 ms 27.7 ms 232 ms 5.8 ms
Parquet snappy 19.1 MB 361 ms 30.0 ms 194 ms 8.9 ms
Parquet zstd 17.1 MB 360 ms 30.8 ms 285 ms 13.4 ms
Parquet gzip 16.1 MB 1490 ms 42.0 ms 536 ms 18.7 ms
Feather no 33.6 MB 223 ms 14.3 ms 229 ms 18.7 ms
Feather lz4 19.8 MB 156 ms 20.0 ms 201 ms 28.4 ms

I suspect that file IO is not going to be the bottleneck. And if data is stored on a slower filesystem having a smaller file size or fewer files might outweigh parsing and compression overhead. Also, by using compression we would save on file size, and by combining files together we save on inodes.

So, I think we might as well just use feather-v2/ipc or parquet.

Metadata

The other reason for using tsdf is storing metadata along with the data itself. In theory, parquet files can also contain metadata (as a list of key/value pairs). But, unfortunately, pandas does not support that. So, we would still need to have a separate metadata file. Or we could use pyarrow for writing the files, which does support storing metadata in parquet files.

Proposal

What would be nice is if tsdf supported other file formats for data storage.

  • We could add a format field to the metadata.
  • The current format is a "binary" data file.
  • When format is "parquet", the file_name points to a parquet file. The metadata should then not contain bits, data_type, endianness, rows, channels.

On the API side we would just need to add parameters to specify the storage type

write_dataframe(path, df, format="binary", compression=None)
@twanvl twanvl added the enhancement New feature or request label Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant