You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my opinion there is no good reason why we could not have used a standard data frame file format. These formats can store columns with different data types, removing the need to split data over multiple files.
Adding support for this would make it easier for other people to use our analysis pipelines, and for ourselves to work with other people's data. It also means that we can more easily use tools that other people have made to work on our data.
Performance
I did some benchmarks (using IMU+time data), and while the raw file format used by tsdf is faster when reading numpy files, this difference disappears as soon as you use dataframes:
Numpy array IO
Format
Compression
File size
Writing
Reading
TSDF
no
33.6 MB
55 ms
4.0 ms
numpy
no
33.6 MB
229 ms
14.5 ms
numpy
yes
15.8 MB
1150 ms
111 ms
Dataframe IO (pandas and polars)
Format
Compression
File size
Writing
Reading
Writing (polars)
Reading (polars)
TSDF
no
33.6 MB
298 ms
16.2 ms
Parquet
no
21.1 MB
329 ms
27.7 ms
232 ms
5.8 ms
Parquet
snappy
19.1 MB
361 ms
30.0 ms
194 ms
8.9 ms
Parquet
zstd
17.1 MB
360 ms
30.8 ms
285 ms
13.4 ms
Parquet
gzip
16.1 MB
1490 ms
42.0 ms
536 ms
18.7 ms
Feather
no
33.6 MB
223 ms
14.3 ms
229 ms
18.7 ms
Feather
lz4
19.8 MB
156 ms
20.0 ms
201 ms
28.4 ms
I suspect that file IO is not going to be the bottleneck. And if data is stored on a slower filesystem having a smaller file size or fewer files might outweigh parsing and compression overhead. Also, by using compression we would save on file size, and by combining files together we save on inodes.
So, I think we might as well just use feather-v2/ipc or parquet.
Metadata
The other reason for using tsdf is storing metadata along with the data itself. In theory, parquet files can also contain metadata (as a list of key/value pairs). But, unfortunately, pandas does not support that. So, we would still need to have a separate metadata file. Or we could use pyarrow for writing the files, which does support storing metadata in parquet files.
Proposal
What would be nice is if tsdf supported other file formats for data storage.
We could add a format field to the metadata.
The current format is a "binary" data file.
When format is "parquet", the file_name points to a parquet file. The metadata should then not contain bits, data_type, endianness, rows, channels.
On the API side we would just need to add parameters to specify the storage type
Motivation
In my opinion there is no good reason why we could not have used a standard data frame file format. These formats can store columns with different data types, removing the need to split data over multiple files.
Adding support for this would make it easier for other people to use our analysis pipelines, and for ourselves to work with other people's data. It also means that we can more easily use tools that other people have made to work on our data.
Performance
I did some benchmarks (using IMU+time data), and while the raw file format used by tsdf is faster when reading numpy files, this difference disappears as soon as you use dataframes:
Numpy array IO
Dataframe IO (pandas and polars)
I suspect that file IO is not going to be the bottleneck. And if data is stored on a slower filesystem having a smaller file size or fewer files might outweigh parsing and compression overhead. Also, by using compression we would save on file size, and by combining files together we save on inodes.
So, I think we might as well just use feather-v2/ipc or parquet.
Metadata
The other reason for using tsdf is storing metadata along with the data itself. In theory, parquet files can also contain metadata (as a list of key/value pairs). But, unfortunately, pandas does not support that. So, we would still need to have a separate metadata file. Or we could use pyarrow for writing the files, which does support storing metadata in parquet files.
Proposal
What would be nice is if tsdf supported other file formats for data storage.
format
field to the metadata."binary"
data file.format
is"parquet"
, thefile_name
points to a parquet file. The metadata should then not containbits
,data_type
,endianness
,rows
,channels
.On the API side we would just need to add parameters to specify the storage type
The text was updated successfully, but these errors were encountered: