You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my opinion the tsdf interface is currently a mess in terms of coupling that shouldn't exist, making easy things hard, and exposing things that should be private. The fact that paradigma has to have helper functions for loading a tsdf file means that there is something wrong with the design of tsdf.
Some issues:
TSDFMetadata mixes actual metadata (study_id, subject_id, device_id, start, end)
with details about the data storage (data_type, endianness, bits).
The latter should not be exposed at all, or at the very least not up to the user of the library to set.
Details like the serialization format of timestamps is exposed (start_iso8601).
Opening a channel from a file requires that the user knows the internal filename, but this should not be semantically relevant information. You should be able to change the file_name in the metadata to any string as long as you rename the corresponding file.
A tsdf metadata file holds metadata for multiple data files. But it is not clear what the relation between these files is. In practice, the only way that we use these is as multiple columns with different data types. But this is not enforced.
I also have some gripes with the tsdf file format, but I'll save those for another issue.
TSDF files for a single recording
In TSDF a metadata file contains (metadata for) multiple data files, and these datafiles can contain data about the same time frame and the same subject, or it can be completely unrelated. You could have a single tsdf file that contains both the IMU sensor readings for subject 1, as well as the price of bread from 1980 to 1990.
A lot of problems with the interface come from the flexibility that tsdf allows. I don't see the advantage of bundling multiple recordings into a single metadata file over having separate metadata files for each recording.
And in fact, in the way we use TSDF in paradigma, a single metadata file always contains a single "recording". But because a single data file can only store a single data type, these recordings are still spread over multiple data files. So using a single data file is not enough.
There should be support for single-recording tsdf files, because I suspect that will be 99% of the uses of tsdf.
These could be easy to detect when opening a tsdf file:
all files have the same start and end time
the column names don't overlap
This works nicely with the next point
Split the TSDFMetadata class
The metadata about files (endianness, bits, filename, etc.) should not be left up to the library user. The only thing that the user sees should be actual metadata about the recording.
Proposal: Split metadata into two classes:
RecordingMetadata for metadata actually about a recording.
While we are at it, the python object should also keeps datetimes as datetime objects (addressing point 2).
FileStorageMetadata for the metadata about the storage.
For each file, this has a list of channels in that file, as well as endianness, data type, etc.
When opening a single-recording tsdf metadata file, you will get one RecordingMetadata, and one or more FileStorageMetadatas.
These could be bundled into a TSDFRecording class, which represents an opened tsdf file.
(Note: when opening a multi-recording tsdf metadata file, there would be multiple RecordingMetadatas, and so multiple TSDFRecordings).
In summary:
class TSDFRecording:
meta : RecordingMetadata
_files : List[FileStorageMetadata]
def read_multi_recording_tsdf_metadata(path) -> List[TSDFRecording]
def read_single_recording_tsdf_metadata(path) -> TSDFRecording
Reading data without knowing filenames
When opening a file, the library user should not have to specify or even know the filenames of data files. Instead, they should just ask for certain channels for a certain recording.
With the TSDFRecording we have covered selecting the cording. So, we could have an interface like:
class TSDFRecording:
...
def read_numpy(self, channels) -> np.ndarray
def read_dataframe(self, channels=None) -> pd.DataFrame
"""
Read the chosen channels into a dataframe.
If channels is None, then read all channels.
"""
To keep it simple, we could require that in read_numpy all channels come from the same data file. Since data files would be an internal detail, we can say they should come from the same "channel group", where a channel group is a group of channels that are stored in a single data file.
There is now a one-to-one correspondence between the data stored in files covered by a TSDFRecording and the data stored in memory in a dataframe. The only difference is that the former holds metadata.
Ideally, the metadata would also by part of the dataframe, to make the correspondence complete.
Pandas has a feature for this in the form of dataframe attributes (df.attrs).
We can use these to store the tsdf metadata.
One small issue is that we need to come up with filenames for the data file. But since these don't matter, I propose to just use "{path}.1.bin", "{path}.2.bin", etc., where the path is the metadata file without the ".json" suffix. If we want to be more fancy, we could attach a name to channel groups, and use that as a suffix. Then you get "{path}.imu.bin" and "{path}.time.bin".
Migration
It is possible to implement this proposed interface alongside the current one. That way there is no breaking change.
The text was updated successfully, but these errors were encountered:
Motivation
In my opinion the tsdf interface is currently a mess in terms of coupling that shouldn't exist, making easy things hard, and exposing things that should be private. The fact that paradigma has to have helper functions for loading a tsdf file means that there is something wrong with the design of tsdf.
Some issues:
with details about the data storage (data_type, endianness, bits).
The latter should not be exposed at all, or at the very least not up to the user of the library to set.
start_iso8601
).file_name
in the metadata to any string as long as you rename the corresponding file.I also have some gripes with the tsdf file format, but I'll save those for another issue.
TSDF files for a single recording
In TSDF a metadata file contains (metadata for) multiple data files, and these datafiles can contain data about the same time frame and the same subject, or it can be completely unrelated. You could have a single tsdf file that contains both the IMU sensor readings for subject 1, as well as the price of bread from 1980 to 1990.
A lot of problems with the interface come from the flexibility that tsdf allows. I don't see the advantage of bundling multiple recordings into a single metadata file over having separate metadata files for each recording.
And in fact, in the way we use TSDF in paradigma, a single metadata file always contains a single "recording". But because a single data file can only store a single data type, these recordings are still spread over multiple data files. So using a single data file is not enough.
There should be support for single-recording tsdf files, because I suspect that will be 99% of the uses of tsdf.
These could be easy to detect when opening a tsdf file:
This works nicely with the next point
Split the TSDFMetadata class
The metadata about files (endianness, bits, filename, etc.) should not be left up to the library user. The only thing that the user sees should be actual metadata about the recording.
Proposal: Split metadata into two classes:
RecordingMetadata
for metadata actually about a recording.While we are at it, the python object should also keeps datetimes as
datetime
objects (addressing point 2).FileStorageMetadata
for the metadata about the storage.For each file, this has a list of channels in that file, as well as endianness, data type, etc.
When opening a single-recording tsdf metadata file, you will get one
RecordingMetadata
, and one or moreFileStorageMetadata
s.These could be bundled into a
TSDFRecording
class, which represents an opened tsdf file.(Note: when opening a multi-recording tsdf metadata file, there would be multiple
RecordingMetadata
s, and so multipleTSDFRecording
s).In summary:
Reading data without knowing filenames
When opening a file, the library user should not have to specify or even know the filenames of data files. Instead, they should just ask for certain channels for a certain recording.
With the
TSDFRecording
we have covered selecting the cording. So, we could have an interface like:To keep it simple, we could require that in
read_numpy
all channels come from the same data file. Since data files would be an internal detail, we can say they should come from the same "channel group", where a channel group is a group of channels that are stored in a single data file.Tracking metadata when working with dataframes
There is now a one-to-one correspondence between the data stored in files covered by a
TSDFRecording
and the data stored in memory in a dataframe. The only difference is that the former holds metadata.Ideally, the metadata would also by part of the dataframe, to make the correspondence complete.
Pandas has a feature for this in the form of dataframe attributes (
df.attrs
).We can use these to store the tsdf metadata.
This means that we can make functions:
With perhaps some additional functions to handle the case of adding/removing channels with their
ChannelMetadata
.All together, the reading, modifying and saving of a tsdf file could look like:
Or the writing could be split as
Interface for writing files
For completeness, the interface for writing tsdf files could look like this:
Low level:
High level:
One small issue is that we need to come up with filenames for the data file. But since these don't matter, I propose to just use
"{path}.1.bin"
,"{path}.2.bin"
, etc., where the path is the metadata file without the".json"
suffix. If we want to be more fancy, we could attach a name to channel groups, and use that as a suffix. Then you get"{path}.imu.bin"
and "{path}.time.bin"
.Migration
It is possible to implement this proposed interface alongside the current one. That way there is no breaking change.
The text was updated successfully, but these errors were encountered: