Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the tsdf library interface #87

Open
twanvl opened this issue Oct 22, 2024 · 0 comments
Open

Improving the tsdf library interface #87

twanvl opened this issue Oct 22, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@twanvl
Copy link
Contributor

twanvl commented Oct 22, 2024

Motivation

In my opinion the tsdf interface is currently a mess in terms of coupling that shouldn't exist, making easy things hard, and exposing things that should be private. The fact that paradigma has to have helper functions for loading a tsdf file means that there is something wrong with the design of tsdf.

Some issues:

  1. TSDFMetadata mixes actual metadata (study_id, subject_id, device_id, start, end)
    with details about the data storage (data_type, endianness, bits).
    The latter should not be exposed at all, or at the very least not up to the user of the library to set.
  2. Details like the serialization format of timestamps is exposed (start_iso8601).
  3. Opening a channel from a file requires that the user knows the internal filename, but this should not be semantically relevant information. You should be able to change the file_name in the metadata to any string as long as you rename the corresponding file.
  4. A tsdf metadata file holds metadata for multiple data files. But it is not clear what the relation between these files is. In practice, the only way that we use these is as multiple columns with different data types. But this is not enforced.

I also have some gripes with the tsdf file format, but I'll save those for another issue.

TSDF files for a single recording

In TSDF a metadata file contains (metadata for) multiple data files, and these datafiles can contain data about the same time frame and the same subject, or it can be completely unrelated. You could have a single tsdf file that contains both the IMU sensor readings for subject 1, as well as the price of bread from 1980 to 1990.

A lot of problems with the interface come from the flexibility that tsdf allows. I don't see the advantage of bundling multiple recordings into a single metadata file over having separate metadata files for each recording.

And in fact, in the way we use TSDF in paradigma, a single metadata file always contains a single "recording". But because a single data file can only store a single data type, these recordings are still spread over multiple data files. So using a single data file is not enough.

There should be support for single-recording tsdf files, because I suspect that will be 99% of the uses of tsdf.
These could be easy to detect when opening a tsdf file:

  • all files have the same start and end time
  • the column names don't overlap

This works nicely with the next point

Split the TSDFMetadata class

The metadata about files (endianness, bits, filename, etc.) should not be left up to the library user. The only thing that the user sees should be actual metadata about the recording.

Proposal: Split metadata into two classes:

  • RecordingMetadata for metadata actually about a recording.
    While we are at it, the python object should also keeps datetimes as datetime objects (addressing point 2).
  • FileStorageMetadata for the metadata about the storage.
    For each file, this has a list of channels in that file, as well as endianness, data type, etc.

When opening a single-recording tsdf metadata file, you will get one RecordingMetadata, and one or more FileStorageMetadatas.
These could be bundled into a TSDFRecording class, which represents an opened tsdf file.

(Note: when opening a multi-recording tsdf metadata file, there would be multiple RecordingMetadatas, and so multiple TSDFRecordings).

In summary:

class TSDFRecording:
    meta : RecordingMetadata
    _files : List[FileStorageMetadata]

def read_multi_recording_tsdf_metadata(path) -> List[TSDFRecording]
def read_single_recording_tsdf_metadata(path) -> TSDFRecording

Reading data without knowing filenames

When opening a file, the library user should not have to specify or even know the filenames of data files. Instead, they should just ask for certain channels for a certain recording.

With the TSDFRecording we have covered selecting the cording. So, we could have an interface like:

class TSDFRecording:
    ...
    def read_numpy(self, channels) -> np.ndarray
    def read_dataframe(self, channels=None) -> pd.DataFrame
        """
        Read the chosen channels into a dataframe.
        If channels is None, then read all channels.
        """

To keep it simple, we could require that in read_numpy all channels come from the same data file. Since data files would be an internal detail, we can say they should come from the same "channel group", where a channel group is a group of channels that are stored in a single data file.

class TSDFRecording:
    ...
    @property channels : List[String]
    @property channel_groups : List[List[String]]

Tracking metadata when working with dataframes

There is now a one-to-one correspondence between the data stored in files covered by a TSDFRecording and the data stored in memory in a dataframe. The only difference is that the former holds metadata.

Ideally, the metadata would also by part of the dataframe, to make the correspondence complete.
Pandas has a feature for this in the form of dataframe attributes (df.attrs).
We can use these to store the tsdf metadata.

This means that we can make functions:

def get_metadata(df: Dataframe) -> RecordingMetadata
def set_metadata(df: Dataframe, meta: RecordingMetadata)

With perhaps some additional functions to handle the case of adding/removing channels with their ChannelMetadata.

All together, the reading, modifying and saving of a tsdf file could look like:

tsdf = read_tsdf_metadata(path) # TSDFRecording
df = tsdf.read_dataframe() # pandas.Dataframe with metadata
df.reverse() # do some operations
write_dataframe(out_path, df)

Or the writing could be split as

tsdf_new = write_dataframe_as_tsdf_recording(out_path, df) # TSDFRecording
tsdf_new.write_metadata(out_path)

Interface for writing files

For completeness, the interface for writing tsdf files could look like this:

Low level:

def write_numpy_without_metadata(path, data) -> FileStorageMetadata
def write_dataframe_without_metadata(path, df: pandas.DataFrame) -> List[FileStorageMetadata]
def make_tsdf_recording(meta: RecordingMetadata, files: List[FileStorageMetadata]) -> TSDFRecording
def write_tsdf_metadata(path, recording: TSDFRecording | List[TSDFRecording])

High level:

def write_dataframe(path, df) -> TSDFRecording:
    files = write_dataframe_without_metadata(path, df)
    recording = make_tsdf_recording(get_metadata(df), files)
    write_tsdf_metadata(path, recording)
    return recording

One small issue is that we need to come up with filenames for the data file. But since these don't matter, I propose to just use "{path}.1.bin", "{path}.2.bin", etc., where the path is the metadata file without the ".json" suffix. If we want to be more fancy, we could attach a name to channel groups, and use that as a suffix. Then you get "{path}.imu.bin" and "{path}.time.bin".

Migration

It is possible to implement this proposed interface alongside the current one. That way there is no breaking change.

@twanvl twanvl added the enhancement New feature or request label Oct 22, 2024
twanvl added a commit that referenced this issue Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant