read_raw_xdf(): loading assumes evenly sampling #385

DominiqueMakowski · 2023-08-12T13:07:42Z

I was adding a wrapper around pyxdf to load and tidy up xdf data in Neurokit when I stumbled upon these issues:

xdf-modules/pyxdf#79
xdf-modules/pyxdf#1

So I decided to give mnelab a try to see if what I was doing was correct, but I found an issue that happens when there was interruptions in the streaming:

import neurokit2 as nk

df, info = nk.read_xdf(
    "raw/physio/sub-01/ses-S001/eeg/sub-01_ses-S001_task-HCT_run-001_eeg.xdf"
)

from mnelab.io.xdf import read_raw_xdf

df2 = read_raw_xdf(
    "raw/physio/sub-01/ses-S001/eeg/sub-01_ses-S001_task-HCT_run-001_eeg.xdf",
    stream_ids=[1, 2, 3, 4, 5],
    fs_new=2000,
).to_data_frame()
df2.index = df.index

plt.close()
fig, ax = plt.subplots(nrows=3)
for c in ["AF7"]:
    ax[0].plot(info["data"][3][c], label="original", color="blue")
    ax[1].plot(df[c], label="NK", color="green")
    ax[2].plot(df2[c], label="mnelab", color="orange")
plt.legend()
plt.show()

Essentially I plotted:

in blue, the original signal from Muse EEG sampled at ~250He
in orange, the resampling at 2000 Hz by NK based on pandas that is timestamp-based
in blue the one by mnelab

There is an interruption at the beginning at the stream, but mnelab probably interpolates linearly which distorts the whole thing.

I can send you that xdf file by email if you need :)

(also tagging my friend @pmavros as this might be relevant to our eeg processing)

cbrnr · 2023-08-16T08:57:40Z

Thanks @DominiqueMakowski, you are correct that we currently assume regularly spaced samples per stream. Using pandas to handle interpolation is actually very clever, I wonder why I haven't thought of it before 😆 (the tradeoff is of course that it is a rather large dependency, but maybe worth it even for MNELAB).

Could you share the file with me so that I can play around with it to get a better grasp of the problem?

Regarding your function, how can I avoid linear interpolation for such a long interruption? By setting fillmissing=1/fs (where fs is the expected sampling frequency of the stream)?

Finally, I think it would be beneficial to get resampling directly into pyXDF. Did you check the implementation in xdf-modules/pyxdf#1 by any chance?

DominiqueMakowski · 2023-08-16T09:09:39Z

Could you share the file with me

Dropped you an email

Regarding your function, how can I avoid linear interpolation for such a long interruption?

I added a fillmissing argument that sets a limit in seconds on wether to prevent interpolations of long periods. It seem to work nicely and leaves NaNs, which is not the best for MNE but that's another issue

Finally, I think it would be beneficial to get resampling directly into pyXDF.

I agree...

cbrnr · 2023-08-16T09:16:43Z

I added a fillmissing argument that sets a limit in seconds on wether to prevent interpolations of long periods

So I set it to 1/fs? The default None seems to turn off NaNs completely if I'm not mistaken.

DominiqueMakowski · 2023-08-16T09:21:13Z

It does the transformation above so you just need to specify it in seconds, like 0.5 or 0.1. And yes None by default leaves all interruptions (normally)

DominiqueMakowski · 2023-08-16T09:53:13Z

If we are to have similar code in NK and mnelab, we might want to outsource some of it to pyxdf.

What I could see is a pyxdf.xdf_to_dataframe() function that tries to import pandas (and if not, errors saying that pandas is required for this function) - i.e., pandas is an optional dependency, and then converts and resamples the xdf to a dataframe (like what we do in NK now). Then, mnelab can extract the array from it and put that into mne.Raw and NK can use it as-is.

cbrnr · 2023-08-16T10:12:12Z

If we are to have similar code in NK and mnelab, we might want to outsource some of it to pyxdf.

Agreed. Resampling should really be handled by pyXDF, and a proposed solution already exists (although I'm not sure how easy it will be to rebase and if it is still working). But this should be discussed directly with the pyXDF people (maybe in xdf-modules/pyxdf#1).

DominiqueMakowski · 2023-08-16T10:20:05Z

For future reference: that method might suffer from some loss of precision, from my small experimentations using the union of existing and new indices was giving the best results

cbrnr · 2023-08-24T08:29:39Z

@DominiqueMakowski is the description of the signals in your top post correct? I think you might have mixed up the colors.

Just to be sure, the correct (expected) signal should contain a segment with missing data in the first second?

When loading just stream 4 (with or without resampling), I get this time series:

So I'm wondering if the import worked, and the problem is maybe in the to_data_frame() method? I can't investigate further for a couple of days (weeks?), so if you find anything in the meantime please LMK.

cbrnr · 2023-08-24T08:44:47Z

One more observation, the MNELAB GUI doesn't let you choose sampling frequencies greater than the highest sampling frequency in the file, i.e. 1000Hz in this example. Even then, the signals look exactly like in the screenshot, so maybe it's because you resample to 2000Hz (I doubt it, but still worth checking)?

cbrnr · 2023-08-24T08:47:10Z

Although I did spot another (probably unrelated) issue: the suggested resampling when selecting all streams is 52Hz, but it should be 1000Hz. Not sure what's going on here, but this seems like a separate issue.

DominiqueMakowski · 2023-08-24T08:57:33Z

Just to be sure, the correct (expected) signal should contain a segment with missing data in the first second?

No, I think the whole recording is like several minutes so it should be within the first minute or so (the time axis is messed up in my fig)

so maybe it's because you resample to 2000Hz (I doubt it, but still worth checking)?

The upsampling is done to avoid aliasing when merging signals with uneven sampling rates, but it should have fairly minimal impact

cbrnr · 2023-08-24T09:00:04Z

No, I think the whole recording is like several minutes so it should be within the first minute or so (the time axis is messed up in my fig)

So your three example plots do not actually show the problem? Sorry, I'm confused now, but now I don't understand what the problem with MNELAB is...

DominiqueMakowski · 2023-08-24T09:04:17Z

Can you zoom out in your fig to see all the signal horizontally?

cbrnr · 2023-08-24T09:19:57Z

Yes:

Looks the same whether I resample to 256Hz or not (in which case the sampling rate is 256.021Hz).

DominiqueMakowski · 2023-08-24T09:27:10Z

can you share the code to reproduce this fig?

cbrnr · 2023-08-24T09:49:10Z

This is all done in MNELAB with GUI commands, but here is the corresponding code (available in View – History). For example, here's the code for loading all streams and resampling to 1000Hz:

from copy import deepcopy
import mne
from mnelab.io import read_raw

datasets = []
data = read_raw(
    "/Users/clemens/Data/biosignal-test-data/XDF/sub-01_ses-S001_task-HCT_run-001_eeg.xdf",
    stream_ids=[1, 2, 3, 4, 5],
    fs_new=1000.0,
    preload=True
)
datasets.insert(0, data)
data.plot(events=events, n_channels=18)

DominiqueMakowski · 2023-08-24T09:54:13Z

Haha yes but that's the whole problem, mnelab assumes that samples are evenly sampled. That's the raw signal:

import pyxdf

streams, _ = pyxdf.load_xdf(
    "./raw/physio/sub-01/ses-S001/eeg/sub-01_ses-S001_task-HCT_run-001_eeg.xdf"
)
plt.plot(streams[3]["time_stamps"], 
         streams[3]["time_series"][:, 0])

cbrnr · 2023-08-24T10:16:39Z

OK, now I get it. Here's a screenshot showing the first channel over the entire duration, top (blue): original data as obtained with PyXDF, bottom (black): data imported with MNELAB:

cbrnr · 2023-08-24T10:19:54Z

Yes, we currently only look at the first and last timestamps when resampling. Even without resampling, we only look at the time of the first timestamp and then use the effective sampling rate for the remaining samples. I guess we need to consider all timestamps.

cbrnr · 2023-10-11T11:43:21Z

@DominiqueMakowski I wonder if interpolating missing data is the best solution. Would it not be better to use NaN values instead? Otherwise, it is difficult to determine if data collection (using a device with a given regular sampling frequency) worked, or if there was a gap where no data samples have been recorded. After all, you don't want to process the interpolated data, right?

DominiqueMakowski · 2023-10-11T12:21:53Z

I opted for a user-defined duration, that allows to keep interruptions longer than a given time

cbrnr · 2023-10-11T12:40:38Z

Ah, right! That's a good approach. So everything > than that duration is filled in as NaNs, right?

cbrnr · 2023-10-11T12:41:30Z

I should have read the thread again, you already mentioned this before! Sorry about the noise!

DominiqueMakowski · 2023-10-11T12:51:56Z

no worries haha I'm very often guilty of that as well

pmavros · 2023-10-11T13:01:56Z

Hi both, thanks for addressing this issue! Just wanted to share a thought. I wonder if the default should be 'none'. If I understand correct Dom's implementation, at the moment if the user does nothing, any interruption will be interpolated. It could lead to unwanted behaviour with users (perhaps less experienced, but I could see this happening in the other cases) who may not realised that there were interruptions in the signal in the first place (especially if they do automated processing downstream). For periods for more than a 'few' milliseconds, should we be interpolating non-stationary signals? Cheers, Panos

…

-----Original Message----- From: Dominique ***@***.***> To: cbrnr ***@***.***> Cc: Panos ***@***.***>; Mention ***@***.***> Date: Wednesday, 11 October 2023 2:52 PM CEST Subject: Re: [cbrnr/mnelab] read_raw_xdf(): resampling assumes evenly sampling (Issue #385) no worries haha I'm very often guilty of that as well — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

DominiqueMakowski · 2023-10-11T13:08:20Z

Yes, none is probably not the best, but then the right default depends on signals, like 1 second of EEG is probably too much, but for other signals like EDA it could be alright. Another option is not to set a default but to throw warnings if a break is detected.

The reader in neurokit is also made with neurokit in mind, which doesn't deal super well with nans

cbrnr · 2023-10-11T13:31:29Z

Good point. Every regularly sampled XDF stream has a nominal sampling frequency, so we could use it to define a default. Conservatively, we could choose everything > 1/fs to be filled with NaNs, but this is likely too small. Maybe > 2/fs is a better choice? It seems like a value depending on fs makes more sense than an absolute time interval.

cbrnr · 2023-10-11T13:34:35Z

I have another question @DominiqueMakowski. You are using df.interpolate() with method="index". This means linear interpolation between index values, right? I'm not sure if I understand the docs correctly.

pmavros · 2023-10-11T13:37:41Z

I like the idea in principle, but going back to Dom's example, sometimes we measure skin conductance at 15Hz or more, but would probably tolerate 1 second of interpolation as (parts of) the signal are slow; whereas for EEG sampling rate can vary (128, 500, 2000Hz) but would not change much our tolerance. What if we used the tag from the xdf? so if there are any discontinuities, stop the reading and ask them tomake signal specific thresholds e.g. {EEG:250, EDA:1000}?

…

-----Original Message----- From: Clemens ***@***.***> To: cbrnr ***@***.***> Cc: Panos ***@***.***>; Mention ***@***.***> Date: Wednesday, 11 October 2023 3:31 PM CEST Subject: Re: [cbrnr/mnelab] read_raw_xdf(): resampling assumes evenly sampling (Issue #385) Good point. Every regularly sampled XDF stream has a nominal sampling frequency, so we could use it to define a default. Conservatively, we could choose everything > 1/fs to be filled with NaNs, but this is likely too small. Maybe > 2/fs is a better choice? It seems like a value depending on fs makes more sense than an absolute time interval. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

cbrnr · 2023-10-11T14:19:58Z

But the question still remains: how do you define a discontinuity based on the signal type? You'd have to use type-specific durations to determine it, or no? Technically, I think it's easiest to take the nominal fs to decide if there are gaps in the signal and then emit a warning. This relies only on the fs and not on the type and domain-specific interpretation of a signal (i.e. which gap is still acceptable).

DominiqueMakowski · 2023-10-11T18:34:19Z

You are using df.interpolate() with method="index". This means linear interpolation between index values, right?

tbh I wouldn't be able to exactly explain how pandas work here, indeed their docs are a bit mysterious. All I can say is that from my trial and errors attempts that was the way that worked the best in preserving the original signal 🤷

I think it's easiest to take the nominal fs to decide if there are gaps

I think that's fine, yeah. in general slower signals will tend to have a lower nominal frequency (at least for some devices). I think we can be fairly conservative with warnings, so users can then explicitly specify more liberal rules

cbrnr · 2023-10-25T12:32:07Z

Quick comment, this problem also occurs without resampling, i.e. when loading just one stream. MNELAB currently does not handle gaps. It assumes that all data points are available at all time points defined by the nominal sampling frequency. In fact, MNELAB just looks at the first timestamp, but completely ignores all other timestamps.

So to fix this problem, I think we will need to resample (interpolate) all XDF streams, even if it's just one stream. Then we can take a look at how resampling two (or more) streams to a common sampling frequency behaves.

sappelhoff · 2024-07-25T11:28:56Z

Just FYI, as I think this bug may be related.

I tried reading an XDF file with 2 streams created by a tobii eyetracker. This eyetracker has channels that code for each sample the validity of that sample (1. = valid, 0 = invalid).

After reading the XDF via mnelab and "merging" the 2 tobii eyetracker streams (both by the same device and sfreq, btw, just one for gaze data, and one for so-called eye-openness-data) through the mnelab resampling approach, I ended up with values -1 and 2 in the data, next to the expected 1 and 0.

The raw data in XDF are fine, so it must be something that happens when going from XDF to raw via mnelab (I suspect the resampling).

Unfortunately I don't have time to dig into this more deeply.

cbrnr · 2024-07-25T11:44:04Z

Thanks @sappelhoff! I don't know if it is even possible to correctly sync these two streams without introducing some kind of (resampling) artifacts. To me, it looks like there should be a way to let the reader know that these streams are already synced, similar to how multiple EEG channels are also not treated as separate streams.

Besides, I think using a numeric stream to carry labels (categories) is not ideal, because as soon as you resample, these labels won't make much sense, since resampling necessarily involves an anti-aliasing filter.

@chkothe do you maybe have a suggestion how to handle this situation?

sappelhoff · 2024-07-25T11:49:21Z

To me, it looks like there should be a way to let the reader know that these streams are already synced, similar to how multiple EEG channels are also not treated as separate streams.

Indeed, I am now circumventing this problem by simply pretending that they are data from the same stream: Creating two MNE raw arrays, and appending them into one raw array.

Besides, I think using a numeric stream to carry labels (categories) is not ideal, because as soon as you resample, these labels won't make much sense, since resampling necessarily involves an anti-aliasing filter.

True, although this is perhaps something to bring up with the tobii company.

cbrnr · 2024-07-25T11:57:44Z

I'm glad that you already found a solution, I think this is the correct way to go forward. Regarding the label stream, I don't have a solution, since this will only work if you use the raw samples (i.e., without resampling) as is. As soon as you resample, I'm not sure how to make this stream work, because even if it was a categorical (string) stream, how would you assign the valid/invalid labels to resampled data?

cbrnr · 2024-09-26T09:47:07Z

tbh I wouldn't be able to exactly explain how pandas work here, indeed their docs are a bit mysterious. All I can say is that from my trial and errors attempts that was the way that worked the best in preserving the original signal 🤷

@DominiqueMakowski I double-checked, method="index" performs linear spline interpolation, which I don't think is ideal. You could check out method="cubic", which I think should be preferred (because it produces a smoother signal, which is much less likely to introduce artifacts when further processing the interpolated signal).

Furthermore, I don't think it is necessary to upsample to twice the highest sampling frequency in the data. Aliasing is only a problem when downsampling, and for upsampling you can just use the highest sampling frequency. But since this is configurable anyway, it's not a big deal.

I'll start working on my own interpolation to fix this problem without pandas.

cbrnr · 2024-09-26T10:50:15Z

My plan is to use cubic spline interpolation as demonstrated in this example:

import matplotlib.pyplot as plt
import numpy as np
from numpy.random import default_rng
from scipy.interpolate import interp1d

rng = default_rng(42)

# generate non-uniformly spaced timestamps
start, stop, n = 0, 5, 100
ts = np.linspace(start, stop, n)
jitter = rng.normal(scale=1 / n, size=n - 2)
ts[1:-1] += jitter
ts = np.concatenate((ts[:30], ts[50:]))

# example signal (1 Hz sine)
data = np.sin(2 * np.pi * 1 * ts)

# define regular time grid for resampling
fs_new = 50  # new sampling frequency
ts_new = np.arange(ts[0], ts[-1], 1 / fs_new)  # regular time grid

# interpolate data to the new time grid
interpolator = interp1d(ts, data, kind="cubic", fill_value="extrapolate")
data_resampled = interpolator(ts_new)

# plot original and resampled data
fig, ax = plt.subplots(figsize=(12, 4))
ax.scatter(ts, data, label="Original data")
ax.plot(ts_new, data_resampled, color="orange")
ax.scatter(ts_new, data_resampled, s=5, color="red", zorder=10, label="Resampled data")
ax.legend()
ax.set_xlabel("Time (s)")
ax.set_ylabel("Amplitude")
ax.spines[["right", "top"]].set_visible(False)
fig.set_tight_layout(True)
plt.show()

Of course, longer periods without any data should not be interpolated as shown in the example (but instead NaNs should be inserted). And of course all of this needs to work for any number of time series with different start and stop times.

I'd be grateful for feedback (@chkothe and @DominiqueMakowski maybe?). Note that I don't think this approach suffers from the problems discussed in xdf-modules/pyxdf#1 (comment), because I'm using cubic splines interpolation instead of linear interpolation. This approach also doesn't need a low-pass filter as a final step (as mentioned in xdf-modules/pyxdf#1 (comment)), because the method does not just insert zeroes (which it then would have to smooth afterwards of course). So unless I'm totally wrong, I think this should be a valid approach for time series that need to be aligned and brought onto a common regularly sampled time grid, which is something that we absolutely have to do in MNE.

DominiqueMakowski · 2024-09-26T11:59:32Z

I am not a fan of cubic interpolation: it introduces phantom oscillations and can lead to severe distortions (see some elements of discussion here). Linear interpolation is the "lesser evil" (the least bad assumption in the presence of unknown) in my opinion, it is indeed implausible but at least it doesn't introduce potential issues: + it doesn't hide the interpolated part which is useful when visualizing signals.

If you want to go with cubic interpolation, I'd recommend at least using a monotonic cubic interpolation

cbrnr · 2024-09-26T12:19:02Z

Good point, thanks for the input! Then back to linear interpolation I guess? For good measure, I think a lowpass filter after interpolation would be necessary then, or no?

cbrnr mentioned this issue Oct 11, 2023

Sync timestamps xdf-modules/pyxdf#1

Open

cbrnr changed the title ~~read_raw_xdf(): resampling assumes evenly sampling~~ read_raw_xdf(): loading assumes evenly sampling Oct 25, 2023

cbrnr added bug help wanted labels Dec 29, 2023

cbrnr mentioned this issue Sep 30, 2024

read_raw_xdf() can I use the nominal sampling rate? #436

Open

read_raw_xdf(): loading assumes evenly sampling #385

read_raw_xdf(): loading assumes evenly sampling #385

Comments

DominiqueMakowski commented Aug 12, 2023

cbrnr commented Aug 16, 2023

DominiqueMakowski commented Aug 16, 2023

cbrnr commented Aug 16, 2023

DominiqueMakowski commented Aug 16, 2023

DominiqueMakowski commented Aug 16, 2023

cbrnr commented Aug 16, 2023

DominiqueMakowski commented Aug 16, 2023

cbrnr commented Aug 24, 2023

cbrnr commented Aug 24, 2023

cbrnr commented Aug 24, 2023

DominiqueMakowski commented Aug 24, 2023

cbrnr commented Aug 24, 2023

DominiqueMakowski commented Aug 24, 2023

cbrnr commented Aug 24, 2023

DominiqueMakowski commented Aug 24, 2023

cbrnr commented Aug 24, 2023

DominiqueMakowski commented Aug 24, 2023 • edited Loading

cbrnr commented Aug 24, 2023

cbrnr commented Aug 24, 2023

cbrnr commented Oct 11, 2023

DominiqueMakowski commented Oct 11, 2023

cbrnr commented Oct 11, 2023

cbrnr commented Oct 11, 2023

DominiqueMakowski commented Oct 11, 2023

pmavros commented Oct 11, 2023 via email

DominiqueMakowski commented Oct 11, 2023

cbrnr commented Oct 11, 2023

cbrnr commented Oct 11, 2023

pmavros commented Oct 11, 2023 via email

cbrnr commented Oct 11, 2023

DominiqueMakowski commented Oct 11, 2023

cbrnr commented Oct 25, 2023

sappelhoff commented Jul 25, 2024

cbrnr commented Jul 25, 2024

sappelhoff commented Jul 25, 2024

cbrnr commented Jul 25, 2024

cbrnr commented Sep 26, 2024

cbrnr commented Sep 26, 2024

DominiqueMakowski commented Sep 26, 2024

cbrnr commented Sep 26, 2024

DominiqueMakowski commented Aug 24, 2023 •

edited

Loading