Skip to content

Utilities for working with DataFrames of `Intervals.jl` or `TimeSpans.jl` objects.

License

Notifications You must be signed in to change notification settings

beacon-biosignals/DataFrameIntervals.jl

Folders and files

NameName
Last commit message
Last commit date
Apr 5, 2024
Jul 12, 2022
Jun 30, 2022
Jun 28, 2022
Jan 20, 2023
Jan 20, 2023
Jun 30, 2022
Jul 12, 2022
Jul 13, 2022
Jan 23, 2024
Oct 17, 2022

Repository files navigation

DataFrameIntervals

CI Coverage Code Style: YASGuide Docs: Stable Docs: Dev

DataFrameIntervals provides two functions that are handy for computing joins over intervals of time: interval_join and groupby_interval_join, and a helper function called quantile_windows. See their doc strings for details.

Rows match in this join if their time spans overlap. The time spans can be represented as

There are several options to support additional types, such as AlignedSpans. One option is to add interface methods to support automatic conversions to intervals; see e.g. #13. Another option is to manually convert to a supported type; this can provide additional control over how the conversion takes place. For example, one can simply convert to TimeSpans:

timespanify = :span => ByRow(TimeSpan) => :span
interval_join(transform(df1, timespanify), transform(df2, timespanify); on=:span)

For AlignedSpans, we can convert to integer indices, after checking the sample rates are all equal:

using Compat # for allequal
if !allequal(Iterators.flatten(((as.sample_rate for as in df1.span), (as.sample_rate for as in df2.span))))
  throw(ArgumentError("Sampling rates do not all match!"))
end
integer_spanify = :span => ByRow(as -> Interval{Int, Closed, Closed}(as.first_index, as.last_index)) => :span
interval_join(transform(df1, integer_spanify), transform(df2, integer_spanify); on=:span)

Example

using TimeSpans
using DataFrames
using DataFrameIntervals
using Distributions
using Random
using Dates

n = 100
tovalue(x) = Nanosecond(round(Int, x * 1e9))
times = cumsum(rand(MersenneTwister(hash((:dataframe_intervals, 2022_06_01))), Gamma(3, 2), n+1))
spans = TimeSpan.(tovalue.(times[1:(end-1)]), tovalue.(times[2:end]))
df = DataFrame(label = rand(('a':'d'), n), x = rand(n), span = spans)
100×3 DataFrame
 Row │ label  x          span
     │ Char   Float64    TimeSpan
─────┼─────────────────────────────────────────────────────
   1 │ b      0.0606309  TimeSpan(00:00:05.164631882, 00:…
   2 │ a      0.961599   TimeSpan(00:00:08.853504418, 00:…
   3 │ c      0.55525    TimeSpan(00:00:13.431519652, 00:…
   4 │ d      0.058248   TimeSpan(00:00:25.929078264, 00:…
  ⋮  │   ⋮        ⋮                      ⋮
  98 │ a      0.995222   TimeSpan(00:08:51.512608520, 00:…
  99 │ d      0.188141   TimeSpan(00:08:56.662988067, 00:…
 100 │ a      0.338053   TimeSpan(00:08:58.445446762, 00:…
quarters = quantile_windows(4, df, label=:quarter)

interval_join(df, quarters, on=:span)
103×6 DataFrame
 Row │ quarter  label  x          span_left                          span_right                         span                              
     │ Int64    Char   Float64    TimeSpan                           TimeSpan                           TimeSpan                          
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       1  b      0.0606309  TimeSpan(00:00:05.164631882, 00:…  TimeSpan(00:00:05.164631882, 00:…  TimeSpan(00:00:05.164631882, 00:…
   2 │       1  a      0.961599   TimeSpan(00:00:08.853504418, 00:…  TimeSpan(00:00:05.164631882, 00:…  TimeSpan(00:00:08.853504418, 00:…
   3 │       1  c      0.55525    TimeSpan(00:00:13.431519652, 00:…  TimeSpan(00:00:05.164631882, 00:…  TimeSpan(00:00:13.431519652, 00:…
   4 │       1  d      0.058248   TimeSpan(00:00:25.929078264, 00:…  TimeSpan(00:00:05.164631882, 00:…  TimeSpan(00:00:25.929078264, 00:…
  ⋮  │    ⋮       ⋮        ⋮                      ⋮                                  ⋮                                  ⋮
 101 │       4  a      0.995222   TimeSpan(00:08:51.512608520, 00:…  TimeSpan(00:06:51.442142229, 00:…  TimeSpan(00:08:51.512608520, 00:…
 102 │       4  d      0.188141   TimeSpan(00:08:56.662988067, 00:…  TimeSpan(00:06:51.442142229, 00:…  TimeSpan(00:08:56.662988067, 00:…
 103 │       4  a      0.338053   TimeSpan(00:08:58.445446762, 00:…  TimeSpan(00:06:51.442142229, 00:…  TimeSpan(00:08:58.445446762, 00:…

Related Packages

Below is a list of related packages and a brief indication of their differences from DataFrameIntervals.

  • TSx various operations on time series data: includes many features DataFrameIntervals does not aim to implement. Does not implement joins over intervals of time.
  • FlexiJoins generic join operations, including by interval predicates (∈, ⊆, ⊊, ⊋, ⊇, !isdisjoint): the algorithms applied here are more general purpose and are bound by the complexity of more general purpose data structures (e.g. KD-trees). DataFrameIntervals is (currently) bound by a lower complexity class for its specific use case.
  • InMemoryDatasets.jl includes inequality-like joins over intervals of time (where the interval is represented as two columns); this cannot yet achieve the behavior implemented in DataFrameIntervals, where multiple inequalities must be checked to determine overlap.