Add a specialized function for calculating rolling averages #400

nmdefries · 2024-01-23T20:12:52Z

Create a variant of epi_slide that only calculates rolling averages and is much faster doing so. It is backed by data.table::frollmean (RcppRoll::roll_mean is somewhat faster but due to licensing issues, we are not using it). Users can modify frollmean behavior by passing args to it via ....

Because frollmean performs calculations strictly by index, i.e. it does not check that the last n obs correspond to the last n dates, epi_slide_mean cannot be used to aggregate across groups. Any group containing duplicate time values will raise an error. So, e.g., trying to compute an average across all geos in a dataset that contains multiple geos but is ungrouped will fail. This is in line with common use cases.

update examples

date and name cleanup

Keeping `as_list_col` for now so args match those of `epi_slide` as closely as possible.

However, date sequence completion is slow when time_step provided

Can't use column names like vars

nmdefries · 2024-01-30T23:35:03Z

I'll be adding unit tests tomorrow. Results from epi_slide_mean match epi_slide(mean) on a variety of manual tests (time types, missing dates, etc) -- tests will codify those comparisons.

The new function is 4-20 times faster than epi_slide (may be faster on larger datasets than I tested). epi_slide_mean scales much better as the input data gets bigger. It looks like all those time_types in epi_df actually work in epi_slide, so I've implemented them here, too.

R/slide.R

nmdefries · 2024-02-06T22:01:13Z

thought: to make this more general, we could let the user pass a data.table rolling fn (frollmean, frollsum, frollapply). Or if we want to use slider's slide_sum, etc, those too. (Intend to add this in a separate PR.)

Edit 04/01/2024: Added in #433

R/slide.R

This reverts commit a51e7ee.

dsweber2 · 2024-03-14T00:13:47Z

So I was trying to actually test this to see if I had anything useful to say, and the tests won't run; for whatever reason my environment can't find a function called Start; what confuses me is that it seems to run fine in the check?

Edit: substituting Start -> min and End -> max allowed the tests to pass, at least. looking at the context it seems like the intended effect

nmdefries · 2024-03-14T15:36:47Z

Looks like Start and End were removed in #418. Maybe they weren't being used anywhere else. I replaced them, so this should work again.

dsweber2

So this is just going to be my commentary on the tests for now (I'll do a closer readthrough of the actual code eventually). Generally covers most of the bases I can think of. I'm not sure I saw any examples with NA's instead of gaps, which would be worth checking.

minor thing: there's a number of test functions that are all named the same thing (generate_special_date_data and test_time_type_mean) that its hard to tell if they actually are.

dsweber2 · 2024-03-14T00:51:11Z

tests/testthat/test-epi_slide.R

+    as_epi_df(as_of = d + 6)
+
+  result1 <- epi_slide_mean(small_x, "value", before = 50, names_sep = NULL, na.rm = TRUE)
+  expect_identical(result1, expected_output)


comparing this with the same calculation on the ungrouped version (epi_slide_mean(small_x %>% ungroup(), "value", before = 50, names_sep = NULL, na.rm = TRUE)), it looks like epi_slide_mean doesn't work for ungrouped tables?

This seems inconsistent with epi_slide though? E.g. epi_slide(small_x, cases_7dav = mean(value), before = 50) auto-groups by geo_value.

What was the motivation for the change?

E.g. epi_slide(small_x, cases_7dav = mean(value), before = 50) auto-groups by geo_value.

(Pretending you used before = 6 here.) It actually does something else... it calculates averages across usually-7-day x "number of geos" row windows, then broadcasts (repeats) those slide+aggregation values back to the ref_time_value and every geo appearing at that ref_time_value. (Technically, "number of geos" isn't quite right. If a new geo or geos start reporting midway through, then we might have 3 x {old number of geos} + 4 x {new number of geos} going into the computation.)

Rant: this simple average is probably not what we want almost ever. With count data, we've divided by "number of geos"; we probably want either the national 7dav of counts, or maybe national 7dav * current geo pop / total pop, "distributing" across states proportionally to population. With rate data, we almost surely want to get a 7dav of the weighted mean of the rates by population (equivalent to converting to cases, getting national 7dav, converting to national rate, broadcasting). --- So the plan to split off the aggregating+broadcasting slide + add geo/epigroup aggregation helpers might save us from thinking we're getting by-geo or getting good cross-geo results here.

Ok, I misunderstood what was going on here; the epi_slide call should've been epi_slide(small_x %>% ungroup, cases_7dav = mean(value), before = 50), which misled me about epi_slide's base behavior.

Depending on how the rework goes we may want to switch epi_slide to just the grouped behavior and leave the ungrouped versions to whatever we call the aggregating versions (like Logan is getting at).

tests/testthat/test-epi_slide.R

R/slide.R

tests/testthat/test-epi_slide.R

style: styler (GHA) more linting style: styler (GHA)

dsweber2

I went through the actual code this time too, generally lgtm.

R/slide.R

nmdefries added 9 commits January 23, 2024 14:53

initial version of mean-specific epi_slide

949bf17

update examples

adding leading/lagging pad dates

4ea9bf7

handle after; reformat to slide_one_grp format

521268b

date and name cleanup

param checks

c133614

filter to ref_time_values passed by user before and after computation

73926f3

replace results with NA if all_rows; make sure output is epi_df

42757f5

don't need to pre-filter for user-provided ref time values

71a5ac3

warn that as_list_col not supported

1a0741d

Keeping `as_list_col` for now so args match those of `epi_slide` as closely as possible.

support list col output

a51e7ee

nmdefries force-pushed the ndefries/specialized-slide-mean branch from aac781b to 2d42c4f Compare January 25, 2024 16:44

nmdefries mentioned this pull request Jan 25, 2024

slide profiling compared to various other backends #392

Open

support different time_step types

93830c4

However, date sequence completion is slow when time_step provided

nmdefries force-pushed the ndefries/specialized-slide-mean branch from 2d42c4f to 93830c4 Compare January 30, 2024 21:38

nmdefries added 2 commits January 30, 2024 17:51

use more precise way to generate all_dates; comment cleanup

6fd21ee

fix epi_slide_mean examples

58b8163

Can't use column names like vars

nmdefries marked this pull request as ready for review January 30, 2024 23:25

nmdefries requested a review from brookslogan January 30, 2024 23:25

nmdefries added 6 commits January 31, 2024 10:25

pkgdown site

173ea58

leave epi_slide_mean result grouped

2e49a95

error if any group has duplicate time values

2cc3227

tests

a3efaeb

test unmappable types

03a1577

compare differnt before/after results

a243879

nmdefries commented Feb 1, 2024

View reviewed changes

R/slide.R Outdated Show resolved Hide resolved

nmdefries commented Feb 1, 2024

View reviewed changes

R/slide.R Outdated Show resolved Hide resolved

Merge branch 'dev' into ndefries/specialized-slide-mean

02db0ed

nmdefries commented Feb 7, 2024

View reviewed changes

R/slide.R Outdated Show resolved Hide resolved

use reclass fn

007438d

nmdefries added 11 commits March 5, 2024 20:32

rename missing dates to times

12c32f0

rename col_name -> names in tests

fe81d2f

Revert "support list col output"

862f112

This reverts commit a51e7ee.

expect error when as_list_col is used in epi_slide_mean

396df53

clean up examples for epi_slide and _mean

8370a1b

comment use of non-time_step-transformed before/after in full_date_seq

4c9e632

fail if time_values are unevenly spaced

5f3af61

test cleanup

4081485

add epi_slide_mean example to intro slide vignette

ed231a9

deprecate more as_list_col behavior, clarify naming error

fe35331

Merge branch 'dev' into ndefries/specialized-slide-mean

75a2847

dsweber2 reviewed Mar 14, 2024

View reviewed changes

nmdefries added 6 commits March 23, 2024 10:46

replace deprecated Start and End

bf98cfd

linting

298be50

style: styler (GHA) more linting style: styler (GHA)

rearrange test vars for clarity

b447484

add actionable example to duplicate time value error

f5d7a8f

check other before/after values for full_date_seq helper

c04b2ad

check 0-row input data in epi_slide_mean

c6ee7f9

nmdefries force-pushed the ndefries/specialized-slide-mean branch 2 times, most recently from fed49f5 to 539260a Compare March 23, 2024 15:12

support col_names as tidyselect

56bed8c

nmdefries force-pushed the ndefries/specialized-slide-mean branch from a5c429a to 56bed8c Compare March 23, 2024 15:42

dsweber2 approved these changes Mar 25, 2024

View reviewed changes

R/slide.R Show resolved Hide resolved

R/slide.R Show resolved Hide resolved

R/slide.R Outdated Show resolved Hide resolved

more descriptive frollmean window size name

0125bee

nmdefries force-pushed the ndefries/specialized-slide-mean branch from 237fc38 to 0125bee Compare March 26, 2024 17:27

style: styler (GHA)

8ae9fba

nmdefries merged commit 1087ca0 into dev Mar 26, 2024

nmdefries deleted the ndefries/specialized-slide-mean branch March 26, 2024 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a specialized function for calculating rolling averages #400

Add a specialized function for calculating rolling averages #400

nmdefries commented Jan 23, 2024 •

edited

Loading

nmdefries commented Jan 30, 2024

nmdefries commented Feb 6, 2024 •

edited

Loading

dsweber2 commented Mar 14, 2024 •

edited

Loading

nmdefries commented Mar 14, 2024 •

edited

Loading

dsweber2 left a comment

dsweber2 Mar 14, 2024

brookslogan Mar 18, 2024 •

edited

Loading

dsweber2 Mar 18, 2024

dsweber2 left a comment

Add a specialized function for calculating rolling averages #400

Add a specialized function for calculating rolling averages #400

Conversation

nmdefries commented Jan 23, 2024 • edited Loading

nmdefries commented Jan 30, 2024

nmdefries commented Feb 6, 2024 • edited Loading

dsweber2 commented Mar 14, 2024 • edited Loading

nmdefries commented Mar 14, 2024 • edited Loading

dsweber2 left a comment

Choose a reason for hiding this comment

dsweber2 Mar 14, 2024

Choose a reason for hiding this comment

brookslogan Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

dsweber2 Mar 18, 2024

Choose a reason for hiding this comment

dsweber2 left a comment

Choose a reason for hiding this comment

nmdefries commented Jan 23, 2024 •

edited

Loading

nmdefries commented Feb 6, 2024 •

edited

Loading

dsweber2 commented Mar 14, 2024 •

edited

Loading

nmdefries commented Mar 14, 2024 •

edited

Loading

brookslogan Mar 18, 2024 •

edited

Loading