Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a specialized function for calculating rolling averages #400

Merged
merged 52 commits into from
Mar 26, 2024

Conversation

nmdefries
Copy link
Contributor

@nmdefries nmdefries commented Jan 23, 2024

Create a variant of epi_slide that only calculates rolling averages and is much faster doing so. It is backed by data.table::frollmean (RcppRoll::roll_mean is somewhat faster but due to licensing issues, we are not using it). Users can modify frollmean behavior by passing args to it via ....

Because frollmean performs calculations strictly by index, i.e. it does not check that the last n obs correspond to the last n dates, epi_slide_mean cannot be used to aggregate across groups. Any group containing duplicate time values will raise an error. So, e.g., trying to compute an average across all geos in a dataset that contains multiple geos but is ungrouped will fail. This is in line with common use cases.

However, date sequence completion is slow when time_step provided
@nmdefries nmdefries force-pushed the ndefries/specialized-slide-mean branch from 2d42c4f to 93830c4 Compare January 30, 2024 21:38
@nmdefries nmdefries marked this pull request as ready for review January 30, 2024 23:25
@nmdefries nmdefries requested a review from brookslogan January 30, 2024 23:25
@nmdefries
Copy link
Contributor Author

I'll be adding unit tests tomorrow. Results from epi_slide_mean match epi_slide(mean) on a variety of manual tests (time types, missing dates, etc) -- tests will codify those comparisons.

The new function is 4-20 times faster than epi_slide (may be faster on larger datasets than I tested). epi_slide_mean scales much better as the input data gets bigger. It looks like all those time_types in epi_df actually work in epi_slide, so I've implemented them here, too.

R/slide.R Outdated Show resolved Hide resolved
R/slide.R Outdated Show resolved Hide resolved
@nmdefries
Copy link
Contributor Author

nmdefries commented Feb 6, 2024

thought: to make this more general, we could let the user pass a data.table rolling fn (frollmean, frollsum, frollapply). Or if we want to use slider's slide_sum, etc, those too. (Intend to add this in a separate PR.)

Edit 04/01/2024: Added in #433

R/slide.R Outdated Show resolved Hide resolved
@dsweber2
Copy link
Contributor

dsweber2 commented Mar 14, 2024

So I was trying to actually test this to see if I had anything useful to say, and the tests won't run; for whatever reason my environment can't find a function called Start; what confuses me is that it seems to run fine in the check?

Edit: substituting Start -> min and End -> max allowed the tests to pass, at least. looking at the context it seems like the intended effect

@nmdefries
Copy link
Contributor Author

nmdefries commented Mar 14, 2024

Looks like Start and End were removed in #418. Maybe they weren't being used anywhere else. I replaced them, so this should work again.

Copy link
Contributor

@dsweber2 dsweber2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is just going to be my commentary on the tests for now (I'll do a closer readthrough of the actual code eventually). Generally covers most of the bases I can think of. I'm not sure I saw any examples with NA's instead of gaps, which would be worth checking.

minor thing: there's a number of test functions that are all named the same thing (generate_special_date_data and test_time_type_mean) that its hard to tell if they actually are.

as_epi_df(as_of = d + 6)

result1 <- epi_slide_mean(small_x, "value", before = 50, names_sep = NULL, na.rm = TRUE)
expect_identical(result1, expected_output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparing this with the same calculation on the ungrouped version (epi_slide_mean(small_x %>% ungroup(), "value", before = 50, names_sep = NULL, na.rm = TRUE)), it looks like epi_slide_mean doesn't work for ungrouped tables?

This seems inconsistent with epi_slide though? E.g. epi_slide(small_x, cases_7dav = mean(value), before = 50) auto-groups by geo_value.

What was the motivation for the change?

Copy link
Contributor

@brookslogan brookslogan Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. epi_slide(small_x, cases_7dav = mean(value), before = 50) auto-groups by geo_value.

(Pretending you used before = 6 here.) It actually does something else... it calculates averages across usually-7-day x "number of geos" row windows, then broadcasts (repeats) those slide+aggregation values back to the ref_time_value and every geo appearing at that ref_time_value. (Technically, "number of geos" isn't quite right. If a new geo or geos start reporting midway through, then we might have 3 x {old number of geos} + 4 x {new number of geos} going into the computation.)

Rant: this simple average is probably not what we want almost ever. With count data, we've divided by "number of geos"; we probably want either the national 7dav of counts, or maybe national 7dav * current geo pop / total pop, "distributing" across states proportionally to population. With rate data, we almost surely want to get a 7dav of the weighted mean of the rates by population (equivalent to converting to cases, getting national 7dav, converting to national rate, broadcasting). --- So the plan to split off the aggregating+broadcasting slide + add geo/epigroup aggregation helpers might save us from thinking we're getting by-geo or getting good cross-geo results here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I misunderstood what was going on here; the epi_slide call should've been epi_slide(small_x %>% ungroup, cases_7dav = mean(value), before = 50), which misled me about epi_slide's base behavior.

Depending on how the rework goes we may want to switch epi_slide to just the grouped behavior and leave the ungrouped versions to whatever we call the aggregating versions (like Logan is getting at).

tests/testthat/test-epi_slide.R Show resolved Hide resolved
R/slide.R Outdated Show resolved Hide resolved
tests/testthat/test-epi_slide.R Outdated Show resolved Hide resolved
tests/testthat/test-epi_slide.R Show resolved Hide resolved
@nmdefries nmdefries force-pushed the ndefries/specialized-slide-mean branch 2 times, most recently from fed49f5 to 539260a Compare March 23, 2024 15:12
@nmdefries nmdefries force-pushed the ndefries/specialized-slide-mean branch from a5c429a to 56bed8c Compare March 23, 2024 15:42
Copy link
Contributor

@dsweber2 dsweber2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the actual code this time too, generally lgtm.

R/slide.R Show resolved Hide resolved
R/slide.R Show resolved Hide resolved
R/slide.R Outdated Show resolved Hide resolved
@nmdefries nmdefries force-pushed the ndefries/specialized-slide-mean branch from 237fc38 to 0125bee Compare March 26, 2024 17:27
@nmdefries nmdefries merged commit 1087ca0 into dev Mar 26, 2024
@nmdefries nmdefries deleted the ndefries/specialized-slide-mean branch March 26, 2024 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants