-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a specialized function for calculating rolling averages #400
Conversation
update examples
date and name cleanup
Keeping `as_list_col` for now so args match those of `epi_slide` as closely as possible.
aac781b
to
2d42c4f
Compare
However, date sequence completion is slow when time_step provided
2d42c4f
to
93830c4
Compare
Can't use column names like vars
I'll be adding unit tests tomorrow. Results from The new function is 4-20 times faster than |
thought: to make this more general, we could let the user pass a Edit 04/01/2024: Added in #433 |
This reverts commit a51e7ee.
So I was trying to actually test this to see if I had anything useful to say, and the tests won't run; for whatever reason my environment can't find a function called Edit: substituting |
Looks like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is just going to be my commentary on the tests for now (I'll do a closer readthrough of the actual code eventually). Generally covers most of the bases I can think of. I'm not sure I saw any examples with NA
's instead of gaps, which would be worth checking.
minor thing: there's a number of test functions that are all named the same thing (generate_special_date_data
and test_time_type_mean
) that its hard to tell if they actually are.
as_epi_df(as_of = d + 6) | ||
|
||
result1 <- epi_slide_mean(small_x, "value", before = 50, names_sep = NULL, na.rm = TRUE) | ||
expect_identical(result1, expected_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comparing this with the same calculation on the ungrouped version (epi_slide_mean(small_x %>% ungroup(), "value", before = 50, names_sep = NULL, na.rm = TRUE)
), it looks like epi_slide_mean
doesn't work for ungrouped tables?
This seems inconsistent with epi_slide
though? E.g. epi_slide(small_x, cases_7dav = mean(value), before = 50)
auto-groups by geo_value
.
What was the motivation for the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. epi_slide(small_x, cases_7dav = mean(value), before = 50) auto-groups by geo_value.
(Pretending you used before = 6
here.) It actually does something else... it calculates averages across usually-7-day x "number of geos" row windows, then broadcasts (repeats) those slide+aggregation values back to the ref_time_value
and every geo appearing at that ref_time_value
. (Technically, "number of geos" isn't quite right. If a new geo or geos start reporting midway through, then we might have 3 x {old number of geos} + 4 x {new number of geos} going into the computation.)
Rant: this simple average is probably not what we want almost ever. With count data, we've divided by "number of geos"; we probably want either the national 7dav of counts, or maybe national 7dav * current geo pop / total pop, "distributing" across states proportionally to population. With rate data, we almost surely want to get a 7dav of the weighted mean of the rates by population (equivalent to converting to cases, getting national 7dav, converting to national rate, broadcasting). --- So the plan to split off the aggregating+broadcasting slide + add geo/epigroup aggregation helpers might save us from thinking we're getting by-geo or getting good cross-geo results here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I misunderstood what was going on here; the epi_slide
call should've been epi_slide(small_x %>% ungroup, cases_7dav = mean(value), before = 50)
, which misled me about epi_slide
's base behavior.
Depending on how the rework goes we may want to switch epi_slide
to just the grouped behavior and leave the ungrouped versions to whatever we call the aggregating versions (like Logan is getting at).
fed49f5
to
539260a
Compare
a5c429a
to
56bed8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the actual code this time too, generally lgtm.
237fc38
to
0125bee
Compare
Create a variant of
epi_slide
that only calculates rolling averages and is much faster doing so. It is backed bydata.table::frollmean
(RcppRoll::roll_mean
is somewhat faster but due to licensing issues, we are not using it). Users can modifyfrollmean
behavior by passing args to it via...
.Because
frollmean
performs calculations strictly by index, i.e. it does not check that the last n obs correspond to the last n dates,epi_slide_mean
cannot be used to aggregate across groups. Any group containing duplicate time values will raise an error. So, e.g., trying to compute an average across all geos in a dataset that contains multiple geos but is ungrouped will fail. This is in line with common use cases.