-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
275 step epi slide #364
275 step epi slide #364
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some comments. I hope we can make use of epi_slide or epi_slide_opt here, since those are the bread and butter of epiprocess.
R/step_epi_slide.R
Outdated
gr <- new_data %>% | ||
dplyr::select(dplyr::all_of(c(ok, object$columns))) %>% | ||
group_by(dplyr::across(dplyr::all_of(ok[-1]))) %>% | ||
dplyr::arrange(time_value) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment (non-blocking): is this grouping and then sorting done often in the steps? if yes and it's easy, might be worth grouping and arranging once at bake start and then assuming that in later steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many steps have a tendency to alter grouping and ordering. My convention is to always ensure that it happens inside the step to avoid bugs. But it would be nice to have an epiprocess
function that uses canonical ordering cmu-delphi/epiprocess#482.
R/step_epi_slide.R
Outdated
dplyr::mutate( | ||
dplyr::across( | ||
dplyr::all_of(object$columns), | ||
~ slider::slide_index_vec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why not use epiprocess::epi_slide or epiprocess::epi_slide_opt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had hoped to just use epi_slide()
, but the interface for functions and column selections is very different. I'm happy to take suggestions, but step_*()
need to have named args (in this case the function) and columns to apply the function to in ...
. I couldn't figure out how to convert that to epi_slide()
syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so messing around a bit, I think we need to add a way of specifying columns to the generic epi_slide
. if .f=frollmean
, then we can use epi_slide_opt
in a natural way:
col_names <- paste("slide_value", col_names, sep = "_")
names(col_names) <- newnames
new_data %>%
dplyr::select(dplyr::all_of(c(ok, object$columns))) %>%
group_by(dplyr::across(dplyr::all_of(ok[-1]))) %>%
epiprocess::epi_slide_opt(
!!col_name_orig,
object$.fb,
before = object$before,
after = object$after) %>%
rename(col_names)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Potential solutions down below. Probably should test & benchmark against more realistic data. Though I'd suggest only allowing something backed by epi_slide_opt
because of the forgetting-epikey-grouping, incomplete-window (gaps, initial values), and speed issues of epi_slide
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a bit of a better benchmark on slides1 and slides2, using jhu_csse_daily_subset
, summing cases, and indeed, things do change. Would still prefer _opt
's output and speed.
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 slides1(edf, before, varnames, … 6.96s 6.96s 0.144 NA 4.74 1 33 6.96s <epi_df> <NULL> <bench_tm> <tibble>
2 slides2(edf, before, varnames, … 179.04ms 188.02ms 5.37 NA 7.15 3 4 559.16ms <epi_df> <NULL> <bench_tm> <tibble>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[[this is very much a partial review, it just wouldn't let me comment on conversations mid-review]]
Ideally, we'd also have steps for epi_mean, _sum, and opt since those are optimized cases for common uses, and will thus run much faster. Unfortunate amount of boilerplate to get there though.
Heard some suppressPackageStartupMessages({
library(dplyr)
library(epiprocess)
library(tidyr)
})
edf <- tibble::tibble(
geo_value = "az",
time_value = 1:10,
x = rep(1, 10),
y = 1:10
) %>%
as_epi_df()
varnames <- c("x", "y")
before <- 6L
fns <- list("7d_sum" = sum) # probably not the best idea with not-necessarily-complete-d data and not epi_slide_opt
slides1 <- function(edf, before, varnames, fns) {
edf %>%
group_by(geo_value) %>%
epi_slide(before = before, ~ summarize(.x, across(all_of(varnames), fns)), names_sep = NULL) %>%
ungroup()
}
# Faster variant when we use more realistic data:
slides2 <- function(edf, before, varnames, fns) {
comps_info <- crossing(varname = varnames, fnname = names(fns), fn = fns)
if (nrow(comps_info) == 0L) {
edf # or maybe arrange it
} else {
bind_cols(lapply(
seq_len(nrow(comps_info)),
function(comp_i) {
varname <- comps_info[[comp_i, "varname"]]
fnname <- comps_info[[comp_i, "fnname"]]
fn <- comps_info[[comp_i, "fn"]][[1L]]
result_name <- paste0(varname, "_", fnname)
basic_result <- edf %>%
group_by(geo_value) %>%
epi_slide(before = before, function(slice, gk, rtv) fn(slice[[varname]]), new_col_name = result_name) %>%
ungroup()
if (comp_i == 1L) {
basic_result
} else {
basic_result[result_name]
}
}
))
}
}
# note: this is NOT the same as you'd get out of epi_slide_opt
slides1(edf, before, varnames, fns)
#> An `epi_df` object, 10 x 6 with metadata:
#> * geo_type = state
#> * time_type = integer
#> * as_of = 2024-07-25 13:18:34.090191
#>
#> # A tibble: 10 × 6
#> geo_value time_value x y x_7d_sum y_7d_sum
#> * <chr> <int> <dbl> <int> <dbl> <int>
#> 1 az 1 1 1 1 1
#> 2 az 2 1 2 2 3
#> 3 az 3 1 3 3 6
#> 4 az 4 1 4 4 10
#> 5 az 5 1 5 5 15
#> 6 az 6 1 6 6 21
#> 7 az 7 1 7 7 28
#> 8 az 8 1 8 7 35
#> 9 az 9 1 9 7 42
#> 10 az 10 1 10 7 49
all.equal(
slides1(edf, before, varnames, fns),
slides2(edf, before, varnames, fns)
)
#> [1] TRUE Created on 2024-07-25 with reprex v2.1.1 |
@brookslogan perhaps you can add your preferred version here? Note that this step is intended to be completely general for arbitrary functions. |
Nothing's going to work properly for arbitrary functions at this point. |
Seemingly |
I need to be able to do |
(@nmdefries am I missing other error cases above?) |
@dajmcdon do any of these use cases require >1 epikey at a time in the computation? |
Size of window (# of datapoints) used in a computation can vary |
@brookslogan potentially. No a priori reason to exclude. My preference is for "something that works most of the time". I'm happy to have a version with potential bugs as long as it isn't producing garbage. |
I'd like to say the approach should be use epi_slide and then try to make some breaking changes in epiprocess that address the above issues, rather than patching around them here. However, I'm not sure in the short term that you wouldn't get garbage especially with short training windows and/or with
|
461d607
to
6ebb904
Compare
I'm pushing an updated version that uses I had to drop support for formulas for right now. We can add that back if you want. I'm about to add |
6ebb904
to
acc40bd
Compare
61d6e20
to
9c97485
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a few changes:
- validate_slide_fun now rejects formula f
- remove warning about optimized slide functions until that PR
- fix tests
- remove try_period and replace with epiprocess internal
- remove slider dependency
- update documentation
This should be good to go.
* validate_slide_fun now rejects formula f * remove warning about optimized slide functions until that PR * fix tests * remove try_period and replace with epiprocess internal * remove slider dependency * update documentation
b9bf025
to
740d438
Compare
Checklist
Please:
dajmcdon.
DESCRIPTION
andNEWS.md
.Always increment the patch version number (the third number), unless you are
making a release PR from dev to main, in which case increment the minor
version number (the second number).
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
0.7.2, then write your changes under the 0.8 heading).
Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch