275 step epi slide #364

dajmcdon · 2024-07-24T22:27:00Z

Checklist

Please:

Make sure this PR is against "dev", not "main".
Request a review from one of the current epipredict main reviewers:
dajmcdon.
Make sure to bump the version number in DESCRIPTION and NEWS.md.
Always increment the patch version number (the third number), unless you are
making a release PR from dev to main, in which case increment the minor
version number (the second number).
Describe changes made in NEWS.md, making sure breaking changes
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
0.7.2, then write your changes under the 0.8 heading).

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

Resolves add step_average #275

dshemetov

Made some comments. I hope we can make use of epi_slide or epi_slide_opt here, since those are the bread and butter of epiprocess.

R/step_epi_slide.R

dshemetov · 2024-07-24T23:56:24Z

R/step_epi_slide.R

+  gr <- new_data %>%
+    dplyr::select(dplyr::all_of(c(ok, object$columns))) %>%
+    group_by(dplyr::across(dplyr::all_of(ok[-1]))) %>%
+    dplyr::arrange(time_value) %>%


comment (non-blocking): is this grouping and then sorting done often in the steps? if yes and it's easy, might be worth grouping and arranging once at bake start and then assuming that in later steps.

Many steps have a tendency to alter grouping and ordering. My convention is to always ensure that it happens inside the step to avoid bugs. But it would be nice to have an epiprocess function that uses canonical ordering cmu-delphi/epiprocess#482.

dshemetov · 2024-07-24T23:57:42Z

R/step_epi_slide.R

+    dplyr::mutate(
+      dplyr::across(
+        dplyr::all_of(object$columns),
+        ~ slider::slide_index_vec(


question: why not use epiprocess::epi_slide or epiprocess::epi_slide_opt?

I had hoped to just use epi_slide(), but the interface for functions and column selections is very different. I'm happy to take suggestions, but step_*() need to have named args (in this case the function) and columns to apply the function to in .... I couldn't figure out how to convert that to epi_slide() syntax.

so messing around a bit, I think we need to add a way of specifying columns to the generic epi_slide. if .f=frollmean, then we can use epi_slide_opt in a natural way:

col_names <- paste("slide_value", col_names, sep = "_") names(col_names) <- newnames new_data %>% dplyr::select(dplyr::all_of(c(ok, object$columns))) %>% group_by(dplyr::across(dplyr::all_of(ok[-1]))) %>% epiprocess::epi_slide_opt( !!col_name_orig, object$.fb, before = object$before, after = object$after) %>% rename(col_names)

@nmdefries @lcbrooks

(Potential solutions down below. Probably should test & benchmark against more realistic data. Though I'd suggest only allowing something backed by epi_slide_opt because of the forgetting-epikey-grouping, incomplete-window (gaps, initial values), and speed issues of epi_slide.)

I did a bit of a better benchmark on slides1 and slides2, using jhu_csse_daily_subset, summing cases, and indeed, things do change. Would still prefer _opt's output and speed.

expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list> 1 slides1(edf, before, varnames, … 6.96s 6.96s 0.144 NA 4.74 1 33 6.96s <epi_df> <NULL> <bench_tm> <tibble> 2 slides2(edf, before, varnames, … 179.04ms 188.02ms 5.37 NA 7.15 3 4 559.16ms <epi_df> <NULL> <bench_tm> <tibble> Warning message: Some expressions had a GC in every iteration; so filtering is disabled.

R/step_epi_slide.R

dsweber2

[[this is very much a partial review, it just wouldn't let me comment on conversations mid-review]]

Ideally, we'd also have steps for epi_mean, _sum, and opt since those are optimized cases for common uses, and will thus run much faster. Unfortunate amount of boilerplate to get there though.

R/step_epi_slide.R

brookslogan · 2024-07-25T20:16:34Z

Heard some epi_slide_opt/across-like things might be useful for reference here. (Though perhaps instead/additionally we can do something better upstream though, or limit things here to step_epi_slide_opt.)

suppressPackageStartupMessages({
  library(dplyr)
  library(epiprocess)
  library(tidyr)
})

edf <- tibble::tibble(
  geo_value = "az",
  time_value = 1:10,
  x = rep(1, 10),
  y = 1:10
) %>%
  as_epi_df()

varnames <- c("x", "y")
before <- 6L
fns <- list("7d_sum" = sum) # probably not the best idea with not-necessarily-complete-d data and not epi_slide_opt

slides1 <- function(edf, before, varnames, fns) {
  edf %>%
    group_by(geo_value) %>%
    epi_slide(before = before, ~ summarize(.x, across(all_of(varnames), fns)), names_sep = NULL) %>%
    ungroup()
}

# Faster variant when we use more realistic data:
slides2 <- function(edf, before, varnames, fns) {
  comps_info <- crossing(varname = varnames, fnname = names(fns), fn = fns)
  if (nrow(comps_info) == 0L) {
    edf # or maybe arrange it
  } else {
    bind_cols(lapply(
      seq_len(nrow(comps_info)),
      function(comp_i) {
        varname <- comps_info[[comp_i, "varname"]]
        fnname <- comps_info[[comp_i, "fnname"]]
        fn <- comps_info[[comp_i, "fn"]][[1L]]
        result_name <- paste0(varname, "_", fnname)
        basic_result <- edf %>%
          group_by(geo_value) %>%
          epi_slide(before = before, function(slice, gk, rtv) fn(slice[[varname]]), new_col_name = result_name) %>%
          ungroup()
        if (comp_i == 1L) {
          basic_result
        } else {
          basic_result[result_name]
        }
      }
    ))
  }
}

# note: this is NOT the same as you'd get out of epi_slide_opt
slides1(edf, before, varnames, fns)
#> An `epi_df` object, 10 x 6 with metadata:
#> * geo_type  = state
#> * time_type = integer
#> * as_of     = 2024-07-25 13:18:34.090191
#> 
#> # A tibble: 10 × 6
#>    geo_value time_value     x     y x_7d_sum y_7d_sum
#>  * <chr>          <int> <dbl> <int>    <dbl>    <int>
#>  1 az                 1     1     1        1        1
#>  2 az                 2     1     2        2        3
#>  3 az                 3     1     3        3        6
#>  4 az                 4     1     4        4       10
#>  5 az                 5     1     5        5       15
#>  6 az                 6     1     6        6       21
#>  7 az                 7     1     7        7       28
#>  8 az                 8     1     8        7       35
#>  9 az                 9     1     9        7       42
#> 10 az                10     1    10        7       49

all.equal(
  slides1(edf, before, varnames, fns),
  slides2(edf, before, varnames, fns)
)
#> [1] TRUE

^{Created on 2024-07-25 with reprex v2.1.1}

dajmcdon · 2024-07-29T21:44:19Z

@brookslogan perhaps you can add your preferred version here? Note that this step is intended to be completely general for arbitrary functions.

brookslogan · 2024-07-29T21:48:06Z

Nothing's going to work properly for arbitrary functions at this point. mean and sum only work fully properly in epi_slide_opt, while functions outside mean/sum/min/max/(others?) are going to require epi_slide plus maybe some manual working around its deficiencies. Which do we want?

dajmcdon · 2024-07-29T21:49:33Z

Seemingly epi_slide() then. What are the deficiencies?

dajmcdon · 2024-07-29T21:50:51Z

I need to be able to do sd(), and something with coefficients from lm().

brookslogan · 2024-07-29T21:54:28Z

Lacks NAs at edges of time series.
Lacks NAs when there are gaps.
(Convenient footgun: lets you forget to group by all epikeys when that's what you want probably 95% of the time, and probably does the wrong thing the other 5% unless you also did some careful completion and have a convenient data set.)

(@nmdefries am I missing other error cases above?)

brookslogan · 2024-07-29T22:01:46Z

@dajmcdon do any of these use cases require >1 epikey at a time in the computation?

nmdefries · 2024-07-29T22:02:34Z

Size of window (# of datapoints) used in a computation can vary

dajmcdon · 2024-07-29T22:05:45Z

@brookslogan potentially. No a priori reason to exclude.

My preference is for "something that works most of the time". I'm happy to have a version with potential bugs as long as it isn't producing garbage.

brookslogan · 2024-07-29T22:23:59Z

I'd like to say the approach should be use epi_slide and then try to make some breaking changes in epiprocess that address the above issues, rather than patching around them here. However, I'm not sure in the short term that you wouldn't get garbage especially with short training windows and/or with get_test_data(). So I'm thinking it's probably better to either

prioritize and wait for these fixes in epiprocess,
[prioritize and wait for generalizations of epi_slide_opt in epiprocess,]
do some patching up / extra checks here, or
do something custom not based epi_slide/epi_slide_opt

Co-authored-by: David Weber <[email protected]>

dsweber2 · 2024-07-31T19:13:50Z

I'm pushing an updated version that uses epi_slide. I had to force push because I rebased on dev.

I had to drop support for formulas for right now. We can add that back if you want. I'm about to add epi_slide_mean, epi_slide_sum and epi_slide_opt (I'll do those via PR on this I think).

dshemetov

I made a few changes:

validate_slide_fun now rejects formula f
remove warning about optimized slide functions until that PR
fix tests
remove try_period and replace with epiprocess internal
remove slider dependency
update documentation

This should be good to go.

* validate_slide_fun now rejects formula f * remove warning about optimized slide functions until that PR * fix tests * remove try_period and replace with epiprocess internal * remove slider dependency * update documentation

dajmcdon requested review from dshemetov and dsweber2 July 24, 2024 22:27

dshemetov reviewed Jul 25, 2024

View reviewed changes

dsweber2 reviewed Jul 25, 2024

View reviewed changes

R/step_epi_slide.R Outdated Show resolved Hide resolved

R/step_epi_slide.R Outdated Show resolved Hide resolved

R/step_epi_slide.R Outdated Show resolved Hide resolved

dajmcdon commented Jul 25, 2024

View reviewed changes

R/step_epi_slide.R Outdated Show resolved Hide resolved

dajmcdon and others added 8 commits July 31, 2024 13:55

basic implementation

546fc70

add tests, redocument

6e3472f

bump version

a878ff9

add news

233e0a9

styler

d8c08d8

remove some extraneous code

8408b25

Apply suggestions from code review

3fec57f

Co-authored-by: David Weber <[email protected]>

first draft of epi_slide in step_epi_slide

7d22ef8

dsweber2 force-pushed the 275-step_epi_slide branch from 461d607 to 6ebb904 Compare July 31, 2024 19:11

dsweber2 assigned dsweber2 and dajmcdon Jul 31, 2024

new version passes the tests

acc40bd

dsweber2 force-pushed the 275-step_epi_slide branch from 6ebb904 to acc40bd Compare July 31, 2024 19:18

doc links update, example to function not formula

a586ce1

need to build the doc fix...

0d7c001

dshemetov force-pushed the 275-step_epi_slide branch from 61d6e20 to 9c97485 Compare August 3, 2024 00:35

dshemetov approved these changes Aug 3, 2024

View reviewed changes

refactor: step_epi_slide

740d438

* validate_slide_fun now rejects formula f * remove warning about optimized slide functions until that PR * fix tests * remove try_period and replace with epiprocess internal * remove slider dependency * update documentation

dshemetov force-pushed the 275-step_epi_slide branch from b9bf025 to 740d438 Compare August 3, 2024 01:33

dshemetov merged commit 991d70b into dev Aug 3, 2024
2 checks passed

dshemetov deleted the 275-step_epi_slide branch August 3, 2024 01:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

275 step epi slide #364

275 step epi slide #364

dajmcdon commented Jul 24, 2024

dshemetov left a comment

dshemetov Jul 24, 2024

dajmcdon Jul 25, 2024

dshemetov Jul 24, 2024

dajmcdon Jul 25, 2024

dsweber2 Jul 25, 2024

dsweber2 Jul 25, 2024

brookslogan Jul 25, 2024

brookslogan Jul 29, 2024 •

edited

Loading

dsweber2 left a comment

brookslogan commented Jul 25, 2024 •

edited

Loading

dajmcdon commented Jul 29, 2024

brookslogan commented Jul 29, 2024

dajmcdon commented Jul 29, 2024

dajmcdon commented Jul 29, 2024

brookslogan commented Jul 29, 2024

brookslogan commented Jul 29, 2024

nmdefries commented Jul 29, 2024

dajmcdon commented Jul 29, 2024

brookslogan commented Jul 29, 2024 •

edited

Loading

dsweber2 commented Jul 31, 2024 •

edited

Loading

dshemetov left a comment •

edited

Loading

275 step epi slide #364

275 step epi slide #364

Conversation

dajmcdon commented Jul 24, 2024

Checklist

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

dshemetov left a comment

Choose a reason for hiding this comment

dshemetov Jul 24, 2024

Choose a reason for hiding this comment

dajmcdon Jul 25, 2024

Choose a reason for hiding this comment

dshemetov Jul 24, 2024

Choose a reason for hiding this comment

dajmcdon Jul 25, 2024

Choose a reason for hiding this comment

dsweber2 Jul 25, 2024

Choose a reason for hiding this comment

dsweber2 Jul 25, 2024

Choose a reason for hiding this comment

brookslogan Jul 25, 2024

Choose a reason for hiding this comment

brookslogan Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

dsweber2 left a comment

Choose a reason for hiding this comment

brookslogan commented Jul 25, 2024 • edited Loading

dajmcdon commented Jul 29, 2024

brookslogan commented Jul 29, 2024

dajmcdon commented Jul 29, 2024

dajmcdon commented Jul 29, 2024

brookslogan commented Jul 29, 2024

brookslogan commented Jul 29, 2024

nmdefries commented Jul 29, 2024

dajmcdon commented Jul 29, 2024

brookslogan commented Jul 29, 2024 • edited Loading

dsweber2 commented Jul 31, 2024 • edited Loading

dshemetov left a comment • edited Loading

Choose a reason for hiding this comment

brookslogan Jul 29, 2024 •

edited

Loading

brookslogan commented Jul 25, 2024 •

edited

Loading

brookslogan commented Jul 29, 2024 •

edited

Loading

dsweber2 commented Jul 31, 2024 •

edited

Loading

dshemetov left a comment •

edited

Loading