Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

275 step epi slide #364

Merged
merged 12 commits into from
Aug 3, 2024
Merged

275 step epi slide #364

merged 12 commits into from
Aug 3, 2024

Conversation

dajmcdon
Copy link
Contributor

Checklist

Please:

  • Make sure this PR is against "dev", not "main".
  • Request a review from one of the current epipredict main reviewers:
    dajmcdon.
  • Make sure to bump the version number in DESCRIPTION and NEWS.md.
    Always increment the patch version number (the third number), unless you are
    making a release PR from dev to main, in which case increment the minor
    version number (the second number).
  • Describe changes made in NEWS.md, making sure breaking changes
    (backwards-incompatible changes to the documented interface) are noted.
    Collect the changes under the next release number (e.g. if you are on
    0.7.2, then write your changes under the 0.8 heading).

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

@dajmcdon dajmcdon requested review from dshemetov and dsweber2 July 24, 2024 22:27
Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some comments. I hope we can make use of epi_slide or epi_slide_opt here, since those are the bread and butter of epiprocess.

R/step_epi_slide.R Outdated Show resolved Hide resolved
R/step_epi_slide.R Outdated Show resolved Hide resolved
gr <- new_data %>%
dplyr::select(dplyr::all_of(c(ok, object$columns))) %>%
group_by(dplyr::across(dplyr::all_of(ok[-1]))) %>%
dplyr::arrange(time_value) %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment (non-blocking): is this grouping and then sorting done often in the steps? if yes and it's easy, might be worth grouping and arranging once at bake start and then assuming that in later steps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many steps have a tendency to alter grouping and ordering. My convention is to always ensure that it happens inside the step to avoid bugs. But it would be nice to have an epiprocess function that uses canonical ordering cmu-delphi/epiprocess#482.

dplyr::mutate(
dplyr::across(
dplyr::all_of(object$columns),
~ slider::slide_index_vec(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why not use epiprocess::epi_slide or epiprocess::epi_slide_opt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had hoped to just use epi_slide(), but the interface for functions and column selections is very different. I'm happy to take suggestions, but step_*() need to have named args (in this case the function) and columns to apply the function to in .... I couldn't figure out how to convert that to epi_slide() syntax.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so messing around a bit, I think we need to add a way of specifying columns to the generic epi_slide. if .f=frollmean, then we can use epi_slide_opt in a natural way:

  col_names <- paste("slide_value", col_names, sep = "_")
  names(col_names) <- newnames
  new_data %>%
    dplyr::select(dplyr::all_of(c(ok, object$columns))) %>%
    group_by(dplyr::across(dplyr::all_of(ok[-1]))) %>%
    epiprocess::epi_slide_opt(
      !!col_name_orig,
      object$.fb,
      before = object$before,
      after = object$after) %>%
    rename(col_names)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Potential solutions down below. Probably should test & benchmark against more realistic data. Though I'd suggest only allowing something backed by epi_slide_opt because of the forgetting-epikey-grouping, incomplete-window (gaps, initial values), and speed issues of epi_slide.)

Copy link
Contributor

@brookslogan brookslogan Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a bit of a better benchmark on slides1 and slides2, using jhu_csse_daily_subset, summing cases, and indeed, things do change. Would still prefer _opt's output and speed.

  expression                            min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result   memory time       gc      
  <bch:expr>                       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>   <list> <list>     <list>  
1 slides1(edf, before, varnames, …    6.96s    6.96s     0.144        NA     4.74     1    33      6.96s <epi_df> <NULL> <bench_tm> <tibble>
2 slides2(edf, before, varnames, … 179.04ms 188.02ms     5.37         NA     7.15     3     4   559.16ms <epi_df> <NULL> <bench_tm> <tibble>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 

R/step_epi_slide.R Show resolved Hide resolved
Copy link
Contributor

@dsweber2 dsweber2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[[this is very much a partial review, it just wouldn't let me comment on conversations mid-review]]

Ideally, we'd also have steps for epi_mean, _sum, and opt since those are optimized cases for common uses, and will thus run much faster. Unfortunate amount of boilerplate to get there though.

R/step_epi_slide.R Outdated Show resolved Hide resolved
R/step_epi_slide.R Outdated Show resolved Hide resolved
R/step_epi_slide.R Outdated Show resolved Hide resolved
R/step_epi_slide.R Outdated Show resolved Hide resolved
@brookslogan
Copy link
Contributor

brookslogan commented Jul 25, 2024

Heard some epi_slide_opt/across-like things might be useful for reference here. (Though perhaps instead/additionally we can do something better upstream though, or limit things here to step_epi_slide_opt.)

suppressPackageStartupMessages({
  library(dplyr)
  library(epiprocess)
  library(tidyr)
})

edf <- tibble::tibble(
  geo_value = "az",
  time_value = 1:10,
  x = rep(1, 10),
  y = 1:10
) %>%
  as_epi_df()

varnames <- c("x", "y")
before <- 6L
fns <- list("7d_sum" = sum) # probably not the best idea with not-necessarily-complete-d data and not epi_slide_opt

slides1 <- function(edf, before, varnames, fns) {
  edf %>%
    group_by(geo_value) %>%
    epi_slide(before = before, ~ summarize(.x, across(all_of(varnames), fns)), names_sep = NULL) %>%
    ungroup()
}

# Faster variant when we use more realistic data:
slides2 <- function(edf, before, varnames, fns) {
  comps_info <- crossing(varname = varnames, fnname = names(fns), fn = fns)
  if (nrow(comps_info) == 0L) {
    edf # or maybe arrange it
  } else {
    bind_cols(lapply(
      seq_len(nrow(comps_info)),
      function(comp_i) {
        varname <- comps_info[[comp_i, "varname"]]
        fnname <- comps_info[[comp_i, "fnname"]]
        fn <- comps_info[[comp_i, "fn"]][[1L]]
        result_name <- paste0(varname, "_", fnname)
        basic_result <- edf %>%
          group_by(geo_value) %>%
          epi_slide(before = before, function(slice, gk, rtv) fn(slice[[varname]]), new_col_name = result_name) %>%
          ungroup()
        if (comp_i == 1L) {
          basic_result
        } else {
          basic_result[result_name]
        }
      }
    ))
  }
}

# note: this is NOT the same as you'd get out of epi_slide_opt
slides1(edf, before, varnames, fns)
#> An `epi_df` object, 10 x 6 with metadata:
#> * geo_type  = state
#> * time_type = integer
#> * as_of     = 2024-07-25 13:18:34.090191
#> 
#> # A tibble: 10 × 6
#>    geo_value time_value     x     y x_7d_sum y_7d_sum
#>  * <chr>          <int> <dbl> <int>    <dbl>    <int>
#>  1 az                 1     1     1        1        1
#>  2 az                 2     1     2        2        3
#>  3 az                 3     1     3        3        6
#>  4 az                 4     1     4        4       10
#>  5 az                 5     1     5        5       15
#>  6 az                 6     1     6        6       21
#>  7 az                 7     1     7        7       28
#>  8 az                 8     1     8        7       35
#>  9 az                 9     1     9        7       42
#> 10 az                10     1    10        7       49

all.equal(
  slides1(edf, before, varnames, fns),
  slides2(edf, before, varnames, fns)
)
#> [1] TRUE

Created on 2024-07-25 with reprex v2.1.1

@dajmcdon
Copy link
Contributor Author

@brookslogan perhaps you can add your preferred version here? Note that this step is intended to be completely general for arbitrary functions.

@brookslogan
Copy link
Contributor

Nothing's going to work properly for arbitrary functions at this point. mean and sum only work fully properly in epi_slide_opt, while functions outside mean/sum/min/max/(others?) are going to require epi_slide plus maybe some manual working around its deficiencies. Which do we want?

@dajmcdon
Copy link
Contributor Author

Seemingly epi_slide() then. What are the deficiencies?

@dajmcdon
Copy link
Contributor Author

I need to be able to do sd(), and something with coefficients from lm().

@brookslogan
Copy link
Contributor

  1. Lacks NAs at edges of time series.
  2. Lacks NAs when there are gaps.
  3. (Convenient footgun: lets you forget to group by all epikeys when that's what you want probably 95% of the time, and probably does the wrong thing the other 5% unless you also did some careful completion and have a convenient data set.)

(@nmdefries am I missing other error cases above?)

@brookslogan
Copy link
Contributor

@dajmcdon do any of these use cases require >1 epikey at a time in the computation?

@nmdefries
Copy link
Contributor

Size of window (# of datapoints) used in a computation can vary

@dajmcdon
Copy link
Contributor Author

@brookslogan potentially. No a priori reason to exclude.

My preference is for "something that works most of the time". I'm happy to have a version with potential bugs as long as it isn't producing garbage.

@brookslogan
Copy link
Contributor

brookslogan commented Jul 29, 2024

I'd like to say the approach should be use epi_slide and then try to make some breaking changes in epiprocess that address the above issues, rather than patching around them here. However, I'm not sure in the short term that you wouldn't get garbage especially with short training windows and/or with get_test_data(). So I'm thinking it's probably better to either

  • prioritize and wait for these fixes in epiprocess,
  • [prioritize and wait for generalizations of epi_slide_opt in epiprocess,]
  • do some patching up / extra checks here, or
  • do something custom not based epi_slide/epi_slide_opt

@dsweber2
Copy link
Contributor

dsweber2 commented Jul 31, 2024

I'm pushing an updated version that uses epi_slide. I had to force push because I rebased on dev.

I had to drop support for formulas for right now. We can add that back if you want. I'm about to add epi_slide_mean, epi_slide_sum and epi_slide_opt (I'll do those via PR on this I think).

@dsweber2 dsweber2 force-pushed the 275-step_epi_slide branch from 6ebb904 to acc40bd Compare July 31, 2024 19:18
@dshemetov dshemetov force-pushed the 275-step_epi_slide branch from 61d6e20 to 9c97485 Compare August 3, 2024 00:35
Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a few changes:

  • validate_slide_fun now rejects formula f
  • remove warning about optimized slide functions until that PR
  • fix tests
  • remove try_period and replace with epiprocess internal
  • remove slider dependency
  • update documentation

This should be good to go.

* validate_slide_fun now rejects formula f
* remove warning about optimized slide functions until that PR
* fix tests
* remove try_period and replace with epiprocess internal
* remove slider dependency
* update documentation
@dshemetov dshemetov force-pushed the 275-step_epi_slide branch from b9bf025 to 740d438 Compare August 3, 2024 01:33
@dshemetov dshemetov merged commit 991d70b into dev Aug 3, 2024
2 checks passed
@dshemetov dshemetov deleted the 275-step_epi_slide branch August 3, 2024 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add step_average
5 participants