cmu-delphi · dshemetov · Oct 22, 2024 · Oct 22, 2024 · Oct 22, 2024
diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
@@ -26,16 +26,20 @@ Our CI builds two version of the documentation:
 - https://cmu-delphi.github.io/epiprocess/ from the `main` branch and
 - https://cmu-delphi.github.io/epiprocess/dev from the `dev` branch.
 
-We include the script `pkgdown-watch.R` that will automatically rebuild the
-documentation locally and preview it. It can be used with:
+Commands for developing the documentation site:
 
 ```sh
-# Make sure you have servr installed
-R -e 'renv::install("servr")'
-# Will start a local server
-Rscript pkgdown-watch.R
-# You may need to first build the site with
+# Basic build and preview
+R -e 'pkgdown::clean_site()'
+R -e 'devtools::document()'
+R -e 'pkgdown::build_site()'
+
+# A smart rebuild workflow for non-RStudio users.
+# You may need to first build the site.
 R -e 'pkgdown::build_site(".", examples = FALSE, devel = TRUE, preview = FALSE)'
+R -e 'renv::install("servr")'
+# Will start a local docs server and monitor for changes.
+Rscript inst/pkgdown-watch.R
 ```
 
 ## Versioning

diff --git a/R/methods-epi_archive.R b/R/methods-epi_archive.R
@@ -627,10 +627,9 @@ epix_detailed_restricted_mutate <- function(.data, ...) {
 #' Slides a given function over variables in an `epi_archive` object. This
 #' behaves similarly to `epi_slide()`, with the key exception that it is
 #' version-aware: the sliding computation at any given reference time t is
-#' performed on **data that would have been available as of t**. See the
-#' [archive
-#' vignette](https://cmu-delphi.github.io/epiprocess/articles/archive.html) for
-#' examples.
+#' performed on **data that would have been available as of t**. This function
+#' is intended for use in accurate backtesting of models; see
+#' `vignette("backtesting", package="epipredict")` for a walkthrough.
 #'
 #' @param .x An [`epi_archive`] or [`grouped_epi_archive`] object. If ungrouped,
 #'   all data in `x` will be treated as part of a single data group.

diff --git a/R/slide.R b/R/slide.R
@@ -6,12 +6,15 @@
 #' as follows:
 #'
 #' ```
-#' # To compute the 7-day trailing average of cases
-#' epi_slide(edf, cases_7dav = mean(cases), .window_size = 7)
+#' # Create new column `cases_7dm` that contains a 7-day trailing median of cases
+#' epi_slide(edf, cases_7dav = median(cases), .window_size = 7)
 #' ```
 #'
-#' This will create the new column `cases_7dav` that contains a 7-day rolling
-#' average of values in "cases". See `vignette("epi_df")` for more examples.
+#' For two very common use cases, we provide optimized functions that are much
+#' faster than `epi_slide`: `epi_slide_mean()` and `epi_slide_sum()`. We
+#' recommend using these functions when possible.
+#'
+#' See `vignette("epi_df")` for more examples.
 #'
 #' @template basic-slide-params
 #' @param .f Function, formula, or missing; together with `...` specifies the

diff --git a/README.Rmd b/README.Rmd
@@ -5,7 +5,7 @@ output: github_document
 <!-- README.md is generated from README.Rmd. Please edit that file -->
 
 ```{r, include = FALSE}
-source(here::here("vignettes", "_common.R"))
+source(file.path("vignettes", "_common.R"))
 knitr::opts_chunk$set(
   fig.path = "man/figures/README-"
 )
@@ -16,7 +16,7 @@ knitr::opts_chunk$set(
 The `{epiprocess}` package works with epidemiological time series data and
 provides tools to manage, analyze, and process the data in preparation for
 modeling. It is designed to work in tandem with
-[`{epipredict}`](https://cmu-delphi.github.io/epipredict/), which provides
+[epipredict](https://cmu-delphi.github.io/epipredict/), which provides
 pre-built epiforecasting models and as well as tools to build custom models.
 Both packages are designed to lower the barrier to entry and implementation cost
 for epidemiological time series analysis and forecasting.
@@ -25,13 +25,18 @@ for epidemiological time series analysis and forecasting.
 
 - `epi_df()` and `epi_archive()`, two data frame classes (that work like a
 `{tibble}` with `{dplyr}` verbs) for working with epidemiological time
-series data;
+series data
+  - `epi_df` is for working with a snapshot of data at a single point in time
+  - `epi_archive` is for working with histories of data that changes over time
+  - one of the most common uses of `epi_archive` is for accurate backtesting of
+    forecasting models, see `vignette("backtesting", package="epipredict")`
 - signal processing tools building on these data structures such as
-  - `epi_slide()` for sliding window operations;
-  - `epix_slide()` for sliding window operations on archives;
-  - `growth_rate()` for computing growth rates;
-  - `detect_outlr()` for outlier detection;
-  - `epi_cor()` for computing correlations;
+  - `epi_slide()` for sliding window operations (aids with feature creation)
+  - `epix_slide()` for sliding window operations on archives (aids with
+  backtesting)
+  - `growth_rate()` for computing growth rates
+  - `detect_outlr()` for outlier detection
+  - `epi_cor()` for computing correlations
 
 If you are new to this set of tools, you may be interested learning through a
 book format: [Introduction to Epidemiological
@@ -41,7 +46,7 @@ You may also be interested in:
 
 - `{epidatr}`, for accessing wide range
 of epidemiological data sets, including COVID-19 data, flu data, and more.
-- `{rtestim}`, a package for estimating
+- [rtestim](https://github.com/dajmcdon/rtestim), a package for estimating
 the time-varying reproduction number of an epidemic.
 
 This package is provided by the [Delphi group](https://delphi.cmu.edu/) at

diff --git a/README.md b/README.md
@@ -6,23 +6,32 @@
 The `{epiprocess}` package works with epidemiological time series data
 and provides tools to manage, analyze, and process the data in
 preparation for modeling. It is designed to work in tandem with
-[`{epipredict}`](https://cmu-delphi.github.io/epipredict/), which
-provides pre-built epiforecasting models and as well as tools to build
-custom models. Both packages are designed to lower the barrier to entry
-and implementation cost for epidemiological time series analysis and
+[epipredict](https://cmu-delphi.github.io/epipredict/), which provides
+pre-built epiforecasting models and as well as tools to build custom
+models. Both packages are designed to lower the barrier to entry and
+implementation cost for epidemiological time series analysis and
 forecasting.
 
 `{epiprocess}` contains:
 
   - `epi_df()` and `epi_archive()`, two data frame classes (that work
     like a `{tibble}` with `{dplyr}` verbs) for working with
-    epidemiological time series data;
+    epidemiological time series data
+      - `epi_df` is for working with a snapshot of data at a single
+        point in time
+      - `epi_archive` is for working with histories of data that changes
+        over time
+      - one of the most common uses of `epi_archive` is for accurate
+        backtesting of forecasting models, see `vignette("backtesting",
+        package="epipredict")`
   - signal processing tools building on these data structures such as
-      - `epi_slide()` for sliding window operations;
-      - `epix_slide()` for sliding window operations on archives;
-      - `growth_rate()` for computing growth rates;
-      - `detect_outlr()` for outlier detection;
-      - `epi_cor()` for computing correlations;
+      - `epi_slide()` for sliding window operations (aids with feature
+        creation)
+      - `epix_slide()` for sliding window operations on archives (aids
+        with backtesting)
+      - `growth_rate()` for computing growth rates
+      - `detect_outlr()` for outlier detection
+      - `epi_cor()` for computing correlations
 
 If you are new to this set of tools, you may be interested learning
 through a book format: [Introduction to Epidemiological
@@ -32,8 +41,8 @@ You may also be interested in:
 
   - `{epidatr}`, for accessing wide range of epidemiological data sets,
     including COVID-19 data, flu data, and more.
-  - `{rtestim}`, a package for estimating the time-varying reproduction
-    number of an epidemic.
+  - [rtestim](https://github.com/dajmcdon/rtestim), a package for
+    estimating the time-varying reproduction number of an epidemic.
 
 This package is provided by the [Delphi group](https://delphi.cmu.edu/)
 at Carnegie Mellon University.

diff --git a/pkgdown-watch.R → inst/pkgdown-watch.R b/pkgdown-watch.R → inst/pkgdown-watch.R
@@ -36,7 +36,7 @@ servr::httw(
     refs <- grep("man.+R(m?d)?$", files, value = TRUE)
     if (length(refs)) {
       # Doesn't work for me, so I run it manually.
-      # pkgdown::build_reference(pkg, preview = FALSE, examples = FALSE, lazy = FALSE)
+      # pkgdown::build_reference(pkg, preview = FALSE, examples = FALSE, lazy = FALSE) # nolint: commented_code_linter
     }
 
     pkgdown <- grep("pkgdown", files, value = TRUE)

diff --git a/man/epi_slide.Rd b/man/epi_slide.Rd
diff --git a/man/epix_slide.Rd b/man/epix_slide.Rd
diff --git a/man/figures/README-unnamed-chunk-7-1.png b/man/figures/README-unnamed-chunk-7-1.png
diff --git a/vignettes/compactify.Rmd b/vignettes/compactify.Rmd
@@ -8,7 +8,7 @@ vignette: >
 ---
 
 ```{r, include = FALSE}
-source(here::here("vignettes", "_common.R"))
+source("_common.R")
 ```
 
 ## Removing redundant update data to save space

diff --git a/vignettes/correlation.Rmd b/vignettes/correlation.Rmd
@@ -8,7 +8,7 @@ vignette: >
 ---
 
 ```{r, include = FALSE}
-source(here::here("vignettes", "_common.R"))
+source("_common.R")
 ```
 
 The `epiprocess` package provides some simple functionality for computing lagged

diff --git a/vignettes/epi_archive.Rmd b/vignettes/epi_archive.Rmd
@@ -8,7 +8,7 @@ vignette: >
 ---
 
 ```{r, include = FALSE}
-source(here::here("vignettes", "_common.R"))
+source("_common.R")
 ```
 
 The `epi_archive` data structure provided by `epiprocess` provides convenient
@@ -25,10 +25,7 @@ In this vignette we will:
 - summarize revision behavior in the archive,
 - produce snapshots of the data in `epi_df` form,
 - merge `epi_archive` objects together,
-- run a simple autoregressive forecaster (version-unaware) on a single date,
-- slide a simple autoregressive forecaster (version-unaware),
-- slide a simple autoregressive forecaster (version-aware),
-- compare version-aware and -unaware forecasts.
+- provide a link to a backtesting forecasting models vignette.
 
 ## Getting data into `epi_archive` format
 
@@ -234,221 +231,11 @@ Note that we have used the `sync = "locf"` argument to specify that we want to
 synchronize the two datasets on their disjoint revisions by using the last
 observation carried forward (LOCF). For more information, see `epix_merge()`.
 
-## Backtesting a simple autoregressive forecaster
-
-One of the most common use cases of the `epi_archive` object is for accurate
-model backtesting. In this section we will:
-
-- develop a simple autoregressive forecaster that predicts the next value of the
-signal based on the current and past values of the signal itself, and
-- demonstrate how to slide this forecaster over the `epi_archive` object to
-produce forecasts at a few dates date, using version-unaware and -aware
-computations,
-- compare the two approaches.
-
-Before we get started, we should mention that all the work of constructing the
-forecaster that we're about to do is just a simple demo. The `epipredict`
-package, which is a companion package to `epiprocess`, offers a suite of
-predictive modeling tools that can improve on some of the shortcomings of the
-simple AR model we will use here, while also allowing the forecasters to be
-built from building blocks. A better version of the function below exists as
-`epipredict::arx_forecaster()`. See `vignette("backtesting",
-package="epipredict")` for a similar demo, but using the forecasters in
-that package.
-
-First, let's define the forecaster. While AR models can be fit in numerous ways
-(using base R or external packages), here we define it "by hand"  because we
-would like to demonstrate a *probabilistic* forecaster: one that outputs not
-just a point prediction, but a notion of uncertainty around this. In particular,
-our forecaster will output a point prediction along with an 90\% uncertainty
-band, represented by a predictive quantiles at the 5\% and 95\% levels (lower
-and upper endpoints of the uncertainty band).
-
-The function defined below, `prob_ar()`, is our probabilistic AR forecaster. Our
-forecasting target will be the `percent_cli` signal. The function is as follows:
+## Backtesting forecasting models
 
-```{r}
-#' `ahead` - the number of time units ahead to forecast (in this case days),
-#' `lags` - the autoregressive lags to use in the model (in this case, the
-#' current value and the values from 7 and 14 days ago),
-#' `min_train_window` - the minimum number of observations required to fit the
-#' forecaster (used to control for edge cases),
-#' `lower_level` and `upper_level` - the quantiles to use for the uncertainty
-#' bands
-#' `symmetrize` - whether to symmetrize the residuals when computing the
-#' uncertainty bands,
-#' `intercept` - whether to include an intercept in the model,
-#' `nonneg` - whether to clip the forecasts at zero.
-prob_ar <- function(y, ahead = 7, lags = c(0, 7, 14), min_train_window = 90,
-                    lower_level = 0.05, upper_level = 0.95, symmetrize = TRUE,
-                    intercept = FALSE, nonneg = TRUE) {
-  # Return NA if insufficient training data
-  if (length(y) < min_train_window + max(lags) + ahead) {
-    return(data.frame(point = NA, lower = NA, upper = NA))
-  }
-
-  # Filter down the edge-NAs
-  y <- y[!is.na(y)]
-
-  # Build features and response for the AR model
-  dat <- do.call(
-    data.frame,
-    purrr::map(lags, function(j) lag(y, n = j))
-  )
-  names(dat) <- paste0("x", seq_len(ncol(dat)))
-  if (intercept) dat$x0 <- rep(1, nrow(dat))
-  dat$y <- lead(y, n = ahead)
-
-  # Now fit the AR model and make a prediction
-  obj <- lm(y ~ . + 0, data = dat)
-  point <- predict(obj, newdata = tail(dat, 1))
-
-  # Compute a band
-  r <- residuals(obj)
-  s <- ifelse(symmetrize, -1, NA) # Should the residuals be symmetrized?
-  q <- quantile(c(r, s * r), probs = c(lower_level, upper_level), na.rm = TRUE)
-  lower <- point + q[1]
-  upper <- point + q[2]
-
-  # Clip at zero if we need to, then return
-  if (nonneg) {
-    point <- max(point, 0)
-    lower <- max(lower, 0)
-    upper <- max(upper, 0)
-  }
-  return(data.frame(point = point, lower = lower, upper = upper))
-}
-```
-
-To start, let's run this forecaster on a single date (say, the last date in the
-archive) to see how it performs. We will use the `epix_as_of()` method to
-generate a snapshot of the archive at the last date, and then run the forecaster.
-
-```{r}
-edf_latest <- epix_as_of(dv_archive, dv_archive$versions_end) %>%
-  tidyr::drop_na() %>%
-  as_tibble()
-edf_latest %>%
-  group_by(geo_value) %>%
-  summarize(fc = prob_ar(percent_cli), time_value = last(time_value), percent_cli = last(percent_cli))
-```
-
-The resulting epi_df now contains three new columns: `fc$point`, `fc$lower`, and
-`fc$upper` corresponding to the point forecast, and the lower and upper
-endpoints of the 95\% prediction band, respectively. The forecasts look
-reasonable and in line with the data. The point forecast is close to the last
-observed value, and the uncertainty bands are wide enough to capture the
-variability in the data.
-
-Note that the same can be achieved wth `epix_slide` using the following code:
-
-```{r}
-dv_archive %>%
-  group_by(geo_value) %>%
-  epix_slide(fc = prob_ar(percent_cli), .versions = dv_archive$versions_end)
-```
-
-Here we used `epix_slide()` to slide the forecaster over the `epi_archive`, but
-by specifying the `.versions` argument to be the last version in the archive, we
-effectively ran the forecaster on the last date in the archive.
-
-Now let's go ahead and slide this forecaster in a version unaware way. First, we
-need to snapshot the latest version of the data, and then make a faux archive by
-setting `version = time_value`. This has the effect of simulating a data set
-that receives the final version updates every day.
-
-```{r}
-dv_archive_faux <- edf_latest %>%
-  mutate(version = time_value) %>%
-  as_epi_archive()
-```
-
-Now we can slide the forecaster over the faux archive to produce forecasts at a
-number of dates in the past, spaced a month apart. Note that we will use the
-`percent_cli` signal from the merged archive, which is the smoothed COVID-19
-case rate. We will forecast 7, 14, 21, and 28 days ahead. We produce forecasts
-in a version-aware way, which simply requires us to use the true `epi_archive`
-object instead of the faux one.
-
-```{r}
-# Generate a sequence of forecast dates. Starting 3 months into the data, so we have
-# enough data to train the AR model.
-forecast_dates <- seq(as.Date("2020-10-01"), as.Date("2021-11-01"), by = "1 months")
-k_week_ahead <- function(archive, ahead = 7) {
-  archive %>%
-    group_by(geo_value) %>%
-    epix_slide(fc = prob_ar(.data$percent_cli, ahead = ahead), .versions = forecast_dates) %>%
-    ungroup() %>%
-    mutate(target_date = version + ahead)
-}
-
-aheads <- 1:28
-forecasts <- bind_rows(
-  map(aheads, ~ k_week_ahead(dv_archive_faux, ahead = .x) %>% mutate(version_aware = FALSE)),
-  map(aheads, ~ k_week_ahead(dv_archive, ahead = .x) %>% mutate(version_aware = TRUE))
-)
-```
-
-Now let's plot the forecasts at each forecast date at multiple horizons: 7, 14,
-21, and 28 days ahead. The grey line represents the finalized COVID-19 case
-rates, and the colored lines represent the forecasts. The left column shows the
-version aware forecasts, and the right column shows the version unaware
-forecasts. They grey vertical lines mark the forecast dates.
-
-```{r}
-edf_latest <- epix_as_of(dv_archive, dv_archive$versions_end)
-max_version <- max(dv_archive$DT$version)
-geo_choose <- "tx"
-
-forecasts_filtered <- forecasts %>%
-  filter(geo_value == geo_choose) %>%
-  mutate(time_value = version)
-percent_cli_data <- bind_rows(
-  # Snapshotted data for the version-aware forecasts
-  map(
-    forecast_dates,
-    ~ epix_as_of(dv_archive, .x) %>%
-      mutate(version = .x)
-  ) %>%
-    bind_rows() %>%
-    mutate(version_aware = TRUE),
-  # Latest data for the version-unaware forecasts
-  edf_latest %>% mutate(version_aware = FALSE)
-) %>%
-  filter(geo_value == geo_choose)
-
-ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)) +
-  geom_ribbon(aes(ymin = fc$lower, ymax = fc$upper, fill = factor(time_value)), alpha = 0.4) +
-  geom_line(aes(y = fc$point, color = factor(time_value)), linetype = 2L) +
-  geom_point(aes(y = fc$point, color = factor(time_value)), size = 0.75) +
-  geom_vline(data = percent_cli_data, aes(color = factor(version), xintercept = version), lty = 2) +
-  geom_line(
-    data = percent_cli_data,
-    aes(x = time_value, y = percent_cli, color = factor(version)),
-    inherit.aes = FALSE, na.rm = TRUE
-  ) +
-  facet_grid(version_aware ~ geo_value, scales = "free") +
-  scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
-  scale_y_continuous(expand = expansion(c(0, 0.05))) +
-  labs(x = "Date", y = "smoothed, day of week adjusted covid-like doctors visits") +
-  theme(legend.position = "none")
-```
-
-A few points are worth making. First, it's clear that training and making
-predictions on finalized data can lead to an overly optimistic sense of accuracy
-(see, for example, [McDonald et al.
-(2021)](https://www.pnas.org/content/118/51/e2111453118/), and references
-therein). Second, we can see that the properly-versioned forecaster is, at some
-points in time, more problematic; for example, it massively overpredicts the
-peak in both locations in winter wave of 2020. However, performance is pretty
-poor across the board here, whether or not properly-versioned data is used, with
-volatile and overconfident forecasts in various places.
-
-As mentioned previously, this forecaster is meant only as a demo of the slide
-functionality with some of the most basic forecasting methodology possible. The
-[`epipredict`](https://cmu-delphi.github.io/epipredict/) package, which builds
-off the data structures and functionality in the current package, is the place
-to look for more robust forecasting methodology.
+One of the most common use cases of `epiprocess::epi_archive()` object is for
+accurate model backtesting. See `vignette("backtesting", package="epipredict")`
+for an in-depth demo, using a pre-built forecaster in that package.
 
 ## Attribution
 

diff --git a/vignettes/epi_df.Rmd b/vignettes/epi_df.Rmd
@@ -8,7 +8,7 @@ vignette: >
 ---
 
 ```{r, include = FALSE}
-source(here::here("vignettes", "_common.R"))
+source("_common.R")
 ```
 
 The `epi_df` data structure provided by `epiprocess` provides convenient ways to

diff --git a/vignettes/epiprocess.Rmd b/vignettes/epiprocess.Rmd
@@ -10,7 +10,7 @@ editor_options:
 ---
 
 ```{r, include = FALSE}
-source(here::here("vignettes", "_common.R"))
+source("_common.R")
 ```
 
 ## Overview
@@ -187,7 +187,7 @@ dv <- archive_cases_dv_subset$DT %>%
 dv
 ```
 
-See `vignette("epi_arcive")` for a more in-depth guide to `epi_archive` objects.
+See `vignette("epi_archive")` for a more in-depth guide to `epi_archive` objects.
 
 ## Data attribution
 

diff --git a/vignettes/growth_rate.Rmd b/vignettes/growth_rate.Rmd
@@ -8,7 +8,7 @@ vignette: >
 ---
 
 ```{r, include = FALSE}
-source(here::here("vignettes", "_common.R"))
+source("_common.R")
 ```
 
 A basic way of assessing growth in a signal is to look at its relative change

diff --git a/vignettes/outliers.Rmd b/vignettes/outliers.Rmd
@@ -8,7 +8,7 @@ vignette: >
 ---
 
 ```{r, include = FALSE}
-source(here::here("vignettes", "_common.R"))
+source("_common.R")
 ```
 
 This vignette describes functionality for detecting and correcting outliers in