Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync dev -> main #556

Merged
merged 2 commits into from
Oct 22, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 11 additions & 7 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
@@ -26,16 +26,20 @@ Our CI builds two version of the documentation:
- https://cmu-delphi.github.io/epiprocess/ from the `main` branch and
- https://cmu-delphi.github.io/epiprocess/dev from the `dev` branch.

We include the script `pkgdown-watch.R` that will automatically rebuild the
documentation locally and preview it. It can be used with:
Commands for developing the documentation site:

```sh
# Make sure you have servr installed
R -e 'renv::install("servr")'
# Will start a local server
Rscript pkgdown-watch.R
# You may need to first build the site with
# Basic build and preview
R -e 'pkgdown::clean_site()'
R -e 'devtools::document()'
R -e 'pkgdown::build_site()'

# A smart rebuild workflow for non-RStudio users.
# You may need to first build the site.
R -e 'pkgdown::build_site(".", examples = FALSE, devel = TRUE, preview = FALSE)'
R -e 'renv::install("servr")'
# Will start a local docs server and monitor for changes.
Rscript inst/pkgdown-watch.R
```

## Versioning
7 changes: 3 additions & 4 deletions R/methods-epi_archive.R
Original file line number Diff line number Diff line change
@@ -627,10 +627,9 @@ epix_detailed_restricted_mutate <- function(.data, ...) {
#' Slides a given function over variables in an `epi_archive` object. This
#' behaves similarly to `epi_slide()`, with the key exception that it is
#' version-aware: the sliding computation at any given reference time t is
#' performed on **data that would have been available as of t**. See the
#' [archive
#' vignette](https://cmu-delphi.github.io/epiprocess/articles/archive.html) for
#' examples.
#' performed on **data that would have been available as of t**. This function
#' is intended for use in accurate backtesting of models; see
#' `vignette("backtesting", package="epipredict")` for a walkthrough.
#'
#' @param .x An [`epi_archive`] or [`grouped_epi_archive`] object. If ungrouped,
#' all data in `x` will be treated as part of a single data group.
11 changes: 7 additions & 4 deletions R/slide.R
Original file line number Diff line number Diff line change
@@ -6,12 +6,15 @@
#' as follows:
#'
#' ```
#' # To compute the 7-day trailing average of cases
#' epi_slide(edf, cases_7dav = mean(cases), .window_size = 7)
#' # Create new column `cases_7dm` that contains a 7-day trailing median of cases
#' epi_slide(edf, cases_7dav = median(cases), .window_size = 7)
#' ```
#'
#' This will create the new column `cases_7dav` that contains a 7-day rolling
#' average of values in "cases". See `vignette("epi_df")` for more examples.
#' For two very common use cases, we provide optimized functions that are much
#' faster than `epi_slide`: `epi_slide_mean()` and `epi_slide_sum()`. We
#' recommend using these functions when possible.
#'
#' See `vignette("epi_df")` for more examples.
#'
#' @template basic-slide-params
#' @param .f Function, formula, or missing; together with `...` specifies the
23 changes: 14 additions & 9 deletions README.Rmd
Original file line number Diff line number Diff line change
@@ -5,7 +5,7 @@ output: github_document
<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
source(here::here("vignettes", "_common.R"))
source(file.path("vignettes", "_common.R"))
knitr::opts_chunk$set(
fig.path = "man/figures/README-"
)
@@ -16,7 +16,7 @@ knitr::opts_chunk$set(
The `{epiprocess}` package works with epidemiological time series data and
provides tools to manage, analyze, and process the data in preparation for
modeling. It is designed to work in tandem with
[`{epipredict}`](https://cmu-delphi.github.io/epipredict/), which provides
[epipredict](https://cmu-delphi.github.io/epipredict/), which provides
pre-built epiforecasting models and as well as tools to build custom models.
Both packages are designed to lower the barrier to entry and implementation cost
for epidemiological time series analysis and forecasting.
@@ -25,13 +25,18 @@ for epidemiological time series analysis and forecasting.

- `epi_df()` and `epi_archive()`, two data frame classes (that work like a
`{tibble}` with `{dplyr}` verbs) for working with epidemiological time
series data;
series data
- `epi_df` is for working with a snapshot of data at a single point in time
- `epi_archive` is for working with histories of data that changes over time
- one of the most common uses of `epi_archive` is for accurate backtesting of
forecasting models, see `vignette("backtesting", package="epipredict")`
- signal processing tools building on these data structures such as
- `epi_slide()` for sliding window operations;
- `epix_slide()` for sliding window operations on archives;
- `growth_rate()` for computing growth rates;
- `detect_outlr()` for outlier detection;
- `epi_cor()` for computing correlations;
- `epi_slide()` for sliding window operations (aids with feature creation)
- `epix_slide()` for sliding window operations on archives (aids with
backtesting)
- `growth_rate()` for computing growth rates
- `detect_outlr()` for outlier detection
- `epi_cor()` for computing correlations

If you are new to this set of tools, you may be interested learning through a
book format: [Introduction to Epidemiological
@@ -41,7 +46,7 @@ You may also be interested in:

- `{epidatr}`, for accessing wide range
of epidemiological data sets, including COVID-19 data, flu data, and more.
- `{rtestim}`, a package for estimating
- [rtestim](https://github.com/dajmcdon/rtestim), a package for estimating
the time-varying reproduction number of an epidemic.

This package is provided by the [Delphi group](https://delphi.cmu.edu/) at
33 changes: 21 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -6,23 +6,32 @@
The `{epiprocess}` package works with epidemiological time series data
and provides tools to manage, analyze, and process the data in
preparation for modeling. It is designed to work in tandem with
[`{epipredict}`](https://cmu-delphi.github.io/epipredict/), which
provides pre-built epiforecasting models and as well as tools to build
custom models. Both packages are designed to lower the barrier to entry
and implementation cost for epidemiological time series analysis and
[epipredict](https://cmu-delphi.github.io/epipredict/), which provides
pre-built epiforecasting models and as well as tools to build custom
models. Both packages are designed to lower the barrier to entry and
implementation cost for epidemiological time series analysis and
forecasting.

`{epiprocess}` contains:

- `epi_df()` and `epi_archive()`, two data frame classes (that work
like a `{tibble}` with `{dplyr}` verbs) for working with
epidemiological time series data;
epidemiological time series data
- `epi_df` is for working with a snapshot of data at a single
point in time
- `epi_archive` is for working with histories of data that changes
over time
- one of the most common uses of `epi_archive` is for accurate
backtesting of forecasting models, see `vignette("backtesting",
package="epipredict")`
- signal processing tools building on these data structures such as
- `epi_slide()` for sliding window operations;
- `epix_slide()` for sliding window operations on archives;
- `growth_rate()` for computing growth rates;
- `detect_outlr()` for outlier detection;
- `epi_cor()` for computing correlations;
- `epi_slide()` for sliding window operations (aids with feature
creation)
- `epix_slide()` for sliding window operations on archives (aids
with backtesting)
- `growth_rate()` for computing growth rates
- `detect_outlr()` for outlier detection
- `epi_cor()` for computing correlations

If you are new to this set of tools, you may be interested learning
through a book format: [Introduction to Epidemiological
@@ -32,8 +41,8 @@ You may also be interested in:

- `{epidatr}`, for accessing wide range of epidemiological data sets,
including COVID-19 data, flu data, and more.
- `{rtestim}`, a package for estimating the time-varying reproduction
number of an epidemic.
- [rtestim](https://github.com/dajmcdon/rtestim), a package for
estimating the time-varying reproduction number of an epidemic.

This package is provided by the [Delphi group](https://delphi.cmu.edu/)
at Carnegie Mellon University.
2 changes: 1 addition & 1 deletion pkgdown-watch.R → inst/pkgdown-watch.R
Original file line number Diff line number Diff line change
@@ -36,7 +36,7 @@ servr::httw(
refs <- grep("man.+R(m?d)?$", files, value = TRUE)
if (length(refs)) {
# Doesn't work for me, so I run it manually.
# pkgdown::build_reference(pkg, preview = FALSE, examples = FALSE, lazy = FALSE)
# pkgdown::build_reference(pkg, preview = FALSE, examples = FALSE, lazy = FALSE) # nolint: commented_code_linter
}

pkgdown <- grep("pkgdown", files, value = TRUE)
11 changes: 7 additions & 4 deletions man/epi_slide.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions man/epix_slide.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Binary file modified man/figures/README-unnamed-chunk-7-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion vignettes/compactify.Rmd
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ vignette: >
---

```{r, include = FALSE}
source(here::here("vignettes", "_common.R"))
source("_common.R")
```

## Removing redundant update data to save space
2 changes: 1 addition & 1 deletion vignettes/correlation.Rmd
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ vignette: >
---

```{r, include = FALSE}
source(here::here("vignettes", "_common.R"))
source("_common.R")
```

The `epiprocess` package provides some simple functionality for computing lagged
225 changes: 6 additions & 219 deletions vignettes/epi_archive.Rmd
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ vignette: >
---

```{r, include = FALSE}
source(here::here("vignettes", "_common.R"))
source("_common.R")
```

The `epi_archive` data structure provided by `epiprocess` provides convenient
@@ -25,10 +25,7 @@ In this vignette we will:
- summarize revision behavior in the archive,
- produce snapshots of the data in `epi_df` form,
- merge `epi_archive` objects together,
- run a simple autoregressive forecaster (version-unaware) on a single date,
- slide a simple autoregressive forecaster (version-unaware),
- slide a simple autoregressive forecaster (version-aware),
- compare version-aware and -unaware forecasts.
- provide a link to a backtesting forecasting models vignette.

## Getting data into `epi_archive` format

@@ -234,221 +231,11 @@ Note that we have used the `sync = "locf"` argument to specify that we want to
synchronize the two datasets on their disjoint revisions by using the last
observation carried forward (LOCF). For more information, see `epix_merge()`.

## Backtesting a simple autoregressive forecaster

One of the most common use cases of the `epi_archive` object is for accurate
model backtesting. In this section we will:

- develop a simple autoregressive forecaster that predicts the next value of the
signal based on the current and past values of the signal itself, and
- demonstrate how to slide this forecaster over the `epi_archive` object to
produce forecasts at a few dates date, using version-unaware and -aware
computations,
- compare the two approaches.

Before we get started, we should mention that all the work of constructing the
forecaster that we're about to do is just a simple demo. The `epipredict`
package, which is a companion package to `epiprocess`, offers a suite of
predictive modeling tools that can improve on some of the shortcomings of the
simple AR model we will use here, while also allowing the forecasters to be
built from building blocks. A better version of the function below exists as
`epipredict::arx_forecaster()`. See `vignette("backtesting",
package="epipredict")` for a similar demo, but using the forecasters in
that package.

First, let's define the forecaster. While AR models can be fit in numerous ways
(using base R or external packages), here we define it "by hand" because we
would like to demonstrate a *probabilistic* forecaster: one that outputs not
just a point prediction, but a notion of uncertainty around this. In particular,
our forecaster will output a point prediction along with an 90\% uncertainty
band, represented by a predictive quantiles at the 5\% and 95\% levels (lower
and upper endpoints of the uncertainty band).

The function defined below, `prob_ar()`, is our probabilistic AR forecaster. Our
forecasting target will be the `percent_cli` signal. The function is as follows:
## Backtesting forecasting models

```{r}
#' `ahead` - the number of time units ahead to forecast (in this case days),
#' `lags` - the autoregressive lags to use in the model (in this case, the
#' current value and the values from 7 and 14 days ago),
#' `min_train_window` - the minimum number of observations required to fit the
#' forecaster (used to control for edge cases),
#' `lower_level` and `upper_level` - the quantiles to use for the uncertainty
#' bands
#' `symmetrize` - whether to symmetrize the residuals when computing the
#' uncertainty bands,
#' `intercept` - whether to include an intercept in the model,
#' `nonneg` - whether to clip the forecasts at zero.
prob_ar <- function(y, ahead = 7, lags = c(0, 7, 14), min_train_window = 90,
lower_level = 0.05, upper_level = 0.95, symmetrize = TRUE,
intercept = FALSE, nonneg = TRUE) {
# Return NA if insufficient training data
if (length(y) < min_train_window + max(lags) + ahead) {
return(data.frame(point = NA, lower = NA, upper = NA))
}
# Filter down the edge-NAs
y <- y[!is.na(y)]
# Build features and response for the AR model
dat <- do.call(
data.frame,
purrr::map(lags, function(j) lag(y, n = j))
)
names(dat) <- paste0("x", seq_len(ncol(dat)))
if (intercept) dat$x0 <- rep(1, nrow(dat))
dat$y <- lead(y, n = ahead)
# Now fit the AR model and make a prediction
obj <- lm(y ~ . + 0, data = dat)
point <- predict(obj, newdata = tail(dat, 1))
# Compute a band
r <- residuals(obj)
s <- ifelse(symmetrize, -1, NA) # Should the residuals be symmetrized?
q <- quantile(c(r, s * r), probs = c(lower_level, upper_level), na.rm = TRUE)
lower <- point + q[1]
upper <- point + q[2]
# Clip at zero if we need to, then return
if (nonneg) {
point <- max(point, 0)
lower <- max(lower, 0)
upper <- max(upper, 0)
}
return(data.frame(point = point, lower = lower, upper = upper))
}
```

To start, let's run this forecaster on a single date (say, the last date in the
archive) to see how it performs. We will use the `epix_as_of()` method to
generate a snapshot of the archive at the last date, and then run the forecaster.

```{r}
edf_latest <- epix_as_of(dv_archive, dv_archive$versions_end) %>%
tidyr::drop_na() %>%
as_tibble()
edf_latest %>%
group_by(geo_value) %>%
summarize(fc = prob_ar(percent_cli), time_value = last(time_value), percent_cli = last(percent_cli))
```

The resulting epi_df now contains three new columns: `fc$point`, `fc$lower`, and
`fc$upper` corresponding to the point forecast, and the lower and upper
endpoints of the 95\% prediction band, respectively. The forecasts look
reasonable and in line with the data. The point forecast is close to the last
observed value, and the uncertainty bands are wide enough to capture the
variability in the data.

Note that the same can be achieved wth `epix_slide` using the following code:

```{r}
dv_archive %>%
group_by(geo_value) %>%
epix_slide(fc = prob_ar(percent_cli), .versions = dv_archive$versions_end)
```

Here we used `epix_slide()` to slide the forecaster over the `epi_archive`, but
by specifying the `.versions` argument to be the last version in the archive, we
effectively ran the forecaster on the last date in the archive.

Now let's go ahead and slide this forecaster in a version unaware way. First, we
need to snapshot the latest version of the data, and then make a faux archive by
setting `version = time_value`. This has the effect of simulating a data set
that receives the final version updates every day.

```{r}
dv_archive_faux <- edf_latest %>%
mutate(version = time_value) %>%
as_epi_archive()
```

Now we can slide the forecaster over the faux archive to produce forecasts at a
number of dates in the past, spaced a month apart. Note that we will use the
`percent_cli` signal from the merged archive, which is the smoothed COVID-19
case rate. We will forecast 7, 14, 21, and 28 days ahead. We produce forecasts
in a version-aware way, which simply requires us to use the true `epi_archive`
object instead of the faux one.

```{r}
# Generate a sequence of forecast dates. Starting 3 months into the data, so we have
# enough data to train the AR model.
forecast_dates <- seq(as.Date("2020-10-01"), as.Date("2021-11-01"), by = "1 months")
k_week_ahead <- function(archive, ahead = 7) {
archive %>%
group_by(geo_value) %>%
epix_slide(fc = prob_ar(.data$percent_cli, ahead = ahead), .versions = forecast_dates) %>%
ungroup() %>%
mutate(target_date = version + ahead)
}
aheads <- 1:28
forecasts <- bind_rows(
map(aheads, ~ k_week_ahead(dv_archive_faux, ahead = .x) %>% mutate(version_aware = FALSE)),
map(aheads, ~ k_week_ahead(dv_archive, ahead = .x) %>% mutate(version_aware = TRUE))
)
```

Now let's plot the forecasts at each forecast date at multiple horizons: 7, 14,
21, and 28 days ahead. The grey line represents the finalized COVID-19 case
rates, and the colored lines represent the forecasts. The left column shows the
version aware forecasts, and the right column shows the version unaware
forecasts. They grey vertical lines mark the forecast dates.

```{r}
edf_latest <- epix_as_of(dv_archive, dv_archive$versions_end)
max_version <- max(dv_archive$DT$version)
geo_choose <- "tx"
forecasts_filtered <- forecasts %>%
filter(geo_value == geo_choose) %>%
mutate(time_value = version)
percent_cli_data <- bind_rows(
# Snapshotted data for the version-aware forecasts
map(
forecast_dates,
~ epix_as_of(dv_archive, .x) %>%
mutate(version = .x)
) %>%
bind_rows() %>%
mutate(version_aware = TRUE),
# Latest data for the version-unaware forecasts
edf_latest %>% mutate(version_aware = FALSE)
) %>%
filter(geo_value == geo_choose)
ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)) +
geom_ribbon(aes(ymin = fc$lower, ymax = fc$upper, fill = factor(time_value)), alpha = 0.4) +
geom_line(aes(y = fc$point, color = factor(time_value)), linetype = 2L) +
geom_point(aes(y = fc$point, color = factor(time_value)), size = 0.75) +
geom_vline(data = percent_cli_data, aes(color = factor(version), xintercept = version), lty = 2) +
geom_line(
data = percent_cli_data,
aes(x = time_value, y = percent_cli, color = factor(version)),
inherit.aes = FALSE, na.rm = TRUE
) +
facet_grid(version_aware ~ geo_value, scales = "free") +
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
scale_y_continuous(expand = expansion(c(0, 0.05))) +
labs(x = "Date", y = "smoothed, day of week adjusted covid-like doctors visits") +
theme(legend.position = "none")
```

A few points are worth making. First, it's clear that training and making
predictions on finalized data can lead to an overly optimistic sense of accuracy
(see, for example, [McDonald et al.
(2021)](https://www.pnas.org/content/118/51/e2111453118/), and references
therein). Second, we can see that the properly-versioned forecaster is, at some
points in time, more problematic; for example, it massively overpredicts the
peak in both locations in winter wave of 2020. However, performance is pretty
poor across the board here, whether or not properly-versioned data is used, with
volatile and overconfident forecasts in various places.

As mentioned previously, this forecaster is meant only as a demo of the slide
functionality with some of the most basic forecasting methodology possible. The
[`epipredict`](https://cmu-delphi.github.io/epipredict/) package, which builds
off the data structures and functionality in the current package, is the place
to look for more robust forecasting methodology.
One of the most common use cases of `epiprocess::epi_archive()` object is for
accurate model backtesting. See `vignette("backtesting", package="epipredict")`
for an in-depth demo, using a pre-built forecaster in that package.

## Attribution

2 changes: 1 addition & 1 deletion vignettes/epi_df.Rmd
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ vignette: >
---

```{r, include = FALSE}
source(here::here("vignettes", "_common.R"))
source("_common.R")
```

The `epi_df` data structure provided by `epiprocess` provides convenient ways to
4 changes: 2 additions & 2 deletions vignettes/epiprocess.Rmd
Original file line number Diff line number Diff line change
@@ -10,7 +10,7 @@ editor_options:
---

```{r, include = FALSE}
source(here::here("vignettes", "_common.R"))
source("_common.R")
```

## Overview
@@ -187,7 +187,7 @@ dv <- archive_cases_dv_subset$DT %>%
dv
```

See `vignette("epi_arcive")` for a more in-depth guide to `epi_archive` objects.
See `vignette("epi_archive")` for a more in-depth guide to `epi_archive` objects.

## Data attribution

2 changes: 1 addition & 1 deletion vignettes/growth_rate.Rmd
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ vignette: >
---

```{r, include = FALSE}
source(here::here("vignettes", "_common.R"))
source("_common.R")
```

A basic way of assessing growth in a signal is to look at its relative change
2 changes: 1 addition & 1 deletion vignettes/outliers.Rmd
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ vignette: >
---

```{r, include = FALSE}
source(here::here("vignettes", "_common.R"))
source("_common.R")
```

This vignette describes functionality for detecting and correcting outliers in