Skip to content

Commit

Permalink
rewrap long vignette lines that annoy me (no need to check these)
Browse files Browse the repository at this point in the history
  • Loading branch information
dajmcdon committed Jun 14, 2024
1 parent cb63333 commit 38bf81d
Show file tree
Hide file tree
Showing 7 changed files with 739 additions and 198 deletions.
15 changes: 9 additions & 6 deletions vignettes/articles/sliding.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,12 @@ library(purrr)
A key function from the epiprocess package is `epi_slide()`, which allows the
user to apply a function or formula-based computation over variables in an
`epi_df` over a running window of `n` time steps (see the following `epiprocess`
vignette to go over the basics of the function: ["Slide a computation over
signal values"](https://cmu-delphi.github.io/epiprocess/articles/slide.html)).
vignette to go over the basics of the function:
["Slide a computation over signal values"](https://cmu-delphi.github.io/epiprocess/articles/slide.html)).
The equivalent sliding method for an `epi_archive` object can be called by using
the wrapper function `epix_slide()` (refer to the following vignette for the
basics of the function: ["Work with archive objects and data
revisions"](https://cmu-delphi.github.io/epiprocess/articles/archive.html)). The
basics of the function:
["Work with archive objects and data revisions"](https://cmu-delphi.github.io/epiprocess/articles/archive.html)). The
key difference from `epi_slide()` is that it performs version-aware
computations. That is, the function only uses data that would have been
available as of time t for that reference time.
Expand Down Expand Up @@ -171,7 +171,9 @@ in such forecasting. Including such factors as well as making enhancements such
as correcting for outliers are some improvements one could make to this simple
model.[^1]

[^1]: Note that, despite the above caveats, simple models like this tend to out-perform many far more complicated models in the online Covid forecasting due to those models high variance predictions.
[^1]: Note that, despite the above caveats, simple models like this tend to
out-perform many far more complicated models in the online Covid forecasting due
to those models high variance predictions.

### Example using case data from Canada

Expand Down Expand Up @@ -327,7 +329,8 @@ fc <- bind_rows(
) %>% pivot_quantiles_wider(fc_.pred_distn)
```

Now we can plot the results on top of the latest case rates. As before, we will only display and focus on the results for FL and CA for simplicity.
Now we can plot the results on top of the latest case rates. As before, we will
only display and focus on the results for FL and CA for simplicity.

```{r plot-ar-asof, message = FALSE, warning = FALSE, fig.width = 9, fig.height = 6}
fc_cafl <- fc %>% filter(geo_value %in% c("ca", "fl"))
Expand Down
258 changes: 198 additions & 60 deletions vignettes/articles/smooth-qr.Rmd

Large diffs are not rendered by default.

302 changes: 246 additions & 56 deletions vignettes/articles/symptom-surveys.Rmd

Large diffs are not rendered by default.

180 changes: 151 additions & 29 deletions vignettes/arx-classifier.Rmd

Large diffs are not rendered by default.

145 changes: 110 additions & 35 deletions vignettes/epipredict.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -28,51 +28,87 @@ library(epipredict)

# Goals for the package

At a high level, our goal with `{epipredict}` is to make running simple Machine Learning / Statistical forecasters for epidemiology easy. However, this package is extremely extensible, and that is part of its utility. Our hope is that it is easy for users with epi training and some statistics to fit baseline models while still allowing those with more nuanced statistical understanding to create complicated specializations using the same framework.
At a high level, our goal with `{epipredict}` is to make running simple Machine
Learning / Statistical forecasters for epidemiology easy. However, this package
is extremely extensible, and that is part of its utility. Our hope is that it is
easy for users with epi training and some statistics to fit baseline models
while still allowing those with more nuanced statistical understanding to create
complicated specializations using the same framework.

Serving both populations is the main motivation for our efforts, but at the same time, we have tried hard to make it useful.
Serving both populations is the main motivation for our efforts, but at the same
time, we have tried hard to make it useful.


## Baseline models

We provide a set of basic, easy-to-use forecasters that work out of the box.
You should be able to do a reasonably limited amount of customization on them. Any serious customization happens with the framework discussed below).
We provide a set of basic, easy-to-use forecasters that work out of the box. You
should be able to do a reasonably limited amount of customization on them. Any
serious customization happens with the framework discussed below).

For the basic forecasters, we provide:

* Baseline flat-line forecaster
* Autoregressive forecaster
* Autoregressive classifier

All the forcasters we provide are built on our framework. So we will use these basic models to illustrate its flexibility.
All the forcasters we provide are built on our framework. So we will use these
basic models to illustrate its flexibility.

## Forecasting framework

Our framework for creating custom forecasters views the prediction task as a set of modular components. There are four types of components:
Our framework for creating custom forecasters views the prediction task as a set
of modular components. There are four types of components:

1. Preprocessor: make transformations to the data before model training
2. Trainer: train a model on data, resulting in a fitted model object
3. Predictor: make predictions, using a fitted model object and processed test data
4. Postprocessor: manipulate or transform the predictions before returning

Users familiar with [`{tidymodels}`](https://www.tidymodels.org) and especially the [`{workflows}`](https://workflows.tidymodels.org) package will notice a lot of overlap. This is by design, and is in fact a feature. The truth is that `{epipredict}` is a wrapper around much that is contained in these packages. Therefore, if you want something from this -verse, it should "just work" (we hope).

The reason for the overlap is that `{workflows}` _already implements_ the first three steps. And it does this very well. However, it is missing the postprocessing stage and currently has no plans for such an implementation. And this feature is important. The baseline forecaster we provide _requires_ postprocessing. Anything more complicated needs this as well.

The second omission from `{tidymodels}` is support for panel data. Besides epidemiological data, economics, psychology, sociology, and many other areas frequently deal with data of this type. So the framework of behind `{epipredict}` implements this. In principle, this has nothing to do with epidemiology, and one could simply use this package as a solution for the missing functionality in `{tidymodels}`. Again, this should "just work".

All of the _panel data_ functionality is implemented through the `epi_df` data type in the companion [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) package. There is much more to see there, but for the moment, it's enough to look at a simple one:
Users familiar with [`{tidymodels}`](https://www.tidymodels.org) and especially
the [`{workflows}`](https://workflows.tidymodels.org) package will notice a lot
of overlap. This is by design, and is in fact a feature. The truth is that
`{epipredict}` is a wrapper around much that is contained in these packages.
Therefore, if you want something from this -verse, it should "just work" (we
hope).

The reason for the overlap is that `{workflows}` *already implements* the first
three steps. And it does this very well. However, it is missing the
postprocessing stage and currently has no plans for such an implementation. And
this feature is important. The baseline forecaster we provide *requires*
postprocessing. Anything more complicated needs this as well.

The second omission from `{tidymodels}` is support for panel data. Besides
epidemiological data, economics, psychology, sociology, and many other areas
frequently deal with data of this type. So the framework of behind
`{epipredict}` implements this. In principle, this has nothing to do with
epidemiology, and one could simply use this package as a solution for the
missing functionality in `{tidymodels}`. Again, this should "just work".

All of the *panel data* functionality is implemented through the `epi_df` data
type in the companion [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/)
package. There is much more to see there, but for the moment, it's enough to
look at a simple one:

```{r epidf}
jhu <- case_death_rate_subset
jhu
```

This data is built into the package and contains the measured variables `case_rate` and `death_rate` for COVID-19 at the daily level for each US state for the year 2021. The "panel" part is because we have repeated measurements across a number of locations.
This data is built into the package and contains the measured variables
`case_rate` and `death_rate` for COVID-19 at the daily level for each US state
for the year 2021. The "panel" part is because we have repeated measurements
across a number of locations.

The `epi_df` encodes the time stamp as `time_value` and the `key` as `geo_value`. While these 2 names are required, the values don't need to actually represent such objects. Additional `key`'s are also supported (like age group, ethnicity, taxonomy, etc.).
The `epi_df` encodes the time stamp as `time_value` and the `key` as
`geo_value`. While these 2 names are required, the values don't need to actually
represent such objects. Additional `key`'s are also supported (like age group,
ethnicity, taxonomy, etc.).

The `epi_df` also contains some metadata that describes the keys as well as the vintage of the data. It's possible that data collected at different times for the _same set_ of `geo_value`'s and `time_value`'s could actually be different. For more details, see [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/articles/epiprocess.html).
The `epi_df` also contains some metadata that describes the keys as well as the
vintage of the data. It's possible that data collected at different times for
the *same set* of `geo_value`'s and `time_value`'s could actually be different.
For more details, see
[`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/articles/epiprocess.html).



Expand All @@ -85,21 +121,32 @@ preprocessing, training, and prediction, bound together, through a package calle
`{workflows}`. We built `{epipredict}` on top of that setup. In this way, you CAN
use almost everything they provide.

* However, `{workflows}` doesn't do postprocessing. And nothing in the -verse handles _panel data_.
* However, `{workflows}` doesn't do postprocessing. And nothing in the -verse
handles _panel data_.

* The tidy-team doesn't have plans to do either of these things. (We checked).

* There are two packages that do _time series_ built on `{tidymodels}`, but it's
"basic" time series: 1-step AR models, exponential smoothing, STL decomposition, etc.[^2] Our group has not prioritized these sorts of models for epidemic forecasting, but one could also integrate these methods into our framework.
"basic" time series: 1-step AR models, exponential smoothing, STL decomposition,
etc.[^2] Our group has not prioritized these sorts of models for epidemic
forecasting, but one could also integrate these methods into our framework.

[^2]: These are [`{timetk}`](https://business-science.github.io/timetk/index.html) and [`{modeltime}`](https://business-science.github.io/timetk/index.html). There are _lots_ of useful methods there than can be used to do fairly complex machine learning methodology, though not directly for panel data and not for direct prediction of future targets.
[^2]: These are [`{timetk}`](https://business-science.github.io/timetk/index.html)
and [`{modeltime}`](https://business-science.github.io/timetk/index.html). There
are *lots* of useful methods there than can be used to do fairly complex machine
learning methodology, though not directly for panel data and not for direct
prediction of future targets.

# Show me the basics

We start with the `jhu` data displayed above.
One of the "canned" forecasters we provide is an autoregressive forecaster with (or without) covariates that _directly_ trains on the response. This is in contrast to a typical "iterative" AR model that trains to predict one-step-ahead, and then plugs in the predictions to "leverage up" to longer horizons.
We start with the `jhu` data displayed above. One of the "canned" forecasters we
provide is an autoregressive forecaster with (or without) covariates that
*directly* trains on the response. This is in contrast to a typical "iterative"
AR model that trains to predict one-step-ahead, and then plugs in the
predictions to "leverage up" to longer horizons.

We'll estimate the model jointly across all locations using only the most recent 30 days.
We'll estimate the model jointly across all locations using only the most
recent 30 days.

```{r demo-workflow}
jhu <- jhu %>% filter(time_value >= max(time_value) - 30)
Expand All @@ -112,16 +159,25 @@ out <- arx_forecaster(

The `out` object has two components:

1. The predictions which is just another `epi_df`. It contains the predictions for each location along with additional columns. By default, these are a 90% predictive interval, the `forecast_date` (the date on which the forecast was putatively made) and the `target_date` (the date for which the forecast is being made).
1. The predictions which is just another `epi_df`. It contains the predictions for
each location along with additional columns. By default, these are a 90%
predictive interval, the `forecast_date` (the date on which the forecast was
putatively made) and the `target_date` (the date for which the forecast is being
made).
```{r}
out$predictions
```
2. A list object of class `epi_workflow`. This object encapsulates all the instructions necessary to create the prediction. More details on this below.
2. A list object of class `epi_workflow`. This object encapsulates all the
instructions necessary to create the prediction. More details on this below.
```{r}
out$epi_workflow
```

By default, the forecaster predicts the outcome (`death_rate`) 1-week ahead, using 3 lags of each predictor (`case_rate` and `death_rate`) at 0 (today), 1 week back and 2 weeks back. The predictors and outcome can be changed directly. The rest of the defaults are encapsulated into a list of arguments. This list is produced by `arx_args_list()`.
By default, the forecaster predicts the outcome (`death_rate`) 1-week ahead,
using 3 lags of each predictor (`case_rate` and `death_rate`) at 0 (today), 1
week back and 2 weeks back. The predictors and outcome can be changed directly.
The rest of the defaults are encapsulated into a list of arguments. This list is
produced by `arx_args_list()`.

## Simple adjustments

Expand All @@ -143,11 +199,19 @@ out2week <- arx_forecaster(
)
```

Here, we've used different lags on the `case_rate` and are now predicting 2 weeks ahead. This example also illustrates a major difficulty with the "iterative" versions of AR models. This model doesn't produce forecasts for `case_rate`, and so, would not have data to "plug in" for the necessary lags.[^1]
Here, we've used different lags on the `case_rate` and are now predicting 2
weeks ahead. This example also illustrates a major difficulty with the
"iterative" versions of AR models. This model doesn't produce forecasts for
`case_rate`, and so, would not have data to "plug in" for the necessary
lags.[^1]

[^1]: An obvious fix is to instead use a VAR and predict both, but this would likely increase the variance of the model, and therefore, may lead to less accurate forecasts for the variable of interest.
[^1]: An obvious fix is to instead use a VAR and predict both, but this would
likely increase the variance of the model, and therefore, may lead to less
accurate forecasts for the variable of interest.

Another property of the basic model is the predictive interval. We describe this in more detail in a different vignette, but it is easy to request multiple quantiles.
Another property of the basic model is the predictive interval. We describe this
in more detail in a different vignette, but it is easy to request multiple
quantiles.

```{r differential-levels}
out_q <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"),
Expand All @@ -157,7 +221,11 @@ out_q <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"),
)
```

The column `.pred_dstn` in the `predictions` object is actually a "distribution" here parameterized by its quantiles. For this default forecaster, these are created using the quantiles of the residuals of the predictive model (possibly symmetrized). Here, we used 23 quantiles, but one can grab a particular quantile,
The column `.pred_dstn` in the `predictions` object is actually a "distribution"
here parameterized by its quantiles. For this default forecaster, these are
created using the quantiles of the residuals of the predictive model (possibly
symmetrized). Here, we used 23 quantiles, but one can grab a particular
quantile,

```{r q1}
head(quantile(out_q$predictions$.pred_distn, p = .4))
Expand All @@ -173,7 +241,8 @@ out_q$predictions %>%
unnest(.pred_distn) # then unnest it
```

Additional simple adjustments to the basic forecaster can be made using the function:
Additional simple adjustments to the basic forecaster can be made using the
function:

```{r, eval = FALSE}
arx_args_list(
Expand All @@ -186,9 +255,12 @@ arx_args_list(

## Changing the engine

So far, our forecasts have been produced using simple linear regression. But this is not the only way to estimate such a model.
The `trainer` argument determines the type of model we want.
This takes a [`{parsnip}`](https://parsnip.tidymodels.org) model. The default is linear regression, but we could instead use a random forest with the `{ranger}` package:
So far, our forecasts have been produced using simple linear regression. But
this is not the only way to estimate such a model. The `trainer` argument
determines the type of model we want. This takes a
[`{parsnip}`](https://parsnip.tidymodels.org) model. The default is linear
regression, but we could instead use a random forest with the `{ranger}`
package:

```{r ranger, warning = FALSE}
out_rf <- arx_forecaster(
Expand All @@ -215,7 +287,9 @@ out_qr <- arx_forecaster(
)
```

FWIW, this last case (using quantile regression), is not far from what the Delphi production forecast team used for its Covid forecasts over the past few years.
FWIW, this last case (using quantile regression), is not far from what the
Delphi production forecast team used for its Covid forecasts over the past few
years.

## Inner workings

Expand Down Expand Up @@ -367,7 +441,8 @@ But ideally, a user could create their own forecasters by building up the
components we provide. In other vignettes, we try to walk through some of these
customizations.

To illustrate everything above, here is (roughly) the code for the `flatline_forecaster()` applied to the `case_rate`.
To illustrate everything above, here is (roughly) the code for the
`flatline_forecaster()` applied to the `case_rate`.

```{r}
r <- epi_recipe(jhu) %>%
Expand Down
7 changes: 5 additions & 2 deletions vignettes/panel-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -239,7 +239,9 @@ summary(extract_fit_engine(wf_linreg))
```

This output tells us the coefficients of the fitted model; for instance,
the estimated intercept is $\widehat{\alpha}_0 =$ `r round(coef(extract_fit_engine(wf_linreg))[1], 3)` and the coefficient for $y_{tijk}$ is
the estimated intercept is $\widehat{\alpha}_0 =$
`r round(coef(extract_fit_engine(wf_linreg))[1], 3)` and the coefficient for
$y_{tijk}$ is
$\widehat\alpha_1 =$ `r round(coef(extract_fit_engine(wf_linreg))[2], 3)`.
The summary also tells us that all estimated coefficients are significantly
different from zero. Extracting the 95% confidence intervals for the
Expand Down Expand Up @@ -315,7 +317,8 @@ defined as follows:
\end{aligned}
\]

where $y_{tijk}$ is the 5-year median income (proportion) at time $t$ (in location $i$, age group $j$ with education quality $k$),
where $y_{tijk}$ is the 5-year median income (proportion) at time $t$ (in
location $i$, age group $j$ with education quality $k$),
$x_{tijk}$ is the 2-year median income (proportion) at time $t$, and
$z_{tijk}$ is the number of graduates (proportion) at time $t$.

Expand Down
Loading

0 comments on commit 38bf81d

Please sign in to comment.