rewrap long vignette lines that annoy me (no need to check these)

cmu-delphi · Jun 14, 2024 · 38bf81d · 38bf81d
1 parent cb63333
commit 38bf81d
Show file tree

Hide file tree

Showing 7 changed files with 739 additions and 198 deletions.
diff --git a/vignettes/articles/sliding.Rmd b/vignettes/articles/sliding.Rmd
@@ -27,12 +27,12 @@ library(purrr)
 A key function from the epiprocess package is `epi_slide()`, which allows the
 user to apply a function or formula-based computation over variables in an
 `epi_df` over a running window of `n` time steps (see the following `epiprocess`
-vignette to go over the basics of the function: ["Slide a computation over
-signal values"](https://cmu-delphi.github.io/epiprocess/articles/slide.html)).
+vignette to go over the basics of the function: 
+["Slide a computation over signal values"](https://cmu-delphi.github.io/epiprocess/articles/slide.html)).
 The equivalent sliding method for an `epi_archive` object can be called by using
 the wrapper function `epix_slide()` (refer to the following vignette for the
-basics of the function: ["Work with archive objects and data
-revisions"](https://cmu-delphi.github.io/epiprocess/articles/archive.html)). The
+basics of the function: 
+["Work with archive objects and data revisions"](https://cmu-delphi.github.io/epiprocess/articles/archive.html)). The
 key difference from `epi_slide()` is that it performs version-aware
 computations. That is, the function only uses data that would have been
 available as of time t for that reference time.
@@ -171,7 +171,9 @@ in such forecasting. Including such factors as well as making enhancements such
 as correcting for outliers are some improvements one could make to this simple
 model.[^1]
 
-[^1]: Note that, despite the above caveats, simple models like this tend to out-perform many far more complicated models in the online Covid forecasting due to those models high variance predictions.
+[^1]: Note that, despite the above caveats, simple models like this tend to
+out-perform many far more complicated models in the online Covid forecasting due
+to those models high variance predictions.
 
 ### Example using case data from Canada
 
@@ -327,7 +329,8 @@ fc <- bind_rows(
 ) %>% pivot_quantiles_wider(fc_.pred_distn)
 ```
 
-Now we can plot the results on top of the latest case rates. As before, we will only display and focus on the results for FL and CA for simplicity.
+Now we can plot the results on top of the latest case rates. As before, we will
+only display and focus on the results for FL and CA for simplicity.
 
 ```{r plot-ar-asof, message = FALSE, warning = FALSE, fig.width = 9, fig.height = 6}
 fc_cafl <- fc %>% filter(geo_value %in% c("ca", "fl"))

diff --git a/vignettes/articles/smooth-qr.Rmd b/vignettes/articles/smooth-qr.Rmd
diff --git a/vignettes/articles/symptom-surveys.Rmd b/vignettes/articles/symptom-surveys.Rmd
diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd
diff --git a/vignettes/epipredict.Rmd b/vignettes/epipredict.Rmd
@@ -28,51 +28,87 @@ library(epipredict)
 
 # Goals for the package
 
-At a high level, our goal with `{epipredict}` is to make running simple Machine Learning / Statistical forecasters for epidemiology easy. However, this package is extremely extensible, and that is part of its utility. Our hope is that it is easy for users with epi training and some statistics to fit baseline models while still allowing those with more nuanced statistical understanding to create complicated specializations using the same framework.
+At a high level, our goal with `{epipredict}` is to make running simple Machine
+Learning / Statistical forecasters for epidemiology easy. However, this package
+is extremely extensible, and that is part of its utility. Our hope is that it is
+easy for users with epi training and some statistics to fit baseline models
+while still allowing those with more nuanced statistical understanding to create
+complicated specializations using the same framework.
 
-Serving both populations is the main motivation for our efforts, but at the same time, we have tried hard to make it useful.
+Serving both populations is the main motivation for our efforts, but at the same
+time, we have tried hard to make it useful.
 
 
 ## Baseline models
 
-We provide a set of basic, easy-to-use forecasters that work out of the box.
-You should be able to do a reasonably limited amount of customization on them. Any serious customization happens with the framework discussed below).
+We provide a set of basic, easy-to-use forecasters that work out of the box. You
+should be able to do a reasonably limited amount of customization on them. Any
+serious customization happens with the framework discussed below).
 
 For the basic forecasters, we provide:
 
 * Baseline flat-line forecaster
 * Autoregressive forecaster
 * Autoregressive classifier
 
-All the forcasters we provide are built on our framework. So we will use these basic models to illustrate its flexibility.
+All the forcasters we provide are built on our framework. So we will use these
+basic models to illustrate its flexibility.
 
 ## Forecasting framework
 
-Our framework for creating custom forecasters views the prediction task as a set of modular components. There are four types of components:
+Our framework for creating custom forecasters views the prediction task as a set
+of modular components. There are four types of components:
 
 1. Preprocessor: make transformations to the data before model training
 2. Trainer: train a model on data, resulting in a fitted model object
 3. Predictor: make predictions, using a fitted model object and processed test data
 4. Postprocessor: manipulate or transform the predictions before returning
 
-Users familiar with [`{tidymodels}`](https://www.tidymodels.org) and especially the [`{workflows}`](https://workflows.tidymodels.org) package will notice a lot of overlap. This is by design, and is in fact a feature. The truth is that `{epipredict}` is a wrapper around much that is contained in these packages. Therefore, if you want something from this -verse, it should "just work" (we hope).
-
-The reason for the overlap is that `{workflows}` _already implements_ the first three steps. And it does this very well. However, it is missing the postprocessing stage and currently has no plans for such an implementation. And this feature is important. The baseline forecaster we provide _requires_ postprocessing. Anything more complicated needs this as well.
-
-The second omission from `{tidymodels}` is support for panel data. Besides epidemiological data, economics, psychology, sociology, and many other areas frequently deal with data of this type. So the framework of behind `{epipredict}` implements this. In principle, this has nothing to do with epidemiology, and one could simply use this package as a solution for the missing functionality in `{tidymodels}`. Again, this should "just work".
-
-All of the _panel data_ functionality is implemented through the `epi_df` data type in the companion [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) package. There is much more to see there, but for the moment, it's enough to look at a simple one:
+Users familiar with [`{tidymodels}`](https://www.tidymodels.org) and especially
+the [`{workflows}`](https://workflows.tidymodels.org) package will notice a lot
+of overlap. This is by design, and is in fact a feature. The truth is that
+`{epipredict}` is a wrapper around much that is contained in these packages.
+Therefore, if you want something from this -verse, it should "just work" (we
+hope).
+
+The reason for the overlap is that `{workflows}` *already implements* the first
+three steps. And it does this very well. However, it is missing the
+postprocessing stage and currently has no plans for such an implementation. And
+this feature is important. The baseline forecaster we provide *requires*
+postprocessing. Anything more complicated needs this as well.
+
+The second omission from `{tidymodels}` is support for panel data. Besides
+epidemiological data, economics, psychology, sociology, and many other areas
+frequently deal with data of this type. So the framework of behind
+`{epipredict}` implements this. In principle, this has nothing to do with
+epidemiology, and one could simply use this package as a solution for the
+missing functionality in `{tidymodels}`. Again, this should "just work".
+
+All of the *panel data* functionality is implemented through the `epi_df` data
+type in the companion [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/)
+package. There is much more to see there, but for the moment, it's enough to
+look at a simple one:
 
 ```{r epidf}
 jhu <- case_death_rate_subset
 jhu
 ```
 
-This data is built into the package and contains the measured variables `case_rate` and `death_rate` for COVID-19 at the daily level for each US state for the year 2021. The "panel" part is because we have repeated measurements across a number of locations.
+This data is built into the package and contains the measured variables
+`case_rate` and `death_rate` for COVID-19 at the daily level for each US state
+for the year 2021. The "panel" part is because we have repeated measurements
+across a number of locations.
 
-The `epi_df` encodes the time stamp as `time_value` and the `key` as `geo_value`. While these 2 names are required, the values don't need to actually represent such objects. Additional `key`'s are also supported (like age group, ethnicity, taxonomy, etc.).
+The `epi_df` encodes the time stamp as `time_value` and the `key` as
+`geo_value`. While these 2 names are required, the values don't need to actually
+represent such objects. Additional `key`'s are also supported (like age group,
+ethnicity, taxonomy, etc.).
 
-The `epi_df` also contains some metadata that describes the keys as well as the vintage of the data. It's possible that data collected at different times for the _same set_ of `geo_value`'s and `time_value`'s could actually be different. For more details, see [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/articles/epiprocess.html).
+The `epi_df` also contains some metadata that describes the keys as well as the
+vintage of the data. It's possible that data collected at different times for
+the *same set* of `geo_value`'s and `time_value`'s could actually be different.
+For more details, see
+[`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/articles/epiprocess.html).
 
 
 
@@ -85,21 +121,32 @@ preprocessing, training, and prediction, bound together, through a package calle
 `{workflows}`. We built `{epipredict}` on top of that setup. In this way, you CAN
 use almost everything they provide.
 
-* However, `{workflows}` doesn't do postprocessing. And nothing in the -verse handles _panel data_.
+* However, `{workflows}` doesn't do postprocessing. And nothing in the -verse 
+handles _panel data_.
 
 * The tidy-team doesn't have plans to do either of these things. (We checked).
 
 * There are two packages that do _time series_ built on `{tidymodels}`, but it's
-"basic" time series: 1-step AR models, exponential smoothing, STL decomposition, etc.[^2] Our group has not prioritized these sorts of models for epidemic forecasting, but one could also integrate these methods into our framework.
+"basic" time series: 1-step AR models, exponential smoothing, STL decomposition,
+etc.[^2] Our group has not prioritized these sorts of models for epidemic
+forecasting, but one could also integrate these methods into our framework.
 
-[^2]: These are [`{timetk}`](https://business-science.github.io/timetk/index.html) and [`{modeltime}`](https://business-science.github.io/timetk/index.html). There are _lots_ of useful methods there than can be used to do fairly complex machine learning methodology, though not directly for panel data and not for direct prediction of future targets.
+[^2]: These are [`{timetk}`](https://business-science.github.io/timetk/index.html) 
+and [`{modeltime}`](https://business-science.github.io/timetk/index.html). There
+are *lots* of useful methods there than can be used to do fairly complex machine
+learning methodology, though not directly for panel data and not for direct
+prediction of future targets.
 
 # Show me the basics
 
-We start with the `jhu` data displayed above.
-One of the "canned" forecasters we provide is an autoregressive forecaster with (or without) covariates that _directly_ trains on the response. This is in contrast to a typical "iterative" AR model that trains to predict one-step-ahead, and then plugs in the predictions to "leverage up" to longer horizons.
+We start with the `jhu` data displayed above. One of the "canned" forecasters we
+provide is an autoregressive forecaster with (or without) covariates that
+*directly* trains on the response. This is in contrast to a typical "iterative"
+AR model that trains to predict one-step-ahead, and then plugs in the
+predictions to "leverage up" to longer horizons.
 
-We'll estimate the model jointly across all locations using only the most recent 30 days.
+We'll estimate the model jointly across all locations using only the most
+recent 30 days.
 
 ```{r demo-workflow}
 jhu <- jhu %>% filter(time_value >= max(time_value) - 30)
@@ -112,16 +159,25 @@ out <- arx_forecaster(
 
 The `out` object has two components:
 
-  1. The predictions which is just another `epi_df`. It contains the predictions for each location along with additional columns. By default, these are a 90% predictive interval, the `forecast_date` (the date on which the forecast was putatively made) and the `target_date` (the date for which the forecast is being made).
+  1. The predictions which is just another `epi_df`. It contains the predictions for
+each location along with additional columns. By default, these are a 90%
+predictive interval, the `forecast_date` (the date on which the forecast was
+putatively made) and the `target_date` (the date for which the forecast is being
+made).
   ```{r}
 out$predictions
   ```
-  2. A list object of class `epi_workflow`. This object encapsulates all the instructions necessary to create the prediction. More details on this below.
+  2. A list object of class `epi_workflow`. This object encapsulates all the
+instructions necessary to create the prediction. More details on this below.
   ```{r}
 out$epi_workflow
   ```
 
-By default, the forecaster predicts the outcome (`death_rate`) 1-week ahead, using 3 lags of each predictor (`case_rate` and `death_rate`) at 0 (today), 1 week back and 2 weeks back. The predictors and outcome can be changed directly. The rest of the defaults are encapsulated into a list of arguments. This list is produced by `arx_args_list()`.
+By default, the forecaster predicts the outcome (`death_rate`) 1-week ahead,
+using 3 lags of each predictor (`case_rate` and `death_rate`) at 0 (today), 1
+week back and 2 weeks back. The predictors and outcome can be changed directly.
+The rest of the defaults are encapsulated into a list of arguments. This list is
+produced by `arx_args_list()`.
 
 ## Simple adjustments
 
@@ -143,11 +199,19 @@ out2week <- arx_forecaster(
 )
 ```
 
-Here, we've used different lags on the `case_rate` and are now predicting 2 weeks ahead. This example also illustrates a major difficulty with the "iterative" versions of AR models. This model doesn't produce forecasts for `case_rate`, and so, would not have data to "plug in" for the necessary lags.[^1]
+Here, we've used different lags on the `case_rate` and are now predicting 2
+weeks ahead. This example also illustrates a major difficulty with the
+"iterative" versions of AR models. This model doesn't produce forecasts for
+`case_rate`, and so, would not have data to "plug in" for the necessary
+lags.[^1]
 
-[^1]: An obvious fix is to instead use a VAR and predict both, but this would likely increase the variance of the model, and therefore, may lead to less accurate forecasts for the variable of interest.
+[^1]: An obvious fix is to instead use a VAR and predict both, but this would
+likely increase the variance of the model, and therefore, may lead to less
+accurate forecasts for the variable of interest.
 
-Another property of the basic model is the predictive interval. We describe this in more detail in a different vignette, but it is easy to request multiple quantiles.
+Another property of the basic model is the predictive interval. We describe this
+in more detail in a different vignette, but it is easy to request multiple
+quantiles.
 
 ```{r differential-levels}
 out_q <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"),
@@ -157,7 +221,11 @@ out_q <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"),
 )
 ```
 
-The column `.pred_dstn` in the `predictions` object is actually a "distribution" here parameterized by its quantiles. For this default forecaster, these are created using the quantiles of the residuals of the predictive model (possibly symmetrized). Here, we used 23 quantiles, but one can grab a particular quantile,
+The column `.pred_dstn` in the `predictions` object is actually a "distribution"
+here parameterized by its quantiles. For this default forecaster, these are
+created using the quantiles of the residuals of the predictive model (possibly
+symmetrized). Here, we used 23 quantiles, but one can grab a particular
+quantile,
 
 ```{r q1}
 head(quantile(out_q$predictions$.pred_distn, p = .4))
@@ -173,7 +241,8 @@ out_q$predictions %>%
   unnest(.pred_distn) # then unnest it
 ```
 
-Additional simple adjustments to the basic forecaster can be made using the function:
+Additional simple adjustments to the basic forecaster can be made using the
+function:
 
 ```{r, eval = FALSE}
 arx_args_list(
@@ -186,9 +255,12 @@ arx_args_list(
 
 ## Changing the engine
 
-So far, our forecasts have been produced using simple linear regression. But this is not the only way to estimate such a model.
-The `trainer` argument determines the type of model we want.
-This takes a [`{parsnip}`](https://parsnip.tidymodels.org) model. The default is linear regression, but we could instead use a random forest with the `{ranger}` package:
+So far, our forecasts have been produced using simple linear regression. But
+this is not the only way to estimate such a model. The `trainer` argument
+determines the type of model we want. This takes a
+[`{parsnip}`](https://parsnip.tidymodels.org) model. The default is linear
+regression, but we could instead use a random forest with the `{ranger}`
+package:
 
 ```{r ranger, warning = FALSE}
 out_rf <- arx_forecaster(
@@ -215,7 +287,9 @@ out_qr <- arx_forecaster(
 )
 ```
 
-FWIW, this last case (using quantile regression), is not far from what the Delphi production forecast team used for its Covid forecasts over the past few years.
+FWIW, this last case (using quantile regression), is not far from what the
+Delphi production forecast team used for its Covid forecasts over the past few
+years.
 
 ## Inner workings
 
@@ -367,7 +441,8 @@ But ideally, a user could create their own forecasters by building up the
 components we provide. In other vignettes, we try to walk through some of these
 customizations.
 
-To illustrate everything above, here is (roughly) the code for the `flatline_forecaster()` applied to the `case_rate`.
+To illustrate everything above, here is (roughly) the code for the 
+`flatline_forecaster()` applied to the `case_rate`.
 
 ```{r}
 r <- epi_recipe(jhu) %>%

diff --git a/vignettes/panel-data.Rmd b/vignettes/panel-data.Rmd
@@ -239,7 +239,9 @@ summary(extract_fit_engine(wf_linreg))
 ```
 
 This output tells us the coefficients of the fitted model; for instance, 
-the estimated intercept is $\widehat{\alpha}_0 =$ `r round(coef(extract_fit_engine(wf_linreg))[1], 3)` and the coefficient for $y_{tijk}$ is 
+the estimated intercept is $\widehat{\alpha}_0 =$ 
+`r round(coef(extract_fit_engine(wf_linreg))[1], 3)` and the coefficient for 
+$y_{tijk}$ is 
 $\widehat\alpha_1 =$ `r round(coef(extract_fit_engine(wf_linreg))[2], 3)`.
 The summary also tells us that all estimated coefficients are significantly 
 different from zero. Extracting the 95% confidence intervals for the 
@@ -315,7 +317,8 @@ defined as follows:
 \end{aligned}
 \]
 
-where $y_{tijk}$ is the 5-year median income (proportion) at time $t$ (in location $i$, age group $j$ with education quality $k$), 
+where $y_{tijk}$ is the 5-year median income (proportion) at time $t$ (in
+location $i$, age group $j$ with education quality $k$), 
 $x_{tijk}$ is the 2-year median income (proportion) at time $t$, and 
 $z_{tijk}$ is the number of graduates (proportion) at time $t$.