diff --git a/vignettes/articles/sliding.Rmd b/vignettes/articles/sliding.Rmd index eaaa95d38..2d0ea3d86 100644 --- a/vignettes/articles/sliding.Rmd +++ b/vignettes/articles/sliding.Rmd @@ -27,12 +27,12 @@ library(purrr) A key function from the epiprocess package is `epi_slide()`, which allows the user to apply a function or formula-based computation over variables in an `epi_df` over a running window of `n` time steps (see the following `epiprocess` -vignette to go over the basics of the function: ["Slide a computation over -signal values"](https://cmu-delphi.github.io/epiprocess/articles/slide.html)). +vignette to go over the basics of the function: +["Slide a computation over signal values"](https://cmu-delphi.github.io/epiprocess/articles/slide.html)). The equivalent sliding method for an `epi_archive` object can be called by using the wrapper function `epix_slide()` (refer to the following vignette for the -basics of the function: ["Work with archive objects and data -revisions"](https://cmu-delphi.github.io/epiprocess/articles/archive.html)). The +basics of the function: +["Work with archive objects and data revisions"](https://cmu-delphi.github.io/epiprocess/articles/archive.html)). The key difference from `epi_slide()` is that it performs version-aware computations. That is, the function only uses data that would have been available as of time t for that reference time. @@ -171,7 +171,9 @@ in such forecasting. Including such factors as well as making enhancements such as correcting for outliers are some improvements one could make to this simple model.[^1] -[^1]: Note that, despite the above caveats, simple models like this tend to out-perform many far more complicated models in the online Covid forecasting due to those models high variance predictions. +[^1]: Note that, despite the above caveats, simple models like this tend to +out-perform many far more complicated models in the online Covid forecasting due +to those models high variance predictions. ### Example using case data from Canada @@ -327,7 +329,8 @@ fc <- bind_rows( ) %>% pivot_quantiles_wider(fc_.pred_distn) ``` -Now we can plot the results on top of the latest case rates. As before, we will only display and focus on the results for FL and CA for simplicity. +Now we can plot the results on top of the latest case rates. As before, we will +only display and focus on the results for FL and CA for simplicity. ```{r plot-ar-asof, message = FALSE, warning = FALSE, fig.width = 9, fig.height = 6} fc_cafl <- fc %>% filter(geo_value %in% c("ca", "fl")) diff --git a/vignettes/articles/smooth-qr.Rmd b/vignettes/articles/smooth-qr.Rmd index 95cfe083f..047f4f178 100644 --- a/vignettes/articles/smooth-qr.Rmd +++ b/vignettes/articles/smooth-qr.Rmd @@ -19,13 +19,23 @@ knitr::opts_chunk$set( # Introducing smooth quantile regression -Whereas other time-series forecasting examples in this package have used (direct) models for single horizons, in multi-period forecasting, the goal is to (directly) forecast several horizons simultaneously. This is useful in epidemiological applications where decisions are based on the trend of a signal. - -The idea underlying smooth quantile regression is that set forecast targets can be approximated by a smooth curve. This novel approach from [Tuzhilina et al., 2022](https://arxiv.org/abs/2202.09723) enforces smoothness across the horizons and can be applied to point estimation by regression or interval prediction by quantile regression. Our focus in this vignette is the latter. +Whereas other time-series forecasting examples in this package have used +(direct) models for single horizons, in multi-period forecasting, the goal is to +(directly) forecast several horizons simultaneously. This is useful in +epidemiological applications where decisions are based on the trend of a signal. + +The idea underlying smooth quantile regression is that set forecast targets can +be approximated by a smooth curve. This novel approach from +[Tuzhilina et al., 2022](https://arxiv.org/abs/2202.09723) +enforces smoothness across the +horizons and can be applied to point estimation by regression or interval +prediction by quantile regression. Our focus in this vignette is the latter. # Built-in function for smooth quantile regression and its parameters -The built-in smooth quantile regression function, `smooth_quantile_reg()` provides a model specification for smooth quantile regression that works under the tidymodels framework. It has the following parameters and default values: +The built-in smooth quantile regression function, `smooth_quantile_reg()` +provides a model specification for smooth quantile regression that works under +the tidymodels framework. It has the following parameters and default values: ```{r, eval = FALSE} smooth_quantile_reg( @@ -39,19 +49,37 @@ smooth_quantile_reg( For smooth quantile regression, the type of model or `mode` is regression. -The only `engine` that is currently supported is `smooth_qr()` from the [`smoothqr` package](https://dajmcdon.github.io/smoothqr/). +The only `engine` that is currently supported is `smooth_qr()` from the +[`smoothqr` package](https://dajmcdon.github.io/smoothqr/). -The `outcome_locations` indicate the multiple horizon (ie. ahead) values. These should be specified by the user. +The `outcome_locations` indicate the multiple horizon (ie. ahead) values. These +should be specified by the user. -The `quantile_levels` parameter is a vector of values that indicates the quantiles to be estimated. The default is the median (0.5 quantile). +The `quantile_levels` parameter is a vector of values that indicates the +quantiles to be estimated. The default is the median (0.5 quantile). -The `degree` parameter indicates the degree of the polynomials used for smoothing of the response. It should be no more than the number of aheads. If the degree is precisely equal to the number of aheads, then there is no smoothing. To better understand this parameter and how it works, we should look to its origins and how it is used in the model. +The `degree` parameter indicates the degree of the polynomials used for +smoothing of the response. It should be no more than the number of aheads. If +the degree is precisely equal to the number of aheads, then there is no +smoothing. To better understand this parameter and how it works, we should look +to its origins and how it is used in the model. # Model form -Smooth quantile regression is linear auto-regressive, with the key feature being a transformation that forces the coefficients to satisfy a smoothing constraint. The purpose of this is for each model coefficient to be a smooth function of ahead values, and so each such coefficient is set to be a linear combination of smooth basis functions (such as a spline or a polynomial). - -The `degree` parameter controls the number of these polynomials used. It should be no greater than the number of responses. This is a tuning parameter, and so it can be chosen by performing a grid search with cross-validation. Intuitively, $d = 1$ corresponds to the constant model, $d = 2$ gives straight line forecasts, while $d = 3$ gives quadratic forecasts. Since a degree of 3 was found to work well in the tested applications (see Section 9 of [Tuzhilina et al., 2022](https://arxiv.org/abs/2202.09723)), it is the default value. +Smooth quantile regression is linear auto-regressive, with the key feature being +a transformation that forces the coefficients to satisfy a smoothing constraint. +The purpose of this is for each model coefficient to be a smooth function of +ahead values, and so each such coefficient is set to be a linear combination of +smooth basis functions (such as a spline or a polynomial). + +The `degree` parameter controls the number of these polynomials used. It should +be no greater than the number of responses. This is a tuning parameter, and so +it can be chosen by performing a grid search with cross-validation. Intuitively, +$d = 1$ corresponds to the constant model, $d = 2$ gives straight line +forecasts, while $d = 3$ gives quadratic forecasts. Since a degree of 3 was +found to work well in the tested applications (see Section 9 of +[Tuzhilina et al., 2022](https://arxiv.org/abs/2202.09723)), +it is the default value. # Demonstration of smooth quantile regression @@ -63,13 +91,18 @@ library(ggplot2) theme_set(theme_bw()) ``` -We will now apply smooth quantile regression on the real data used for COVID-19 forecasting. The built-in dataset we will use is a subset of JHU daily data on state cases and deaths. This sample data ranges from Dec. 31, 2020 to Dec. 31, 2021. +We will now apply smooth quantile regression on the real data used for COVID-19 +forecasting. The built-in dataset we will use is a subset of JHU daily data on +state cases and deaths. This sample data ranges from Dec. 31, 2020 to +Dec. 31, 2021. ```{r} edf <- case_death_rate_subset ``` -We will set the forecast date to be November 30, 2021 so that we can produce forecasts for target dates of 1 to 28 days ahead. We construct our test data, `tedf` from the days beyond this. +We will set the forecast date to be November 30, 2021 so that we can produce +forecasts for target dates of 1 to 28 days ahead. We construct our test data, +`tedf` from the days beyond this. ```{r} fd <- as.Date("2021-11-30") @@ -77,40 +110,64 @@ fd <- as.Date("2021-11-30") tedf <- edf %>% filter(time_value >= fd) ``` -We will use the most recent 3 months worth of data up to the forecast date for training. +We will use the most recent 3 months worth of data up to the forecast date for +training. ```{r} edf <- edf %>% filter(time_value < fd, time_value >= fd - 90L) ``` -And for plotting our focus will be on a subset of two states - California and Utah. +And for plotting our focus will be on a subset of two states - California and +Utah. ```{r} geos <- c("ut", "ca") ``` -Suppose that our goal with this data is to predict COVID-19 death rates at several horizons for each state. On day $t$, we want to predict new deaths $y$ that are $a = 1,\dots, 28$ days ahead at locations $j$ using the death rates from today, 1 week ago, and 2 weeks ago. So for each location, we'll predict the median (0.5 quantile) for each of the target dates by using +Suppose that our goal with this data is to predict COVID-19 death rates at +several horizons for each state. On day $t$, we want to predict new deaths $y$ +that are $a = 1,\dots, 28$ days ahead at locations $j$ using the death rates +from today, 1 week ago, and 2 weeks ago. So for each location, we'll predict the +median (0.5 quantile) for each of the target dates by using $$ \hat{y}_{j}(t+a) = \alpha(a) + \sum_{l = 0}^2 \beta_{l}(a) y_{j}(t - 7l) $$ -where $\beta_{l}(a) = \sum_{i=1}^d \theta_{il} h_i(a)$ is the smoothing constraint where ${h_1(a), \dots, h_d(a)}$ are the set of smooth basis functions and $d$ is a hyperparameter that manages the flexibility of $\beta_{l}(a)$. Remember that the goal is to have each $\beta_{l}(a)$ to be a smooth function of the aheads and that is achieved through imposing the smoothing constraint. - -Note that this model is intended to be simple and straightforward. Our only modification to this model is to add case rates as another predictive feature (we will leave it to the reader to incorporate additional features beyond this and the historical response values). We can update the basic model incorporate the $k = 2$ predictive features of case and death rates for each location j, $x_j(t) = (x_{j1}(t), x_{j2}(t))$ as follows: +where $\beta_{l}(a) = \sum_{i=1}^d \theta_{il} h_i(a)$ is the smoothing +constraint where ${h_1(a), \dots, h_d(a)}$ are the set of smooth basis functions +and $d$ is a hyperparameter that manages the flexibility of $\beta_{l}(a)$. +Remember that the goal is to have each $\beta_{l}(a)$ to be a smooth function of +the aheads and that is achieved through imposing the smoothing constraint. + +Note that this model is intended to be simple and straightforward. Our only +modification to this model is to add case rates as another predictive feature +(we will leave it to the reader to incorporate additional features beyond this +and the historical response values). We can update the basic model incorporate +the $k = 2$ predictive features of case and death rates for each location j, +$x_j(t) = (x_{j1}(t), x_{j2}(t))$ as follows: $$ \hat{y}_{j}(t+a) = \alpha(a) + \sum_{k = 1}^2 \sum_{l = 0}^2 \beta_{kl}(a) x_{jk}(t - 7l) $$ where $\beta_{kl}(a) = \sum_{i=1}^d \theta_{ikl} h_i(a)$. -Now, we will create our own forecaster from scratch by building up an `epi_workflow` (there is no canned forecaster that is currently available). Building our own forecaster allows for customization and control over the pre-processing and post-processing actions we wish to take. +Now, we will create our own forecaster from scratch by building up an +`epi_workflow` (there is no canned forecaster that is currently available). +Building our own forecaster allows for customization and control over the +pre-processing and post-processing actions we wish to take. -The pre-processing steps we take in our `epi_recipe` are simply to lag the predictor (by 0, 7, and 14 days) and lead the response by the multiple aheads specified by the function user. +The pre-processing steps we take in our `epi_recipe` are simply to lag the +predictor (by 0, 7, and 14 days) and lead the response by the multiple aheads +specified by the function user. -The post-processing layers we add to our `frosting` are nearly as simple. We first predict, unnest the prediction list-cols, omit NAs from them, and enforce that they are greater than 0. +The post-processing layers we add to our `frosting` are nearly as simple. We +first predict, unnest the prediction list-cols, omit NAs from them, and enforce +that they are greater than 0. -The third component of an to an `epi_workflow`, the model, is smooth quantile regression, which has three main arguments - the quantiles, aheads, and degree. +The third component of an to an `epi_workflow`, the model, is smooth quantile +regression, which has three main arguments - the quantiles, aheads, and degree. -After creating our `epi_workflow` with these components, we get our test data based on longest lag period and make the predictions. +After creating our `epi_workflow` with these components, we get our test data +based on longest lag period and make the predictions. We input our forecaster into a function for ease of use. @@ -149,7 +206,9 @@ smooth_fc <- function(x, aheads = 1:28, degree = 3L, quantiles = 0.5, fd) { } ``` -Notice that we allow the function user to specify the aheads, degree, and quantile as they may want to change these parameter values. We also allow for input of the forecast date as we fixed that at the onset of this demonstration. +Notice that we allow the function user to specify the aheads, degree, and +quantile as they may want to change these parameter values. We also allow for +input of the forecast date as we fixed that at the onset of this demonstration. We now can produce smooth quantile regression predictions for our problem: @@ -158,7 +217,9 @@ smooth_preds <- smooth_fc(edf, fd = fd) smooth_preds ``` -Most often, we're not going to want to limit ourselves to just predicting the median value as there is uncertainty about the predictions, so let's try to predict several different quantiles in addition to the median: +Most often, we're not going to want to limit ourselves to just predicting the +median value as there is uncertainty about the predictions, so let's try to +predict several different quantiles in addition to the median: ```{r, warning = FALSE} several_quantiles <- c(.1, .25, .5, .75, .9) @@ -167,9 +228,13 @@ smooth_preds <- smooth_fc(edf, quantiles = several_quantiles, fd = fd) smooth_preds ``` -We can see that we have different columns for the different quantile predictions. +We can see that we have different columns for the different quantile +predictions. -Let's visualize these results for the sample of two states. We will create a simple plotting function, under which the median predictions are an orange line and the surrounding quantiles are blue bands around this. For comparison, we will include the actual values over time as a black line. +Let's visualize these results for the sample of two states. We will create a +simple plotting function, under which the median predictions are an orange line +and the surrounding quantiles are blue bands around this. For comparison, we +will include the actual values over time as a black line. ```{r} plot_preds <- function(preds, geos_to_plot = NULL, train_test_dat, fd) { @@ -194,17 +259,24 @@ plot_preds <- function(preds, geos_to_plot = NULL, train_test_dat, fd) { } ``` -Since we would like to plot the actual death rates for these states over time, we bind the training and testing data together and input this into our plotting function as follows: +Since we would like to plot the actual death rates for these states over time, +we bind the training and testing data together and input this into our plotting +function as follows: ```{r, warning = FALSE} plot_preds(smooth_preds, geos, bind_rows(tedf, edf), fd) ``` -We can see that the predictions are smooth curves for each state, as expected when using smooth quantile regression. In addition while the curvature of the forecasts matches that of the truth, the forecasts do not look remarkably accurate. +We can see that the predictions are smooth curves for each state, as expected +when using smooth quantile regression. In addition while the curvature of the +forecasts matches that of the truth, the forecasts do not look remarkably +accurate. ## Varying the degrees parameter -We can test the impact of different degrees by using the `map()` function. Noting that this may take some time to run, let's try out all degrees from 1 to 7: +We can test the impact of different degrees by using the `map()` function. +Noting that this may take some time to run, let's try out all degrees from 1 +to 7: ```{r, warning = FALSE} smooth_preds_list <- map(1:7, ~ smooth_fc(edf, @@ -215,9 +287,12 @@ smooth_preds_list <- map(1:7, ~ smooth_fc(edf, mutate(degree = .x)) %>% list_rbind() ``` -One way to quantify the impact of these on the forecasting is to look at the mean absolute error (MAE) or mean squared error (MSE) over the degrees. We can select the degree that results in the lowest MAE. +One way to quantify the impact of these on the forecasting is to look at the +mean absolute error (MAE) or mean squared error (MSE) over the degrees. We can +select the degree that results in the lowest MAE. -Since the MAE compares the predicted values to the actual values, we will first join the test data to the predicted data for our comparisons: +Since the MAE compares the predicted values to the actual values, we will first +join the test data to the predicted data for our comparisons: ```{r, message = FALSE} tedf_sub <- tedf %>% rename(target_date = time_value, actual = death_rate) %>% @@ -236,7 +311,8 @@ smooth_preds_df_deg <- smooth_preds_list %>% smooth_preds_df_deg %>% arrange(mean) ``` -Instead of just looking at the raw numbers, let's create a simple line plot to visualize how the MAE changes over degrees for this data: +Instead of just looking at the raw numbers, let's create a simple line plot to +visualize how the MAE changes over degrees for this data: ```{r} ggplot(smooth_preds_df_deg, aes(degree, mean)) + @@ -245,26 +321,42 @@ ggplot(smooth_preds_df_deg, aes(degree, mean)) + ylab("Mean MAE") ``` -We can see that the degree that results in the lowest MAE is 3. Hence, we could pick this degree for future forecasting work on this data. +We can see that the degree that results in the lowest MAE is 3. Hence, we could +pick this degree for future forecasting work on this data. ## A brief comparison between smoothing and no smoothing -Now, we will briefly compare the results from using smooth quantile regression to those obtained without smoothing. The latter approach amounts to ordinary quantile regression to get predictions for the intended target date. The main drawback is that it ignores the fact that the responses all represent the same signal, just for different ahead values. In contrast, the smooth quantile regression approach utilizes this information about the data structure - the fact that the aheads in are not be independent of each other, but that they are naturally related over time by a smooth curve. - -To get the basic quantile regression results we can utilize the forecaster that we've already built. We can simply set the degree to be the number of ahead values to re-run the code without smoothing. +Now, we will briefly compare the results from using smooth quantile regression +to those obtained without smoothing. The latter approach amounts to ordinary +quantile regression to get predictions for the intended target date. The main +drawback is that it ignores the fact that the responses all represent the same +signal, just for different ahead values. In contrast, the smooth quantile +regression approach utilizes this information about the data structure - the +fact that the aheads in are not be independent of each other, but that they are +naturally related over time by a smooth curve. + +To get the basic quantile regression results we can utilize the forecaster that +we've already built. We can simply set the degree to be the number of ahead +values to re-run the code without smoothing. ```{r, warning = FALSE} -baseline_preds <- smooth_fc(edf, degree = 28L, quantiles = several_quantiles, fd = fd) +baseline_preds <- smooth_fc( + edf, degree = 28L, quantiles = several_quantiles, fd = fd) ``` -And we can produce the corresponding plot to inspect the predictions obtained under the baseline model: +And we can produce the corresponding plot to inspect the predictions obtained +under the baseline model: ```{r, warning = FALSE} plot_preds(baseline_preds, geos, bind_rows(tedf, edf), fd) ``` -Unlike for smooth quantile regression, the resulting forecasts are not smooth curves, but rather jagged and irregular in shape. +Unlike for smooth quantile regression, the resulting forecasts are not smooth +curves, but rather jagged and irregular in shape. -For a more formal comparison between the two approaches, we could compare the test performance in terms of accuracy through calculating either the, MAE or MSE, where the performance measure of choice can be calculated over over all times and locations for each ahead value +For a more formal comparison between the two approaches, we could compare the +test performance in terms of accuracy through calculating either the, MAE or +MSE, where the performance measure of choice can be calculated over over all +times and locations for each ahead value ```{r, message = FALSE} baseline_preds_mae_df <- baseline_preds %>% @@ -297,27 +389,54 @@ mean(baseline_preds_mae_df$mean) mean(smooth_preds_mae_df$mean) ``` -The former shows that forecasts for the immediate future and for the distant future are more inaccurate for both models under consideration. The latter shows that the smooth quantile regression model and baseline models perform very similarly overall, with the smooth quantile regression model only slightly beating the baseline model in terms of overall average MAE. +The former shows that forecasts for the immediate future and for the distant +future are more inaccurate for both models under consideration. The latter shows +that the smooth quantile regression model and baseline models perform very +similarly overall, with the smooth quantile regression model only slightly +beating the baseline model in terms of overall average MAE. -One other commonly used metric is the Weighted Interval Score (WIS, [Bracher et al., 2021](https://arxiv.org/pdf/2005.12881.pdf)), which a scoring rule that is based on the population quantiles. The point is to score the interval, whereas MAE only evaluates the accuracy of the point forecast. +One other commonly used metric is the Weighted Interval Score +(WIS, [Bracher et al., 2021](https://arxiv.org/pdf/2005.12881.pdf)), +which a scoring rule that is based on the population quantiles. The point is to +score the interval, whereas MAE only evaluates the accuracy of the point +forecast. -Let $F$ be a forecast composed of predicted quantiles $q_{\tau}$ for the set of quantile levels $\tau$. Then, in terms of the predicted quantiles, the WIS for target variable $Y$ is represented as follows ([McDonald et al., 2021](https://www.pnas.org/doi/full/10.1073/pnas.2111453118)): +Let $F$ be a forecast composed of predicted quantiles $q_{\tau}$ for the set of +quantile levels $\tau$. Then, in terms of the predicted quantiles, the WIS for +target variable $Y$ is represented as follows +([McDonald etal., 2021](https://www.pnas.org/doi/full/10.1073/pnas.2111453118)): $$ WIS(F, Y) = 2 \sum_{\tau} \phi_{\tau} (Y - q_{\tau}) $$ -where $\phi_{\tau}(x) = \tau |x|$ for $x \geq 0$ and$\phi_{\tau}(x) = (1 - \tau) |x|$ for $x < 0$. +where $\phi_{\tau}(x) = \tau |x|$ for $x \geq 0$ +and$\phi_{\tau}(x) = (1 - \tau) |x|$ for $x < 0$. -This form is general as it can accommodate both symmetric and asymmetric quantile levels. If the quantile levels are symmetric, then we can alternatively express the WIS as a collection of central prediction intervals ($\ell_{\alpha}, u_{\alpha}$) parametrized by the exclusion probability $\alpha$: +This form is general as it can accommodate both symmetric and asymmetric +quantile levels. If the quantile levels are symmetric, then we can alternatively +express the WIS as a collection of central prediction intervals +($\ell_{\alpha}, u_{\alpha}$) parametrized by the exclusion probability +$\alpha$: $$ WIS(F, Y) = \sum_{\alpha} \{ (u_{\alpha} - \ell_{\alpha}) + 2 \cdot \text{dist}(Y, [\ell_{\alpha}, u_{\alpha}]) \} $$ -where $\text{dist}(a,S)$ is the smallest distance between point $a$ and an element of set $S$. - -While we implement the former representation, we mention this form because it shows the that the score can be decomposed into the addition of a sharpness component (first term in the summand) and an under/overprediction component (second term in the summand). This alternative representation is useful because from it, we more easily see the major limitation to the WIS, which is that the score tends to prioritize sharpness (how wide the interval is) relative to coverage (if the interval contains the truth). - -Now, we write a simple function for the first representation of the score that is compatible with the latest version of `epipredict` (adapted from the corresponding function in [smoothmpf-epipredict](https://github.com/dajmcdon/smoothmpf-epipredict)). The inputs for it are the actual and predicted values and the quantile levels. +where $\text{dist}(a,S)$ is the smallest distance between point $a$ and an +element of set $S$. + +While we implement the former representation, we mention this form because it +shows the that the score can be decomposed into the addition of a sharpness +component (first term in the summand) and an under/overprediction component +(second term in the summand). This alternative representation is useful because +from it, we more easily see the major limitation to the WIS, which is that the +score tends to prioritize sharpness (how wide the interval is) relative to +coverage (if the interval contains the truth). + +Now, we write a simple function for the first representation of the score that +is compatible with the latest version of `epipredict` (adapted from the +corresponding function in +[smoothmpf-epipredict](https://github.com/dajmcdon/smoothmpf-epipredict)). The +inputs for it are the actual and predicted values and the quantile levels. ```{r} wis_dist_quantile <- function(actual, values, quantile_levels) { @@ -329,13 +448,18 @@ wis_dist_quantile <- function(actual, values, quantile_levels) { } ``` -Next, we apply the `wis_dist_quantile` function to get a WIS score for each state on each target date. We then compute the mean WIS for each ahead value over all of the states. The results for each of the smooth and baseline forecasters are shown in a similar style line plot as we chose for MAE: +Next, we apply the `wis_dist_quantile` function to get a WIS score for each +state on each target date. We then compute the mean WIS for each ahead value +over all of the states. The results for each of the smooth and baseline +forecasters are shown in a similar style line plot as we chose for MAE: ```{r} smooth_preds_wis_df <- smooth_preds %>% left_join(tedf_sub, by = c("geo_value", "target_date")) %>% rowwise() %>% - mutate(wis = wis_dist_quantile(actual, c(`0.1`, `0.25`, `0.5`, `0.75`, `0.9`), several_quantiles)) %>% + mutate(wis = wis_dist_quantile(actual, c(`0.1`, `0.25`, `0.5`, `0.75`, `0.9`), + several_quantiles) + ) %>% group_by(ahead) %>% summarise(mean = mean(wis)) %>% mutate(type = "smooth") @@ -343,7 +467,9 @@ smooth_preds_wis_df <- smooth_preds %>% baseline_preds_wis_df <- baseline_preds %>% left_join(tedf_sub, by = c("geo_value", "target_date")) %>% rowwise() %>% - mutate(wis = wis_dist_quantile(actual, c(`0.1`, `0.25`, `0.5`, `0.75`, `0.9`), several_quantiles)) %>% + mutate(wis = wis_dist_quantile(actual, c(`0.1`, `0.25`, `0.5`, `0.75`, `0.9`), + several_quantiles) + ) %>% group_by(ahead) %>% summarise(mean = mean(wis)) %>% mutate(type = "baseline") @@ -357,21 +483,33 @@ ggplot(preds_wis_df, aes(ahead, mean, color = type)) + scale_color_manual(values = c("#A69943", "#063970")) ``` -The results are consistent with what we saw for MAE: The forecasts for the near and distant future tend to be inaccurate for both models. The smooth quantile regression model only slightly outperforms the baseline model. +The results are consistent with what we saw for MAE: The forecasts for the near +and distant future tend to be inaccurate for both models. The smooth quantile +regression model only slightly outperforms the baseline model. -Though averaging the WIS score over location and time tends to be the primary aggregation scheme used in evaluation and model comparisons (see, for example, [McDonald et al., 2021](https://www.pnas.org/doi/full/10.1073/pnas.2111453118)), we can also obtain a single numerical summary by averaging over the aheads, times, and locations: +Though averaging the WIS score over location and time tends to be the primary +aggregation scheme used in evaluation and model comparisons (see, for example, +[McDonald et al., 2021](https://www.pnas.org/doi/full/10.1073/pnas.2111453118)), +we can also obtain a single numerical summary by averaging over the aheads, +times, and locations: ```{r} mean(baseline_preds_wis_df$mean) mean(smooth_preds_wis_df$mean) ``` -Overall, both perspectives agree that the smooth quantile regression model tends to perform only slightly better than the baseline model in terms of average WIS, illustrating the difficulty of this forecasting problem. +Overall, both perspectives agree that the smooth quantile regression model tends +to perform only slightly better than the baseline model in terms of average WIS, +illustrating the difficulty of this forecasting problem. # What we've learned in a nutshell -Smooth quantile regression is used in multi-period forecasting for predicting several horizons simultaneously with a single smooth curve. It operates under the key assumption that the future of the response can be approximated well by a smooth curve. +Smooth quantile regression is used in multi-period forecasting for predicting +several horizons simultaneously with a single smooth curve. It operates under +the key assumption that the future of the response can be approximated well by a +smooth curve. # Attribution -The information presented on smooth quantile regression is from [Tuzhilina et al., 2022](https://arxiv.org/abs/2202.09723). +The information presented on smooth quantile regression is from +[Tuzhilina et al., 2022](https://arxiv.org/abs/2202.09723). diff --git a/vignettes/articles/symptom-surveys.Rmd b/vignettes/articles/symptom-surveys.Rmd index 9558b80c5..e8d4a8228 100644 --- a/vignettes/articles/symptom-surveys.Rmd +++ b/vignettes/articles/symptom-surveys.Rmd @@ -20,23 +20,48 @@ knitr::opts_chunk$set( # Introduction -During the COVID-19 pandemic, Delphi ran COVID-19 symptom surveys through Facebook and Google. In these surveys, millions of -people in the US were asked whether they or the people that they know are experiencing COVID-like symptoms. This enabled the -calculation of a "% CLI-in-community" signal for counties across the US. This is simply an estimate of the percentage of people who know someone who is presently sick with a COVID-like illness. - -These surveys were valuable tools for monitoring the pandemic because they reported daily and not subject to reporting delays that plague other sources of data. - -In this vignette, we will look at whether the % CLI-in-community indicators from the Facebook and Google surveys improve the accuracy of short-term forecasts of county-level COVID-19 case rates. The purpose here is to study and demonstrate the value of the Facebook and Google % CLI-in-community signals to add predictive power beyond what we can achieve with simple time series models trained on case rates alone. - -Note that this vignette was adapted from the following [Delphi blog post](https://delphi.cmu.edu/blog/2020/09/21/can-symptoms-surveys-improve-covid-19-forecasts/), with the necessary modifications to enable the use of `epipredict`. The results may be different from those on the blog post (one reason is that we are exploring the use of a different forecaster and another is that we're using the most recent versions of the datasets). - -Now, we will delve into the forecasting problem set-up and code followed by a discussion of the results. +During the COVID-19 pandemic, Delphi ran COVID-19 symptom surveys through +Facebook and Google. In these surveys, millions of people in the US were asked +whether they or the people that they know are experiencing COVID-like symptoms. +This enabled the calculation of a "% CLI-in-community" signal for counties +across the US. This is simply an estimate of the percentage of people who know +someone who is presently sick with a COVID-like illness. + +These surveys were valuable tools for monitoring the pandemic because they +reported daily and not subject to reporting delays that plague other sources of +data. + +In this vignette, we will look at whether the % CLI-in-community indicators from +the Facebook and Google surveys improve the accuracy of short-term forecasts of +county-level COVID-19 case rates. The purpose here is to study and demonstrate +the value of the Facebook and Google % CLI-in-community signals to add +predictive power beyond what we can achieve with simple time series models +trained on case rates alone. + +Note that this vignette was adapted from the following [Delphi blog +post](https://delphi.cmu.edu/blog/2020/09/21/can-symptoms-surveys-improve-covid-19-forecasts/), +with the necessary modifications to enable the use of `epipredict`. The results +may be different from those on the blog post (one reason is that we are +exploring the use of a different forecaster and another is that we're using the +most recent versions of the datasets). + +Now, we will delve into the forecasting problem set-up and code followed by a +discussion of the results. ## Problem Setup -Our goal is to predict county-level COVID-19 case incidence rates for 1 and 2 weeks ahead. For this, we restrict our attention to the 442 counties that had at least 200 confirmed cases by May 14, 2020 (the end of the Google survey data) and in which both the Facebook and Google % CLI-in-community signals are available. +Our goal is to predict county-level COVID-19 case incidence rates for 1 and 2 +weeks ahead. For this, we restrict our attention to the 442 counties that had at +least 200 confirmed cases by May 14, 2020 (the end of the Google survey data) +and in which both the Facebook and Google % CLI-in-community signals are +available. -To set the notation, let $Y_{l,t}$ denote the smoothed COVID-19 case incidence rate for location (county) $l$ and day $t$. Let $F_{l,t}$ and $G_{l,t}$ denote the Facebook and Google % CLI-in-community signals, respectively, for location $l$ and time $t$. Note that we rescale all these signals from their given values in our API so that they are true proportions. We then evaluate the following four models: +To set the notation, let $Y_{l,t}$ denote the smoothed COVID-19 case incidence +rate for location (county) $l$ and day $t$. Let $F_{l,t}$ and $G_{l,t}$ denote +the Facebook and Google % CLI-in-community signals, respectively, for location +$l$ and time $t$. Note that we rescale all these signals from their given values +in our API so that they are true proportions. We then evaluate the following +four models: $$ \begin{align} @@ -46,31 +71,79 @@ h(Y_{l,t+d}) &\approx \alpha + \sum_{j = 0}^2 \beta_j h(Y_{l,t-7j}) + \sum_{j = h(Y_{l,t+d}) &\approx \alpha + \sum_{j = 0}^2 \beta_j h(Y_{l,t-7j}) + \sum_{j = 0}^2 \gamma_j h(F_{l, t-7j}) + \sum_{j = 0}^2 \tau_j h(G_{l, t-7j}) \end{align} $$ -Here $d = 7$ or $d = 14$ depending on the target value, and $h$ is a transformation to be specified later. - -We'll call the first model the "Cases" model because it bases its predictions of future case rates on COVID-19 case rates (from 0, 1 and 2 weeks back). The second model is called "Cases + Facebook" because it additionally incorporates the current Facebook signal, and the Facebook signal from 1 and 2 weeks back. The third model, "Cases + Google", is exactly the same as the second but substitutes the Google signal instead of the Facebook one. The fourth and final model model, "Cases + Facebook + Google", uses both Facebook and Google signals. For each model, we use our canned autoregressive forecaster with quantile regression to forecast at time $t_0$ (and predict case rates at $t_0 + d$). We train over training over all locations, $l$ (all 442 counties), and all time $t$ that are within the most recent 14 days of data available up to and including time $t_0$. In other words, we use 14 trailing days for the training set. - -The forecasts are denoted by $\hat{Y}_{l, t_0 + d}$. To see how accurate these forecasts are, we use the scaled absolute error: +Here $d = 7$ or $d = 14$ depending on the target value, and $h$ is a +transformation to be specified later. + +We'll call the first model the "Cases" model because it bases its predictions of +future case rates on COVID-19 case rates (from 0, 1 and 2 weeks back). The +second model is called "Cases + Facebook" because it additionally incorporates +the current Facebook signal, and the Facebook signal from 1 and 2 weeks back. +The third model, "Cases + Google", is exactly the same as the second but +substitutes the Google signal instead of the Facebook one. The fourth and final +model model, "Cases + Facebook + Google", uses both Facebook and Google signals. +For each model, we use our canned autoregressive forecaster with quantile +regression to forecast at time $t_0$ (and predict case rates at $t_0 + d$). We +train over training over all locations, $l$ (all 442 counties), and all time $t$ +that are within the most recent 14 days of data available up to and including +time $t_0$. In other words, we use 14 trailing days for the training set. + +The forecasts are denoted by $\hat{Y}_{l, t_0 + d}$. To see how accurate these +forecasts are, we use the scaled absolute error: $$ \frac{| \hat{Y}_{l, t_0 + d} - Y_{l, t_0 + d} |} {| Y_{l, t_0} - Y_{l, t_0 + d} |} $$ -where the error in the denominator is the strawman model error. This model simply uses the most recent case rate for future predictions. You may recognize this as an application of the flatline forecaster from `epipredict`. +where the error in the denominator is the strawman model error. This model +simply uses the most recent case rate for future predictions. You may recognize +this as an application of the flatline forecaster from `epipredict`. -We normalize in this manner for two reasons. First, since the scaled error is the fraction improvement over the strawman’s error, we get an interpretable scale, where numebrs like 0.8 or 0.9 are favorable, and numbers like 2 or 5 are increasingly disastrous. Second, in such problems we should expect considerable county-to-county variability in the forecasting difficulty. Normalizing by the strawman's error helps to adjust for this so that the results on aggregate are not dominated by the county-to-county differences. +We normalize in this manner for two reasons. First, since the scaled error is +the fraction improvement over the strawman’s error, we get an interpretable +scale, where numebrs like 0.8 or 0.9 are favorable, and numbers like 2 or 5 are +increasingly disastrous. Second, in such problems we should expect considerable +county-to-county variability in the forecasting difficulty. Normalizing by the +strawman's error helps to adjust for this so that the results on aggregate are +not dominated by the county-to-county differences. ## Transformations -To help stabilize the variance of the case, Facebook and Google data, we chose to use the logit transformation on their proportions. In actuality, we use a "padded" version $h(x) = \log (\frac{x+a}{1-x+a})$ such that the numerator and denominator are pushed away from zero by a small constant, $a = 0.01$. An alternative to the logit transform is using a log transform (as in $h(x) = \log (x+a)$ where $a$ is for padding). Note that such variance-stabilizing transformations are used in the model fitting. When we calculate the errors, we back-transform the values for comparison using the inverse transform $h^{-1}$ so that we may calculate them on the original scale. +To help stabilize the variance of the case, Facebook and Google data, we chose +to use the logit transformation on their proportions. In actuality, we use a +"padded" version $h(x) = \log (\frac{x+a}{1-x+a})$ such that the numerator and +denominator are pushed away from zero by a small constant, $a = 0.01$. An +alternative to the logit transform is using a log transform (as in +$h(x) = \log (x+a)$ where $a$ is for padding). Note that such +variance-stabilizing transformations are used in the model fitting. When we +calculate the errors, we back-transform the values for comparison using the +inverse transform $h^{-1}$ so that we may calculate them on the original scale. ## Forecasting Code -The code below marches the forecast date $t_0$ forward, one day at a time for the nine forecasting dates for which the four models can all be fit (May 6, 2020 to May 14, 2020). Then, it fits the models, makes predictions 7 and 14 days ahead (as permissible by the data), and records the errors. - -There are a number of benefits to using `epipredict` over writing the code from scratch to fit and predict under each of the models. First, we do not have to reformat the data for input into a model or concern ourselves with its unique interface. We instead work under unifying interface to streamline the modelling process. Second, we avoid having to write our own function to append shift values (leads or lags). This is done for us under-the-hood in the `arx_forecaster()` function. You can see this in the forecaster output by inspecting the `step_epi_lag()` and `step_epi_ahead()` pre-processing steps in the `epi_workflow`. Third, we only need one for loop for the forecast dates (and not a second loop for the different aheads) as we can easily use `map()` with the `arx_forecaster()` over the different ahead values, as we’ve done before. - -However, there are some trade-offs to bear in mind. For instance, since we are using a canned arx forecaster, we are not able to easily modify and add steps such as that for signal transformations to the pre-processing (this is pre-specified as part of using a canned forecaster). If we were to code-up our own forecaster under the `epipredict` framework, we could easily add steps to re-scale and transform the signals to our `epi_recipe`. This would make the code more succinct and self-contained. +The code below marches the forecast date $t_0$ forward, one day at a time for +the nine forecasting dates for which the four models can all be fit (May 6, 2020 +to May 14, 2020). Then, it fits the models, makes predictions 7 and 14 days +ahead (as permissible by the data), and records the errors. + +There are a number of benefits to using `epipredict` over writing the code from +scratch to fit and predict under each of the models. First, we do not have to +reformat the data for input into a model or concern ourselves with its unique +interface. We instead work under unifying interface to streamline the modelling +process. Second, we avoid having to write our own function to append shift +values (leads or lags). This is done for us under-the-hood in the +`arx_forecaster()` function. You can see this in the forecaster output by +inspecting the `step_epi_lag()` and `step_epi_ahead()` pre-processing steps in +the `epi_workflow`. Third, we only need one for loop for the forecast dates (and +not a second loop for the different aheads) as we can easily use `map()` with +the `arx_forecaster()` over the different ahead values, as we’ve done before. + +However, there are some trade-offs to bear in mind. For instance, since we are +using a canned arx forecaster, we are not able to easily modify and add steps +such as that for signal transformations to the pre-processing (this is +pre-specified as part of using a canned forecaster). If we were to code-up our +own forecaster under the `epipredict` framework, we could easily add steps to +re-scale and transform the signals to our `epi_recipe`. This would make the code +more succinct and self-contained. ```{r, message = FALSE, warning = FALSE} library(epidatr) @@ -305,9 +378,17 @@ out_df <- do.call(rbind, out_list) ## Results: All Four Models -Since there are only two common forecast dates available for the four models for the 14-day-ahead forecasts (May 13 and May 14, 2020), we skip studying the 14-day-ahead forecast results in this four-way model discussion. +Since there are only two common forecast dates available for the four models for +the 14-day-ahead forecasts (May 13 and May 14, 2020), we skip studying +the 14-day-ahead forecast results in this four-way model discussion. -Below we compute the median scaled errors for each of the four models over the 9-day test period. We can see that adding either or both of the survey signals improves on the median scaled error of the model that uses cases only, with the biggest gain achieved by the "Cases + Google" model. We can also see that the median scaled errors are all close to 1 (with all but that from the "Cases + Google" and "Cases + Facebook + Google" models exceeding 1), which speaks to the difficulty of the forecasting problem. +Below we compute the median scaled errors for each of the four models over +the 9-day test period. We can see that adding either or both of the survey +signals improves on the median scaled error of the model that uses cases only, +with the biggest gain achieved by the "Cases + Google" model. We can also see +that the median scaled errors are all close to 1 (with all but that from the +"Cases + Google" and "Cases + Facebook + Google" models exceeding 1), which +speaks to the difficulty of the forecasting problem. ```{r} library(dplyr) @@ -357,7 +438,16 @@ knitr::kable( ) ``` $$\\[0.01in]$$ -Are these differences in median scaled errors significant? Some basic hypothesis testing suggests that some probably are: Below we conduct a sign test for whether the difference in the "Cases" model’s scaled error and each other model’s scaled error is centered at zero. The sign test is run on the 9 test days x 442 counties = 3978 pairs of scaled errors. The p-value from the "Cases" versus "Cases + Google" test is tiny and well below a cutoff of 0.01. In contrast, the p-values from the "Cases" versus "Cases + Facebook" and the "Cases" versus "Cases + Facebook + Google" tests are much bigger and exceed this cutoff, suggesting that the Facebook survey is not adding as much for this situation (meaning the time and ahead considered, etc.) +Are these differences in median scaled errors significant? Some basic hypothesis +testing suggests that some probably are: Below we conduct a sign test for +whether the difference in the "Cases" model’s scaled error and each other +model’s scaled error is centered at zero. The sign test is run on the 9 test +days x 442 counties = 3978 pairs of scaled errors. The p-value from the "Cases" +versus "Cases + Google" test is tiny and well below a cutoff of 0.01. In +contrast, the p-values from the "Cases" versus "Cases + Facebook" and the +"Cases" versus "Cases + Facebook + Google" tests are much bigger and exceed this +cutoff, suggesting that the Facebook survey is not adding as much for this +situation (meaning the time and ahead considered, etc.) ```{r} # Compute p-values using the sign test against a one-sided alternative, for @@ -393,7 +483,17 @@ knitr::kable( ) ``` $$\\[0.01in]$$ -We should take these test results with a grain of salt because the sign test assumes independence of observations, which clearly cannot be true given the spatiotemporal structure of our forecasting problem. To mitigate the dependence across time (which intuitively seems to matter more than that across space), we recomputed these tests in a stratified way, where for each day we run a sign test on the scaled errors between two models over all 442 counties. The results are plotted as histograms below; the "Cases + Google" and (to a lesser extent) the "Cases + Facebook + Google" models appear to deliver some decently small p-values, but this is not very evident with the "Cases + Facebook" model. Taking a larger sample size (of more than nine test days) would be a natural next step to take to see if these results persist. +We should take these test results with a grain of salt because the sign test +assumes independence of observations, which clearly cannot be true given the +spatiotemporal structure of our forecasting problem. To mitigate the dependence +across time (which intuitively seems to matter more than that across space), we +recomputed these tests in a stratified way, where for each day we run a sign +test on the scaled errors between two models over all 442 counties. The results +are plotted as histograms below; the "Cases + Google" and (to a lesser extent) +the "Cases + Facebook + Google" models appear to deliver some decently small +p-values, but this is not very evident with the "Cases + Facebook" model. Taking +a larger sample size (of more than nine test days) would be a natural next step +to take to see if these results persist. ```{r} # Red, blue (similar to ggplot defaults), then yellow @@ -417,7 +517,14 @@ ggplot(res_dif4 %>% ## Results: First Two Models -One way to get a larger sample size with the current data is to compare a subset of the models. Therefore, next we focus on comparing results between the "Cases" and "Cases + Facebook" models only. Restricting to common forecast dates for these two models yields a much longer test period for the 7 and 14-day-ahead forecasts: May 20 through August 27, 2020. We make the code to compare these two models a simple function so that we have the option to use it over different dates or aheads (in particular, this function will be useful for the next section where we explore several ahead values): +One way to get a larger sample size with the current data is to compare a subset +of the models. Therefore, next we focus on comparing results between the "Cases" +and "Cases + Facebook" models only. Restricting to common forecast dates for +these two models yields a much longer test period for the 7 and 14-day-ahead +forecasts: May 20 through August 27, 2020. We make the code to compare these two +models a simple function so that we have the option to use it over different +dates or aheads (in particular, this function will be useful for the next +section where we explore several ahead values): ```{r} case_fb_mods <- function(forecast_dates, leads) { @@ -510,7 +617,9 @@ leads <- c(7, 14) res <- case_fb_mods(dates, leads) ``` -The median scaled errors over the test period are computed and reported below. Now we see a decent improvement in median scaled error for the "Cases + Facebook" model, which is true for both 7-day-ahead and 14-day-ahead forecasts. +The median scaled errors over the test period are computed and reported below. +Now we see a decent improvement in median scaled error for the "Cases + +Facebook" model, which is true for both 7-day-ahead and 14-day-ahead forecasts. ```{r} # For just models 1 and 2, then calculate the scaled @@ -555,7 +664,19 @@ knitr::kable( ``` $$\\[0.01in]$$ -Thanks to the extended length of the test period, we can also plot the trajectories of the median scaled errors over time, as we do below, with the left plot concerning 7-day-ahead forecasts, and the right 14-day-ahead forecasts. These plots reveal something at once interesting and bothersome: the median scaled errors are quite volatile over time, and for some periods in July, forecasting became much harder, with the scaled errors reaching above 1.5 for 7-day-ahead forecasts, and above 1.8 for 14-day-ahead forecasts. Furthermore, we can see a clear visual difference between the median scaled errors from the "Cases + Facebook" model in red and the "Cases" model in black. The former appears to be below the latter during periods with low median scaled errors and above during periods where forecasting becomes hard and the scaled errors shoot above 1. This suggests that the Facebook signal may be more useful to incorporate during periods of time where forecasting is easier. +Thanks to the extended length of the test period, we can also plot the +trajectories of the median scaled errors over time, as we do below, with the +left plot concerning 7-day-ahead forecasts, and the right 14-day-ahead +forecasts. These plots reveal something at once interesting and bothersome: the +median scaled errors are quite volatile over time, and for some periods in July, +forecasting became much harder, with the scaled errors reaching above 1.5 +for 7-day-ahead forecasts, and above 1.8 for 14-day-ahead forecasts. +Furthermore, we can see a clear visual difference between the median scaled +errors from the "Cases + Facebook" model in red and the "Cases" model in black. +The former appears to be below the latter during periods with low median scaled +errors and above during periods where forecasting becomes hard and the scaled +errors shoot above 1. This suggests that the Facebook signal may be more useful +to incorporate during periods of time where forecasting is easier. ```{r} # Plot median errors as a function of time, for models 1 and 2, and both 7 and @@ -575,7 +696,14 @@ ggplot( theme(legend.position = "bottom", legend.title = element_blank()) ``` -The fact that the lines are non-coincident suggests that the results we’re seeing here are likely to be significantly different, though it’s hard to say definitively given the complicated dependence structure present in the data. Below we perform a sign test for whether the difference in scaled errors from the "Cases" and "Cases + Facebook" models is centered at zero. The p-values are essentially zero, given the large sample sizes: 98 test days in total for the 7-day-ahead forecasts and 91 days for the 14-day-ahead forecasts (times 442 counties for each day). +The fact that the lines are non-coincident suggests that the results we’re +seeing here are likely to be significantly different, though it’s hard to say +definitively given the complicated dependence structure present in the data. +Below we perform a sign test for whether the difference in scaled errors from +the "Cases" and "Cases + Facebook" models is centered at zero. The p-values are +essentially zero, given the large sample sizes: 98 test days in total for +the 7-day-ahead forecasts and 91 days for the 14-day-ahead forecasts (times 442 +counties for each day). ```{r} # Compute p-values using the sign test against a one-sided alternative, just @@ -605,7 +733,8 @@ knitr::kable( ``` $$\\[0.01in]$$ -If we stratify and recompute p-values by forecast date, the bulk of p-values are quite small. +If we stratify and recompute p-values by forecast date, the bulk of p-values are +quite small. ```{r} ggplot(res_dif2 %>% @@ -624,13 +753,30 @@ ggplot(res_dif2 %>% theme(legend.position = "none") ``` -This exploration illustrates an important point: The test period should be chosen so that it is large enough in size to see differences (if there are any) between the models under comparison. While we did not observe significant differences between the "Cases" and "Cases + Facebook" models when the test period was small at 9 days, we did observe a significant difference over this extended test period of nearly 100 days. +This exploration illustrates an important point: The test period should be +chosen so that it is large enough in size to see differences (if there are any) +between the models under comparison. While we did not observe significant +differences between the "Cases" and "Cases + Facebook" models when the test +period was small at 9 days, we did observe a significant difference over this +extended test period of nearly 100 days. ## Varying the Number of Days Ahead -Statistical significance refers to whether an effect exists (as opposed to occurring by chance), while practical significance refers to the magnitude of the effect and whether it is meaningful in the real world. Hypothesis tests, such as the sign tests we conducted above, tell us whether the differences in errors are statistically significant, but not about their practical significance. For example, for 7-day-ahead forecasts, what does an improvement of 0.019 units on the scaled error scale really mean, when comparing the "Cases + Facebook" model to the "Cases" model? Is this a meaningful gain in practice? - -To answer questions such as these, we can look at the way that the median scaled errors behave as a function of the number of days ahead. Previously, we considered forecasting case rates just 7 and 14 days ahead. Now we will systematically examine 5 through 20 days ahead (the key difference in the code being that we use `leads = 5:20`). Note that running the code for this many leads may take a while. +Statistical significance refers to whether an effect exists (as opposed to +occurring by chance), while practical significance refers to the magnitude of +the effect and whether it is meaningful in the real world. Hypothesis tests, +such as the sign tests we conducted above, tell us whether the differences in +errors are statistically significant, but not about their practical +significance. For example, for 7-day-ahead forecasts, what does an improvement +of 0.019 units on the scaled error scale really mean, when comparing the "Cases ++ Facebook" model to the "Cases" model? Is this a meaningful gain in practice? + +To answer questions such as these, we can look at the way that the median scaled +errors behave as a function of the number of days ahead. Previously, we +considered forecasting case rates just 7 and 14 days ahead. Now we will +systematically examine 5 through 20 days ahead (the key difference in the code +being that we use `leads = 5:20`). Note that running the code for this many +leads may take a while. ```{r} # Consider a number of leads @@ -639,7 +785,10 @@ leads <- 5:20 res <- case_fb_mods(dates, leads) ``` -We obtain and plot the median scaled errors for the "Cases" and "Cases + Facebook" models for different number of days ahead for the forecast target. This is done over May 20 through August 27 for the forecast dates that are common to the two models. +We obtain and plot the median scaled errors for the "Cases" and "Cases + +Facebook" models for different number of days ahead for the forecast target. +This is done over May 20 through August 27 for the forecast dates that are +common to the two models. ```{r} err_by_lead <- res %>% @@ -676,22 +825,63 @@ ggplot(err_by_lead, aes(x = lead, y = err)) + theme_bw() # + theme(legend.position = "bottom", legend.title = element_blank()) ``` -A first glance shows that the "Cases + Facebook" model, in red, gives better median scaled errors at all ahead values. Furthermore, the vertical gap between the two curves is consistently in the range of what we were seeing before (for 7 and 14 days ahead), around 0.02 units on the scaled error scale. - -But if we look at this from a different angle, by considering the horizontal gap between the curves, then we can infer something quite a bit more interesting: For 7-day-ahead forecasts, the median scaled error of the "Cases" model (indicated by the horizontal gray line) is comparable to that of 12-day-ahead forecasts from the "Cases + Facebook" model. So using the % CLI-in-community signal from our Facebook survey buys us around 4 extra days of lead time for this forecasting problem, which is striking. As you might imagine, different forecast targets yield different lead times (for 14-day-ahead forecasts, it appears to be around 2 to 3 days of lead time), but the added value of the survey signal is clear throughout. +A first glance shows that the "Cases + Facebook" model, in red, gives better +median scaled errors at all ahead values. Furthermore, the vertical gap between +the two curves is consistently in the range of what we were seeing before (for 7 +and 14 days ahead), around 0.02 units on the scaled error scale. + +But if we look at this from a different angle, by considering the horizontal gap +between the curves, then we can infer something quite a bit more interesting: +For 7-day-ahead forecasts, the median scaled error of the "Cases" model +(indicated by the horizontal gray line) is comparable to that of 12-day-ahead +forecasts from the "Cases + Facebook" model. So using the % CLI-in-community +signal from our Facebook survey buys us around 4 extra days of lead time for +this forecasting problem, which is striking. As you might imagine, different +forecast targets yield different lead times (for 14-day-ahead forecasts, it +appears to be around 2 to 3 days of lead time), but the added value of the +survey signal is clear throughout. ## Wrap-Up -In this vignette, we've shown that either of the Facebook or Google % CLI-in-community signals can improve the accuracy of short-term forecasts of county-level COVID-19 case rates. The significance of these improvements is more apparent with the Facebook signal, thanks to the much larger test period. With either signal, the magnitude of the improvement offered seems modest but nontrivial, especially because the forecasting problem is so difficult in the first place. - -We reiterate that this was just a demo. Our analysis was fairly simple and lacks a few qualities that we’d expect in a truly comprehensive, realistic forecasting analysis. For reflection, let's discuss three possible areas to improve: - -1. The models we considered are simple autoregressive structures from standard time series and could be improved in various ways (including, considering other relevant dimensions like mobility measures, county health metrics, etc.). - -2. The forecasts we produced are point rather than distributional forecasts. That is, we predict a single number, rather than an entire distribution for what happens 7 and 14 days ahead. Distributional forecasts portray uncertainty in a transparent way, which is important in practice. - -3. The way we trained our forecast models does not account for data latency and revisions, which are critical issues. For each (retrospective) forecast date, $t_0$, we constructed forecasts by training on data that we fetched from the API today, "as of" the day of writing this, and not "as of" the forecast date. This matters because nearly all signals are subject to latency and go through multiple revisions. - -On the flip side, our example here was not that far away from being realistic. The models we examined are actually not too different from Delphi’s forecasters in production. Also, the way we fit the quantile regression models in the code extends immediately to multiple quantile regression (this just requires changing the parameter `quantile_levels` in the call to `quantile_reg()`). And lastly, it’s fairly easy to change the data acquisition step in the code so that data gets pulled "as of" the forecast date (this requires specifying the parameter `as_of` in the call to `pub_covidcast()` and should change per forecast date). - -Hopefully these preliminary findings have gotten you excited about the possible uses of this symptom survey data. For further practice, try your hand at implementing the suggested improvements or develop your own novel analytic approach to extract insights from this data. +In this vignette, we've shown that either of the Facebook or Google % +CLI-in-community signals can improve the accuracy of short-term forecasts of +county-level COVID-19 case rates. The significance of these improvements is more +apparent with the Facebook signal, thanks to the much larger test period. With +either signal, the magnitude of the improvement offered seems modest but +nontrivial, especially because the forecasting problem is so difficult in the +first place. + +We reiterate that this was just a demo. Our analysis was fairly simple and lacks +a few qualities that we’d expect in a truly comprehensive, realistic forecasting +analysis. For reflection, let's discuss three possible areas to improve: + +1. The models we considered are simple autoregressive structures from standard + time series and could be improved in various ways (including, considering + other relevant dimensions like mobility measures, county health metrics, + etc.). + +2. The forecasts we produced are point rather than distributional forecasts. + That is, we predict a single number, rather than an entire distribution for + what happens 7 and 14 days ahead. Distributional forecasts portray + uncertainty in a transparent way, which is important in practice. + +3. The way we trained our forecast models does not account for data latency and + revisions, which are critical issues. For each (retrospective) forecast + date, $t_0$, we constructed forecasts by training on data that we fetched + from the API today, "as of" the day of writing this, and not "as of" the + forecast date. This matters because nearly all signals are subject to + latency and go through multiple revisions. + +On the flip side, our example here was not that far away from being realistic. +The models we examined are actually not too different from Delphi’s forecasters +in production. Also, the way we fit the quantile regression models in the code +extends immediately to multiple quantile regression (this just requires changing +the parameter `quantile_levels` in the call to `quantile_reg()`). And lastly, +it’s fairly easy to change the data acquisition step in the code so that data +gets pulled "as of" the forecast date (this requires specifying the parameter +`as_of` in the call to `pub_covidcast()` and should change per forecast date). + +Hopefully these preliminary findings have gotten you excited about the possible +uses of this symptom survey data. For further practice, try your hand at +implementing the suggested improvements or develop your own novel analytic +approach to extract insights from this data. diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd index 690844c40..089f000df 100644 --- a/vignettes/arx-classifier.Rmd +++ b/vignettes/arx-classifier.Rmd @@ -29,9 +29,18 @@ library(epipredict) ## Introducing the ARX classifier -The `arx_classifier()` is an autoregressive classification model for `epi_df` data that is used to predict a discrete class for each case under consideration. It is a direct forecaster in that it estimates the classes at a specific horizon or ahead value. - -To get a sense of how the `arx_classifier()` works, let's consider a simple example with minimal inputs. For this, we will use the built-in `case_death_rate_subset` that contains confirmed COVID-19 cases and deaths from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, we'll take a subset of data for five states over June 4, 2021 to December 31, 2021. Our objective is to predict whether the case rates are increasing when considering the 0, 7 and 14 day case rates: +The `arx_classifier()` is an autoregressive classification model for `epi_df` +data that is used to predict a discrete class for each case under consideration. +It is a direct forecaster in that it estimates the classes at a specific horizon +or ahead value. + +To get a sense of how the `arx_classifier()` works, let's consider a simple +example with minimal inputs. For this, we will use the built-in +`case_death_rate_subset` that contains confirmed COVID-19 cases and deaths from +JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, we'll take +a subset of data for five states over June 4, 2021 to December 31, 2021. Our +objective is to predict whether the case rates are increasing when considering +the 0, 7 and 14 day case rates: ```{r} jhu <- case_death_rate_subset %>% @@ -49,17 +58,39 @@ out <- arx_classifier(jhu, out$predictions ``` -The key takeaway from the predictions is that there are two prediction classes: (-Inf, 0.25] and (0.25, Inf). This is because for our goal of classification the classes must be discrete. The discretization of the real-valued outcome is controlled by the `breaks` argument, which defaults to 0.25. Such breaks will be automatically extended to cover the entire real line. For example, the default break of 0.25 is silently extended to breaks = c(-Inf, .25, Inf) and, therefore, results in two classes: [-Inf, 0.25] and (0.25, Inf). These two classes are used to discretize the outcome. The conversion of the outcome to such classes is handled internally. So if discrete classes already exist for the outcome in the `epi_df`, then we recommend to code a classifier from scratch using the `epi_workflow` framework for more control. - -The `trainer` is a `parsnip` model describing the type of estimation such that `mode = "classification"` is enforced. The two typical trainers that are used are `parsnip::logistic_reg()` for two classes or `parsnip::multinom_reg()` for more than two classes. +The key takeaway from the predictions is that there are two prediction classes: +(-Inf, 0.25] and (0.25, Inf). This is because for our goal of classification +the classes must be discrete. The discretization of the real-valued outcome is +controlled by the `breaks` argument, which defaults to 0.25. Such breaks will be +automatically extended to cover the entire real line. For example, the default +break of 0.25 is silently extended to breaks = c(-Inf, .25, Inf) and, therefore, +results in two classes: [-Inf, 0.25] and (0.25, Inf). These two classes are +used to discretize the outcome. The conversion of the outcome to such classes is +handled internally. So if discrete classes already exist for the outcome in the +`epi_df`, then we recommend to code a classifier from scratch using the +`epi_workflow` framework for more control. + +The `trainer` is a `parsnip` model describing the type of estimation such that +`mode = "classification"` is enforced. The two typical trainers that are used +are `parsnip::logistic_reg()` for two classes or `parsnip::multinom_reg()` for +more than two classes. ```{r} workflows::extract_spec_parsnip(out$epi_workflow) ``` -From the parsnip model specification, we can see that the trainer used is logistic regression, which is expected for our binary outcome. More complicated trainers like `parsnip::naive_Bayes()` or `parsnip::rand_forest()` may also be used (however, we will stick to the basics in this gentle introduction to the classifier). +From the parsnip model specification, we can see that the trainer used is +logistic regression, which is expected for our binary outcome. More complicated +trainers like `parsnip::naive_Bayes()` or `parsnip::rand_forest()` may also be +used (however, we will stick to the basics in this gentle introduction to the +classifier). -If you use the default trainer of logistic regression for binary classification and you decide against using the default break of 0.25, then you should only input one break so that there are two classification bins to properly dichotomize the outcome. For example, let's set a break of 0.5 instead of relying on the default of 0.25. We can do this by passing 0.5 to the `breaks` argument in `arx_class_args_list()` as follows: +If you use the default trainer of logistic regression for binary classification +and you decide against using the default break of 0.25, then you should only +input one break so that there are two classification bins to properly +dichotomize the outcome. For example, let's set a break of 0.5 instead of +relying on the default of 0.25. We can do this by passing 0.5 to the `breaks` +argument in `arx_class_args_list()` as follows: ```{r} out_break_0.5 <- arx_classifier(jhu, @@ -72,18 +103,52 @@ out_break_0.5 <- arx_classifier(jhu, out_break_0.5$predictions ``` -Indeed, we can observe that the two `.pred_class` are now (-Inf, 0.5] and (0.5, Inf). -See `help(arx_class_args_list)` for other available modifications. - -Additional arguments that may be supplied to `arx_class_args_list()` include the expected `lags` and `ahead` arguments for an autoregressive-type model. These have default values of 0, 7, and 14 days for the lags of the predictors and 7 days ahead of the forecast date for predicting the outcome. There is also `n_training` to indicate the upper bound for the number of training rows per key. If you would like some practice with using this, then remove the filtering command to obtain data within "2021-06-04" and "2021-12-31" and instead set `n_training` to be the number of days between these two dates, inclusive of the end points. The end results should be the same. In addition to `n_training`, there are `forecast_date` and `target_date` to specify the date that the forecast is created and intended, respectively. We will not dwell on such arguments here as they are not unique to this classifier or absolutely essential to understanding how it operates. The remaining arguments will be discussed organically, as they are needed to serve our purposes. For information on any remaining arguments that are not discussed here, please see the function documentation for a complete list and their definitions. +Indeed, we can observe that the two `.pred_class` are now (-Inf, 0.5] and (0.5, +Inf). See `help(arx_class_args_list)` for other available modifications. + +Additional arguments that may be supplied to `arx_class_args_list()` include the +expected `lags` and `ahead` arguments for an autoregressive-type model. These +have default values of 0, 7, and 14 days for the lags of the predictors and 7 +days ahead of the forecast date for predicting the outcome. There is also +`n_training` to indicate the upper bound for the number of training rows per +key. If you would like some practice with using this, then remove the filtering +command to obtain data within "2021-06-04" and "2021-12-31" and instead set +`n_training` to be the number of days between these two dates, inclusive of the +end points. The end results should be the same. In addition to `n_training`, +there are `forecast_date` and `target_date` to specify the date that the +forecast is created and intended, respectively. We will not dwell on such +arguments here as they are not unique to this classifier or absolutely essential +to understanding how it operates. The remaining arguments will be discussed +organically, as they are needed to serve our purposes. For information on any +remaining arguments that are not discussed here, please see the function +documentation for a complete list and their definitions. ## Example of using the ARX classifier -Now, to demonstrate the power and utility of this built-in arx classifier, we will loosely adapt the classification example that was written from scratch in `vignette("preprocessing-and-models")`. However, to keep things simple and not merely a direct translation, we will only consider two prediction categories and leave the extension to three as an exercise for the reader. - -To motivate this example, a major use of autoregressive classification models is to predict upswings or downswings like in hotspot prediction models to anticipate the direction of the outcome (see [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118) for more on these). In our case, one simple question that such models can help answer is... Do we expect that the future will have increased case rates or not relative to the present? - -To answer this question, we can create a predictive model for upswings and downswings of case rates rather than the raw case rates themselves. For this situation, our target is to predict whether there is an increase in case rates or not. Following [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118), we look at the relative change between $Y_{l,t}$ and $Y_{l, t+a}$, where the former is the case rate at location $l$ at time $t$ and the latter is the rate for that location at time $t+a$. Using these variables, we define a categorical response variable with two classes +Now, to demonstrate the power and utility of this built-in arx classifier, we +will loosely adapt the classification example that was written from scratch in +`vignette("preprocessing-and-models")`. However, to keep things simple and not +merely a direct translation, we will only consider two prediction categories and +leave the extension to three as an exercise for the reader. + +To motivate this example, a major use of autoregressive classification models is +to predict upswings or downswings like in hotspot prediction models to +anticipate the direction of the outcome (see [McDonald, Bien, Green, Hu, et al. +(2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118) for more on +these). In our case, one simple question that such models can help answer is... +Do we expect that the future will have increased case rates or not relative to +the present? + +To answer this question, we can create a predictive model for upswings and +downswings of case rates rather than the raw case rates themselves. For this +situation, our target is to predict whether there is an increase in case rates +or not. Following +[McDonald, Bien, Green, Hu, et al.(2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118), +we look at the +relative change between $Y_{l,t}$ and $Y_{l, t+a}$, where the former is the case +rate at location $l$ at time $t$ and the latter is the rate for that location at +time $t+a$. Using these variables, we define a categorical response variable +with two classes $$\begin{align} Z_{l,t} = \left\{\begin{matrix} @@ -93,7 +158,9 @@ Z_{l,t} = \left\{\begin{matrix} \end{align}$$ where $Y_{l,t}^\Delta = (Y_{l, t} - Y_{l, t-7} / Y_{l, t-7}$. If $Y_{l,t}^\Delta$ > 0.25, meaning that the number of new cases over the week has increased by over 25\%, then $Z_{l,t}$ is up. This is the criteria for location $l$ to be a hotspot at time $t$. On the other hand, if $Y_{l,t}^\Delta$ \leq 0.25$, then then $Z_{l,t}$ is categorized as not up, meaning that there has not been a >25\% increase in the new cases over the past week. -The logistic regression model we use to predict this binary response can be considered to be a simplification of the multinomial regression model presented in `vignette("preprocessing-and-models")`: +The logistic regression model we use to predict this binary response can be +considered to be a simplification of the multinomial regression model presented +in `vignette("preprocessing-and-models")`: $$\begin{align} \pi_{\text{up}}(x) &= Pr(Z_{l, t} = \text{up}|x) = \frac{e^{g_{\text{up}}(x)}}{1 + e^{g_{\text{up}}(x)}}, \\ @@ -105,9 +172,15 @@ $$ g_{\text{up}}(x) = \log\left ( \frac{\Pr(Z_{l, t} = \text{up} \vert x)}{\Pr(Z_{l, t} = \text{not up} \vert x)} \right ) = \beta_{10} + \beta_{11}Y_{l,t}^\Delta + \beta_{12}Y_{l,t-7}^\Delta + \beta_{13}Y_{l,t-14}^\Delta. $$ -Now then, we will operate on the same subset of the `case_death_rate_subset` that we used in our above example. This time, we will use it to investigate whether the number of newly reported cases over the past 7 days has increased by at least 25\% compared to the preceding week for our sample of states. +Now then, we will operate on the same subset of the `case_death_rate_subset` +that we used in our above example. This time, we will use it to investigate +whether the number of newly reported cases over the past 7 days has increased by +at least 25% compared to the preceding week for our sample of states. -Notice that by using the `arx_classifier()` function we've completely eliminated the need to manually categorize the response variable and implement pre-processing steps, which was necessary in `vignette("preprocessing-and-models")`. +Notice that by using the `arx_classifier()` function we've completely eliminated +the need to manually categorize the response variable and implement +pre-processing steps, which was necessary in +`vignette("preprocessing-and-models")`. ```{r} log_res <- arx_classifier( @@ -122,15 +195,47 @@ log_res <- arx_classifier( log_res$epi_workflow ``` -Comparing the pre-processing steps for this to those in the other vignette, we can see that they are not precisely the same, but they cover the same essentials of transforming `case_rate` to the growth rate scale (`step_growth_rate()`), lagging the predictors (`step_epi_lag()`), leading the response (`step_epi_ahead()`), which are both constructed from the growth rates, and constructing the binary classification response variable (`step_mutate()`). - -On this topic, it is important to understand that we are not actually concerned about the case values themselves. Rather we are concerned whether the quantity of cases in the future is a lot larger than that in the present. For this reason, the outcome does not remain as cases, but rather it is transformed by using either growth rates (as the predictors and outcome in our example are) or lagged differences. While the latter is closer to the requirements for the [2022-23 CDC Flusight Hospitalization Experimental Target](https://github.com/cdcepi/Flusight-forecast-data/blob/745511c436923e1dc201dea0f4181f21a8217b52/data-experimental/README.md), and it is conceptually easy to understand because it is simply the change of the value for the horizon, it is not the default. The default is `growth_rate`. One reason for this choice is because the growth rate is on a rate scale, not on the absolute scale, so it fosters comparability across locations without any conscious effort (on the other hand, when using the `lag_difference` one would need to take care to operate on rates per 100k and not raw counts). We utilize `epiprocess::growth_rate()` to create the outcome using some of the additional arguments. One important argument for the growth rate calculation is the `method`. Only `rel_change` for relative change should be used as the method because the test data is the only data that is accessible and the other methods require access to the training data. - -The other optional arguments for controlling the growth rate calculation (that can be inputted as `additional_gr_args`) can be found in the documentation for `epiprocess::growth_rate()` and the related `vignette("growth_rate", package = "epiprocess")`. +Comparing the pre-processing steps for this to those in the other vignette, we +can see that they are not precisely the same, but they cover the same essentials +of transforming `case_rate` to the growth rate scale (`step_growth_rate()`), +lagging the predictors (`step_epi_lag()`), leading the response +(`step_epi_ahead()`), which are both constructed from the growth rates, and +constructing the binary classification response variable (`step_mutate()`). + +On this topic, it is important to understand that we are not actually concerned +about the case values themselves. Rather we are concerned whether the quantity +of cases in the future is a lot larger than that in the present. For this +reason, the outcome does not remain as cases, but rather it is transformed by +using either growth rates (as the predictors and outcome in our example are) or +lagged differences. While the latter is closer to the requirements for the +[2022-23 CDC Flusight Hospitalization Experimental Target](https://github.com/cdcepi/Flusight-forecast-data/blob/745511c436923e1dc201dea0f4181f21a8217b52/data-experimental/README.md), +and it is conceptually easy to understand because it is simply the change of the +value for the horizon, it is not the default. The default is `growth_rate`. One +reason for this choice is because the growth rate is on a rate scale, not on the +absolute scale, so it fosters comparability across locations without any +conscious effort (on the other hand, when using the `lag_difference` one would +need to take care to operate on rates per 100k and not raw counts). We utilize +`epiprocess::growth_rate()` to create the outcome using some of the additional +arguments. One important argument for the growth rate calculation is the +`method`. Only `rel_change` for relative change should be used as the method +because the test data is the only data that is accessible and the other methods +require access to the training data. + +The other optional arguments for controlling the growth rate calculation (that +can be inputted as `additional_gr_args`) can be found in the documentation for +`epiprocess::growth_rate()` and the related +`vignette("growth_rate", package = "epiprocess")`. ### Visualizing the results -To visualize the prediction classes across the states for the target date, we can plot our results as a heatmap. However, if we were to plot the results for only one target date, like our 7-day ahead predictions, then that would be a pretty sad heatmap (which would look more like a bar chart than a heatmap)... So instead of doing that, let's get predictions for several aheads and plot a heatmap across the target dates. To get the predictions across several ahead values, we will use the map function in the same way that we did in other vignettes: +To visualize the prediction classes across the states for the target date, we +can plot our results as a heatmap. However, if we were to plot the results for +only one target date, like our 7-day ahead predictions, then that would be a +pretty sad heatmap (which would look more like a bar chart than a heatmap)... So +instead of doing that, let's get predictions for several aheads and plot a +heatmap across the target dates. To get the predictions across several ahead +values, we will use the map function in the same way that we did in other +vignettes: ```{r} multi_log_res <- map(1:40, ~ arx_classifier( @@ -144,7 +249,8 @@ multi_log_res <- map(1:40, ~ arx_classifier( )$predictions) %>% list_rbind() ``` -We can plot a the heatmap of the results over the aheads to see if there's anything novel or interesting to take away: +We can plot a the heatmap of the results over the aheads to see if there's +anything novel or interesting to take away: ```{r} ggplot(multi_log_res, aes(target_date, geo_value, fill = .pred_class)) + @@ -154,8 +260,24 @@ ggplot(multi_log_res, aes(target_date, geo_value, fill = .pred_class)) + scale_fill_brewer(palette = "Set1") ``` -While there is a bit of variability near to the end, we can clearly see that there are upswings for all states starting from the beginning of January 2022, which we can recall was when there was a massive spike in cases for many states. So our results seem to align well with what actually happened at the beginning of January 2022. +While there is a bit of variability near to the end, we can clearly see that +there are upswings for all states starting from the beginning of January 2022, +which we can recall was when there was a massive spike in cases for many states. +So our results seem to align well with what actually happened at the beginning +of January 2022. ## A brief reflection -The most noticeable benefit of using the `arx_classifier()` function is the simplification and reduction of the manual implementation of the classifier from about 30 down to 3 lines. However, as we noted before, the trade-off for simplicity is control over the precise pre-processing, post-processing, and additional features embedded in the coding of a classifier. So the good thing is that `epipredict` provides both - a built-in `arx_classifer()` or the means to implement your own classifier from scratch by using the `epi_workflow` framework. And which you choose will depend on the circumstances. Our advice is to start with using the built-in classifier for ostensibly simple projects and begin to implement your own when the modelling project takes a complicated turn. To get some practice on coding up a classifier by hand, consider translating this binary classification model example to an `epi_workflow`, akin to that in `vignette("preprocessing-and-models")`. +The most noticeable benefit of using the `arx_classifier()` function is the +simplification and reduction of the manual implementation of the classifier from +about 30 down to 3 lines. However, as we noted before, the trade-off for +simplicity is control over the precise pre-processing, post-processing, and +additional features embedded in the coding of a classifier. So the good thing is +that `epipredict` provides both - a built-in `arx_classifer()` or the means to +implement your own classifier from scratch by using the `epi_workflow` +framework. And which you choose will depend on the circumstances. Our advice is +to start with using the built-in classifier for ostensibly simple projects and +begin to implement your own when the modelling project takes a complicated turn. +To get some practice on coding up a classifier by hand, consider translating +this binary classification model example to an `epi_workflow`, akin to that in +`vignette("preprocessing-and-models")`. diff --git a/vignettes/epipredict.Rmd b/vignettes/epipredict.Rmd index 1c3808711..af83dc321 100644 --- a/vignettes/epipredict.Rmd +++ b/vignettes/epipredict.Rmd @@ -28,15 +28,22 @@ library(epipredict) # Goals for the package -At a high level, our goal with `{epipredict}` is to make running simple Machine Learning / Statistical forecasters for epidemiology easy. However, this package is extremely extensible, and that is part of its utility. Our hope is that it is easy for users with epi training and some statistics to fit baseline models while still allowing those with more nuanced statistical understanding to create complicated specializations using the same framework. +At a high level, our goal with `{epipredict}` is to make running simple Machine +Learning / Statistical forecasters for epidemiology easy. However, this package +is extremely extensible, and that is part of its utility. Our hope is that it is +easy for users with epi training and some statistics to fit baseline models +while still allowing those with more nuanced statistical understanding to create +complicated specializations using the same framework. -Serving both populations is the main motivation for our efforts, but at the same time, we have tried hard to make it useful. +Serving both populations is the main motivation for our efforts, but at the same +time, we have tried hard to make it useful. ## Baseline models -We provide a set of basic, easy-to-use forecasters that work out of the box. -You should be able to do a reasonably limited amount of customization on them. Any serious customization happens with the framework discussed below). +We provide a set of basic, easy-to-use forecasters that work out of the box. You +should be able to do a reasonably limited amount of customization on them. Any +serious customization happens with the framework discussed below). For the basic forecasters, we provide: @@ -44,35 +51,64 @@ For the basic forecasters, we provide: * Autoregressive forecaster * Autoregressive classifier -All the forcasters we provide are built on our framework. So we will use these basic models to illustrate its flexibility. +All the forcasters we provide are built on our framework. So we will use these +basic models to illustrate its flexibility. ## Forecasting framework -Our framework for creating custom forecasters views the prediction task as a set of modular components. There are four types of components: +Our framework for creating custom forecasters views the prediction task as a set +of modular components. There are four types of components: 1. Preprocessor: make transformations to the data before model training 2. Trainer: train a model on data, resulting in a fitted model object 3. Predictor: make predictions, using a fitted model object and processed test data 4. Postprocessor: manipulate or transform the predictions before returning -Users familiar with [`{tidymodels}`](https://www.tidymodels.org) and especially the [`{workflows}`](https://workflows.tidymodels.org) package will notice a lot of overlap. This is by design, and is in fact a feature. The truth is that `{epipredict}` is a wrapper around much that is contained in these packages. Therefore, if you want something from this -verse, it should "just work" (we hope). - -The reason for the overlap is that `{workflows}` _already implements_ the first three steps. And it does this very well. However, it is missing the postprocessing stage and currently has no plans for such an implementation. And this feature is important. The baseline forecaster we provide _requires_ postprocessing. Anything more complicated needs this as well. - -The second omission from `{tidymodels}` is support for panel data. Besides epidemiological data, economics, psychology, sociology, and many other areas frequently deal with data of this type. So the framework of behind `{epipredict}` implements this. In principle, this has nothing to do with epidemiology, and one could simply use this package as a solution for the missing functionality in `{tidymodels}`. Again, this should "just work". - -All of the _panel data_ functionality is implemented through the `epi_df` data type in the companion [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) package. There is much more to see there, but for the moment, it's enough to look at a simple one: +Users familiar with [`{tidymodels}`](https://www.tidymodels.org) and especially +the [`{workflows}`](https://workflows.tidymodels.org) package will notice a lot +of overlap. This is by design, and is in fact a feature. The truth is that +`{epipredict}` is a wrapper around much that is contained in these packages. +Therefore, if you want something from this -verse, it should "just work" (we +hope). + +The reason for the overlap is that `{workflows}` *already implements* the first +three steps. And it does this very well. However, it is missing the +postprocessing stage and currently has no plans for such an implementation. And +this feature is important. The baseline forecaster we provide *requires* +postprocessing. Anything more complicated needs this as well. + +The second omission from `{tidymodels}` is support for panel data. Besides +epidemiological data, economics, psychology, sociology, and many other areas +frequently deal with data of this type. So the framework of behind +`{epipredict}` implements this. In principle, this has nothing to do with +epidemiology, and one could simply use this package as a solution for the +missing functionality in `{tidymodels}`. Again, this should "just work". + +All of the *panel data* functionality is implemented through the `epi_df` data +type in the companion [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) +package. There is much more to see there, but for the moment, it's enough to +look at a simple one: ```{r epidf} jhu <- case_death_rate_subset jhu ``` -This data is built into the package and contains the measured variables `case_rate` and `death_rate` for COVID-19 at the daily level for each US state for the year 2021. The "panel" part is because we have repeated measurements across a number of locations. +This data is built into the package and contains the measured variables +`case_rate` and `death_rate` for COVID-19 at the daily level for each US state +for the year 2021. The "panel" part is because we have repeated measurements +across a number of locations. -The `epi_df` encodes the time stamp as `time_value` and the `key` as `geo_value`. While these 2 names are required, the values don't need to actually represent such objects. Additional `key`'s are also supported (like age group, ethnicity, taxonomy, etc.). +The `epi_df` encodes the time stamp as `time_value` and the `key` as +`geo_value`. While these 2 names are required, the values don't need to actually +represent such objects. Additional `key`'s are also supported (like age group, +ethnicity, taxonomy, etc.). -The `epi_df` also contains some metadata that describes the keys as well as the vintage of the data. It's possible that data collected at different times for the _same set_ of `geo_value`'s and `time_value`'s could actually be different. For more details, see [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/articles/epiprocess.html). +The `epi_df` also contains some metadata that describes the keys as well as the +vintage of the data. It's possible that data collected at different times for +the *same set* of `geo_value`'s and `time_value`'s could actually be different. +For more details, see +[`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/articles/epiprocess.html). @@ -85,21 +121,32 @@ preprocessing, training, and prediction, bound together, through a package calle `{workflows}`. We built `{epipredict}` on top of that setup. In this way, you CAN use almost everything they provide. -* However, `{workflows}` doesn't do postprocessing. And nothing in the -verse handles _panel data_. +* However, `{workflows}` doesn't do postprocessing. And nothing in the -verse +handles _panel data_. * The tidy-team doesn't have plans to do either of these things. (We checked). * There are two packages that do _time series_ built on `{tidymodels}`, but it's -"basic" time series: 1-step AR models, exponential smoothing, STL decomposition, etc.[^2] Our group has not prioritized these sorts of models for epidemic forecasting, but one could also integrate these methods into our framework. +"basic" time series: 1-step AR models, exponential smoothing, STL decomposition, +etc.[^2] Our group has not prioritized these sorts of models for epidemic +forecasting, but one could also integrate these methods into our framework. -[^2]: These are [`{timetk}`](https://business-science.github.io/timetk/index.html) and [`{modeltime}`](https://business-science.github.io/timetk/index.html). There are _lots_ of useful methods there than can be used to do fairly complex machine learning methodology, though not directly for panel data and not for direct prediction of future targets. +[^2]: These are [`{timetk}`](https://business-science.github.io/timetk/index.html) +and [`{modeltime}`](https://business-science.github.io/timetk/index.html). There +are *lots* of useful methods there than can be used to do fairly complex machine +learning methodology, though not directly for panel data and not for direct +prediction of future targets. # Show me the basics -We start with the `jhu` data displayed above. -One of the "canned" forecasters we provide is an autoregressive forecaster with (or without) covariates that _directly_ trains on the response. This is in contrast to a typical "iterative" AR model that trains to predict one-step-ahead, and then plugs in the predictions to "leverage up" to longer horizons. +We start with the `jhu` data displayed above. One of the "canned" forecasters we +provide is an autoregressive forecaster with (or without) covariates that +*directly* trains on the response. This is in contrast to a typical "iterative" +AR model that trains to predict one-step-ahead, and then plugs in the +predictions to "leverage up" to longer horizons. -We'll estimate the model jointly across all locations using only the most recent 30 days. +We'll estimate the model jointly across all locations using only the most +recent 30 days. ```{r demo-workflow} jhu <- jhu %>% filter(time_value >= max(time_value) - 30) @@ -112,16 +159,25 @@ out <- arx_forecaster( The `out` object has two components: - 1. The predictions which is just another `epi_df`. It contains the predictions for each location along with additional columns. By default, these are a 90% predictive interval, the `forecast_date` (the date on which the forecast was putatively made) and the `target_date` (the date for which the forecast is being made). + 1. The predictions which is just another `epi_df`. It contains the predictions for +each location along with additional columns. By default, these are a 90% +predictive interval, the `forecast_date` (the date on which the forecast was +putatively made) and the `target_date` (the date for which the forecast is being +made). ```{r} out$predictions ``` - 2. A list object of class `epi_workflow`. This object encapsulates all the instructions necessary to create the prediction. More details on this below. + 2. A list object of class `epi_workflow`. This object encapsulates all the +instructions necessary to create the prediction. More details on this below. ```{r} out$epi_workflow ``` -By default, the forecaster predicts the outcome (`death_rate`) 1-week ahead, using 3 lags of each predictor (`case_rate` and `death_rate`) at 0 (today), 1 week back and 2 weeks back. The predictors and outcome can be changed directly. The rest of the defaults are encapsulated into a list of arguments. This list is produced by `arx_args_list()`. +By default, the forecaster predicts the outcome (`death_rate`) 1-week ahead, +using 3 lags of each predictor (`case_rate` and `death_rate`) at 0 (today), 1 +week back and 2 weeks back. The predictors and outcome can be changed directly. +The rest of the defaults are encapsulated into a list of arguments. This list is +produced by `arx_args_list()`. ## Simple adjustments @@ -143,11 +199,19 @@ out2week <- arx_forecaster( ) ``` -Here, we've used different lags on the `case_rate` and are now predicting 2 weeks ahead. This example also illustrates a major difficulty with the "iterative" versions of AR models. This model doesn't produce forecasts for `case_rate`, and so, would not have data to "plug in" for the necessary lags.[^1] +Here, we've used different lags on the `case_rate` and are now predicting 2 +weeks ahead. This example also illustrates a major difficulty with the +"iterative" versions of AR models. This model doesn't produce forecasts for +`case_rate`, and so, would not have data to "plug in" for the necessary +lags.[^1] -[^1]: An obvious fix is to instead use a VAR and predict both, but this would likely increase the variance of the model, and therefore, may lead to less accurate forecasts for the variable of interest. +[^1]: An obvious fix is to instead use a VAR and predict both, but this would +likely increase the variance of the model, and therefore, may lead to less +accurate forecasts for the variable of interest. -Another property of the basic model is the predictive interval. We describe this in more detail in a different vignette, but it is easy to request multiple quantiles. +Another property of the basic model is the predictive interval. We describe this +in more detail in a different vignette, but it is easy to request multiple +quantiles. ```{r differential-levels} out_q <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"), @@ -157,7 +221,11 @@ out_q <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"), ) ``` -The column `.pred_dstn` in the `predictions` object is actually a "distribution" here parameterized by its quantiles. For this default forecaster, these are created using the quantiles of the residuals of the predictive model (possibly symmetrized). Here, we used 23 quantiles, but one can grab a particular quantile, +The column `.pred_dstn` in the `predictions` object is actually a "distribution" +here parameterized by its quantiles. For this default forecaster, these are +created using the quantiles of the residuals of the predictive model (possibly +symmetrized). Here, we used 23 quantiles, but one can grab a particular +quantile, ```{r q1} head(quantile(out_q$predictions$.pred_distn, p = .4)) @@ -173,7 +241,8 @@ out_q$predictions %>% unnest(.pred_distn) # then unnest it ``` -Additional simple adjustments to the basic forecaster can be made using the function: +Additional simple adjustments to the basic forecaster can be made using the +function: ```{r, eval = FALSE} arx_args_list( @@ -186,9 +255,12 @@ arx_args_list( ## Changing the engine -So far, our forecasts have been produced using simple linear regression. But this is not the only way to estimate such a model. -The `trainer` argument determines the type of model we want. -This takes a [`{parsnip}`](https://parsnip.tidymodels.org) model. The default is linear regression, but we could instead use a random forest with the `{ranger}` package: +So far, our forecasts have been produced using simple linear regression. But +this is not the only way to estimate such a model. The `trainer` argument +determines the type of model we want. This takes a +[`{parsnip}`](https://parsnip.tidymodels.org) model. The default is linear +regression, but we could instead use a random forest with the `{ranger}` +package: ```{r ranger, warning = FALSE} out_rf <- arx_forecaster( @@ -215,7 +287,9 @@ out_qr <- arx_forecaster( ) ``` -FWIW, this last case (using quantile regression), is not far from what the Delphi production forecast team used for its Covid forecasts over the past few years. +FWIW, this last case (using quantile regression), is not far from what the +Delphi production forecast team used for its Covid forecasts over the past few +years. ## Inner workings @@ -367,7 +441,8 @@ But ideally, a user could create their own forecasters by building up the components we provide. In other vignettes, we try to walk through some of these customizations. -To illustrate everything above, here is (roughly) the code for the `flatline_forecaster()` applied to the `case_rate`. +To illustrate everything above, here is (roughly) the code for the +`flatline_forecaster()` applied to the `case_rate`. ```{r} r <- epi_recipe(jhu) %>% diff --git a/vignettes/panel-data.Rmd b/vignettes/panel-data.Rmd index f2b4d8d09..0dea322f2 100644 --- a/vignettes/panel-data.Rmd +++ b/vignettes/panel-data.Rmd @@ -239,7 +239,9 @@ summary(extract_fit_engine(wf_linreg)) ``` This output tells us the coefficients of the fitted model; for instance, -the estimated intercept is $\widehat{\alpha}_0 =$ `r round(coef(extract_fit_engine(wf_linreg))[1], 3)` and the coefficient for $y_{tijk}$ is +the estimated intercept is $\widehat{\alpha}_0 =$ +`r round(coef(extract_fit_engine(wf_linreg))[1], 3)` and the coefficient for +$y_{tijk}$ is $\widehat\alpha_1 =$ `r round(coef(extract_fit_engine(wf_linreg))[2], 3)`. The summary also tells us that all estimated coefficients are significantly different from zero. Extracting the 95% confidence intervals for the @@ -315,7 +317,8 @@ defined as follows: \end{aligned} \] -where $y_{tijk}$ is the 5-year median income (proportion) at time $t$ (in location $i$, age group $j$ with education quality $k$), +where $y_{tijk}$ is the 5-year median income (proportion) at time $t$ (in +location $i$, age group $j$ with education quality $k$), $x_{tijk}$ is the 2-year median income (proportion) at time $t$, and $z_{tijk}$ is the number of graduates (proportion) at time $t$. diff --git a/vignettes/preprocessing-and-models.Rmd b/vignettes/preprocessing-and-models.Rmd index 11171489b..d557ed1f7 100644 --- a/vignettes/preprocessing-and-models.Rmd +++ b/vignettes/preprocessing-and-models.Rmd @@ -58,7 +58,8 @@ and forecasts to characterize the state of an outbreak and its course. They use it to inform public health decision makers on potential consequences of deploying control measures. -One of the outcomes that the CDC forecasts is [death counts from COVID-19](https://www.cdc.gov/coronavirus/2019-ncov/science/forecasting/forecasting-us.html). +One of the outcomes that the CDC forecasts is +[death counts from COVID-19](https://www.cdc.gov/coronavirus/2019-ncov/science/forecasting/forecasting-us.html). Although there are many state-of-the-art models, we choose to use Poisson regression, the textbook example for modeling count data, as an illustration for using the `epipredict` package with other existing tidymodels packages. @@ -109,7 +110,8 @@ Poisson distribution with mean $\mu_{t+7}$; $s_{\text{state}}$ are dummy variables for each state and take values of either 0 or 1. Preprocessing steps will be performed to prepare the -data for model fitting. But before diving into them, it will be helpful to understand what `roles` are in the `recipes` framework. +data for model fitting. But before diving into them, it will be helpful to +understand what `roles` are in the `recipes` framework. --- @@ -235,9 +237,10 @@ using `layer_residual_quantiles()` should be used before population scaling or else the transformation will make the results uninterpretable. -We wish, now, to predict the 7-day ahead death counts with lagged case rates and death -rates, along with some extra behaviourial predictors. Namely, we will use survey data -from [COVID-19 Trends and Impact Survey](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/fb-survey.html#behavior-indicators). +We wish, now, to predict the 7-day ahead death counts with lagged case rates and +death rates, along with some extra behaviourial predictors. Namely, we will use +survey data from +[COVID-19 Trends and Impact Survey](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/fb-survey.html#behavior-indicators). The survey data provides the estimated percentage of people who wore a mask for most or all of the time while in public in the past 7 days and the estimated @@ -413,7 +416,8 @@ We say location $\ell$ is a hotspot at time $t$ when $Z_{\ell,t}$ is `up`, meaning the number of newly reported cases over the past 7 days has increased by at least 25% compared to the preceding week. When $Z_{\ell,t}$ is categorized as `down`, it suggests that there has been at least a 20% -decrease in newly reported cases over the past 7 days (a 20% decrease is the inverse of a 25% increase). Otherwise, we will +decrease in newly reported cases over the past 7 days (a 20% decrease is the +inverse of a 25% increase). Otherwise, we will consider the trend to be `flat`. The expression of the multinomial regression we will use is as follows: @@ -439,7 +443,8 @@ g_{\text{up}}(x) &= \log\left(\frac{Pr(Z_{\ell,t}=\text{up}\mid x)}{Pr(Z_{\ell,t \end{aligned} Preprocessing steps are similar to the previous models with an additional step -of categorizing the response variables. Again, we will use a subset of death rate and case rate data from our built-in dataset +of categorizing the response variables. Again, we will use a subset of death +rate and case rate data from our built-in dataset `case_death_rate_subset`. ```{r} @@ -585,7 +590,12 @@ Sciences 118.51 (2021): e2111453118. [doi:10.1073/pnas.2111453118](https://doi.o ## Attribution -This object contains a modified part of the [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19) as [republished in the COVIDcast Epidata API.](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html) +This object contains a modified part of the +[COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19) +as [republished in the COVIDcast Epidata API.](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html) -This data set is licensed under the terms of the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/) by the Johns Hopkins University -on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020. +This data set is licensed under the terms of the +[Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/) +by the Johns Hopkins +University on behalf of its Center for Systems Science in Engineering. Copyright +Johns Hopkins University 2020.