diff --git a/NEWS.md b/NEWS.md index 9a66713a..5ec2303d 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,4 +1,3 @@ - # wwinference 0.1.0.99 (dev) ## User-visible changes @@ -7,7 +6,10 @@ hospital admissions to output of function and package data. ([#184](https://gith - `wwinference` now checks whether `site_pop` is fixed per site (see issue [#223](https://github.com/CDCgov/ww-inference-model/issues/226) reported by [@akeyel](https://github.com/akeyel)). ## Internal changes -- Updated the workflow for posting the pages artifact to PRs (issue [#229](https://github.com/CDCgov/ww-inference-model/issues/229)). +- Modified the priors on the infection feedback term and the step size of the weekly random walk in the effective reproductive number (issue [#227](https://github.com/CDCgov/ww-inference-model/issues/227)), based on benchmarking results from the evaluation pipeline described in the [PR](https://github.com/CDCgov/ww-inference-model/pull/236) corresponding to this change. +- Add package workflow diagram to readme ([#248](https://github.com/CDCgov/ww-inference-model/issues/248)) +- `get_plot_subpop_rt()` now uses a shared y-axis to facilitate comparison of R(t) estimates) ([#245](https://github.com/CDCgov/ww-inference-model/issues/245)) +- Updated the workflow for posting the pages artifact to PRs (issue [#229](https://github.com/CDCgov/ww-inference-model/issues/229)(https://github.com/CDCgov/ww-inference-model/issues/229)). - Modify `plot_forecasted_counts()` so that it does not require an evaluation dataset ([#218](https://github.com/CDCgov/ww-inference-model/pull/218)) # wwinference 0.1.0 @@ -22,7 +24,5 @@ As it's written, the package is intended to allow users to do the following: - Validate input data validation with informative error messaging ([#37](https://github.com/CDCgov/ww-inference-model/issues/37), [#54](https://github.com/CDCgov/ww-inference-model/issues/54)) - Provide a wrapper function to generate forward simulated data with user-specified variables. It calls a number of functions to perform specific model components ([#27](https://github.com/CDCgov/ww-inference-model/issues/27)) - Contains S3 class methods applied to the output of the main model wrapper function, the `wwinference_fit` class object ([#58](https://github.com/CDCgov/ww-inference-model/issues/58)). -<<<<<<< HEAD - Wastewater concentration data is expected to be in log scale ([#122](https://onetakeda.box.com/s/pju273g5khx3y3cwoae2zwv3e7vu03x3)). -======= - Wastewater concentration data is expected to be in log scale ([#122](https://github.com/CDCgov/ww-inference-model/pull/122)). diff --git a/R/data.R b/R/data.R index 42660979..35fb62da 100644 --- a/R/data.R +++ b/R/data.R @@ -39,6 +39,47 @@ #' @source vignette_data.R "ww_data" +#' Example evaluation wastewater dataset. +#' +#' A dataset containing the simulated retrospective wastewater concentrations +#' (labeled here as `log_genome_copies_per_ml_eval`) by sample collection date +#' (`date`), the site where the sample was collected (`site`) and the lab +#' where the samples were processed (`lab`). Additional columns that are +#' required attributes needed for the model are the limit of detection for +#' that lab on each day (labeled here as `log_lod`) and the population size of +#' the wastewater catchment area represented by the wastewater concentrations +#' in each `site`. +#' +#' This data is generated via the default values in the +#' `generate_simulated_data()` function. They represent the bare minumum +#' required fields needed to pass to the model, and we recommend that users +#' try to format their own data to match this format. +#' +#' The variables are as follows: +#' +#' @format ## ww_data_eval +#' A tibble with 126 rows and 6 columns +#' \describe{ +#' \item{date}{Sample collection date, formatted in ISO8601 standards as +#' YYYY-MM-DD} +#' \item{site}{The wastewater treatment plant where the sample was collected} +#' \item{lab}{The lab where the sample was processed} +#' \item{log_genome_copies_per_ml_eval}{The natural log of the wastewater +#' concentration measured on the date specified, collected in the site +#' specified, and processed in the lab specified. The package expects +#' this quantity in units of log estimated genome copies per mL.} +#' \item{log_lod}{The log of the limit of detection in the site and lab on a +#' particular day of the quantification device (e.g. PCR). This should be in +#' units of log estimated genome copies per mL.} +#' \item{site_pop}{The population size of the wastewater catchment area +#' represented by the site variable} +#' \item{location}{ A string indicating the location that all of the +#' data is coming from. This is not a necessary column, but instead is +#' included to more realistically mirror a typical workflow} +#' } +#' @source vignette_data.R +"ww_data_eval" + #' Example wastewater dataset with independent site correlations. @@ -100,9 +141,9 @@ #' to match this format. #' #' This data is generated via the default values in the -#' `generate_simulated_data()` function. They represent the bare minumum +#' `generate_simulated_data()` function. They represent the bare minimum #' required fields needed to pass to the model, and we recommend that users -#' try to format their own data to match this formate. +#' try to format their own data to match this format. #' #' The variables are as follows: #' \describe{ @@ -304,8 +345,6 @@ "rt_site_data" - - #' COVID-19 post-Omicron generation interval probability mass function #' #' \describe{ diff --git a/R/figures.R b/R/figures.R index 03ee740d..902e61f2 100644 --- a/R/figures.R +++ b/R/figures.R @@ -268,7 +268,7 @@ get_plot_subpop_rt <- function(draws, linetype = "dashed", show.legend = FALSE ) + - facet_wrap(~subpop_name, scales = "free") + + facet_wrap(~subpop_name) + geom_hline(aes(yintercept = 1), linetype = "dashed") + xlab("") + ylab("Subpopulation R(t)") + diff --git a/R/generate_simulated_data.R b/R/generate_simulated_data.R index 79d4b4d3..ec216e16 100644 --- a/R/generate_simulated_data.R +++ b/R/generate_simulated_data.R @@ -575,7 +575,6 @@ generate_simulated_data <- function(r_in_weeks = # nolint - # Global adjusted R(t) -------------------------------------------------- # I(t)/convolve(I(t), g(t)) #nolint # This is not used directly, but we want to have it for comparing to the @@ -599,6 +598,41 @@ generate_simulated_data <- function(r_in_weeks = # nolint lod_lab_site = lod_lab_site ) + ww_data_eval <- format_ww_data( + log_obs_conc_lab_site = log_obs_conc_lab_site_eval, + ot = ot + ht, + ht = 0, + date_df = date_df, + site_lab_map = site_lab_map, + lod_lab_site = lod_lab_site + ) |> + dplyr::rename( + "log_genome_copies_per_ml_eval" = "log_genome_copies_per_ml" + ) + + # Artificially add values below the LOD---------------------------------- + # Replace it with an NA, will be used as an example of how to format data + # properly. + min_ww_val <- min(ww_data$log_genome_copies_per_ml) + ww_data <- ww_data |> + dplyr::mutate( + "log_genome_copies_per_ml" = + dplyr::case_when( + .data$log_genome_copies_per_ml == + !!min_ww_val ~ 0.5 * .data$log_lod, + TRUE ~ .data$log_genome_copies_per_ml + ) + ) + ww_data_eval <- ww_data_eval |> + dplyr::mutate( + "log_genome_copies_per_ml_eval" = + dplyr::case_when( + .data$log_genome_copies_per_ml_eval == + !!min_ww_val ~ 0.5 * .data$log_lod, + TRUE ~ .data$log_genome_copies_per_ml_eval + ) + ) + ww_data_eval <- format_ww_data( log_obs_conc_lab_site = log_obs_conc_lab_site_eval, diff --git a/R/get_stan_data.R b/R/get_stan_data.R index 12a107d6..824c7afb 100644 --- a/R/get_stan_data.R +++ b/R/get_stan_data.R @@ -74,6 +74,7 @@ get_input_ww_data_for_stan <- function(preprocessed_ww_data, ) |> dplyr::arrange(.data$date, .data$lab_site_index) } + return(ww_data) } @@ -418,6 +419,7 @@ get_stan_data <- function(input_count_data, # Get the last date that there were observations of the epidemiological # indicator (aka cases or hospital admissions counts) last_count_data_date <- max(input_count_data$date, na.rm = TRUE) + # Validate input pmfs---------------------------------------------------- validate_pmf(generation_interval, calibration_time, @@ -651,6 +653,7 @@ get_stan_data <- function(input_count_data, sd_log_sigma_ww_site_prior_sd = params$sd_log_sigma_ww_site_prior_sd, eta_sd_sd = params$eta_sd_sd, + eta_sd_mean = params$eta_sd_mean, sigma_i_first_obs_prior_mode = params$sigma_i_first_obs_prior_mode, sigma_i_first_obs_prior_sd = params$sigma_i_first_obs_prior_sd, p_hosp_prior_mean = params$p_hosp_mean, @@ -682,6 +685,7 @@ get_stan_data <- function(input_count_data, offset_ref_initial_exp_growth_rate_prior_sd = params$offset_ref_initial_exp_growth_rate_prior_sd ) + return(stan_data_list) } @@ -797,6 +801,7 @@ get_ww_indices_and_values <- function(input_ww_data, "Length of censored vectors incorrect" = length(ww_censored) + length(ww_uncensored) == owt ) + ww_sampled_times <- ww_data_joined |> dplyr::pull("t") ww_sampled_subpops <- ww_data_joined |> dplyr::pull("subpop_index") lab_site_to_subpop_spine <- lab_site_site_spine |> diff --git a/R/initialization.R b/R/initialization.R index b8387ac1..bccee0bf 100644 --- a/R/initialization.R +++ b/R/initialization.R @@ -111,10 +111,12 @@ get_inits_for_one_chain <- function(stan_data, stdev = 0.01) { # unstructured correlation matrix L_Omega = as.matrix(diag(2)) ) + if (stan_data$corr_structure_switch == 2) { init_list$L_Omega <- diag((n_subpops - 1)) } + if (stan_data$n_subpops > 1) { init_list$error_rt_subpop <- matrix( stats::rnorm((n_subpops - 1) * n_weeks, diff --git a/R/wwinference.R b/R/wwinference.R index 734705fc..bbf56cff 100644 --- a/R/wwinference.R +++ b/R/wwinference.R @@ -247,6 +247,7 @@ wwinference <- function(ww_data, ) } + # Get site to subpop spine site_subpop_spine <- get_site_subpop_spine( input_ww_data = input_ww_data, diff --git a/README.md b/README.md index 6ce78bb5..0e8042eb 100644 --- a/README.md +++ b/README.md @@ -27,6 +27,10 @@ This will help make clear the data requirements and how to structure this data t - Sam Abbott (seabbs) - Damon Bayer (damonbayer) +# Package workflow +The following depicts the suggested workflow for fitting the wastewater-informed forecasting model. See the ["Getting Started" vignette](https://cdcgov.github.io/ww-inference-model/articles/wwinference.html) for a full example. +![](./man/figures/wwinference_workflow.png) + # Installing and running code ## Install R diff --git a/data-raw/vignette_data.R b/data-raw/vignette_data.R index 8a88867e..6fca63df 100644 --- a/data-raw/vignette_data.R +++ b/data-raw/vignette_data.R @@ -30,10 +30,10 @@ hosp_data_eval_ind <- simulated_data_ind$hosp_data_eval rt_site_data_ind <- simulated_data_ind$rt_site_data rt_global_data_ind <- simulated_data_ind$rt_global_data - usethis::use_data(hosp_data, overwrite = TRUE) usethis::use_data(hosp_data_eval, overwrite = TRUE) usethis::use_data(ww_data, overwrite = TRUE) + usethis::use_data(rt_site_data, overwrite = TRUE) usethis::use_data(rt_global_data, overwrite = TRUE) usethis::use_data(hosp_data_ind, overwrite = TRUE) diff --git a/data/hosp_data.rda b/data/hosp_data.rda index fdde74ca..9721e64f 100644 Binary files a/data/hosp_data.rda and b/data/hosp_data.rda differ diff --git a/data/hosp_data_eval.rda b/data/hosp_data_eval.rda index 705b474f..739e9d1c 100644 Binary files a/data/hosp_data_eval.rda and b/data/hosp_data_eval.rda differ diff --git a/data/hosp_data_eval_ind.rda b/data/hosp_data_eval_ind.rda index d5ecacaf..337b605a 100644 Binary files a/data/hosp_data_eval_ind.rda and b/data/hosp_data_eval_ind.rda differ diff --git a/data/hosp_data_ind.rda b/data/hosp_data_ind.rda index 24161d97..1a63d13e 100644 Binary files a/data/hosp_data_ind.rda and b/data/hosp_data_ind.rda differ diff --git a/data/ww_data.rda b/data/ww_data.rda index eadf5bc5..ed368649 100644 Binary files a/data/ww_data.rda and b/data/ww_data.rda differ diff --git a/data/ww_data_eval.rda b/data/ww_data_eval.rda new file mode 100644 index 00000000..176a52b4 Binary files /dev/null and b/data/ww_data_eval.rda differ diff --git a/data/ww_data_ind.rda b/data/ww_data_ind.rda index 94871746..3acdbe51 100644 Binary files a/data/ww_data_ind.rda and b/data/ww_data_ind.rda differ diff --git a/inst/extdata/example_params.toml b/inst/extdata/example_params.toml index 1975fc60..4f054711 100644 --- a/inst/extdata/example_params.toml +++ b/inst/extdata/example_params.toml @@ -50,9 +50,11 @@ offset_ref_initial_exp_growth_rate_prior_sd = 0.025 autoreg_p_hosp_a = 1 # shape1 parameter of autoreg term on IHR(t) trend autoreg_p_hosp_b = 100 # shape2 parameter of autoreg term on IHR(t) trend +eta_sd_mean = 0.0278 # from posterior of fit to long time series eta_sd_sd = 0.01 -infection_feedback_prior_logmean = 6.37408 # log(mode) + q^2 mode = 500, q = 0.4 -infection_feedback_prior_logsd = 0.4 +infection_feedback_prior_logmean = 4.498 # log(~90) from posterior of fit to long +# time series +infection_feedback_prior_logsd = 0.636 # log(~1.9) [hospital_admission_observation_process] diff --git a/inst/stan/wwinference.stan b/inst/stan/wwinference.stan index 2e205ce8..aeda96f6 100644 --- a/inst/stan/wwinference.stan +++ b/inst/stan/wwinference.stan @@ -89,6 +89,7 @@ data { real sd_log_sigma_ww_site_prior_mode; real sd_log_sigma_ww_site_prior_sd; real eta_sd_sd; + real eta_sd_mean; real p_hosp_prior_mean; real p_hosp_sd_logit; real p_hosp_w_sd_sd; @@ -303,10 +304,10 @@ transformed parameters { log_r_subpop_t_in_weeks = log_r_t_in_weeks + (n_subpops > 1 ? offset_ref_log_r_t[1] : 0); } else { + log_r_subpop_t_in_weeks = to_vector(log_r_subpop_t_in_weeks_matrix[i-1, :]); } - //convert from weekly to daily unadj_r_subpop_t = exp(to_row_vector(ind_m*(log_r_subpop_t_in_weeks))); @@ -402,7 +403,7 @@ model { offset_ref_logit_i_first_obs_prior_sd); offset_ref_initial_exp_growth_rate ~ normal(offset_ref_initial_exp_growth_rate_prior_mean, offset_ref_initial_exp_growth_rate_prior_sd); - eta_sd ~ normal(0, eta_sd_sd); + eta_sd ~ normal(eta_sd_mean, eta_sd_sd); autoreg_rt_subpop ~ beta(autoreg_rt_subpop_a, autoreg_rt_subpop_b); autoreg_rt ~ beta(autoreg_rt_a, autoreg_rt_b); diff --git a/man/figures/wwinference_package_workflow.svg b/man/figures/wwinference_package_workflow.svg new file mode 100644 index 00000000..83ffbb27 --- /dev/null +++ b/man/figures/wwinference_package_workflow.svg @@ -0,0 +1 @@ +Site-lab level wastewater data“total” population count dataPre-processed wastewater dataPre-processedcountdataPre-processed wastewater data with exclusions indicatedwwinference_fitobjectModel parameters & delay distributions via get_model_spec()forecast_date, calibration_time,forecast_horizonMCMC fitting specifications,fit_optspreprocess_ww_data()indicate_ww_exclusions()preprocess_count_data()Fit model via wwinference()raw_input_datafit(CmdStanobject)stan_data_listfit_optsget_draws()wwinference_drawsobjectpredicted_countspredicted_wwsubpop_rtglobal_rtget_plot_forecasted_counts()get_plot_ww_conc()get_plot_global_rt()get_plot_subpop_rt()get_model_diagnostic_flags()summary_diagnostics()parameter_diagnostics()ModeldiagnosticsPlot model outputs diff --git a/man/figures/wwinference_workflow.png b/man/figures/wwinference_workflow.png new file mode 100644 index 00000000..a70e2587 Binary files /dev/null and b/man/figures/wwinference_workflow.png differ diff --git a/man/hosp_data.Rd b/man/hosp_data.Rd index cde27d78..1b6d161a 100644 --- a/man/hosp_data.Rd +++ b/man/hosp_data.Rd @@ -28,9 +28,9 @@ to match this format. } \details{ This data is generated via the default values in the -\code{generate_simulated_data()} function. They represent the bare minumum +\code{generate_simulated_data()} function. They represent the bare minimum required fields needed to pass to the model, and we recommend that users -try to format their own data to match this formate. +try to format their own data to match this format. The variables are as follows: \describe{ diff --git a/man/ww_data_eval.Rd b/man/ww_data_eval.Rd new file mode 100644 index 00000000..2afdd3d1 --- /dev/null +++ b/man/ww_data_eval.Rd @@ -0,0 +1,55 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/data.R +\docType{data} +\name{ww_data_eval} +\alias{ww_data_eval} +\title{Example evaluation wastewater dataset.} +\format{ +\subsection{ww_data_eval}{ + +A tibble with 126 rows and 6 columns +\describe{ +\item{date}{Sample collection date, formatted in ISO8601 standards as +YYYY-MM-DD} +\item{site}{The wastewater treatment plant where the sample was collected} +\item{lab}{The lab where the sample was processed} +\item{log_genome_copies_per_ml_eval}{The natural log of the wastewater +concentration measured on the date specified, collected in the site +specified, and processed in the lab specified. The package expects +this quantity in units of log estimated genome copies per mL.} +\item{log_lod}{The log of the limit of detection in the site and lab on a +particular day of the quantification device (e.g. PCR). This should be in +units of log estimated genome copies per mL.} +\item{site_pop}{The population size of the wastewater catchment area +represented by the site variable} +\item{location}{ A string indicating the location that all of the +data is coming from. This is not a necessary column, but instead is +included to more realistically mirror a typical workflow} +} +} +} +\source{ +vignette_data.R +} +\usage{ +ww_data_eval +} +\description{ +A dataset containing the simulated retrospective wastewater concentrations +(labeled here as \code{log_genome_copies_per_ml_eval}) by sample collection date +(\code{date}), the site where the sample was collected (\code{site}) and the lab +where the samples were processed (\code{lab}). Additional columns that are +required attributes needed for the model are the limit of detection for +that lab on each day (labeled here as \code{log_lod}) and the population size of +the wastewater catchment area represented by the wastewater concentrations +in each \code{site}. +} +\details{ +This data is generated via the default values in the +\code{generate_simulated_data()} function. They represent the bare minumum +required fields needed to pass to the model, and we recommend that users +try to format their own data to match this format. + +The variables are as follows: +} +\keyword{datasets} diff --git a/model_definition.md b/model_definition.md index 8136ed11..fa8f0186 100644 --- a/model_definition.md +++ b/model_definition.md @@ -70,8 +70,10 @@ In the case where the sum of the wastewater site catchment populations meets or This amounts to modeling the wastewater catchments populations as approximately non-overlapping; every infected individual either does not contribute to measured wastewater or contributes principally to one wastewater catchment. This approximation is reasonable if we restrict our analyses to primary wastewaster treatment plants, which avoids the possibility that an individual might be sampled once in a sample taken upstream and then sampled again in a more aggregated sample taken further downstream. + If the sum of the wastewater site catchment populations meets or exceeds the reported jurisdiction population ($\sum\nolimits_{k=1}^{K_\mathrm{sites}} n_k \ge n$) the model does not use a final subpopulation without sampled wastewater. In that case, the total number of subpopulations $K_\mathrm{total} = K_\mathrm{sites}$. + When converting from predicted per capita incident hospital admissions $H(t)$ to predicted hospitalization counts, we use the jurisdiction population size $n$, even in the case where $\sum n_k > n$. This amounts to making two key additional modeling assumptions: diff --git a/tests/testthat/test_preprocess_ww_data.R b/tests/testthat/test_preprocess_ww_data.R index 00849bc8..be1592c2 100644 --- a/tests/testthat/test_preprocess_ww_data.R +++ b/tests/testthat/test_preprocess_ww_data.R @@ -20,7 +20,6 @@ test_that("Function returns site indices in order of largest site pop", { spine <- processed |> dplyr::distinct(site_pop, site_index) - expect_true(spine$site_pop[spine$site_index == 1] == max(spine$site_pop)) }) @@ -383,8 +382,8 @@ test_that("Function handles LOD values equal to concentration values", { test_that("Constant population per site", { wrong_pop <- ww_data + wrong_pop$site_pop[1] <- ww_data$site_pop[1] + 1000 - wrong_pop$site_pop <- 1e6 + seq_len(nrow(ww_data)) expect_error( preprocess_ww_data( diff --git a/vignettes/wwinference.Rmd b/vignettes/wwinference.Rmd index 1750ebad..4b73b861 100644 --- a/vignettes/wwinference.Rmd +++ b/vignettes/wwinference.Rmd @@ -73,7 +73,8 @@ by the hospital admissions data, in this case, the size of the theoretical state Additionally, we provide the `hosp_data_eval` dataset which contains the simulated hospital admissions 28 days ahead of the forecast date, which can be used to evaluate the model. -For the wastewater data, the expcted format is a table of observations with the + +For the wastewater data, the expected format is a table of observations with the following columns. The wastewater data should not contain `NA` values for days with missing observations, instead these should be excluded: - a date (column `date`): the date the sample was collected @@ -155,7 +156,6 @@ ww_data_preprocessed <- preprocess_ww_data( ``` Note that this function assumes that there are no missing values in the concentration column. The package expects observations below the LOD will -be replaced with a numeric value below the LOD. If there are NAs in your dataset when observations are below the LOD, we suggest replacing them with a value be replaced with a numeric value below the LOD. If there are NAs in your dataset when observations are below the LOD, we suggest replacing them with a value below the LOD in upstream pre-processing.