Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 184: Add outputs to generate_simulated_data() fxn and package data #220

Merged
merged 13 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# wwinference 0.1.0.99 (dev)

## User-visible changes

- Add wastewater data into the forecast period to output in `generate_simulated_data()` function and as package data. Also adds subpopulation-level
hospital admissions to output of function and package data. ([#184](https://github.com/CDCgov/ww-inference-model/issues/184))
- `wwinference` now checks whether `site_pop` is fixed per site (see issue [#223](https://github.com/CDCgov/ww-inference-model/issues/226) reported by [@akeyel](https://github.com/akeyel)).

## Internal changes

- Updated the workflow for posting the pages artifact to PRs (issue [#229](https://github.com/CDCgov/ww-inference-model/issues/229)).
- Modify `plot_forecasted_counts()` so that it does not require an evaluation dataset ([#218](https://github.com/CDCgov/ww-inference-model/pull/218))

Expand Down
116 changes: 114 additions & 2 deletions R/data.R
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,47 @@
#' @source vignette_data.R
"ww_data"

#' Example evaluation wastewater dataset.
#'
#' A dataset containing the simulated retrospective wastewater concentrations
#' (labeled here as `log_genome_copies_per_ml_eval`) by sample collection date
#' (`date`), the site where the sample was collected (`site`) and the lab
#' where the samples were processed (`lab`). Additional columns that are
#' required attributes needed for the model are the limit of detection for
#' that lab on each day (labeled here as `log_lod`) and the population size of
#' the wastewater catchment area represented by the wastewater concentrations
#' in each `site`.
#'
#' This data is generated via the default values in the
#' `generate_simulated_data()` function. They represent the bare minumum
#' required fields needed to pass to the model, and we recommend that users
#' try to format their own data to match this format.
#'
#' The variables are as follows:
#'
#' @format ## ww_data_eval
#' A tibble with 126 rows and 6 columns
#' \describe{
#' \item{date}{Sample collection date, formatted in ISO8601 standards as
#' YYYY-MM-DD}
#' \item{site}{The wastewater treatment plant where the sample was collected}
#' \item{lab}{The lab where the sample was processed}
#' \item{log_genome_copies_per_ml_eval}{The natural log of the wastewater
#' concentration measured on the date specified, collected in the site
#' specified, and processed in the lab specified. The package expects
#' this quantity in units of log estimated genome copies per mL.}
#' \item{log_lod}{The log of the limit of detection in the site and lab on a
#' particular day of the quantification device (e.g. PCR). This should be in
#' units of log estimated genome copies per mL.}
#' \item{site_pop}{The population size of the wastewater catchment area
#' represented by the site variable}
#' \item{location}{ A string indicating the location that all of the
#' data is coming from. This is not a necessary column, but instead is
#' included to more realistically mirror a typical workflow}
#' }
#' @source vignette_data.R
"ww_data_eval"




Expand All @@ -57,9 +98,9 @@
#' to match this format.
#'
#' This data is generated via the default values in the
#' `generate_simulated_data()` function. They represent the bare minumum
#' `generate_simulated_data()` function. They represent the bare minimum
#' required fields needed to pass to the model, and we recommend that users
#' try to format their own data to match this formate.
#' try to format their own data to match this format.
#'
#' The variables are as follows:
#' \describe{
Expand Down Expand Up @@ -132,6 +173,77 @@
#' @source vignette_data.R
"hosp_data_eval"




#' Example subpopulation level hospital admissions dataset
#'
#' A dataset containing the simulated daily hospital admissions
#' (labeled here as `daily_hosp_admits`) by date of admission (`date`) in
#' each subpopulation.
#' Additional columns that are the population size of the
#' population contributing to the hospital admissions. In this instance,
#' the subpopulations here are each of the wastewater catchment areas plus
#' an additional subpopulation for the portion of the population not captured
#' by wastewater surveillance. The data generated are daily hospital
#' admissions but they could be any other epidemiological count dataset e.g.
#' cases. This data should only contain hospital admissions that would have
#' been available as of the date that the forecast was made.
#'
#' This data is generated via the default values in the
#' `generate_simulated_data()` function.
#'
#' The variables are as follows:
#' \describe{
#' \item{date}{Date the hospital admissions occurred, formatted in ISO8601
#' standards as YYYY-MM-DD}
#' \item{subpop_name}{A string indicating the subpopulation the hospital
#' admissiosn corresponds to. This is either a wastewater site, or the
#' remainder of the population}
#' \item{daily_hosp_admits}{The number of individuals admitted to the
#' hospital on that date, available as of the forecast date}
#' \item{subpop_pop}{The number of people contributing to the daily hospital
#' admissions in each subpopulation}
#' }
#' @source vignette_data.R
"subpop_hosp_data"


#' Example subpopulation level retrospective hospital admissions dataset
#'
#' A dataset containing the simulated daily hospital admissions
#' (labeled here as `daily_hosp_admits`) by date of admission (`date`) in
#' each subpopulation observed retrospectively.
#' Additional columns that are required are the population size of the
#' population contributing to the hospital admissions. In this instance,
#' the subpopulations here are each of the wastewater catchment areas plus
#' an additional subpopulation for the portion of the population not captured
#' by wastewater surveillance. The data generated are daily hospital
#' admissions but they could be any other epidemiological count dataset e.g.
#' cases.This data should contain hospital admissions retrospectively beyond
#' the forecast date in order to evaluate the forecasts.
#'
#' This data is generated via the default values in the
#' `generate_simulated_data()` function. They represent the bare minimumum
#' required fields needed to pass to the model, and we recommend that users
#' try to format their own data to match this format.
#'
#' The variables are as follows:
#' \describe{
#' \item{date}{Date the hospital admissions occurred, formatted in ISO8601
#' standards as YYYY-MM-DD}
#' \item{subpop_name}{A string indicating the subpopulation the hospital
#' admissions corresponds to. This is either a wastewater site, or the
#' remainder of the population}
#' \item{daily_hosp_admits_for_eval}{The number of individuals admitted to the
#' hospital on that date, available as of the forecast date}
#' \item{subpop_pop}{The number of people contributing to the daily hospital
#' admissions in each subpopulation}
#' }
#' @source vignette_data.R
"subpop_hosp_data_eval"


#' COVID-19 post-Omicron generation interval probability mass function
#'
#' \describe{
Expand Down
131 changes: 125 additions & 6 deletions R/generate_simulated_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,9 @@
#' infection feedback into the infection process, default is `FALSE`, which
#' sets the strength of the infection feedback to 0.
#' If `TRUE`, this will apply an infection feedback drawn from the prior.
#' @param subpop_phi Vector of numeric values indicating the overdispersion
#' parameter phi in the hospital admissions observation process in each
#' subpopulation
#' @param input_params_path path to the toml file with the parameters to use
#' to generate the simulated data
#'
Expand Down Expand Up @@ -121,6 +124,7 @@ generate_simulated_data <- function(r_in_weeks = # nolint
sigma_eps = 0.05,
sd_i0_over_n = 0.5,
if_feedback = FALSE,
subpop_phi = c(25, 50, 70, 40, 100),
input_params_path =
fs::path_package("extdata",
"example_params.toml",
Expand Down Expand Up @@ -322,25 +326,72 @@ generate_simulated_data <- function(r_in_weeks = # nolint
)

## Latent per capita admissions--------------------------------------------
# This won't be used other than for the unit test
model_hosp_over_n <- model$functions$convolve_dot_product(
p_hosp_days * new_i_over_n, # individuals who will be hospitalized
rev(inf_to_hosp),
uot + ot + ht
)[(uot + 1):(uot + ot + ht)]
dylanhmorris marked this conversation as resolved.
Show resolved Hide resolved

# Also compute per capita hosps for each subpopulation
model_hosp_subpop_over_n <- matrix(
nrow = n_subpops,
ncol = (ot + ht)
)
for (i in 1:n_subpops) {
model_hosp_subpop_over_n[i, ] <- model$functions$convolve_dot_product(
p_hosp_days * new_i_over_n_site[i, ],
rev(inf_to_hosp),
uot + ot + ht
)[(uot + 1):(uot + ot + ht)]
}

# unit test to make sure these are equivalent
dylanhmorris marked this conversation as resolved.
Show resolved Hide resolved
if (!all.equal(
colSums(model_hosp_subpop_over_n * pop_fraction),
model_hosp_over_n,
tolerance = 1e-8
)) {
cli::cli_abort("Sum of convolutions not equal to convolution of sums")
}


## Add weekday effect on hospital admissions-------------------------------
pred_hosp <- pop_size * model$functions$day_of_week_effect(
model_hosp_over_n,
day_of_week_vector,
hosp_wday_effect
)

pred_hosp_subpop <- matrix(
nrow = n_subpops,
ncol = (ot + ht)
)
for (i in 1:n_subpops) {
pred_hosp_subpop[i, ] <- pop_fraction[i] * pop_size *
model$functions$day_of_week_effect(
model_hosp_subpop_over_n[i, ],
day_of_week_vector,
hosp_wday_effect
)
}


## Add observation error---------------------------------------------------
# This is negative binomial but could swap out for a different obs error
pred_obs_hosp <- rnbinom(
n = length(pred_hosp), mu = pred_hosp,
size = 1 / ((params$inv_sqrt_phi_prior_mean)^2)
# Use negative binomial but could swap out for a different obs error.
# Each subpopulation has its own dispersion parameter, then we sum
# the observations to get the population total
pred_obs_hosp_subpop <- matrix(
nrow = n_subpops,
ncol = (ot + ht)
)
for (i in 1:n_subpops) {
pred_obs_hosp_subpop[i, ] <- rnbinom(
n = length(pred_hosp_subpop[i, ]), mu = pred_hosp_subpop[i, ],
size = subpop_phi[i]
)
}
pred_obs_hosp <- colSums(pred_obs_hosp_subpop)



Expand Down Expand Up @@ -381,6 +432,18 @@ generate_simulated_data <- function(r_in_weeks = # nolint
lab_site_reporting_latency = lab_site_reporting_latency
)

# Create evaluation data with same reporting freq but go through the entire
# time period
log_obs_conc_lab_site_eval <- downsample_ww_obs(
log_conc_lab_site = log_conc_lab_site,
n_lab_sites = n_lab_sites,
ot = ot + ht,
ht = 0,
nt = 0,
lab_site_reporting_freq = lab_site_reporting_freq,
lab_site_reporting_latency = rep(0, n_lab_sites)
)



# Global adjusted R(t) --------------------------------------------------
Expand All @@ -406,6 +469,18 @@ generate_simulated_data <- function(r_in_weeks = # nolint
lod_lab_site = lod_lab_site
)

ww_data_eval <- format_ww_data(
log_obs_conc_lab_site = log_obs_conc_lab_site_eval,
ot = ot + ht,
ht = 0,
date_df = date_df,
site_lab_map = site_lab_map,
lod_lab_site = lod_lab_site
) |>
dplyr::rename(
"log_genome_copies_per_ml_eval" = "log_genome_copies_per_ml"
)

# Artificially add values below the LOD----------------------------------
# Replace it with an NA, will be used as an example of how to format data
# properly.
Expand All @@ -419,16 +494,27 @@ generate_simulated_data <- function(r_in_weeks = # nolint
TRUE ~ .data$log_genome_copies_per_ml
)
)
ww_data_eval <- ww_data_eval |>
dplyr::mutate(
"log_genome_copies_per_ml_eval" =
dplyr::case_when(
.data$log_genome_copies_per_ml_eval ==
!!min_ww_val ~ 0.5 * .data$log_lod,
TRUE ~ .data$log_genome_copies_per_ml_eval
)
)


# Make a hospital admissions dataframe for model calibration
hosp_data <- format_hosp_data(pred_obs_hosp,
hosp_data <- format_hosp_data(
pred_obs_hosp = pred_obs_hosp,
dur_obs = ot,
pop_size = pop_size,
date_df = date_df
)

hosp_data_eval <- format_hosp_data(pred_obs_hosp,
hosp_data_eval <- format_hosp_data(
pred_obs_hosp = pred_obs_hosp,
dur_obs = (ot + ht),
pop_size = pop_size,
date_df = date_df
Expand All @@ -437,6 +523,36 @@ generate_simulated_data <- function(r_in_weeks = # nolint
"daily_hosp_admits_for_eval" = "daily_hosp_admits"
)

# Make a subpopulation level hospital admissions data
# For now this will only be used for evaluation, eventually, can add
# feature to use this in calibration
subpop_map <- tibble::tibble(
subpop_index = as.character(1:n_subpops),
subpop_pop = pop_size * pop_fraction,
subpop_name = c(1:n_sites, NA)
) |>
dplyr::mutate(subpop_name = ifelse(!is.na(subpop_name),
glue::glue("Site: {subpop_name}"),
"remainder of population"
))

subpop_hosp_data <- format_subpop_hosp_data(
pred_obs_hosp_subpop = pred_obs_hosp_subpop,
dur_obs = ot,
subpop_map = subpop_map,
date_df = date_df
)

subpop_hosp_data_eval <- format_subpop_hosp_data(
pred_obs_hosp_subpop = pred_obs_hosp_subpop,
dur_obs = (ot + ht),
subpop_map = subpop_map,
date_df = date_df
) |>
dplyr::rename(
"daily_hosp_admits_for_eval" = "daily_hosp_admits"
)

# Global R(t)
true_rt <- tibble::tibble(
unadj_rt_daily = as.numeric(unadj_r_daily),
Expand All @@ -453,8 +569,11 @@ generate_simulated_data <- function(r_in_weeks = # nolint

example_data <- list(
ww_data = ww_data,
ww_data_eval = ww_data_eval,
hosp_data = hosp_data,
hosp_data_eval = hosp_data_eval,
subpop_hosp_data = subpop_hosp_data,
subpop_hosp_data_eval = subpop_hosp_data_eval,
true_global_rt = true_rt
)

Expand Down
Loading
Loading