diff --git a/html_outputs/index.html b/html_outputs/index.html index aa0ddd8b..c6cb52a3 100644 --- a/html_outputs/index.html +++ b/html_outputs/index.html @@ -794,11 +794,11 @@

website and join our contact list
-
  • contact@appliedepi.org, tweet @appliedepi, or LinkedIn
    +
  • contact@appliedepi.org, tweet @appliedepi, or LinkedIn
  • Submit issues to our Github repository
  • -

    We offer live R training from instructors with decades of applied epidemiology experience - www.appliedepi.org/live.

    +

    We offer live R training from instructors with decades of applied epidemiology experience - www.appliedepi.org/live.

    @@ -1463,7 +1463,7 @@

    Contribution

    - diff --git a/html_outputs/new_pages/characters_strings.html b/html_outputs/new_pages/characters_strings.html index a1291eb4..6ab6e3ec 100644 --- a/html_outputs/new_pages/characters_strings.html +++ b/html_outputs/new_pages/characters_strings.html @@ -2,12 +2,12 @@ - + -The Epidemiologist R Handbook - 10  Characters and strings +10  Characters and strings – The Epidemiologist R Handbook @@ -1886,43 +1887,43 @@

    gtsummary19.2.1 Cross-tabulation

    The gtsummary package also allows us to quickly and easily create tables of counts. This can be useful for quickly summarising the data, and putting it in context with the regression we have carried out.

    -
    #Carry out our regression
    -univ_tab <- linelist %>% 
    -  dplyr::select(explanatory_vars, outcome) %>% ## select variables of interest
    -
    -  tbl_uvregression(                         ## produce univariate table
    -    method = glm,                           ## define regression want to run (generalised linear model)
    -    y = outcome,                            ## define outcome variable
    -    method.args = list(family = binomial),  ## define what type of glm want to run (logistic)
    -    exponentiate = TRUE                     ## exponentiate to produce odds ratios (rather than log odds)
    -  )
    -
    -#Create our cross tabulation
    -cross_tab <- linelist %>%
    -  dplyr::select(explanatory_vars, outcome) %>%   ## select variables of interest
    -     tbl_summary(by = outcome)                   ## create summary table
    -
    -tbl_merge(tbls = list(cross_tab,
    -                      univ_tab),
    -          tab_spanner = c("Summary", "Univariate regression"))
    +
    #Carry out our regression
    +univ_tab <- linelist %>% 
    +  dplyr::select(explanatory_vars, outcome) %>% ## select variables of interest
    +
    +  tbl_uvregression(                         ## produce univariate table
    +    method = glm,                           ## define regression want to run (generalised linear model)
    +    y = outcome,                            ## define outcome variable
    +    method.args = list(family = binomial),  ## define what type of glm want to run (logistic)
    +    exponentiate = TRUE                     ## exponentiate to produce odds ratios (rather than log odds)
    +  )
    +
    +#Create our cross tabulation
    +cross_tab <- linelist %>%
    +  dplyr::select(explanatory_vars, outcome) %>%   ## select variables of interest
    +     tbl_summary(by = outcome)                   ## create summary table
    +
    +tbl_merge(tbls = list(cross_tab,
    +                      univ_tab),
    +          tab_spanner = c("Summary", "Univariate regression"))
    -
    - @@ -2578,34 +2579,34 @@

    Sometimes in your analysis, you will want to investigate whether or not there are different relationships between an outcome and variables, by different strata. This could be something like, a difference in gender, age group, or source of infection.

    To do this, you will want to split your dataset into the strata of interest. For example, creating two separate datasets of gender == "f" and gender == "m", would be done by:

    -
    f_linelist <- linelist %>%
    -     filter(gender == 0) %>%                 ## subset to only where the gender == "f"
    -  dplyr::select(explanatory_vars, outcome)     ## select variables of interest
    -     
    -m_linelist <- linelist %>%
    -     filter(gender == 1) %>%                 ## subset to only where the gender == "f"
    -  dplyr::select(explanatory_vars, outcome)     ## select variables of interest
    +
    f_linelist <- linelist %>%
    +     filter(gender == 0) %>%                 ## subset to only where the gender == "f"
    +  dplyr::select(explanatory_vars, outcome)     ## select variables of interest
    +     
    +m_linelist <- linelist %>%
    +     filter(gender == 1) %>%                 ## subset to only where the gender == "f"
    +  dplyr::select(explanatory_vars, outcome)     ## select variables of interest

    Once this has been done, you can carry out your regression in either base R or gtsummary.

    19.3.1 base R

    To carry this out in base R, you run two different regressions, one for where gender == "f" and gender == "m".

    -
    #Run model for f
    -f_model <- glm(outcome ~ vomit, family = "binomial", data = f_linelist) %>% 
    -     tidy(exponentiate = TRUE, conf.int = TRUE) %>%        # exponentiate and produce CIs
    -     mutate(across(where(is.numeric), round, digits = 2)) %>%  # round all numeric columns
    -     mutate(gender = "f")                                      # create a column which identifies these results as using the f dataset
    - 
    -#Run model for m
    -m_model <- glm(outcome ~ vomit, family = "binomial", data = m_linelist) %>% 
    -     tidy(exponentiate = TRUE, conf.int = TRUE) %>%        # exponentiate and produce CIs
    -     mutate(across(where(is.numeric), round, digits = 2)) %>%  # round all numeric columns
    -     mutate(gender = "m")                                      # create a column which identifies these results as using the m dataset
    -
    -#Combine the results
    -rbind(f_model,
    -      m_model)
    +
    #Run model for f
    +f_model <- glm(outcome ~ vomit, family = "binomial", data = f_linelist) %>% 
    +     tidy(exponentiate = TRUE, conf.int = TRUE) %>%        # exponentiate and produce CIs
    +     mutate(across(where(is.numeric), round, digits = 2)) %>%  # round all numeric columns
    +     mutate(gender = "f")                                      # create a column which identifies these results as using the f dataset
    + 
    +#Run model for m
    +m_model <- glm(outcome ~ vomit, family = "binomial", data = m_linelist) %>% 
    +     tidy(exponentiate = TRUE, conf.int = TRUE) %>%        # exponentiate and produce CIs
    +     mutate(across(where(is.numeric), round, digits = 2)) %>%  # round all numeric columns
    +     mutate(gender = "m")                                      # create a column which identifies these results as using the m dataset
    +
    +#Combine the results
    +rbind(f_model,
    +      m_model)
    # A tibble: 4 × 8
       term        estimate std.error statistic p.value conf.low conf.high gender
    @@ -2621,54 +2622,54 @@ 

    19.3.2 gtsummary

    The same approach is repeated using gtsummary, however it is easier to produce publication ready tables with gtsummary and compare the two tables with the function tbl_merge().

    -
    #Run model for f
    -f_model_gt <- f_linelist %>% 
    -     dplyr::select(vomit, outcome) %>% ## select variables of interest
    -     tbl_uvregression(                         ## produce univariate table
    -          method = glm,                           ## define regression want to run (generalised linear model)
    -          y = outcome,                            ## define outcome variable
    -          method.args = list(family = binomial),  ## define what type of glm want to run (logistic)
    -          exponentiate = TRUE                     ## exponentiate to produce odds ratios (rather than log odds)
    -     )
    -
    -#Run model for m
    -m_model_gt <- m_linelist %>% 
    -     dplyr::select(vomit, outcome) %>% ## select variables of interest
    -     tbl_uvregression(                         ## produce univariate table
    -          method = glm,                           ## define regression want to run (generalised linear model)
    -          y = outcome,                            ## define outcome variable
    -          method.args = list(family = binomial),  ## define what type of glm want to run (logistic)
    -          exponentiate = TRUE                     ## exponentiate to produce odds ratios (rather than log odds)
    -     )
    -
    -#Combine gtsummary tables
    -f_and_m_table <- tbl_merge(
    -     tbls = list(f_model_gt,
    -                 m_model_gt),
    -     tab_spanner = c("Female",
    -                     "Male")
    -)
    -
    -#Print
    -f_and_m_table
    +
    #Run model for f
    +f_model_gt <- f_linelist %>% 
    +     dplyr::select(vomit, outcome) %>% ## select variables of interest
    +     tbl_uvregression(                         ## produce univariate table
    +          method = glm,                           ## define regression want to run (generalised linear model)
    +          y = outcome,                            ## define outcome variable
    +          method.args = list(family = binomial),  ## define what type of glm want to run (logistic)
    +          exponentiate = TRUE                     ## exponentiate to produce odds ratios (rather than log odds)
    +     )
    +
    +#Run model for m
    +m_model_gt <- m_linelist %>% 
    +     dplyr::select(vomit, outcome) %>% ## select variables of interest
    +     tbl_uvregression(                         ## produce univariate table
    +          method = glm,                           ## define regression want to run (generalised linear model)
    +          y = outcome,                            ## define outcome variable
    +          method.args = list(family = binomial),  ## define what type of glm want to run (logistic)
    +          exponentiate = TRUE                     ## exponentiate to produce odds ratios (rather than log odds)
    +     )
    +
    +#Combine gtsummary tables
    +f_and_m_table <- tbl_merge(
    +     tbls = list(f_model_gt,
    +                 m_model_gt),
    +     tab_spanner = c("Female",
    +                     "Male")
    +)
    +
    +#Print
    +f_and_m_table
    -
    - @@ -3190,9 +3191,9 @@

    Conduct m

    Here we use glm() but add more variables to the right side of the equation, separated by plus symbols (+).

    To run the model with all of our explanatory variables we would run:

    -
    mv_reg <- glm(outcome ~ gender + fever + chills + cough + aches + vomit + age_cat, family = "binomial", data = linelist)
    -
    -summary(mv_reg)
    +
    mv_reg <- glm(outcome ~ gender + fever + chills + cough + aches + vomit + age_cat, family = "binomial", data = linelist)
    +
    +summary(mv_reg)
    
     Call:
    @@ -3227,26 +3228,26 @@ 

    Conduct m

    If you want to include two variables and an interaction between them you can separate them with an asterisk * instead of a +. Separate them with a colon : if you are only specifying the interaction. For example:

    -
    glm(outcome ~ gender + age_cat * fever, family = "binomial", data = linelist)
    +
    glm(outcome ~ gender + age_cat * fever, family = "binomial", data = linelist)

    Optionally, you can use this code to leverage the pre-defined vector of column names and re-create the above command using str_c(). This might be useful if your explanatory variable names are changing, or you don’t want to type them all out again.

    -
    ## run a regression with all variables of interest 
    -mv_reg <- explanatory_vars %>%  ## begin with vector of explanatory column names
    -  str_c(collapse = "+") %>%     ## combine all names of the variables of interest separated by a plus
    -  str_c("outcome ~ ", .) %>%    ## combine the names of variables of interest with outcome in formula style
    -  glm(family = "binomial",      ## define type of glm as logistic,
    -      data = linelist)          ## define your dataset
    +
    ## run a regression with all variables of interest 
    +mv_reg <- explanatory_vars %>%  ## begin with vector of explanatory column names
    +  str_c(collapse = "+") %>%     ## combine all names of the variables of interest separated by a plus
    +  str_c("outcome ~ ", .) %>%    ## combine the names of variables of interest with outcome in formula style
    +  glm(family = "binomial",      ## define type of glm as logistic,
    +      data = linelist)          ## define your dataset

    Building the model

    You can build your model step-by-step, saving various models that include certain explanatory variables. You can compare these models with likelihood-ratio tests using lrtest() from the package lmtest, as below:

    NOTE: Using base anova(model1, model2, test = "Chisq) produces the same results

    -
    model1 <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
    -model2 <- glm(outcome ~ age_cat + gender, family = "binomial", data = linelist)
    -
    -lmtest::lrtest(model1, model2)
    +
    model1 <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
    +model2 <- glm(outcome ~ age_cat + gender, family = "binomial", data = linelist)
    +
    +lmtest::lrtest(model1, model2)
    Likelihood ratio test
     
    @@ -3259,26 +3260,26 @@ 

    Building the

    Another option is to take the model object and apply the step() function from the stats package. Specify which variable selection direction you want use when building the model.

    -
    ## choose a model using forward selection based on AIC
    -## you can also do "backward" or "both" by adjusting the direction
    -final_mv_reg <- mv_reg %>%
    -  step(direction = "forward", trace = FALSE)
    +
    ## choose a model using forward selection based on AIC
    +## you can also do "backward" or "both" by adjusting the direction
    +final_mv_reg <- mv_reg %>%
    +  step(direction = "forward", trace = FALSE)

    You can also turn off scientific notation in your R session, for clarity:

    -
    options(scipen=999)
    +
    options(scipen=999)

    As described in the section on univariate analysis, pass the model output to tidy() to exponentiate the log odds and CIs. Finally we round all numeric columns to two decimal places. Scroll through to see all the rows.

    -
    mv_tab_base <- final_mv_reg %>% 
    -  broom::tidy(exponentiate = TRUE, conf.int = TRUE) %>%  ## get a tidy dataframe of estimates 
    -  mutate(across(where(is.numeric), round, digits = 2))          ## round 
    +
    mv_tab_base <- final_mv_reg %>% 
    +  broom::tidy(exponentiate = TRUE, conf.int = TRUE) %>%  ## get a tidy dataframe of estimates 
    +  mutate(across(where(is.numeric), round, digits = 2))          ## round 

    Here is what the resulting data frame looks like:

    -
    - +
    +
    @@ -3290,30 +3291,30 @@

    Combine with gtsummary

    The gtsummary package provides the tbl_regression() function, which will take the outputs from a regression (glm() in this case) and produce a nice summary table.

    -
    ## show results table of final regression 
    -mv_tab <- tbl_regression(final_mv_reg, exponentiate = TRUE)
    +
    ## show results table of final regression 
    +mv_tab <- tbl_regression(final_mv_reg, exponentiate = TRUE)

    Let’s see the table:

    -
    mv_tab
    +
    mv_tab
    -
    - @@ -3877,28 +3878,28 @@

    Combine

    You can also combine several different output tables produced by gtsummary with the tbl_merge() function. We now combine the multivariable results with the gtsummary univariate results that we created above:

    -
    ## combine with univariate results 
    -tbl_merge(
    -  tbls = list(univ_tab, mv_tab),                          # combine
    -  tab_spanner = c("**Univariate**", "**Multivariable**")) # set header names
    +
    ## combine with univariate results 
    +tbl_merge(
    +  tbls = list(univ_tab, mv_tab),                          # combine
    +  tab_spanner = c("**Univariate**", "**Multivariable**")) # set header names
    -
    - @@ -4568,23 +4569,23 @@

    Combine with
  • Use round() with two decimal places on all the column that are class Double.
  • -
    ## combine univariate and multivariable tables 
    -left_join(univ_tab_base, mv_tab_base, by = "term") %>% 
    -  ## choose columns and rename them
    -  select( # new name =  old name
    -    "characteristic" = term, 
    -    "recovered"      = "0", 
    -    "dead"           = "1", 
    -    "univ_or"        = estimate.x, 
    -    "univ_ci_low"    = conf.low.x, 
    -    "univ_ci_high"   = conf.high.x,
    -    "univ_pval"      = p.value.x, 
    -    "mv_or"          = estimate.y, 
    -    "mvv_ci_low"     = conf.low.y, 
    -    "mv_ci_high"     = conf.high.y,
    -    "mv_pval"        = p.value.y 
    -  ) %>% 
    -  mutate(across(where(is.double), round, 2))   
    +
    ## combine univariate and multivariable tables 
    +left_join(univ_tab_base, mv_tab_base, by = "term") %>% 
    +  ## choose columns and rename them
    +  select( # new name =  old name
    +    "characteristic" = term, 
    +    "recovered"      = "0", 
    +    "dead"           = "1", 
    +    "univ_or"        = estimate.x, 
    +    "univ_ci_low"    = conf.low.x, 
    +    "univ_ci_high"   = conf.high.x,
    +    "univ_pval"      = p.value.x, 
    +    "mv_or"          = estimate.y, 
    +    "mvv_ci_low"     = conf.low.y, 
    +    "mv_ci_high"     = conf.high.y,
    +    "mv_pval"        = p.value.y 
    +  ) %>% 
    +  mutate(across(where(is.double), round, 2))   
    # A tibble: 20 × 11
        characteristic recovered  dead univ_or univ_ci_low univ_ci_high univ_pval
    @@ -4634,30 +4635,30 @@ 

    ggplot2

    Before plotting, you may want to use fct_relevel() from the forcats package to set the order of the variables/levels on the y-axis. ggplot() may display them in alpha-numeric order which would not work well for these age category values (“30” would appear before “5”). See the page on Factors for more details.

    -
    ## remove the intercept term from your multivariable results
    -mv_tab_base %>% 
    -  
    -  #set order of levels to appear along y-axis
    -  mutate(term = fct_relevel(
    -    term,
    -    "vomit", "gender", "fever", "cough", "chills", "aches",
    -    "age_cat5-9", "age_cat10-14", "age_cat15-19", "age_cat20-29",
    -    "age_cat30-49", "age_cat50-69", "age_cat70+")) %>%
    -  
    -  # remove "intercept" row from plot
    -  filter(term != "(Intercept)") %>% 
    -  
    -  ## plot with variable on the y axis and estimate (OR) on the x axis
    -  ggplot(aes(x = estimate, y = term)) +
    -  
    -  ## show the estimate as a point
    -  geom_point() + 
    -  
    -  ## add in an error bar for the confidence intervals
    -  geom_errorbar(aes(xmin = conf.low, xmax = conf.high)) + 
    -  
    -  ## show where OR = 1 is for reference as a dashed line
    -  geom_vline(xintercept = 1, linetype = "dashed")
    +
    ## remove the intercept term from your multivariable results
    +mv_tab_base %>% 
    +  
    +  #set order of levels to appear along y-axis
    +  mutate(term = fct_relevel(
    +    term,
    +    "vomit", "gender", "fever", "cough", "chills", "aches",
    +    "age_cat5-9", "age_cat10-14", "age_cat15-19", "age_cat20-29",
    +    "age_cat30-49", "age_cat50-69", "age_cat70+")) %>%
    +  
    +  # remove "intercept" row from plot
    +  filter(term != "(Intercept)") %>% 
    +  
    +  ## plot with variable on the y axis and estimate (OR) on the x axis
    +  ggplot(aes(x = estimate, y = term)) +
    +  
    +  ## show the estimate as a point
    +  geom_point() + 
    +  
    +  ## add in an error bar for the confidence intervals
    +  geom_errorbar(aes(xmin = conf.low, xmax = conf.high)) + 
    +  
    +  ## show where OR = 1 is for reference as a dashed line
    +  geom_vline(xintercept = 1, linetype = "dashed")
    @@ -4673,12 +4674,12 @@

    easy

    An alternative, if you do not want to the fine level of control that ggplot2 provides, is to use a combination of easystats packages.

    The function model_parameters() from the parameters package does the equivalent of the broom package function tidy(). The see package then accepts those outputs and creates a default forest plot as a ggplot() object.

    -
    pacman::p_load(easystats)
    -
    -## remove the intercept term from your multivariable results
    -final_mv_reg %>% 
    -  model_parameters(exponentiate = TRUE) %>% 
    -  plot()
    +
    pacman::p_load(easystats)
    +
    +## remove the intercept term from your multivariable results
    +final_mv_reg %>% 
    +  model_parameters(exponentiate = TRUE) %>% 
    +  plot()
    @@ -4695,22 +4696,22 @@

    While there are many different functions, and many different packages, to assess model fit, one package that nicely combines several different metrics and approaches into a single source is the performance package. This package allows you to assess model assumptions (such as linearity, homogeneity, highlight outliers, etc.) and check how well the model performs (Akaike Information Criterion values, R2, RMSE, etc) with a few simple functions.

    Unfortunately, we are unable to use this package with gtsummary, but it readily accepts objects generated by other packages such as stats, lmerMod and tidymodels. Here we will demonstrate its application using the function glm() for a multivariable regression. To do this we can use the function performance() to assess model fit, and compare_perfomrance() to compare the two models.

    -
    #Load in packages
    -pacman::p_load(performance)
    -
    -#Set up regression models
    -regression_one <- linelist %>%
    -     select(outcome, gender, fever, chills, cough) %>%
    -     glm(formula = outcome ~ .,
    -         family = binomial)
    -
    -regression_two <- linelist %>%
    -     select(outcome, days_onset_hosp, aches, vomit, age_years) %>%
    -     glm(formula = outcome ~ .,
    -         family = binomial)
    -
    -#Assess model fit
    -performance(regression_one)
    +
    #Load in packages
    +pacman::p_load(performance)
    +
    +#Set up regression models
    +regression_one <- linelist %>%
    +     select(outcome, gender, fever, chills, cough) %>%
    +     glm(formula = outcome ~ .,
    +         family = binomial)
    +
    +regression_two <- linelist %>%
    +     select(outcome, days_onset_hosp, aches, vomit, age_years) %>%
    +     glm(formula = outcome ~ .,
    +         family = binomial)
    +
    +#Assess model fit
    +performance(regression_one)
    # Indices of model performance
     
    @@ -4718,7 +4719,7 @@ 

    -
    performance(regression_two)
    +
    performance(regression_two)
    # Indices of model performance
     
    @@ -4726,9 +4727,9 @@ 

    -
    #Compare model fit
    -compare_performance(regression_one,
    -                    regression_two)
    +
    #Compare model fit
    +compare_performance(regression_one,
    +                    regression_two)
    When comparing models, please note that probably not all models were fit
       from same data.
    @@ -5347,7 +5348,7 @@

    var lightboxQuarto = GLightbox({"selector":".lightbox","closeEffect":"zoom","openEffect":"zoom","loop":false,"descPosition":"bottom"}); (function() { let previousOnload = window.onload; window.onload = () => { diff --git a/html_outputs/new_pages/standardization.html b/html_outputs/new_pages/standardization.html index 99324f7f..7e9abfff 100644 --- a/html_outputs/new_pages/standardization.html +++ b/html_outputs/new_pages/standardization.html @@ -669,6 +669,12 @@ 43  Dashboards with Shiny

    + + @@ -684,43 +690,43 @@ @@ -845,8 +851,8 @@

    Load popul

    -
    - +
    +
    @@ -855,8 +861,8 @@

    Load popul

    -
    - +
    +

    @@ -866,15 +872,15 @@

    Load death co

    Deaths in Country A

    -
    - +
    +

    Deaths in Country B

    -
    - +
    +

    @@ -905,8 +911,8 @@

    Cl

    The combined population data now look like this (click through to see countries A and B):

    -
    - +
    +

    And now we perform similar operations on the two deaths datasets.

    @@ -922,8 +928,8 @@

    Cl

    The deaths data now look like this, and contain data from both countries:

    -
    - +
    +

    We now join the deaths and population data based on common columns Country, age_cat5, and Sex. This adds the column Deaths.

    @@ -951,8 +957,8 @@

    Cl

    -
    - +
    +

    CAUTION: If you have few deaths per stratum, consider using 10-, or 15-year categories, instead of 5-year categories for age.

    @@ -966,8 +972,8 @@

    Load

    -
    - +
    +
    @@ -999,8 +1005,8 @@

    Create dataset wit

    This complete dataset looks like this:

    -
    - +
    +
    @@ -1008,7 +1014,7 @@

    Create dataset wit

    21.3 PHEindicatormethods package

    -

    Another way of calculating standardized rates is with the PHEindicatormethods package. This package allows you to calculate directly as well as indirectly standardized rates. We will show both.

    +

    One way of calculating standardized rates is with the PHEindicatormethods package. This package allows you to calculate directly as well as indirectly standardized rates. We will show both.

    This section will use the all_data data frame created at the end of the Preparation section. This data frame includes the country populations, death events, and the world standard reference population. You can view it here.

    diff --git a/html_outputs/new_pages/stat_tests.html b/html_outputs/new_pages/stat_tests.html index 2471d86a..656d638c 100644 --- a/html_outputs/new_pages/stat_tests.html +++ b/html_outputs/new_pages/stat_tests.html @@ -680,6 +680,12 @@ 43  Dashboards with Shiny

    + + @@ -695,43 +701,43 @@ @@ -804,8 +810,7 @@

    18  Load packages

    janitor, # adding totals and percents to tables flextable # converting tables to HTML )
    -
    -
    also installing the dependency 'repr'
    -
    -
    -
    Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.4:
    -  cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.4/PACKAGES'
    -
    -
    -
    package 'repr' successfully unpacked and MD5 sums checked
    -package 'skimr' successfully unpacked and MD5 sums checked
    -
    -The downloaded binary packages are in
    -    C:\Users\ah1114\AppData\Local\Temp\RtmpeoxLnM\downloaded_packages
    -
    -
    -
    
    -skimr installed
    -
    -
    -
    also installing the dependencies 'iterators', 'permute', 'ca', 'foreach', 'gclus', 'qap', 'registry', 'TSP', 'vegan', 'seriation'
    -
    -
    -
    Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.4:
    -  cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.4/PACKAGES'
    -
    -
    -
    package 'iterators' successfully unpacked and MD5 sums checked
    -package 'permute' successfully unpacked and MD5 sums checked
    -package 'ca' successfully unpacked and MD5 sums checked
    -package 'foreach' successfully unpacked and MD5 sums checked
    -package 'gclus' successfully unpacked and MD5 sums checked
    -package 'qap' successfully unpacked and MD5 sums checked
    -package 'registry' successfully unpacked and MD5 sums checked
    -package 'TSP' successfully unpacked and MD5 sums checked
    -package 'vegan' successfully unpacked and MD5 sums checked
    -package 'seriation' successfully unpacked and MD5 sums checked
    -package 'corrr' successfully unpacked and MD5 sums checked
    -
    -The downloaded binary packages are in
    -    C:\Users\ah1114\AppData\Local\Temp\RtmpeoxLnM\downloaded_packages
    -
    -
    -
    
    -corrr installed
    -

    Import data

    We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import your data with the import() function from the rio package (it accepts many file types like .xlsx, .rds, .csv - see the Import and export page for details).

    -
    -
    Warning: Missing `trust` will be set to FALSE by default for RDS in 2.0.0.
    -
    -
    -
    -
    # import the linelist
    -linelist <- import("linelist_cleaned.rds")
    +
    # import the linelist
    +linelist <- import("linelist_cleaned.rds")

    The first 50 rows of the linelist are displayed below.

    -
    - +
    +
    @@ -919,8 +874,8 @@

    T-tests

    A t-test, also called “Student’s t-Test”, is typically used to determine if there is a significant difference between the means of some numeric variable between two groups. Here we’ll show the syntax to do this test depending on whether the columns are in the same data frame.

    Syntax 1: This is the syntax when your numeric and categorical columns are in the same data frame. Provide the numeric column on the left side of the equation and the categorical column on the right side. Specify the dataset to data =. Optionally, set paired = TRUE, and conf.level = (0.95 default), and alternative = (either “two.sided”, “less”, or “greater”). Enter ?t.test for more details.

    -
    ## compare mean age by outcome group with a t-test
    -t.test(age_years ~ gender, data = linelist)
    +
    ## compare mean age by outcome group with a t-test
    +t.test(age_years ~ gender, data = linelist)
    
         Welch Two Sample t-test
    @@ -937,26 +892,26 @@ 

    T-tests

    Syntax 2: You can compare two separate numeric vectors using this alternative syntax. For example, if the two columns are in different data sets.

    -
    t.test(df1$age_years, df2$age_years)
    +
    t.test(df1$age_years, df2$age_years)

    You can also use a t-test to determine whether a sample mean is significantly different from some specific value. Here we conduct a one-sample t-test with the known/hypothesized population mean as mu =:

    -
    t.test(linelist$age_years, mu = 45)
    +
    t.test(linelist$age_years, mu = 45)

    Shapiro-Wilk test

    The Shapiro-Wilk test can be used to determine whether a sample came from a normally-distributed population (an assumption of many other tests and analysis, such as the t-test). However, this can only be used on a sample between 3 and 5000 observations. For larger samples a quantile-quantile plot may be helpful.

    -
    shapiro.test(linelist$age_years)
    +
    shapiro.test(linelist$age_years)

    Wilcoxon rank sum test

    The Wilcoxon rank sum test, also called the Mann–Whitney U test, is often used to help determine if two numeric samples are from the same distribution when their populations are not normally distributed or have unequal variance.

    -
    ## compare age distribution by outcome group with a wilcox test
    -wilcox.test(age_years ~ outcome, data = linelist)
    +
    ## compare age distribution by outcome group with a wilcox test
    +wilcox.test(age_years ~ outcome, data = linelist)
    
         Wilcoxon rank sum test with continuity correction
    @@ -971,8 +926,8 @@ 

    Wilcoxon

    Kruskal-Wallis test

    The Kruskal-Wallis test is an extension of the Wilcoxon rank sum test that can be used to test for differences in the distribution of more than two samples. When only two samples are used it gives identical results to the Wilcoxon rank sum test.

    -
    ## compare age distribution by outcome group with a kruskal-wallis test
    -kruskal.test(age_years ~ outcome, linelist)
    +
    ## compare age distribution by outcome group with a kruskal-wallis test
    +kruskal.test(age_years ~ outcome, linelist)
    
         Kruskal-Wallis rank sum test
    @@ -986,8 +941,8 @@ 

    Kruskal-Wal

    Chi-squared test

    Pearson’s Chi-squared test is used in testing for significant differences between categorical croups.

    -
    ## compare the proportions in each group with a chi-squared test
    -chisq.test(linelist$gender, linelist$outcome)
    +
    ## compare the proportions in each group with a chi-squared test
    +chisq.test(linelist$gender, linelist$outcome)
    
         Pearson's Chi-squared test with Yates' continuity correction
    @@ -1006,8 +961,8 @@ 

    Summary stat

    The function get_summary_stats() is a quick way to return summary statistics. Simply pipe your dataset to this function and provide the columns to analyse. If no columns are specified, the statistics are calculated for all columns.

    By default, a full range of summary statistics are returned: n, max, min, median, 25%ile, 75%ile, IQR, median absolute deviation (mad), mean, standard deviation, standard error, and a confidence interval of the mean.

    -
    linelist %>%
    -  rstatix::get_summary_stats(age, temp)
    +
    linelist %>%
    +  rstatix::get_summary_stats(age, temp)
    # A tibble: 2 × 13
       variable     n   min   max median    q1    q3   iqr    mad  mean     sd    se
    @@ -1020,9 +975,9 @@ 

    Summary stat

    You can specify a subset of summary statistics to return by providing one of the following values to type =: “full”, “common”, “robust”, “five_number”, “mean_sd”, “mean_se”, “mean_ci”, “median_iqr”, “median_mad”, “quantile”, “mean”, “median”, “min”, “max”.

    It can be used with grouped data as well, such that a row is returned for each grouping-variable:

    -
    linelist %>%
    -  group_by(hospital) %>%
    -  rstatix::get_summary_stats(age, temp, type = "common")
    +
    linelist %>%
    +  group_by(hospital) %>%
    +  rstatix::get_summary_stats(age, temp, type = "common")
    # A tibble: 12 × 11
        hospital     variable     n   min   max median   iqr  mean     sd    se    ci
    @@ -1047,8 +1002,8 @@ 

    Summary stat

    T-test

    Use a formula syntax to specify the numeric and categorical columns:

    -
    linelist %>% 
    -  t_test(age_years ~ gender)
    +
    linelist %>% 
    +  t_test(age_years ~ gender)
    # A tibble: 1 × 10
       .y.   group1 group2    n1    n2 statistic    df        p    p.adj p.adj.signif
    @@ -1058,8 +1013,8 @@ 

    T-test

    Or use ~ 1 and specify mu = for a one-sample T-test. This can also be done by group.

    -
    linelist %>% 
    -  t_test(age_years ~ 1, mu = 30)
    +
    linelist %>% 
    +  t_test(age_years ~ 1, mu = 30)
    # A tibble: 1 × 7
       .y.       group1 group2         n statistic    df     p
    @@ -1069,9 +1024,9 @@ 

    T-test

    If applicable, the statistical tests can be done by group, as shown below:

    -
    linelist %>% 
    -  group_by(gender) %>% 
    -  t_test(age_years ~ 1, mu = 18)
    +
    linelist %>% 
    +  group_by(gender) %>% 
    +  t_test(age_years ~ 1, mu = 18)
    # A tibble: 3 × 8
       gender .y.       group1 group2         n statistic    df         p
    @@ -1086,9 +1041,9 @@ 

    T-test

    Shapiro-Wilk test

    As stated above, sample size must be between 3 and 5000.

    -
    linelist %>% 
    -  head(500) %>%            # first 500 rows of case linelist, for example only
    -  shapiro_test(age_years)
    +
    linelist %>% 
    +  head(500) %>%            # first 500 rows of case linelist, for example only
    +  shapiro_test(age_years)
    # A tibble: 1 × 3
       variable  statistic        p
    @@ -1100,8 +1055,8 @@ 

    Shapiro-Wil

    Wilcoxon rank sum test

    -
    linelist %>% 
    -  wilcox_test(age_years ~ gender)
    +
    linelist %>% 
    +  wilcox_test(age_years ~ gender)
    # A tibble: 1 × 9
       .y.       group1 group2    n1    n2 statistic        p    p.adj p.adj.signif
    @@ -1114,8 +1069,8 @@ 

    Wilcox

    Kruskal-Wallis test

    Also known as the Mann-Whitney U test.

    -
    linelist %>% 
    -  kruskal_test(age_years ~ outcome)
    +
    linelist %>% 
    +  kruskal_test(age_years ~ outcome)
    # A tibble: 1 × 6
       .y.           n statistic    df     p method        
    @@ -1128,10 +1083,10 @@ 

    Kruskal-W

    Chi-squared test

    The chi-square test function accepts a table, so first we create a cross-tabulation. There are many ways to create a cross-tabulation (see Descriptive tables) but here we use tabyl() from janitor and remove the left-most column of value labels before passing to chisq_test().

    -
    linelist %>% 
    -  tabyl(gender, outcome) %>% 
    -  select(-1) %>% 
    -  chisq_test()
    +
    linelist %>% 
    +  tabyl(gender, outcome) %>% 
    +  select(-1) %>% 
    +  chisq_test()
    # A tibble: 1 × 6
           n statistic     p    df method          p.signif
    @@ -1150,34 +1105,31 @@ 

    Chi-squared test

    Compare the proportions of a categorical variable in two groups. The default statistical test for add_p() when applied to a categorical variable is to perform a chi-squared test of independence with continuity correction, but if any expected call count is below 5 then a Fisher’s exact test is used.

    -
    linelist %>% 
    -  select(gender, outcome) %>%    # keep variables of interest
    -  tbl_summary(by = outcome) %>%  # produce summary table and specify grouping variable
    -  add_p()                        # specify what test to perform
    +
    linelist %>% 
    +  select(gender, outcome) %>%    # keep variables of interest
    +  tbl_summary(by = outcome) %>%  # produce summary table and specify grouping variable
    +  add_p()                        # specify what test to perform
    -
    1323 missing rows in the "outcome" column have been removed.
    -The following errors were returned during `add_p()`:
    -✖ For variable `gender` (`outcome`) and "p.value" statistic: The package
    -  "cardx" (>= 0.2.1) is required.
    +
    1323 missing rows in the "outcome" column have been removed.
    -
    - @@ -1633,7 +1585,8 @@

    Chi-squared N = 1,983

    1

    p-value

    +

    2" class="gt_col_heading gt_columns_bottom_border gt_center" data-quarto-table-cell-role="th" scope="col">

    p-value

    +2 @@ -1643,8 +1596,7 @@

    Chi-squared
    -
    - +>0.9     f @@ -1672,6 +1624,10 @@

    Chi-squared 1

    n (%)

    + +2 +

    Pearson’s Chi-squared test

    + @@ -1684,36 +1640,33 @@

    Chi-squared

    T-tests

    Compare the difference in means for a continuous variable in two groups. For example, compare the mean age by patient outcome.

    -
    linelist %>% 
    -  select(age_years, outcome) %>%             # keep variables of interest
    -  tbl_summary(                               # produce summary table
    -    statistic = age_years ~ "{mean} ({sd})", # specify what statistics to show
    -    by = outcome) %>%                        # specify the grouping variable
    -  add_p(age_years ~ "t.test")                # specify what tests to perform
    +
    linelist %>% 
    +  select(age_years, outcome) %>%             # keep variables of interest
    +  tbl_summary(                               # produce summary table
    +    statistic = age_years ~ "{mean} ({sd})", # specify what statistics to show
    +    by = outcome) %>%                        # specify the grouping variable
    +  add_p(age_years ~ "t.test")                # specify what tests to perform
    -
    1323 missing rows in the "outcome" column have been removed.
    -The following errors were returned during `add_p()`:
    -✖ For variable `age_years` (`outcome`) and "p.value" statistic: The package
    -  "cardx" (>= 0.2.1) is required.
    +
    1323 missing rows in the "outcome" column have been removed.
    -
    - @@ -2169,7 +2122,8 @@

    T-tests

    N = 1,983

    1

    p-value

    +
    2" class="gt_col_heading gt_columns_bottom_border gt_center" data-quarto-table-cell-role="th" scope="col">

    p-value

    +2 @@ -2177,8 +2131,7 @@

    T-tests

    age_years 16 (12) 16 (13) -
    - +0.6     Unknown @@ -2192,6 +2145,10 @@

    T-tests

    1

    Mean (SD)

    + +2 +

    Welch Two Sample t-test

    + @@ -2204,36 +2161,33 @@

    T-tests

    Wilcoxon rank sum test

    Compare the distribution of a continuous variable in two groups. The default is to use the Wilcoxon rank sum test and the median (IQR) when comparing two groups. However for non-normally distributed data or comparing multiple groups, the Kruskal-wallis test is more appropriate.

    -
    linelist %>% 
    -  select(age_years, outcome) %>%                       # keep variables of interest
    -  tbl_summary(                                         # produce summary table
    -    statistic = age_years ~ "{median} ({p25}, {p75})", # specify what statistic to show (this is default so could remove)
    -    by = outcome) %>%                                  # specify the grouping variable
    -  add_p(age_years ~ "wilcox.test")                     # specify what test to perform (default so could leave brackets empty)
    +
    linelist %>% 
    +  select(age_years, outcome) %>%                       # keep variables of interest
    +  tbl_summary(                                         # produce summary table
    +    statistic = age_years ~ "{median} ({p25}, {p75})", # specify what statistic to show (this is default so could remove)
    +    by = outcome) %>%                                  # specify the grouping variable
    +  add_p(age_years ~ "wilcox.test")                     # specify what test to perform (default so could leave brackets empty)
    -
    1323 missing rows in the "outcome" column have been removed.
    -The following errors were returned during `add_p()`:
    -✖ For variable `age_years` (`outcome`) and "p.value" statistic: The package
    -  "cardx" (>= 0.2.1) is required.
    +
    1323 missing rows in the "outcome" column have been removed.
    -
    - @@ -2689,7 +2643,8 @@

    Wilcox N = 1,983

    1

    p-value

    +

    2" class="gt_col_heading gt_columns_bottom_border gt_center" data-quarto-table-cell-role="th" scope="col">

    p-value

    +2 @@ -2697,8 +2652,7 @@

    Wilcox age_years 13 (6, 23) 13 (6, 23) -
    - +0.8     Unknown @@ -2712,6 +2666,10 @@

    Wilcox 1

    Median (Q1, Q3)

    + +2 +

    Wilcoxon rank sum test

    + @@ -2724,36 +2682,33 @@

    Wilcox

    Kruskal-wallis test

    Compare the distribution of a continuous variable in two or more groups, regardless of whether the data is normally distributed.

    -
    linelist %>% 
    -  select(age_years, outcome) %>%                       # keep variables of interest
    -  tbl_summary(                                         # produce summary table
    -    statistic = age_years ~ "{median} ({p25}, {p75})", # specify what statistic to show (default, so could remove)
    -    by = outcome) %>%                                  # specify the grouping variable
    -  add_p(age_years ~ "kruskal.test")                    # specify what test to perform
    +
    linelist %>% 
    +  select(age_years, outcome) %>%                       # keep variables of interest
    +  tbl_summary(                                         # produce summary table
    +    statistic = age_years ~ "{median} ({p25}, {p75})", # specify what statistic to show (default, so could remove)
    +    by = outcome) %>%                                  # specify the grouping variable
    +  add_p(age_years ~ "kruskal.test")                    # specify what test to perform
    -
    1323 missing rows in the "outcome" column have been removed.
    -The following errors were returned during `add_p()`:
    -✖ For variable `age_years` (`outcome`) and "p.value" statistic: The package
    -  "cardx" (>= 0.2.1) is required.
    +
    1323 missing rows in the "outcome" column have been removed.
    -
    - @@ -3209,7 +3164,8 @@

    Kruskal-w N = 1,983

    1

    p-value

    +

    2" class="gt_col_heading gt_columns_bottom_border gt_center" data-quarto-table-cell-role="th" scope="col">

    p-value

    +2 @@ -3217,8 +3173,7 @@

    Kruskal-w age_years 13 (6, 23) 13 (6, 23) -
    - +0.8     Unknown @@ -3232,6 +3187,10 @@

    Kruskal-w 1

    Median (Q1, Q3)

    + +2 +

    Kruskal-Wallis rank sum test

    + @@ -3371,11 +3330,11 @@

    Correlation between numeric variables can be investigated using the tidyverse
    corrr package. It allows you to compute correlations using Pearson, Kendall tau or Spearman rho. The package creates a table and also has a function to automatically plot the values.

    -
    correlation_tab <- linelist %>% 
    -  select(generation, age, ct_blood, days_onset_hosp, wt_kg, ht_cm) %>%   # keep numeric variables of interest
    -  correlate()      # create correlation table (using default pearson)
    -
    -correlation_tab    # print
    +
    correlation_tab <- linelist %>% 
    +  select(generation, age, ct_blood, days_onset_hosp, wt_kg, ht_cm) %>%   # keep numeric variables of interest
    +  correlate()      # create correlation table (using default pearson)
    +
    +correlation_tab    # print
    # A tibble: 6 × 7
       term            generation      age ct_blood days_onset_hosp    wt_kg    ht_cm
    @@ -3387,12 +3346,12 @@ 

    -
    ## remove duplicate entries (the table above is mirrored) 
    -correlation_tab <- correlation_tab %>% 
    -  shave()
    -
    -## view correlation table 
    -correlation_tab
    +
    ## remove duplicate entries (the table above is mirrored) 
    +correlation_tab <- correlation_tab %>% 
    +  shave()
    +
    +## view correlation table 
    +correlation_tab
    # A tibble: 6 × 7
       term            generation       age ct_blood days_onset_hosp  wt_kg ht_cm
    @@ -3404,8 +3363,8 @@ 

    -
    ## plot correlations 
    -rplot(correlation_tab)
    +
    ## plot correlations 
    +rplot(correlation_tab)
    @@ -4018,7 +3977,7 @@

    var lightboxQuarto = GLightbox({"loop":false,"descPosition":"bottom","openEffect":"zoom","closeEffect":"zoom","selector":".lightbox"}); (function() { let previousOnload = window.onload; window.onload = () => { diff --git a/html_outputs/new_pages/survey_analysis.html b/html_outputs/new_pages/survey_analysis.html index 8d9cf575..086d9f0c 100644 --- a/html_outputs/new_pages/survey_analysis.html +++ b/html_outputs/new_pages/survey_analysis.html @@ -672,6 +672,12 @@ 43  Dashboards with Shiny

    + + @@ -687,43 +693,43 @@ @@ -846,9 +852,9 @@

    Packages

    Load data

    The example dataset used in this section:

      -
    • fictional mortality survey data.
    • -
    • fictional population counts for the survey area.
    • -
    • data dictionary for the fictional mortality survey data.
    • +
    • Fictional mortality survey data.
    • +
    • Fictional population counts for the survey area.
    • +
    • Data dictionary for the fictional mortality survey data.

    This is based off the MSF OCA ethical review board pre-approved survey. The fictional dataset was produced as part of the “R4Epis” project. This is all based off data collected using KoboToolbox, which is a data collection software based off Open Data Kit.

    Kobo allows you to export both the collected data, as well as the data dictionary for that dataset. We strongly recommend doing this as it simplifies data cleaning and is useful for looking up variables/questions.

    @@ -865,8 +871,8 @@

    Load data

    The first 10 rows of the survey are displayed below.

    -
    - +
    +

    We also want to import the data on sampling population so that we can produce appropriate weights. This data can be in different formats, however we would suggest to have it as seen below (this can just be typed in to an excel).

    @@ -877,8 +883,8 @@

    Load data

    The first 10 rows of the survey are displayed below.

    -
    - +
    +

    For cluster surveys you may want to add survey weights at the cluster level. You could read this data in as above. Alternatively if there are only a few counts, these could be entered as below in to a tibble. In any case you will need to have one column with a cluster identifier which matches your survey data, and another column with the number of households in each cluster.

    @@ -942,19 +948,6 @@

    Clean data

    mutate(across(all_of(YNVARS), str_detect, pattern = "yes"))

    -
    -
    Warning: There was 1 warning in `mutate()`.
    -ℹ In argument: `across(all_of(YNVARS), str_detect, pattern = "yes")`.
    -Caused by warning:
    -! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
    -Supply arguments directly to `.fns` through an anonymous function instead.
    -
    -  # Previously
    -  across(a:b, mean, na.rm = TRUE)
    -
    -  # Now
    -  across(a:b, \(x) mean(x, na.rm = TRUE))
    -
    @@ -969,79 +962,79 @@

    For mortality surveys we want to now how long each individual was present for in the location to be able to calculate an appropriate mortality rate for our period of interest. This is not relevant to all surveys, but particularly for mortality surveys this is important as they are conducted frequently among mobile or displaced populations.

    To do this we first define our time period of interest, also known as a recall period (i.e. the time that participants are asked to report on when answering questions). We can then use this period to set inappropriate dates to missing, i.e. if deaths are reported from outside the period of interest.

    -
    ## set the start/end of recall period
    -## can be changed to date variables from dataset 
    -## (e.g. arrival date & date questionnaire)
    -survey_data <- survey_data %>% 
    -  mutate(recall_start = as.Date("2018-01-01"), 
    -         recall_end   = as.Date("2018-05-01")
    -  )
    -
    -
    -# set inappropriate dates to NA based on rules 
    -## e.g. arrivals before start, departures departures after end
    -survey_data <- survey_data %>%
    -      mutate(
    -           arrived_date = if_else(arrived_date < recall_start, 
    -                                 as.Date(NA),
    -                                  arrived_date),
    -           birthday_date = if_else(birthday_date < recall_start,
    -                                  as.Date(NA),
    -                                  birthday_date),
    -           left_date = if_else(left_date > recall_end,
    -                              as.Date(NA),
    -                               left_date),
    -           death_date = if_else(death_date > recall_end,
    -                               as.Date(NA),
    -                               death_date)
    -           )
    +
    ## set the start/end of recall period
    +## can be changed to date variables from dataset 
    +## (e.g. arrival date & date questionnaire)
    +survey_data <- survey_data %>% 
    +  mutate(recall_start = as.Date("2018-01-01"), 
    +         recall_end   = as.Date("2018-05-01")
    +  )
    +
    +
    +# set inappropriate dates to NA based on rules 
    +## e.g. arrivals before start, departures departures after end
    +survey_data <- survey_data %>%
    +      mutate(
    +           arrived_date = if_else(arrived_date < recall_start, 
    +                                 as.Date(NA),
    +                                  arrived_date),
    +           birthday_date = if_else(birthday_date < recall_start,
    +                                  as.Date(NA),
    +                                  birthday_date),
    +           left_date = if_else(left_date > recall_end,
    +                              as.Date(NA),
    +                               left_date),
    +           death_date = if_else(death_date > recall_end,
    +                               as.Date(NA),
    +                               death_date)
    +           )

    We can then use our date variables to define start and end dates for each individual. We can use the find_start_date() function from sitrep to fine the causes for the dates and then use that to calculate the difference between days (person-time).

    start date: Earliest appropriate arrival event within your recall period. Either the beginning of your recall period (which you define in advance), or a date after the start of recall if applicable (e.g. arrivals or births).

    end date: Earliest appropriate departure event within your recall period. Either the end of your recall period, or a date before the end of recall if applicable (e.g. departures, deaths).

    -
    ## create new variables for start and end dates/causes
    -survey_data <- survey_data %>% 
    -     ## choose earliest date entered in survey
    -     ## from births, household arrivals, and camp arrivals 
    -     find_start_date("birthday_date",
    -                  "arrived_date",
    -                  period_start = "recall_start",
    -                  period_end   = "recall_end",
    -                  datecol      = "startdate",
    -                  datereason   = "startcause" 
    -                 ) %>%
    -     ## choose earliest date entered in survey
    -     ## from camp departures, death and end of the study
    -     find_end_date("left_date",
    -                "death_date",
    -                period_start = "recall_start",
    -                period_end   = "recall_end",
    -                datecol      = "enddate",
    -                datereason   = "endcause" 
    -               )
    -
    -
    -## label those that were present at the start/end (except births/deaths)
    -survey_data <- survey_data %>% 
    -     mutate(
    -       ## fill in start date to be the beginning of recall period (for those empty) 
    -       startdate = if_else(is.na(startdate), recall_start, startdate), 
    -       ## set the start cause to present at start if equal to recall period 
    -       ## unless it is equal to the birth date 
    -       startcause = if_else(startdate == recall_start & startcause != "birthday_date",
    -                              "Present at start", startcause), 
    -       ## fill in end date to be end of recall period (for those empty) 
    -       enddate = if_else(is.na(enddate), recall_end, enddate), 
    -       ## set the end cause to present at end if equall to recall end 
    -       ## unless it is equal to the death date
    -       endcause = if_else(enddate == recall_end & endcause != "death_date", 
    -                            "Present at end", endcause))
    -
    -
    -## Define observation time in days
    -survey_data <- survey_data %>% 
    -  mutate(obstime = as.numeric(enddate - startdate))
    +
    ## create new variables for start and end dates/causes
    +survey_data <- survey_data %>% 
    +     ## choose earliest date entered in survey
    +     ## from births, household arrivals, and camp arrivals 
    +     find_start_date("birthday_date",
    +                  "arrived_date",
    +                  period_start = "recall_start",
    +                  period_end   = "recall_end",
    +                  datecol      = "startdate",
    +                  datereason   = "startcause" 
    +                 ) %>%
    +     ## choose earliest date entered in survey
    +     ## from camp departures, death and end of the study
    +     find_end_date("left_date",
    +                "death_date",
    +                period_start = "recall_start",
    +                period_end   = "recall_end",
    +                datecol      = "enddate",
    +                datereason   = "endcause" 
    +               )
    +
    +
    +## label those that were present at the start/end (except births/deaths)
    +survey_data <- survey_data %>% 
    +     mutate(
    +       ## fill in start date to be the beginning of recall period (for those empty) 
    +       startdate = if_else(is.na(startdate), recall_start, startdate), 
    +       ## set the start cause to present at start if equal to recall period 
    +       ## unless it is equal to the birth date 
    +       startcause = if_else(startdate == recall_start & startcause != "birthday_date",
    +                              "Present at start", startcause), 
    +       ## fill in end date to be end of recall period (for those empty) 
    +       enddate = if_else(is.na(enddate), recall_end, enddate), 
    +       ## set the end cause to present at end if equall to recall end 
    +       ## unless it is equal to the death date
    +       endcause = if_else(enddate == recall_end & endcause != "death_date", 
    +                            "Present at end", endcause))
    +
    +
    +## Define observation time in days
    +survey_data <- survey_data %>% 
    +  mutate(obstime = as.numeric(enddate - startdate))

    @@ -1051,52 +1044,52 @@

    DANGER: You cant have missing values in your weight variable, or any of the variables relevant to your survey design (e.g. age, sex, strata or cluster variables).

    -
    ## store the cases that you drop so you can describe them (e.g. non-consenting 
    -## or wrong village/cluster)
    -dropped <- survey_data %>% 
    -  filter(!consent | is.na(startdate) | is.na(enddate) | village_name == "other")
    -
    -## use the dropped cases to remove the unused rows from the survey data set  
    -survey_data <- anti_join(survey_data, dropped, by = names(dropped))
    +
    ## store the cases that you drop so you can describe them (e.g. non-consenting 
    +## or wrong village/cluster)
    +dropped <- survey_data %>% 
    +  filter(!consent | is.na(startdate) | is.na(enddate) | village_name == "other")
    +
    +## use the dropped cases to remove the unused rows from the survey data set  
    +survey_data <- anti_join(survey_data, dropped, by = names(dropped))

    As mentioned above we demonstrate how to add weights for three different study designs (stratified, cluster and stratified cluster). These require information on the source population and/or the clusters surveyed. We will use the stratified cluster code for this example, but use whichever is most appropriate for your study design.

    -
    # stratified ------------------------------------------------------------------
    -# create a variable called "surv_weight_strata"
    -# contains weights for each individual - by age group, sex and health district
    -survey_data <- add_weights_strata(x = survey_data,
    -                                         p = population,
    -                                         surv_weight = "surv_weight_strata",
    -                                         surv_weight_ID = "surv_weight_ID_strata",
    -                                         age_group, sex, health_district)
    -
    -## cluster ---------------------------------------------------------------------
    -
    -# get the number of people of individuals interviewed per household
    -# adds a variable with counts of the household (parent) index variable
    -survey_data <- survey_data %>%
    -  add_count(index, name = "interviewed")
    -
    -
    -## create cluster weights
    -survey_data <- add_weights_cluster(x = survey_data,
    -                                          cl = cluster_counts,
    -                                          eligible = member_number,
    -                                          interviewed = interviewed,
    -                                          cluster_x = village_name,
    -                                          cluster_cl = cluster,
    -                                          household_x = index,
    -                                          household_cl = households,
    -                                          surv_weight = "surv_weight_cluster",
    -                                          surv_weight_ID = "surv_weight_ID_cluster",
    -                                          ignore_cluster = FALSE,
    -                                          ignore_household = FALSE)
    -
    -
    -# stratified and cluster ------------------------------------------------------
    -# create a survey weight for cluster and strata
    -survey_data <- survey_data %>%
    -  mutate(surv_weight_cluster_strata = surv_weight_strata * surv_weight_cluster)
    +
    # stratified ------------------------------------------------------------------
    +# create a variable called "surv_weight_strata"
    +# contains weights for each individual - by age group, sex and health district
    +survey_data <- add_weights_strata(x = survey_data,
    +                                         p = population,
    +                                         surv_weight = "surv_weight_strata",
    +                                         surv_weight_ID = "surv_weight_ID_strata",
    +                                         age_group, sex, health_district)
    +
    +## cluster ---------------------------------------------------------------------
    +
    +# get the number of people of individuals interviewed per household
    +# adds a variable with counts of the household (parent) index variable
    +survey_data <- survey_data %>%
    +  add_count(index, name = "interviewed")
    +
    +
    +## create cluster weights
    +survey_data <- add_weights_cluster(x = survey_data,
    +                                          cl = cluster_counts,
    +                                          eligible = member_number,
    +                                          interviewed = interviewed,
    +                                          cluster_x = village_name,
    +                                          cluster_cl = cluster,
    +                                          household_x = index,
    +                                          household_cl = households,
    +                                          surv_weight = "surv_weight_cluster",
    +                                          surv_weight_ID = "surv_weight_ID_cluster",
    +                                          ignore_cluster = FALSE,
    +                                          ignore_household = FALSE)
    +
    +
    +# stratified and cluster ------------------------------------------------------
    +# create a survey weight for cluster and strata
    +survey_data <- survey_data %>%
    +  mutate(surv_weight_cluster_strata = surv_weight_strata * surv_weight_cluster)
    @@ -1112,64 +1105,64 @@

    The survey package effectively uses base R coding, and so it is not possible to use pipes (%>%) or other dplyr syntax. With the survey package we use the svydesign() function to define a survey object with appropriate clusters, weights and strata.

    NOTE: we need to use the tilde (~) in front of variables, this is because the package uses the base R syntax of assigning variables based on formulae.

    -
    # simple random ---------------------------------------------------------------
    -base_survey_design_simple <- svydesign(ids = ~1, # 1 for no cluster ids
    -                   weights = NULL,               # No weight added
    -                   strata = NULL,                # sampling was simple (no strata)
    -                   data = survey_data            # have to specify the dataset
    -                  )
    -
    -## stratified ------------------------------------------------------------------
    -base_survey_design_strata <- svydesign(ids = ~1,  # 1 for no cluster ids
    -                   weights = ~surv_weight_strata, # weight variable created above
    -                   strata = ~health_district,     # sampling was stratified by district
    -                   data = survey_data             # have to specify the dataset
    -                  )
    -
    -# cluster ---------------------------------------------------------------------
    -base_survey_design_cluster <- svydesign(ids = ~village_name, # cluster ids
    -                   weights = ~surv_weight_cluster, # weight variable created above
    -                   strata = NULL,                 # sampling was simple (no strata)
    -                   data = survey_data              # have to specify the dataset
    -                  )
    -
    -# stratified cluster ----------------------------------------------------------
    -base_survey_design <- svydesign(ids = ~village_name,      # cluster ids
    -                   weights = ~surv_weight_cluster_strata, # weight variable created above
    -                   strata = ~health_district,             # sampling was stratified by district
    -                   data = survey_data                     # have to specify the dataset
    -                  )
    +
    # simple random ---------------------------------------------------------------
    +base_survey_design_simple <- svydesign(ids = ~1, # 1 for no cluster ids
    +                   weights = NULL,               # No weight added
    +                   strata = NULL,                # sampling was simple (no strata)
    +                   data = survey_data            # have to specify the dataset
    +                  )
    +
    +## stratified ------------------------------------------------------------------
    +base_survey_design_strata <- svydesign(ids = ~1,  # 1 for no cluster ids
    +                   weights = ~surv_weight_strata, # weight variable created above
    +                   strata = ~health_district,     # sampling was stratified by district
    +                   data = survey_data             # have to specify the dataset
    +                  )
    +
    +# cluster ---------------------------------------------------------------------
    +base_survey_design_cluster <- svydesign(ids = ~village_name, # cluster ids
    +                   weights = ~surv_weight_cluster, # weight variable created above
    +                   strata = NULL,                 # sampling was simple (no strata)
    +                   data = survey_data              # have to specify the dataset
    +                  )
    +
    +# stratified cluster ----------------------------------------------------------
    +base_survey_design <- svydesign(ids = ~village_name,      # cluster ids
    +                   weights = ~surv_weight_cluster_strata, # weight variable created above
    +                   strata = ~health_district,             # sampling was stratified by district
    +                   data = survey_data                     # have to specify the dataset
    +                  )

    26.6.2 Srvyr package

    With the srvyr package we can use the as_survey_design() function, which has all the same arguments as above but allows pipes (%>%), and so we do not need to use the tilde (~).

    -
    ## simple random ---------------------------------------------------------------
    -survey_design_simple <- survey_data %>% 
    -  as_survey_design(ids = 1, # 1 for no cluster ids 
    -                   weights = NULL, # No weight added
    -                   strata = NULL # sampling was simple (no strata)
    -                  )
    -## stratified ------------------------------------------------------------------
    -survey_design_strata <- survey_data %>%
    -  as_survey_design(ids = 1, # 1 for no cluster ids
    -                   weights = surv_weight_strata, # weight variable created above
    -                   strata = health_district # sampling was stratified by district
    -                  )
    -## cluster ---------------------------------------------------------------------
    -survey_design_cluster <- survey_data %>%
    -  as_survey_design(ids = village_name, # cluster ids
    -                   weights = surv_weight_cluster, # weight variable created above
    -                   strata = NULL # sampling was simple (no strata)
    -                  )
    -
    -## stratified cluster ----------------------------------------------------------
    -survey_design <- survey_data %>%
    -  as_survey_design(ids = village_name, # cluster ids
    -                   weights = surv_weight_cluster_strata, # weight variable created above
    -                   strata = health_district # sampling was stratified by district
    -                  )
    +
    ## simple random ---------------------------------------------------------------
    +survey_design_simple <- survey_data %>% 
    +  as_survey_design(ids = 1, # 1 for no cluster ids 
    +                   weights = NULL, # No weight added
    +                   strata = NULL # sampling was simple (no strata)
    +                  )
    +## stratified ------------------------------------------------------------------
    +survey_design_strata <- survey_data %>%
    +  as_survey_design(ids = 1, # 1 for no cluster ids
    +                   weights = surv_weight_strata, # weight variable created above
    +                   strata = health_district # sampling was stratified by district
    +                  )
    +## cluster ---------------------------------------------------------------------
    +survey_design_cluster <- survey_data %>%
    +  as_survey_design(ids = village_name, # cluster ids
    +                   weights = surv_weight_cluster, # weight variable created above
    +                   strata = NULL # sampling was simple (no strata)
    +                  )
    +
    +## stratified cluster ----------------------------------------------------------
    +survey_design <- survey_data %>%
    +  as_survey_design(ids = village_name, # cluster ids
    +                   weights = surv_weight_cluster_strata, # weight variable created above
    +                   strata = health_district # sampling was stratified by district
    +                  )
    @@ -1190,52 +1183,52 @@

    Compare the proportions in each age group between your sample and the source population. This is important to be able to highlight potential sampling bias. You could similarly repeat this looking at distributions by sex.

    Note that these p-values are just indicative, and a descriptive discussion (or visualisation with age-pyramids below) of the distributions in your study sample compared to the source population is more important than the binomial test itself. This is because increasing sample size will more often than not lead to differences that may be irrelevant after weighting your data.

    -
    ## counts and props of the study population
    -ag <- survey_data %>% 
    -  group_by(age_group) %>% 
    -  drop_na(age_group) %>% 
    -  tally() %>% 
    -  mutate(proportion = n / sum(n), 
    -         n_total = sum(n))
    -
    -## counts and props of the source population
    -propcount <- population %>% 
    -  group_by(age_group) %>%
    -    tally(population) %>%
    -    mutate(proportion = n / sum(n))
    -
    -## bind together the columns of two tables, group by age, and perform a 
    -## binomial test to see if n/total is significantly different from population
    -## proportion.
    -  ## suffix here adds to text to the end of columns in each of the two datasets
    -left_join(ag, propcount, by = "age_group", suffix = c("", "_pop")) %>%
    -  group_by(age_group) %>%
    -  ## broom::tidy(binom.test()) makes a data frame out of the binomial test and
    -  ## will add the variables p.value, parameter, conf.low, conf.high, method, and
    -  ## alternative. We will only use p.value here. You can include other
    -  ## columns if you want to report confidence intervals
    -  mutate(binom = list(broom::tidy(binom.test(n, n_total, proportion_pop)))) %>%
    -  unnest(cols = c(binom)) %>% # important for expanding the binom.test data frame
    -  mutate(proportion_pop = proportion_pop * 100) %>%
    -  ## Adjusting the p-values to correct for false positives 
    -  ## (because testing multiple age groups). This will only make 
    -  ## a difference if you have many age categories
    -  mutate(p.value = p.adjust(p.value, method = "holm")) %>%
    -                      
    -  ## Only show p-values over 0.001 (those under report as <0.001)
    -  mutate(p.value = ifelse(p.value < 0.001, 
    -                          "<0.001", 
    -                          as.character(round(p.value, 3)))) %>% 
    -  
    -  ## rename the columns appropriately
    -  select(
    -    "Age group" = age_group,
    -    "Study population (n)" = n,
    -    "Study population (%)" = proportion,
    -    "Source population (n)" = n_pop,
    -    "Source population (%)" = proportion_pop,
    -    "P-value" = p.value
    -  )
    +
    ## counts and props of the study population
    +ag <- survey_data %>% 
    +  group_by(age_group) %>% 
    +  drop_na(age_group) %>% 
    +  tally() %>% 
    +  mutate(proportion = n / sum(n), 
    +         n_total = sum(n))
    +
    +## counts and props of the source population
    +propcount <- population %>% 
    +  group_by(age_group) %>%
    +    tally(population) %>%
    +    mutate(proportion = n / sum(n))
    +
    +## bind together the columns of two tables, group by age, and perform a 
    +## binomial test to see if n/total is significantly different from population
    +## proportion.
    +  ## suffix here adds to text to the end of columns in each of the two datasets
    +left_join(ag, propcount, by = "age_group", suffix = c("", "_pop")) %>%
    +  group_by(age_group) %>%
    +  ## broom::tidy(binom.test()) makes a data frame out of the binomial test and
    +  ## will add the variables p.value, parameter, conf.low, conf.high, method, and
    +  ## alternative. We will only use p.value here. You can include other
    +  ## columns if you want to report confidence intervals
    +  mutate(binom = list(broom::tidy(binom.test(n, n_total, proportion_pop)))) %>%
    +  unnest(cols = c(binom)) %>% # important for expanding the binom.test data frame
    +  mutate(proportion_pop = proportion_pop * 100) %>%
    +  ## Adjusting the p-values to correct for false positives 
    +  ## (because testing multiple age groups). This will only make 
    +  ## a difference if you have many age categories
    +  mutate(p.value = p.adjust(p.value, method = "holm")) %>%
    +                      
    +  ## Only show p-values over 0.001 (those under report as <0.001)
    +  mutate(p.value = ifelse(p.value < 0.001, 
    +                          "<0.001", 
    +                          as.character(round(p.value, 3)))) %>% 
    +  
    +  ## rename the columns appropriately
    +  select(
    +    "Age group" = age_group,
    +    "Study population (n)" = n,
    +    "Study population (%)" = proportion,
    +    "Source population (n)" = n_pop,
    +    "Source population (%)" = proportion_pop,
    +    "P-value" = p.value
    +  )
    # A tibble: 5 × 6
     # Groups:   Age group [5]
    @@ -1257,108 +1250,108 @@ 

    As with the formal binomial test of difference, seen above in the sampling bias section, we are interested here in visualising whether our sampled population is substantially different from the source population and whether weighting corrects this difference. To do this we will use the patchwork package to show our ggplot visualisations side-by-side; for details see the section on combining plots in ggplot tips chapter of the handbook. We will visualise our source population, our un-weighted survey population and our weighted survey population. You may also consider visualising by each strata of your survey - in our example here that would be by using the argument stack_by = "health_district" (see ?plot_age_pyramid for details).

    NOTE: The x and y axes are flipped in pyramids

    -
    ## define x-axis limits and labels ---------------------------------------------
    -## (update these numbers to be the values for your graph)
    -max_prop <- 35      # choose the highest proportion you want to show 
    -step <- 5           # choose the space you want beween labels 
    -
    -## this part defines vector using the above numbers with axis breaks
    -breaks <- c(
    -    seq(max_prop/100 * -1, 0 - step/100, step/100), 
    -    0, 
    -    seq(0 + step / 100, max_prop/100, step/100)
    -    )
    -
    -## this part defines vector using the above numbers with axis limits
    -limits <- c(max_prop/100 * -1, max_prop/100)
    -
    -## this part defines vector using the above numbers with axis labels
    -labels <-  c(
    -      seq(max_prop, step, -step), 
    -      0, 
    -      seq(step, max_prop, step)
    -    )
    -
    -
    -## create plots individually  --------------------------------------------------
    -
    -## plot the source population 
    -## nb: this needs to be collapsed for the overall population (i.e. removing health districts)
    -source_population <- population %>%
    -  ## ensure that age and sex are factors
    -  mutate(age_group = factor(age_group, 
    -                            levels = c("0-2", 
    -                                       "3-14", 
    -                                       "15-29",
    -                                       "30-44", 
    -                                       "45+")), 
    -         sex = factor(sex)) %>% 
    -  group_by(age_group, sex) %>% 
    -  ## add the counts for each health district together 
    -  summarise(population = sum(population)) %>% 
    -  ## remove the grouping so can calculate overall proportion
    -  ungroup() %>% 
    -  mutate(proportion = population / sum(population)) %>% 
    -  ## plot pyramid 
    -  age_pyramid(
    -            age_group = age_group, 
    -            split_by = sex, 
    -            count = proportion, 
    -            proportional = TRUE) +
    -  ## only show the y axis label (otherwise repeated in all three plots)
    -  labs(title = "Source population", 
    -       y = "", 
    -       x = "Age group (years)") + 
    -  ## make the x axis the same for all plots 
    -  scale_y_continuous(breaks = breaks, 
    -    limits = limits, 
    -    labels = labels)
    -  
    -  
    -## plot the unweighted sample population 
    -sample_population <- age_pyramid(survey_data, 
    -                 age_group = "age_group", 
    -                 split_by = "sex",
    -                 proportion = TRUE) + 
    -  ## only show the x axis label (otherwise repeated in all three plots)
    -  labs(title = "Unweighted sample population", 
    -       y = "Proportion (%)", 
    -       x = "") + 
    -  ## make the x axis the same for all plots 
    -  scale_y_continuous(breaks = breaks, 
    -    limits = limits, 
    -    labels = labels)
    -
    -
    -## plot the weighted sample population 
    -weighted_population <- survey_design %>% 
    -  ## make sure the variables are factors
    -  mutate(age_group = factor(age_group), 
    -         sex = factor(sex)) %>%
    -  age_pyramid(
    -    age_group = "age_group",
    -    split_by = "sex", 
    -    proportion = TRUE) +
    -  ## only show the x axis label (otherwise repeated in all three plots)
    -  labs(title = "Weighted sample population", 
    -       y = "", 
    -       x = "")  + 
    -  ## make the x axis the same for all plots 
    -  scale_y_continuous(breaks = breaks, 
    -    limits = limits, 
    -    labels = labels)
    -
    -## combine all three plots  ----------------------------------------------------
    -## combine three plots next to eachother using + 
    -source_population + sample_population + weighted_population + 
    -  ## only show one legend and define theme 
    -  ## note the use of & for combining theme with plot_layout()
    -  plot_layout(guides = "collect") & 
    -  theme(legend.position = "bottom",                    # move legend to bottom
    -        legend.title = element_blank(),                # remove title
    -        text = element_text(size = 18),                # change text size
    -        axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1) # turn x-axis text
    -       )
    +
    ## define x-axis limits and labels ---------------------------------------------
    +## (update these numbers to be the values for your graph)
    +max_prop <- 35      # choose the highest proportion you want to show 
    +step <- 5           # choose the space you want beween labels 
    +
    +## this part defines vector using the above numbers with axis breaks
    +breaks <- c(
    +    seq(max_prop/100 * -1, 0 - step/100, step/100), 
    +    0, 
    +    seq(0 + step / 100, max_prop/100, step/100)
    +    )
    +
    +## this part defines vector using the above numbers with axis limits
    +limits <- c(max_prop/100 * -1, max_prop/100)
    +
    +## this part defines vector using the above numbers with axis labels
    +labels <-  c(
    +      seq(max_prop, step, -step), 
    +      0, 
    +      seq(step, max_prop, step)
    +    )
    +
    +
    +## create plots individually  --------------------------------------------------
    +
    +## plot the source population 
    +## nb: this needs to be collapsed for the overall population (i.e. removing health districts)
    +source_population <- population %>%
    +  ## ensure that age and sex are factors
    +  mutate(age_group = factor(age_group, 
    +                            levels = c("0-2", 
    +                                       "3-14", 
    +                                       "15-29",
    +                                       "30-44", 
    +                                       "45+")), 
    +         sex = factor(sex)) %>% 
    +  group_by(age_group, sex) %>% 
    +  ## add the counts for each health district together 
    +  summarise(population = sum(population)) %>% 
    +  ## remove the grouping so can calculate overall proportion
    +  ungroup() %>% 
    +  mutate(proportion = population / sum(population)) %>% 
    +  ## plot pyramid 
    +  age_pyramid(
    +            age_group = age_group, 
    +            split_by = sex, 
    +            count = proportion, 
    +            proportional = TRUE) +
    +  ## only show the y axis label (otherwise repeated in all three plots)
    +  labs(title = "Source population", 
    +       y = "", 
    +       x = "Age group (years)") + 
    +  ## make the x axis the same for all plots 
    +  scale_y_continuous(breaks = breaks, 
    +    limits = limits, 
    +    labels = labels)
    +  
    +  
    +## plot the unweighted sample population 
    +sample_population <- age_pyramid(survey_data, 
    +                 age_group = "age_group", 
    +                 split_by = "sex",
    +                 proportion = TRUE) + 
    +  ## only show the x axis label (otherwise repeated in all three plots)
    +  labs(title = "Unweighted sample population", 
    +       y = "Proportion (%)", 
    +       x = "") + 
    +  ## make the x axis the same for all plots 
    +  scale_y_continuous(breaks = breaks, 
    +    limits = limits, 
    +    labels = labels)
    +
    +
    +## plot the weighted sample population 
    +weighted_population <- survey_design %>% 
    +  ## make sure the variables are factors
    +  mutate(age_group = factor(age_group), 
    +         sex = factor(sex)) %>%
    +  age_pyramid(
    +    age_group = "age_group",
    +    split_by = "sex", 
    +    proportion = TRUE) +
    +  ## only show the x axis label (otherwise repeated in all three plots)
    +  labs(title = "Weighted sample population", 
    +       y = "", 
    +       x = "")  + 
    +  ## make the x axis the same for all plots 
    +  scale_y_continuous(breaks = breaks, 
    +    limits = limits, 
    +    labels = labels)
    +
    +## combine all three plots  ----------------------------------------------------
    +## combine three plots next to eachother using + 
    +source_population + sample_population + weighted_population + 
    +  ## only show one legend and define theme 
    +  ## note the use of & for combining theme with plot_layout()
    +  plot_layout(guides = "collect") & 
    +  theme(legend.position = "bottom",                    # move legend to bottom
    +        legend.title = element_blank(),                # remove title
    +        text = element_text(size = 18),                # change text size
    +        axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1) # turn x-axis text
    +       )
    @@ -1372,28 +1365,28 @@

    26.7.3 Alluvial/sankey diagram

    Visualising starting points and outcomes for individuals can be very helpful to get an overview. There is quite an obvious application for mobile populations, however there are numerous other applications such as cohorts or any other situation where there are transitions in states for individuals. These diagrams have several different names including alluvial, sankey and parallel sets - the details are in the handbook chapter on diagrams and charts.

    -
    ## summarize data
    -flow_table <- survey_data %>%
    -  count(startcause, endcause, sex) %>%  # get counts 
    -  gather_set_data(x = c("startcause", "endcause"))     # change format for plotting
    -
    -
    -## plot your dataset 
    -  ## on the x axis is the start and end causes
    -  ## gather_set_data generates an ID for each possible combination
    -  ## splitting by y gives the possible start/end combos
    -  ## value as n gives it as counts (could also be changed to proportion)
    -ggplot(flow_table, aes(x, id = id, split = y, value = n)) +
    -  ## colour lines by sex 
    -  geom_parallel_sets(aes(fill = sex), alpha = 0.5, axis.width = 0.2) +
    -  ## fill in the label boxes grey
    -  geom_parallel_sets_axes(axis.width = 0.15, fill = "grey80", color = "grey80") +
    -  ## change text colour and angle (needs to be adjusted)
    -  geom_parallel_sets_labels(color = "black", angle = 0, size = 5) +
    -  ## remove axis labels
    -  theme_void()+
    -  ## move legend to bottom
    -  theme(legend.position = "bottom")               
    +
    ## summarize data
    +flow_table <- survey_data %>%
    +  count(startcause, endcause, sex) %>%  # get counts 
    +  gather_set_data(x = c("startcause", "endcause"))     # change format for plotting
    +
    +
    +## plot your dataset 
    +  ## on the x axis is the start and end causes
    +  ## gather_set_data generates an ID for each possible combination
    +  ## splitting by y gives the possible start/end combos
    +  ## value as n gives it as counts (could also be changed to proportion)
    +ggplot(flow_table, aes(x, id = id, split = y, value = n)) +
    +  ## colour lines by sex 
    +  geom_parallel_sets(aes(fill = sex), alpha = 0.5, axis.width = 0.2) +
    +  ## fill in the label boxes grey
    +  geom_parallel_sets_axes(axis.width = 0.15, fill = "grey80", color = "grey80") +
    +  ## change text colour and angle (needs to be adjusted)
    +  geom_parallel_sets_labels(color = "black", angle = 0, size = 5) +
    +  ## remove axis labels
    +  theme_void()+
    +  ## move legend to bottom
    +  theme(legend.position = "bottom")               
    @@ -1413,22 +1406,22 @@

    We can use the svyciprop() function from survey to get weighted proportions and accompanying 95% confidence intervals. An appropriate design effect can be extracted using the svymean() rather than svyprop() function. It is worth noting that svyprop() only appears to accept variables between 0 and 1 (or TRUE/FALSE), so categorical variables will not work.

    NOTE: Functions from survey also accept srvyr design objects, but here we have used the survey design object just for consistency

    -
    ## produce weighted counts 
    -svytable(~died, base_survey_design)
    +
    ## produce weighted counts 
    +svytable(~died, base_survey_design)
    died
          FALSE       TRUE 
     1406244.43   76213.01 
    -
    ## produce weighted proportions
    -svyciprop(~died, base_survey_design, na.rm = T)
    +
    ## produce weighted proportions
    +svyciprop(~died, base_survey_design, na.rm = T)
                  2.5%  97.5%
     died 0.0514 0.0208 0.1213
    -
    ## get the design effect 
    -svymean(~died, base_survey_design, na.rm = T, deff = T) %>% 
    -  deff()
    +
    ## get the design effect 
    +svymean(~died, base_survey_design, na.rm = T, deff = T) %>% 
    +  deff()
    diedFALSE  diedTRUE 
      3.755508  3.755508 
    @@ -1436,54 +1429,54 @@

    We can combine the functions from survey shown above in to a function which we define ourselves below, called svy_prop; and we can then use that function together with map() from the purrr package to iterate over several variables and create a table. See the handbook iteration chapter for details on purrr.

    -
    # Define function to calculate weighted counts, proportions, CI and design effect
    -# x is the variable in quotation marks 
    -# design is your survey design object
    -
    -svy_prop <- function(design, x) {
    -  
    -  ## put the variable of interest in a formula 
    -  form <- as.formula(paste0( "~" , x))
    -  ## only keep the TRUE column of counts from svytable
    -  weighted_counts <- svytable(form, design)[[2]]
    -  ## calculate proportions (multiply by 100 to get percentages)
    -  weighted_props <- svyciprop(form, design, na.rm = TRUE) * 100
    -  ## extract the confidence intervals and multiply to get percentages
    -  weighted_confint <- confint(weighted_props) * 100
    -  ## use svymean to calculate design effect and only keep the TRUE column
    -  design_eff <- deff(svymean(form, design, na.rm = TRUE, deff = TRUE))[[TRUE]]
    -  
    -  ## combine in to one data frame
    -  full_table <- cbind(
    -    "Variable"        = x,
    -    "Count"           = weighted_counts,
    -    "Proportion"      = weighted_props,
    -    weighted_confint, 
    -    "Design effect"   = design_eff
    -    )
    -  
    -  ## return table as a dataframe
    -  full_table <- data.frame(full_table, 
    -             ## remove the variable names from rows (is a separate column now)
    -             row.names = NULL)
    -  
    -  ## change numerics back to numeric
    -  full_table[ , 2:6] <- as.numeric(full_table[, 2:6])
    -  
    -  ## return dataframe
    -  full_table
    -}
    -
    -## iterate over several variables to create a table 
    -purrr::map(
    -  ## define variables of interest
    -  c("left", "died", "arrived"), 
    -  ## state function using and arguments for that function (design)
    -  svy_prop, design = base_survey_design) %>% 
    -  ## collapse list in to a single data frame
    -  bind_rows() %>% 
    -  ## round 
    -  mutate(across(where(is.numeric), round, digits = 1))
    +
    # Define function to calculate weighted counts, proportions, CI and design effect
    +# x is the variable in quotation marks 
    +# design is your survey design object
    +
    +svy_prop <- function(design, x) {
    +  
    +  ## put the variable of interest in a formula 
    +  form <- as.formula(paste0( "~" , x))
    +  ## only keep the TRUE column of counts from svytable
    +  weighted_counts <- svytable(form, design)[[2]]
    +  ## calculate proportions (multiply by 100 to get percentages)
    +  weighted_props <- svyciprop(form, design, na.rm = TRUE) * 100
    +  ## extract the confidence intervals and multiply to get percentages
    +  weighted_confint <- confint(weighted_props) * 100
    +  ## use svymean to calculate design effect and only keep the TRUE column
    +  design_eff <- deff(svymean(form, design, na.rm = TRUE, deff = TRUE))[[TRUE]]
    +  
    +  ## combine in to one data frame
    +  full_table <- cbind(
    +    "Variable"        = x,
    +    "Count"           = weighted_counts,
    +    "Proportion"      = weighted_props,
    +    weighted_confint, 
    +    "Design effect"   = design_eff
    +    )
    +  
    +  ## return table as a dataframe
    +  full_table <- data.frame(full_table, 
    +             ## remove the variable names from rows (is a separate column now)
    +             row.names = NULL)
    +  
    +  ## change numerics back to numeric
    +  full_table[ , 2:6] <- as.numeric(full_table[, 2:6])
    +  
    +  ## return dataframe
    +  full_table
    +}
    +
    +## iterate over several variables to create a table 
    +purrr::map(
    +  ## define variables of interest
    +  c("left", "died", "arrived"), 
    +  ## state function using and arguments for that function (design)
    +  svy_prop, design = base_survey_design) %>% 
    +  ## collapse list in to a single data frame
    +  bind_rows() %>% 
    +  ## round 
    +  mutate(across(where(is.numeric), round, digits = 1))
      Variable    Count Proportion X2.5. X97.5. Design.effect
     1     left 701199.1       47.3  39.2   55.5           2.4
    @@ -1497,21 +1490,21 @@ 

    With srvyr we can use dplyr syntax to create a table. Note that the survey_mean() function is used and the proportion argument is specified, and also that the same function is used to calculate design effect. This is because srvyr wraps around both of the survey package functions svyciprop() and svymean(), which are used in the above section.

    NOTE: It does not seem to be possible to get proportions from categorical variables using srvyr either, if you need this then check out the section below using sitrep

    -
    ## use the srvyr design object
    -survey_design %>% 
    -  summarise(
    -    ## produce the weighted counts 
    -    counts = survey_total(died), 
    -    ## produce weighted proportions and confidence intervals 
    -    ## multiply by 100 to get a percentage 
    -    props = survey_mean(died, 
    -                        proportion = TRUE, 
    -                        vartype = "ci") * 100, 
    -    ## produce the design effect 
    -    deff = survey_mean(died, deff = TRUE)) %>% 
    -  ## only keep the rows of interest
    -  ## (drop standard errors and repeat proportion calculation)
    -  select(counts, props, props_low, props_upp, deff_deff)
    +
    ## use the srvyr design object
    +survey_design %>% 
    +  summarise(
    +    ## produce the weighted counts 
    +    counts = survey_total(died), 
    +    ## produce weighted proportions and confidence intervals 
    +    ## multiply by 100 to get a percentage 
    +    props = survey_mean(died, 
    +                        proportion = TRUE, 
    +                        vartype = "ci") * 100, 
    +    ## produce the design effect 
    +    deff = survey_mean(died, deff = TRUE)) %>% 
    +  ## only keep the rows of interest
    +  ## (drop standard errors and repeat proportion calculation)
    +  select(counts, props, props_low, props_upp, deff_deff)
    # A tibble: 1 × 5
       counts props props_low props_upp deff_deff
    @@ -1521,42 +1514,42 @@ 

    Here too we could write a function to then iterate over multiple variables using the purrr package. See the handbook iteration chapter for details on purrr.

    -
    # Define function to calculate weighted counts, proportions, CI and design effect
    -# design is your survey design object
    -# x is the variable in quotation marks 
    -
    -
    -srvyr_prop <- function(design, x) {
    -  
    -  summarise(
    -    ## using the survey design object
    -    design, 
    -    ## produce the weighted counts 
    -    counts = survey_total(.data[[x]]), 
    -    ## produce weighted proportions and confidence intervals 
    -    ## multiply by 100 to get a percentage 
    -    props = survey_mean(.data[[x]], 
    -                        proportion = TRUE, 
    -                        vartype = "ci") * 100, 
    -    ## produce the design effect 
    -    deff = survey_mean(.data[[x]], deff = TRUE)) %>% 
    -  ## add in the variable name
    -  mutate(variable = x) %>% 
    -  ## only keep the rows of interest
    -  ## (drop standard errors and repeat proportion calculation)
    -  select(variable, counts, props, props_low, props_upp, deff_deff)
    -  
    -}
    -  
    -
    -## iterate over several variables to create a table 
    -purrr::map(
    -  ## define variables of interest
    -  c("left", "died", "arrived"), 
    -  ## state function using and arguments for that function (design)
    -  ~srvyr_prop(.x, design = survey_design)) %>% 
    -  ## collapse list in to a single data frame
    -  bind_rows()
    +
    # Define function to calculate weighted counts, proportions, CI and design effect
    +# design is your survey design object
    +# x is the variable in quotation marks 
    +
    +
    +srvyr_prop <- function(design, x) {
    +  
    +  summarise(
    +    ## using the survey design object
    +    design, 
    +    ## produce the weighted counts 
    +    counts = survey_total(.data[[x]]), 
    +    ## produce weighted proportions and confidence intervals 
    +    ## multiply by 100 to get a percentage 
    +    props = survey_mean(.data[[x]], 
    +                        proportion = TRUE, 
    +                        vartype = "ci") * 100, 
    +    ## produce the design effect 
    +    deff = survey_mean(.data[[x]], deff = TRUE)) %>% 
    +  ## add in the variable name
    +  mutate(variable = x) %>% 
    +  ## only keep the rows of interest
    +  ## (drop standard errors and repeat proportion calculation)
    +  select(variable, counts, props, props_low, props_upp, deff_deff)
    +  
    +}
    +  
    +
    +## iterate over several variables to create a table 
    +purrr::map(
    +  ## define variables of interest
    +  c("left", "died", "arrived"), 
    +  ## state function using and arguments for that function (design)
    +  ~srvyr_prop(.x, design = survey_design)) %>% 
    +  ## collapse list in to a single data frame
    +  bind_rows()
    # A tibble: 3 × 6
       variable  counts props props_low props_upp deff_deff
    @@ -1571,17 +1564,17 @@ 

    26.8.3 Sitrep package

    The tab_survey() function from sitrep is a wrapper for srvyr, allowing you to create weighted tables with minimal coding. It also allows you to calculate weighted proportions for categorical variables.

    -
    ## using the survey design object
    -survey_design %>% 
    -  ## pass the names of variables of interest unquoted
    -  tab_survey(
    -       "arrived", 
    -       "left", 
    -       "died", 
    -       "education_level",
    -       deff = TRUE,   # calculate the design effect
    -       pretty = TRUE  # merge the proportion and 95%CI
    -       )
    +
    ## using the survey design object
    +survey_design %>% 
    +  ## pass the names of variables of interest unquoted
    +  tab_survey(
    +       "arrived", 
    +       "left", 
    +       "died", 
    +       "education_level",
    +       deff = TRUE,   # calculate the design effect
    +       pretty = TRUE  # merge the proportion and 95%CI
    +       )
    @@ -1602,16 +1595,16 @@

    26.8.4 Gtsummary package

    With gtsummary we can use the function tbl_svysummary() and the addition add_ci() to add confidence intervals.

    -
    ## using the survey package design object
    -tbl_svysummary(base_survey_design, 
    -               include = c(arrived, left, died),   ## define variables want to include
    -               statistic = list(everything() ~ c("{n} ({p}%)"))) %>% ## define stats of interest
    -     add_ci() %>% ## add confidence intervals
    -  add_n() %>%
    -     modify_header(label = list(
    -          n ~ "**Weighted total (N)**",
    -          stat_0 ~ "**Weighted count**"
    -     ))
    +
    ## using the survey package design object
    +tbl_svysummary(base_survey_design, 
    +               include = c(arrived, left, died),   ## define variables want to include
    +               statistic = list(everything() ~ c("{n} ({p}%)"))) %>% ## define stats of interest
    +     add_ci() %>% ## add confidence intervals
    +  add_n() %>%
    +     modify_header(label = list(
    +          n ~ "**Weighted total (N)**",
    +          stat_0 ~ "**Weighted count**"
    +     ))
    Warning: The `update` argument of `modify_header()` is deprecated as of gtsummary 2.0.0.
     ℹ Use `modify_header(...)` input instead. Dynamic dots allow for syntax like
    @@ -1620,23 +1613,23 @@ 

    -
    - @@ -2135,16 +2128,16 @@

    26.9.1 Survey package

    -
    ratio <- svyratio(~died, 
    -         denominator = ~obstime, 
    -         design = base_survey_design)
    -
    -ci <- confint(ratio)
    -
    -cbind(
    -  ratio$ratio * 10000, 
    -  ci * 10000
    -)
    +
    ratio <- svyratio(~died, 
    +         denominator = ~obstime, 
    +         design = base_survey_design)
    +
    +ci <- confint(ratio)
    +
    +cbind(
    +  ratio$ratio * 10000, 
    +  ci * 10000
    +)
          obstime    2.5 %   97.5 %
     died 5.981922 1.194294 10.76955
    @@ -2154,14 +2147,14 @@

    26.9.2 Srvyr package

    -
    survey_design %>% 
    -  ## survey ratio used to account for observation time 
    -  summarise(
    -    mortality = survey_ratio(
    -      as.numeric(died) * 10000, 
    -      obstime, 
    -      vartype = "ci")
    -    )
    +
    survey_design %>% 
    +  ## survey ratio used to account for observation time 
    +  summarise(
    +    mortality = survey_ratio(
    +      as.numeric(died) * 10000, 
    +      obstime, 
    +      vartype = "ci")
    +    )
    # A tibble: 1 × 3
       mortality mortality_low mortality_upp
    @@ -2177,34 +2170,34 @@ 

    To carry out a univariate regression, we can use the packages survey for the function svyglm() and the package gtsummary which allows us to call svyglm() inside the function tbl_uvregression. To do this we first use the survey_design object created above. This is then provided to the function tbl_uvregression() as in the Univariate and multivariable regression chapter. We then make one key change, we change method = glm to method = survey::svyglm in order to carry out our survey weighted regression.

    Here we will be using the previously created object survey_design to predict whether the value in the column died is TRUE, using the columns malaria_treatment, bednet, and age_years.

    -
    survey_design %>%
    -     tbl_uvregression(                             #Carry out a univariate regression, if we wanted a multivariable regression we would use tbl_
    -          method = survey::svyglm,                 #Set this to survey::svyglm to carry out our weighted regression on the survey data
    -          y = died,                                #The column we are trying to predict
    -          method.args = list(family = binomial),   #The family, we are carrying out a logistic regression so we want the family as binomial
    -          include = c(malaria_treatment,           #These are the columns we want to evaluate
    -                      bednet,
    -                      age_years),
    -          exponentiate = T                         #To transform the log odds to odds ratio for easier interpretation
    -     )
    +
    survey_design %>%
    +     tbl_uvregression(                             #Carry out a univariate regression, if we wanted a multivariable regression we would use tbl_
    +          method = survey::svyglm,                 #Set this to survey::svyglm to carry out our weighted regression on the survey data
    +          y = died,                                #The column we are trying to predict
    +          method.args = list(family = binomial),   #The family, we are carrying out a logistic regression so we want the family as binomial
    +          include = c(malaria_treatment,           #These are the columns we want to evaluate
    +                      bednet,
    +                      age_years),
    +          exponentiate = T                         #To transform the log odds to odds ratio for easier interpretation
    +     )
    -
    - @@ -2738,32 +2731,32 @@

    If we wanted to carry out a multivariable regression, we would have to first use the function svyglm() and pipe (%>%) the results into the function tbl_regression. Note that we need to specify the formula.

    -
    survey_design %>%
    -     svyglm(formula = died ~ malaria_treatment + 
    -                 bednet + 
    -                 age_years,
    -            family = binomial) %>%                   #The family, we are carrying out a logistic regression so we want the family as binomial
    -     tbl_regression( 
    -          exponentiate = T                           #To transform the log odds to odds ratio for easier interpretation                            
    -     )
    +
    survey_design %>%
    +     svyglm(formula = died ~ malaria_treatment + 
    +                 bednet + 
    +                 age_years,
    +            family = binomial) %>%                   #The family, we are carrying out a logistic regression so we want the family as binomial
    +     tbl_regression( 
    +          exponentiate = T                           #To transform the log odds to odds ratio for easier interpretation                            
    +     )
    -
    - @@ -3886,7 +3879,7 @@

    - +
    +
    @@ -1614,21 +1582,21 @@

    Expand pati
  • tdc() creates the time-dependent covariate column, agvhd, to go with the newly created time intervals.
  • -
    td_dat <- 
    -  tmerge(
    -    data1 = bmt %>% select(my_id, T1, delta1), 
    -    data2 = bmt %>% select(my_id, T1, delta1, TA, deltaA), 
    -    id = my_id, 
    -    death = event(T1, delta1),
    -    agvhd = tdc(TA)
    -    )
    +
    td_dat <- 
    +  tmerge(
    +    data1 = bmt %>% select(my_id, T1, delta1), 
    +    data2 = bmt %>% select(my_id, T1, delta1, TA, deltaA), 
    +    id = my_id, 
    +    death = event(T1, delta1),
    +    agvhd = tdc(TA)
    +    )

    To see what this does, let’s look at the data for the first 5 individual patients.

    The variables of interest in the original data looked like this:

    -
    bmt %>% 
    -  select(my_id, T1, delta1, TA, deltaA) %>% 
    -  filter(my_id %in% seq(1, 5))
    +
    bmt %>% 
    +  select(my_id, T1, delta1, TA, deltaA) %>% 
    +  filter(my_id %in% seq(1, 5))
      my_id   T1 delta1   TA deltaA
     1     1 2081      0   67      1
    @@ -1640,8 +1608,8 @@ 

    Expand pati

    The new dataset for these same patients looks like this:

    -
    td_dat %>% 
    -  filter(my_id %in% seq(1, 5))
    +
    td_dat %>% 
    +  filter(my_id %in% seq(1, 5))
      my_id   T1 delta1 tstart tstop death agvhd
     1     1 2081      0      0    67     0     0
    @@ -1660,12 +1628,12 @@ 

    Expand pati

    Cox regression with time-dependent covariates

    Now that we’ve reshaped our data and added the new time-dependent aghvd variable, let’s fit a simple single variable cox regression model. We can use the same coxph() function as before, we just need to change our Surv() function to specify both the start and stop time for each interval using the time1 = and time2 = arguments.

    -
    bmt_td_model = coxph(
    -  Surv(time = tstart, time2 = tstop, event = death) ~ agvhd, 
    -  data = td_dat
    -  )
    -
    -summary(bmt_td_model)
    +
    bmt_td_model = coxph(
    +  Surv(time = tstart, time2 = tstop, event = death) ~ agvhd, 
    +  data = td_dat
    +  )
    +
    +summary(bmt_td_model)
    Call:
     coxph(formula = Surv(time = tstart, time2 = tstop, event = death) ~ 
    @@ -1687,7 +1655,7 @@ 

    -
    ggforest(bmt_td_model, data = td_dat)
    +
    ggforest(bmt_td_model, data = td_dat)
    @@ -2304,7 +2272,7 @@

    var lightboxQuarto = GLightbox({"selector":".lightbox","closeEffect":"zoom","descPosition":"bottom","openEffect":"zoom","loop":false}); (function() { let previousOnload = window.onload; window.onload = () => { diff --git a/html_outputs/new_pages/tables_descriptive.html b/html_outputs/new_pages/tables_descriptive.html index 38c480f9..caf6f72d 100644 --- a/html_outputs/new_pages/tables_descriptive.html +++ b/html_outputs/new_pages/tables_descriptive.html @@ -2,12 +2,12 @@ - + -The Epidemiologist R Handbook - 17  Descriptive tables +17  Descriptive tables – The Epidemiologist R Handbook

    Age Category/Gender

    f

    m

    NA_

    Total

    0-4

    640 (22.8%)

    416 (14.8%)

    39 (14.0%)

    1,095 (18.6%)

    5-9

    641 (22.8%)

    412 (14.7%)

    42 (15.1%)

    1,095 (18.6%)

    10-14

    518 (18.5%)

    383 (13.7%)

    40 (14.4%)

    941 (16.0%)

    15-19

    359 (12.8%)

    364 (13.0%)

    20 (7.2%)

    743 (12.6%)

    20-29

    468 (16.7%)

    575 (20.5%)

    30 (10.8%)

    1,073 (18.2%)

    30-49

    179 (6.4%)

    557 (19.9%)

    18 (6.5%)

    754 (12.8%)

    50-69

    2 (0.1%)

    91 (3.2%)

    2 (0.7%)

    95 (1.6%)

    70+

    0 (0.0%)

    5 (0.2%)

    1 (0.4%)

    6 (0.1%)

    0 (0.0%)

    0 (0.0%)

    86 (30.9%)

    86 (1.5%)

    +

    Age Category/Gender

    f

    m

    NA_

    Total

    0-4

    640 (22.8%)

    416 (14.8%)

    39 (14.0%)

    1,095 (18.6%)

    5-9

    641 (22.8%)

    412 (14.7%)

    42 (15.1%)

    1,095 (18.6%)

    10-14

    518 (18.5%)

    383 (13.7%)

    40 (14.4%)

    941 (16.0%)

    15-19

    359 (12.8%)

    364 (13.0%)

    20 (7.2%)

    743 (12.6%)

    20-29

    468 (16.7%)

    575 (20.5%)

    30 (10.8%)

    1,073 (18.2%)

    30-49

    179 (6.4%)

    557 (19.9%)

    18 (6.5%)

    754 (12.8%)

    50-69

    2 (0.1%)

    91 (3.2%)

    2 (0.7%)

    95 (1.6%)

    70+

    0 (0.0%)

    5 (0.2%)

    1 (0.4%)

    6 (0.1%)

    0 (0.0%)

    0 (0.0%)

    86 (30.9%)

    86 (1.5%)

    @@ -1601,9 +1598,9 @@

    Printing the

    Use on other tables

    You can use janitor’s adorn_*() functions on other tables, such as those created by summarise() and count() from dplyr, or table() from base R. Simply pipe the table to the desired janitor function. For example:

    -
    linelist %>% 
    -  count(hospital) %>%   # dplyr function
    -  adorn_totals()        # janitor function
    +
    linelist %>% 
    +  count(hospital) %>%   # dplyr function
    +  adorn_totals()        # janitor function
                                 hospital    n
                          Central Hospital  454
    @@ -1620,19 +1617,19 @@ 

    Use on othe

    Saving the tabyl

    If you convert the table to a “pretty” image with a package like flextable, you can save it with functions from that package - like save_as_html(), save_as_word(), save_as_ppt(), and save_as_image() from flextable (as discussed more extensively in the Tables for presentation page). Below, the table is saved as a Word document, in which it can be further hand-edited.

    -
    linelist %>%
    -  tabyl(age_cat, gender) %>% 
    -  adorn_totals(where = "col") %>% 
    -  adorn_percentages(denominator = "col") %>% 
    -  adorn_pct_formatting() %>% 
    -  adorn_ns(position = "front") %>% 
    -  adorn_title(
    -    row_name = "Age Category",
    -    col_name = "Gender",
    -    placement = "combined") %>% 
    -  flextable::flextable() %>%                     # convert to image
    -  flextable::autofit() %>%                       # ensure only one line per row
    -  flextable::save_as_docx(path = "tabyl.docx")   # save as Word document to filepath
    +
    linelist %>%
    +  tabyl(age_cat, gender) %>% 
    +  adorn_totals(where = "col") %>% 
    +  adorn_percentages(denominator = "col") %>% 
    +  adorn_pct_formatting() %>% 
    +  adorn_ns(position = "front") %>% 
    +  adorn_title(
    +    row_name = "Age Category",
    +    col_name = "Gender",
    +    placement = "combined") %>% 
    +  flextable::flextable() %>%                     # convert to image
    +  flextable::autofit() %>%                       # ensure only one line per row
    +  flextable::save_as_docx(path = "tabyl.docx")   # save as Word document to filepath
    @@ -1648,10 +1645,10 @@

    Saving the tab

    Statistics

    You can apply statistical tests on tabyls, like chisq.test() or fisher.test() from the stats package, as shown below. Note missing values are not allowed so they are excluded from the tabyl with show_na = FALSE.

    -
    age_by_outcome <- linelist %>% 
    -  tabyl(age_cat, outcome, show_na = FALSE) 
    -
    -chisq.test(age_by_outcome)
    +
    age_by_outcome <- linelist %>% 
    +  tabyl(age_cat, outcome, show_na = FALSE) 
    +
    +chisq.test(age_by_outcome)
    
         Pearson's Chi-squared test
    @@ -1683,8 +1680,8 @@ 

    Get counts

    The most simple function to apply within summarise() is n(). Leave the parentheses empty to count the number of rows.

    -
    linelist %>%                 # begin with linelist
    -  summarise(n_rows = n())    # return new summary dataframe with column n_rows
    +
    linelist %>%                 # begin with linelist
    +  summarise(n_rows = n())    # return new summary dataframe with column n_rows
      n_rows
     1   5888
    @@ -1692,9 +1689,9 @@

    Get counts

    This gets more interesting if we have grouped the data beforehand.

    -
    linelist %>% 
    -  group_by(age_cat) %>%     # group data by unique values in column age_cat
    -  summarise(n_rows = n())   # return number of rows *per group*
    +
    linelist %>% 
    +  group_by(age_cat) %>%     # group data by unique values in column age_cat
    +  summarise(n_rows = n())   # return number of rows *per group*
    # A tibble: 9 × 2
       age_cat n_rows
    @@ -1719,8 +1716,8 @@ 

    Get counts

  • Un-groups the data.
  • -
    linelist %>% 
    -  count(age_cat)
    +
    linelist %>% 
    +  count(age_cat)
      age_cat    n
     1     0-4 1095
    @@ -1737,8 +1734,8 @@ 

    Get counts

    You can change the name of the counts column from the default n to something else by specifying it to name =.

    Tabulating counts of two or more grouping columns are still returned in “long” format, with the counts in the n column. See the page on Pivoting data to learn about “long” and “wide” data formats.

    -
    linelist %>% 
    -  count(age_cat, outcome)
    +
    linelist %>% 
    +  count(age_cat, outcome)
       age_cat outcome   n
     1      0-4   Death 471
    @@ -1782,14 +1779,14 @@ 

    Proportions

    Note that in this case, sum() in the mutate() command will return the sum of the whole column n for use as the proportion denominator. As explained in the Grouping data page, if sum() is used in grouped data (e.g. if the mutate() immediately followed a group_by() command), it will return sums by group. As stated just above, count() finishes its actions by ungrouping. Thus, in this scenario we get full column proportions.

    To easily display percents, you can wrap the proportion in the function percent() from the package scales (note this convert to class character).

    -
    age_summary <- linelist %>% 
    -  count(age_cat) %>%                     # group and count by gender (produces "n" column)
    -  mutate(                                # create percent of column - note the denominator
    -    percent = scales::percent(n / sum(n))
    -    ) 
    -
    -# print
    -age_summary
    +
    age_summary <- linelist %>% 
    +  count(age_cat) %>%                     # group and count by gender (produces "n" column)
    +  mutate(                                # create percent of column - note the denominator
    +    percent = scales::percent(n / sum(n))
    +    ) 
    +
    +# print
    +age_summary
      age_cat    n percent
     1     0-4 1095  18.60%
    @@ -1805,15 +1802,15 @@ 

    Proportions

    Below is a method to calculate proportions within groups. It relies on different levels of data grouping being selectively applied and removed. First, the data are grouped on outcome via group_by(). Then, count() is applied. This function further groups the data by age_cat and returns counts for each outcome-age-cat combination. Importantly - as it finishes its process, count() also ungroups the age_cat grouping, so the only remaining data grouping is the original grouping by outcome. Thus, the final step of calculating proportions (denominator sum(n)) is still grouped by outcome.

    -
    age_by_outcome <- linelist %>%                  # begin with linelist
    -  group_by(outcome) %>%                         # group by outcome 
    -  count(age_cat) %>%                            # group and count by age_cat, and then remove age_cat grouping
    -  mutate(percent = scales::percent(n / sum(n))) # calculate percent - note the denominator is by outcome group
    +
    age_by_outcome <- linelist %>%                  # begin with linelist
    +  group_by(outcome) %>%                         # group by outcome 
    +  count(age_cat) %>%                            # group and count by age_cat, and then remove age_cat grouping
    +  mutate(percent = scales::percent(n / sum(n))) # calculate percent - note the denominator is by outcome group
    -
    - +
    +
    @@ -1821,14 +1818,14 @@

    Proportions

    Plotting

    To display a “long” table output like the above with ggplot() is relatively straight-forward. The data are naturally in “long” format, which is naturally accepted by ggplot(). See further examples in the pages ggplot basics and ggplot tips.

    -
    linelist %>%                      # begin with linelist
    -  count(age_cat, outcome) %>%     # group and tabulate counts by two columns
    -  ggplot()+                       # pass new data frame to ggplot
    -    geom_col(                     # create bar plot
    -      mapping = aes(   
    -        x = outcome,              # map outcome to x-axis
    -        fill = age_cat,           # map age_cat to the fill
    -        y = n))                   # map the counts column `n` to the height
    +
    linelist %>%                      # begin with linelist
    +  count(age_cat, outcome) %>%     # group and tabulate counts by two columns
    +  ggplot()+                       # pass new data frame to ggplot
    +    geom_col(                     # create bar plot
    +      mapping = aes(   
    +        x = outcome,              # map outcome to x-axis
    +        fill = age_cat,           # map age_cat to the fill
    +        y = n))                   # map the counts column `n` to the height
    @@ -1852,18 +1849,18 @@

    Summary st

    Below, linelist data are summarised to describe the days delay from symptom onset to hospital admission (column days_onset_hosp), by hospital.

    -
    summary_table <- linelist %>%                                        # begin with linelist, save out as new object
    -  group_by(hospital) %>%                                             # group all calculations by hospital
    -  summarise(                                                         # only the below summary columns will be returned
    -    cases       = n(),                                                # number of rows per group
    -    delay_max   = max(days_onset_hosp, na.rm = T),                    # max delay
    -    delay_mean  = round(mean(days_onset_hosp, na.rm=T), digits = 1),  # mean delay, rounded
    -    delay_sd    = round(sd(days_onset_hosp, na.rm = T), digits = 1),  # standard deviation of delays, rounded
    -    delay_3     = sum(days_onset_hosp >= 3, na.rm = T),               # number of rows with delay of 3 or more days
    -    pct_delay_3 = scales::percent(delay_3 / cases)                    # convert previously-defined delay column to percent 
    -  )
    -
    -summary_table  # print
    +
    summary_table <- linelist %>%                                        # begin with linelist, save out as new object
    +  group_by(hospital) %>%                                             # group all calculations by hospital
    +  summarise(                                                         # only the below summary columns will be returned
    +    cases       = n(),                                                # number of rows per group
    +    delay_max   = max(days_onset_hosp, na.rm = T),                    # max delay
    +    delay_mean  = round(mean(days_onset_hosp, na.rm=T), digits = 1),  # mean delay, rounded
    +    delay_sd    = round(sd(days_onset_hosp, na.rm = T), digits = 1),  # standard deviation of delays, rounded
    +    delay_3     = sum(days_onset_hosp >= 3, na.rm = T),               # number of rows with delay of 3 or more days
    +    pct_delay_3 = scales::percent(delay_3 / cases)                    # convert previously-defined delay column to percent 
    +  )
    +
    +summary_table  # print
    # A tibble: 6 × 7
       hospital               cases delay_max delay_mean delay_sd delay_3 pct_delay_3
    @@ -1897,12 +1894,12 @@ 

    Summary st

    Conditional statistics

    You may want to return conditional statistics - e.g. the maximum of rows that meet certain criteria. This can be done by subsetting the column with brackets [ ]. The example below returns the maximum temperature for patients classified having or not having fever. Be aware however - it may be more appropriate to add another column to the group_by() command and pivot_wider() (as demonstrated below).

    -
    linelist %>% 
    -  group_by(hospital) %>% 
    -  summarise(
    -    max_temp_fvr = max(temp[fever == "yes"], na.rm = T),
    -    max_temp_no = max(temp[fever == "no"], na.rm = T)
    -  )
    +
    linelist %>% 
    +  group_by(hospital) %>% 
    +  summarise(
    +    max_temp_fvr = max(temp[fever == "yes"], na.rm = T),
    +    max_temp_no = max(temp[fever == "no"], na.rm = T)
    +  )
    # A tibble: 6 × 3
       hospital                             max_temp_fvr max_temp_no
    @@ -1924,18 +1921,18 @@ 

    Glueing togeth

    Then, to make the table more presentable, a total row is added with adorn_totals() from janitor (which ignores non-numeric columns). Lastly, we use select() from dplyr to both re-order and rename to nicer column names.

    Now you could pass to flextable and print the table to Word, .png, .jpeg, .html, Powerpoint, RMarkdown, etc.! (see the Tables for presentation page).

    -
    summary_table %>% 
    -  mutate(delay = str_glue("{delay_mean} ({delay_sd})")) %>%  # combine and format other values
    -  select(-c(delay_mean, delay_sd)) %>%                       # remove two old columns   
    -  adorn_totals(where = "row") %>%                            # add total row
    -  select(                                                    # order and rename cols
    -    "Hospital Name"   = hospital,
    -    "Cases"           = cases,
    -    "Max delay"       = delay_max,
    -    "Mean (sd)"       = delay,
    -    "Delay 3+ days"   = delay_3,
    -    "% delay 3+ days" = pct_delay_3
    -    )
    +
    summary_table %>% 
    +  mutate(delay = str_glue("{delay_mean} ({delay_sd})")) %>%  # combine and format other values
    +  select(-c(delay_mean, delay_sd)) %>%                       # remove two old columns   
    +  adorn_totals(where = "row") %>%                            # add total row
    +  select(                                                    # order and rename cols
    +    "Hospital Name"   = hospital,
    +    "Cases"           = cases,
    +    "Max delay"       = delay_max,
    +    "Mean (sd)"       = delay,
    +    "Delay 3+ days"   = delay_3,
    +    "% delay 3+ days" = pct_delay_3
    +    )
                            Hospital Name Cases Max delay Mean (sd) Delay 3+ days
                          Central Hospital   454        12 1.9 (1.9)           108
    @@ -1959,9 +1956,9 @@ 

    Glueing togeth

    Percentiles

    Percentiles and quantiles in dplyr deserve a special mention. To return quantiles, use quantile() with the defaults or specify the value(s) you would like with probs =.

    -
    # get default percentile values of age (0%, 25%, 50%, 75%, 100%)
    -linelist %>% 
    -  summarise(age_percentiles = quantile(age_years, na.rm = TRUE))
    +
    # get default percentile values of age (0%, 25%, 50%, 75%, 100%)
    +linelist %>% 
    +  summarise(age_percentiles = quantile(age_years, na.rm = TRUE))
      age_percentiles
     1               0
    @@ -1970,14 +1967,14 @@ 

    Percentiles

    4 23 5 84
    -
    # get manually-specified percentile values of age (5%, 50%, 75%, 98%)
    -linelist %>% 
    -  summarise(
    -    age_percentiles = quantile(
    -      age_years,
    -      probs = c(.05, 0.5, 0.75, 0.98), 
    -      na.rm=TRUE)
    -    )
    +
    # get manually-specified percentile values of age (5%, 50%, 75%, 98%)
    +linelist %>% 
    +  summarise(
    +    age_percentiles = quantile(
    +      age_years,
    +      probs = c(.05, 0.5, 0.75, 0.98), 
    +      na.rm=TRUE)
    +    )
      age_percentiles
     1               1
    @@ -1988,15 +1985,15 @@ 

    Percentiles

    If you want to return quantiles by group, you may encounter long and less useful outputs if you simply add another column to group_by(). So, try this approach instead - create a column for each quantile level desired.

    -
    # get manually-specified percentile values of age (5%, 50%, 75%, 98%)
    -linelist %>% 
    -  group_by(hospital) %>% 
    -  summarise(
    -    p05 = quantile(age_years, probs = 0.05, na.rm=T),
    -    p50 = quantile(age_years, probs = 0.5, na.rm=T),
    -    p75 = quantile(age_years, probs = 0.75, na.rm=T),
    -    p98 = quantile(age_years, probs = 0.98, na.rm=T)
    -    )
    +
    # get manually-specified percentile values of age (5%, 50%, 75%, 98%)
    +linelist %>% 
    +  group_by(hospital) %>% 
    +  summarise(
    +    p05 = quantile(age_years, probs = 0.05, na.rm=T),
    +    p50 = quantile(age_years, probs = 0.5, na.rm=T),
    +    p75 = quantile(age_years, probs = 0.75, na.rm=T),
    +    p98 = quantile(age_years, probs = 0.98, na.rm=T)
    +    )
    # A tibble: 6 × 5
       hospital                               p05   p50   p75   p98
    @@ -2011,9 +2008,9 @@ 

    Percentiles

    While dplyr summarise() certainly offers more fine control, you may find that all the summary statistics you need can be produced with get_summary_stat() from the rstatix package. If operating on grouped data, if will return 0%, 25%, 50%, 75%, and 100%. If applied to ungrouped data, you can specify the percentiles with probs = c(.05, .5, .75, .98).

    -
    linelist %>% 
    -  group_by(hospital) %>% 
    -  rstatix::get_summary_stats(age, type = "quantile")
    +
    linelist %>% 
    +  group_by(hospital) %>% 
    +  rstatix::get_summary_stats(age, type = "quantile")
    # A tibble: 6 × 8
       hospital                         variable     n  `0%` `25%` `50%` `75%` `100%`
    @@ -2027,8 +2024,8 @@ 

    Percentiles

    -
    linelist %>% 
    -  rstatix::get_summary_stats(age, type = "quantile")
    +
    linelist %>% 
    +  rstatix::get_summary_stats(age, type = "quantile")
    # A tibble: 1 × 7
       variable     n  `0%` `25%` `50%` `75%` `100%`
    @@ -2044,11 +2041,11 @@ 

    Summa

    For example, let’s say you are beginning with the data frame of counts below, called linelist_agg - it shows in “long” format the case counts by outcome and gender.

    Below we create this example data frame of linelist case counts by outcome and gender (missing values removed for clarity).

    -
    linelist_agg <- linelist %>% 
    -  drop_na(gender, outcome) %>% 
    -  count(outcome, gender)
    -
    -linelist_agg
    +
    linelist_agg <- linelist %>% 
    +  drop_na(gender, outcome) %>% 
    +  count(outcome, gender)
    +
    +linelist_agg
      outcome gender    n
     1   Death      f 1227
    @@ -2059,12 +2056,12 @@ 

    Summa

    To sum the counts (in column n) by group you can use summarise() but set the new column equal to sum(n, na.rm=T). To add a conditional element to the sum operation, you can use the subset bracket [ ] syntax on the counts column.

    -
    linelist_agg %>% 
    -  group_by(outcome) %>% 
    -  summarise(
    -    total_cases  = sum(n, na.rm=T),
    -    male_cases   = sum(n[gender == "m"], na.rm=T),
    -    female_cases = sum(n[gender == "f"], na.rm=T))
    +
    linelist_agg %>% 
    +  group_by(outcome) %>% 
    +  summarise(
    +    total_cases  = sum(n, na.rm=T),
    +    male_cases   = sum(n[gender == "m"], na.rm=T),
    +    female_cases = sum(n[gender == "f"], na.rm=T))
    # A tibble: 2 × 4
       outcome total_cases male_cases female_cases
    @@ -2085,11 +2082,11 @@ 

    a

    Below, mean() is applied to several numeric columns. A vector of columns are named explicitly to .cols = and a single function mean is specified (no parentheses) to .fns =. Any additional arguments for the function (e.g. na.rm=TRUE) are provided after .fns =, separated by a comma.

    It can be difficult to get the order of parentheses and commas correct when using across(). Remember that within across() you must include the columns, the functions, and any extra arguments needed for the functions.

    -
    linelist %>% 
    -  group_by(outcome) %>% 
    -  summarise(across(.cols = c(age_years, temp, wt_kg, ht_cm),  # columns
    -                   .fns = mean,                               # function
    -                   na.rm=T))                                  # extra arguments
    +
    linelist %>% 
    +  group_by(outcome) %>% 
    +  summarise(across(.cols = c(age_years, temp, wt_kg, ht_cm),  # columns
    +                   .fns = mean,                               # function
    +                   na.rm=T))                                  # extra arguments
    # A tibble: 3 × 5
       outcome age_years  temp wt_kg ht_cm
    @@ -2101,11 +2098,11 @@ 

    a

    Multiple functions can be run at once. Below the functions mean and sd are provided to .fns = within a list(). You have the opportunity to provide character names (e.g. “mean” and “sd”) which are appended in the new column names.

    -
    linelist %>% 
    -  group_by(outcome) %>% 
    -  summarise(across(.cols = c(age_years, temp, wt_kg, ht_cm), # columns
    -                   .fns = list("mean" = mean, "sd" = sd),    # multiple functions 
    -                   na.rm=T))                                 # extra arguments
    +
    linelist %>% 
    +  group_by(outcome) %>% 
    +  summarise(across(.cols = c(age_years, temp, wt_kg, ht_cm), # columns
    +                   .fns = list("mean" = mean, "sd" = sd),    # multiple functions 
    +                   na.rm=T))                                 # extra arguments
    # A tibble: 3 × 9
       outcome age_years_mean age_years_sd temp_mean temp_sd wt_kg_mean wt_kg_sd
    @@ -2135,12 +2132,12 @@ 

    a

    For example, to return the mean of every numeric column use where() and provide the function as.numeric() (without parentheses). All this remains within the across() command.

    -
    linelist %>% 
    -  group_by(outcome) %>% 
    -  summarise(across(
    -    .cols = where(is.numeric),  # all numeric columns in the data frame
    -    .fns = mean,
    -    na.rm=T))
    +
    linelist %>% 
    +  group_by(outcome) %>% 
    +  summarise(across(
    +    .cols = where(is.numeric),  # all numeric columns in the data frame
    +    .fns = mean,
    +    na.rm=T))
    # A tibble: 3 × 12
       outcome generation   age age_years   lon   lat wt_kg ht_cm ct_blood  temp
    @@ -2157,22 +2154,22 @@ 

    Pivot widerIf you prefer your table in “wide” format you can transform it using the tidyr pivot_wider() function. You will likely need to re-name the columns with rename(). For more information see the page on Pivoting data.

    The example below begins with the “long” table age_by_outcome from the proportions section. We create it again and print, for clarity:

    -
    age_by_outcome <- linelist %>%                  # begin with linelist
    -  group_by(outcome) %>%                         # group by outcome 
    -  count(age_cat) %>%                            # group and count by age_cat, and then remove age_cat grouping
    -  mutate(percent = scales::percent(n / sum(n))) # calculate percent - note the denominator is by outcome group
    +
    age_by_outcome <- linelist %>%                  # begin with linelist
    +  group_by(outcome) %>%                         # group by outcome 
    +  count(age_cat) %>%                            # group and count by age_cat, and then remove age_cat grouping
    +  mutate(percent = scales::percent(n / sum(n))) # calculate percent - note the denominator is by outcome group
    -
    - +
    +

    To pivot wider, we create the new columns from the values in the existing column age_cat (by setting names_from = age_cat). We also specify that the new table values will come from the existing column n, with values_from = n. The columns not mentioned in our pivoting command (outcome) will remain unchanged on the far left side.

    -
    age_by_outcome %>% 
    -  select(-percent) %>%   # keep only counts for simplicity
    -  pivot_wider(names_from = age_cat, values_from = n)  
    +
    age_by_outcome %>% 
    +  select(-percent) %>%   # keep only counts for simplicity
    +  pivot_wider(names_from = age_cat, values_from = n)  
    # A tibble: 3 × 10
     # Groups:   outcome [3]
    @@ -2192,17 +2189,17 @@ 

    j

    If your table consists only of counts or proportions/percents that can be summed into a total, then you can add sum totals using janitor’s adorn_totals() as described in the section above. Note that this function can only sum the numeric columns - if you want to calculate other total summary statistics see the next approach with dplyr.

    Below, linelist is grouped by gender and summarised into a table that described the number of cases with known outcome, deaths, and recovered. Piping the table to adorn_totals() adds a total row at the bottom reflecting the sum of each column. The further adorn_*() functions adjust the display as noted in the code.

    -
    linelist %>% 
    -  group_by(gender) %>%
    -  summarise(
    -    known_outcome = sum(!is.na(outcome)),           # Number of rows in group where outcome is not missing
    -    n_death  = sum(outcome == "Death", na.rm=T),    # Number of rows in group where outcome is Death
    -    n_recover = sum(outcome == "Recover", na.rm=T), # Number of rows in group where outcome is Recovered
    -  ) %>% 
    -  adorn_totals() %>%                                # Adorn total row (sums of each numeric column)
    -  adorn_percentages("col") %>%                      # Get column proportions
    -  adorn_pct_formatting() %>%                        # Convert proportions to percents
    -  adorn_ns(position = "front")                      # display % and counts (with counts in front)
    +
    linelist %>% 
    +  group_by(gender) %>%
    +  summarise(
    +    known_outcome = sum(!is.na(outcome)),           # Number of rows in group where outcome is not missing
    +    n_death  = sum(outcome == "Death", na.rm=T),    # Number of rows in group where outcome is Death
    +    n_recover = sum(outcome == "Recover", na.rm=T), # Number of rows in group where outcome is Recovered
    +  ) %>% 
    +  adorn_totals() %>%                                # Adorn total row (sums of each numeric column)
    +  adorn_percentages("col") %>%                      # Get column proportions
    +  adorn_pct_formatting() %>%                        # Convert proportions to percents
    +  adorn_ns(position = "front")                      # display % and counts (with counts in front)
     gender  known_outcome        n_death      n_recover
           f 2,180  (47.8%) 1,227  (47.5%)   953  (48.1%)
    @@ -2217,14 +2214,15 @@ 

    Joining data page. Below is an example:

    You can make a summary table of outcome by hospital with group_by() and summarise() like this:

    -
    by_hospital <- linelist %>% 
    -  filter(!is.na(outcome) & hospital != "Missing") %>%  # Remove cases with missing outcome or hospital
    -  group_by(hospital, outcome) %>%                      # Group data
    -  summarise(                                           # Create new summary columns of indicators of interest
    -    N = n(),                                            # Number of rows per hospital-outcome group     
    -    ct_value = median(ct_blood, na.rm=T))               # median CT value per group
    -  
    -by_hospital # print table
    +
    by_hospital <- linelist %>% 
    +  filter(!is.na(outcome) & hospital != "Missing") %>%  # Remove cases with missing outcome or hospital
    +  group_by(hospital, outcome) %>%                      # Group data
    +  summarise(                                           # Create new summary columns of indicators of interest
    +    N = n(),                                           # Number of rows per hospital-outcome group     
    +    ct_value = median(ct_blood, na.rm=T)               # median CT value per group
    +    )               
    +  
    +by_hospital # print table
    # A tibble: 10 × 4
     # Groups:   hospital [5]
    @@ -2244,14 +2242,14 @@ 

    -
    totals <- linelist %>% 
    -      filter(!is.na(outcome) & hospital != "Missing") %>%
    -      group_by(outcome) %>%                            # Grouped only by outcome, not by hospital    
    -      summarise(
    -        N = n(),                                       # These statistics are now by outcome only     
    -        ct_value = median(ct_blood, na.rm=T))
    -
    -totals # print table
    +
    totals <- linelist %>% 
    +      filter(!is.na(outcome) & hospital != "Missing") %>%
    +      group_by(outcome) %>%                            # Grouped only by outcome, not by hospital    
    +      summarise(
    +        N = n(),                                       # These statistics are now by outcome only     
    +        ct_value = median(ct_blood, na.rm=T))
    +
    +totals # print table
    # A tibble: 2 × 3
       outcome     N ct_value
    @@ -2262,35 +2260,35 @@ 

    Cleaning data and core functions page).

    -
    table_long <- bind_rows(by_hospital, totals) %>% 
    -  mutate(hospital = replace_na(hospital, "Total"))
    +
    table_long <- bind_rows(by_hospital, totals) %>% 
    +  mutate(hospital = replace_na(hospital, "Total"))

    Here is the new table with “Total” rows at the bottom.

    -
    - +
    +

    This table is in a “long” format, which may be what you want. Optionally, you can pivot this table wider to make it more readable. See the section on pivoting wider above, and the Pivoting data page. You can also add more columns, and arrange it nicely. This code is below.

    -
    table_long %>% 
    -  
    -  # Pivot wider and format
    -  ########################
    -  mutate(hospital = replace_na(hospital, "Total")) %>% 
    -  pivot_wider(                                         # Pivot from long to wide
    -    values_from = c(ct_value, N),                       # new values are from ct and count columns
    -    names_from = outcome) %>%                           # new column names are from outcomes
    -  mutate(                                              # Add new columns
    -    N_Known = N_Death + N_Recover,                               # number with known outcome
    -    Pct_Death = scales::percent(N_Death / N_Known, 0.1),         # percent cases who died (to 1 decimal)
    -    Pct_Recover = scales::percent(N_Recover / N_Known, 0.1)) %>% # percent who recovered (to 1 decimal)
    -  select(                                              # Re-order columns
    -    hospital, N_Known,                                   # Intro columns
    -    N_Recover, Pct_Recover, ct_value_Recover,            # Recovered columns
    -    N_Death, Pct_Death, ct_value_Death)  %>%             # Death columns
    -  arrange(N_Known)                                  # Arrange rows from lowest to highest (Total row at bottom)
    +
    table_long %>% 
    +  
    +  # Pivot wider and format
    +  ########################
    +  mutate(hospital = replace_na(hospital, "Total")) %>% 
    +  pivot_wider(                                         # Pivot from long to wide
    +    values_from = c(ct_value, N),                       # new values are from ct and count columns
    +    names_from = outcome) %>%                           # new column names are from outcomes
    +  mutate(                                              # Add new columns
    +    N_Known = N_Death + N_Recover,                               # number with known outcome
    +    Pct_Death = scales::percent(N_Death / N_Known, 0.1),         # percent cases who died (to 1 decimal)
    +    Pct_Recover = scales::percent(N_Recover / N_Known, 0.1)) %>% # percent who recovered (to 1 decimal)
    +  select(                                              # Re-order columns
    +    hospital, N_Known,                                   # Intro columns
    +    N_Recover, Pct_Recover, ct_value_Recover,            # Recovered columns
    +    N_Death, Pct_Death, ct_value_Death)  %>%             # Death columns
    +  arrange(N_Known)                                  # Arrange rows from lowest to highest (Total row at bottom)
    # A tibble: 6 × 8
     # Groups:   hospital [6]
    @@ -2308,7 +2306,7 @@ 

    Tables for presentation page.

    -

    Hospital

    Total cases with known outcome

    Recovered

    Died

    Total

    % of cases

    Median CT values

    Total

    % of cases

    Median CT values

    St. Mark's Maternity Hospital (SMMH)

    325

    126

    38.8%

    22

    199

    61.2%

    22

    Central Hospital

    358

    165

    46.1%

    22

    193

    53.9%

    22

    Other

    685

    290

    42.3%

    21

    395

    57.7%

    22

    Military Hospital

    708

    309

    43.6%

    22

    399

    56.4%

    21

    Missing

    1,125

    514

    45.7%

    21

    611

    54.3%

    21

    Port Hospital

    1,364

    579

    42.4%

    21

    785

    57.6%

    22

    Total

    3,440

    1,469

    42.7%

    22

    1,971

    57.3%

    22

    +

    Hospital

    Total cases with known outcome

    Recovered

    Died

    Total

    % of cases

    Median CT values

    Total

    % of cases

    Median CT values

    St. Mark's Maternity Hospital (SMMH)

    325

    126

    38.8%

    22

    199

    61.2%

    22

    Central Hospital

    358

    165

    46.1%

    22

    193

    53.9%

    22

    Other

    685

    290

    42.3%

    21

    395

    57.7%

    22

    Military Hospital

    708

    309

    43.6%

    22

    399

    56.4%

    21

    Missing

    1,125

    514

    45.7%

    21

    611

    54.3%

    21

    Port Hospital

    1,364

    579

    42.4%

    21

    785

    57.6%

    22

    Total

    3,440

    1,469

    42.7%

    22

    1,971

    57.3%

    22

    @@ -2323,27 +2321,27 @@

    Summary table

    The default behavior of tbl_summary() is quite incredible - it takes the columns you provide and creates a summary table in one command. The function prints statistics appropriate to the column class: median and inter-quartile range (IQR) for numeric columns, and counts (%) for categorical columns. Missing values are converted to “Unknown”. Footnotes are added to the bottom to explain the statistics, while the total N is shown at the top.

    -
    linelist %>% 
    -  select(age_years, gender, outcome, fever, temp, hospital) %>%  # keep only the columns of interest
    -  tbl_summary()                                                  # default
    +
    linelist %>% 
    +  select(age_years, gender, outcome, fever, temp, hospital) %>%  # keep only the columns of interest
    +  tbl_summary()                                                  # default
    -
    - @@ -2906,28 +2904,28 @@

    Adjustments

    A simple example of a statistic = equation might look like below, to only print the mean of column age_years:

    -
    linelist %>% 
    -  select(age_years) %>%         # keep only columns of interest 
    -  tbl_summary(                  # create summary table
    -    statistic = age_years ~ "{mean}") # print mean of age
    +
    linelist %>% 
    +  select(age_years) %>%         # keep only columns of interest 
    +  tbl_summary(                  # create summary table
    +    statistic = age_years ~ "{mean}") # print mean of age
    -
    - @@ -3398,28 +3396,28 @@

    Adjustments

    A slightly more complex equation might look like "({min}, {max})", incorporating the max and min values within parentheses and separated by a comma:

    -
    linelist %>% 
    -  select(age_years) %>%                       # keep only columns of interest 
    -  tbl_summary(                                # create summary table
    -    statistic = age_years ~ "({min}, {max})") # print min and max of age
    +
    linelist %>% 
    +  select(age_years) %>%                       # keep only columns of interest 
    +  tbl_summary(                                # create summary table
    +    statistic = age_years ~ "({min}, {max})") # print min and max of age
    -
    - @@ -3898,48 +3896,47 @@

    Adjustments

    type =
    This is used to adjust how many levels of the statistics are shown. The syntax is similar to statistic = in that you provide an equation with columns on the left and a value on the right. Two common scenarios include:

      -
    • type = all_categorical() ~ "categorical" Forces dichotomous columns (e.g. fever yes/no) to show all levels instead of only the “yes” row
      -
    • -
    • type = all_continuous() ~ "continuous2" Allows multi-line statistics per variable, as shown in a later section
    • +
    • type = all_categorical() ~ "categorical" Forces dichotomous columns (e.g. fever yes/no) to show all levels instead of only the “yes” row.
    • +
    • type = all_continuous() ~ "continuous2" Allows multi-line statistics per variable, as shown in a later section.

    In the example below, each of these arguments is used to modify the original summary table:

    -
    linelist %>% 
    -  select(age_years, gender, outcome, fever, temp, hospital) %>% # keep only columns of interest
    -  tbl_summary(     
    -    by = outcome,                                               # stratify entire table by outcome
    -    statistic = list(all_continuous() ~ "{mean} ({sd})",        # stats and format for continuous columns
    -                     all_categorical() ~ "{n} / {N} ({p}%)"),   # stats and format for categorical columns
    -    digits = all_continuous() ~ 1,                              # rounding for continuous columns
    -    type   = all_categorical() ~ "categorical",                 # force all categorical levels to display
    -    label  = list(                                              # display labels for column names
    -      age_years ~ "Age (years)",
    -      gender    ~ "Gender",
    -      temp      ~ "Temperature",
    -      hospital  ~ "Hospital"),
    -    missing_text = "Missing"                                    # how missing values should display
    -  )
    +
    linelist %>% 
    +  select(age_years, gender, outcome, fever, temp, hospital) %>% # keep only columns of interest
    +  tbl_summary(     
    +    by = outcome,                                               # stratify entire table by outcome
    +    statistic = list(all_continuous() ~ "{mean} ({sd})",        # stats and format for continuous columns
    +                     all_categorical() ~ "{n} / {N} ({p}%)"),   # stats and format for categorical columns
    +    digits = all_continuous() ~ 1,                              # rounding for continuous columns
    +    type   = all_categorical() ~ "categorical",                 # force all categorical levels to display
    +    label  = list(                                              # display labels for column names
    +      age_years ~ "Age (years)",
    +      gender    ~ "Gender",
    +      temp      ~ "Temperature",
    +      hospital  ~ "Hospital"),
    +    missing_text = "Missing"                                    # how missing values should display
    +  )
    1323 missing rows in the "outcome" column have been removed.
    -
    - @@ -4514,33 +4511,33 @@

    Adjustments

    Multi-line stats for continuous variables

    If you want to print multiple lines of statistics for continuous variables, you can indicate this by setting the type = to “continuous2”. You can combine all of the previously shown elements in one table by choosing which statistics you want to show. To do this you need to tell the function that you want to get a table back by entering the type as “continuous2”. The number of missing values is shown as “Unknown”.

    -
    linelist %>% 
    -  select(age_years, temp) %>%                      # keep only columns of interest
    -  tbl_summary(                                     # create summary table
    -    type = all_continuous() ~ "continuous2",       # indicate that you want to print multiple statistics 
    -    statistic = all_continuous() ~ c(
    -      "{mean} ({sd})",                             # line 1: mean and SD
    -      "{median} ({p25}, {p75})",                   # line 2: median and IQR
    -      "{min}, {max}")                              # line 3: min and max
    -    )
    +
    linelist %>% 
    +  select(age_years, temp) %>%                      # keep only columns of interest
    +  tbl_summary(                                     # create summary table
    +    type = all_continuous() ~ "continuous2",       # indicate that you want to print multiple statistics 
    +    statistic = all_continuous() ~ c(
    +      "{mean} ({sd})",                             # line 1: mean and SD
    +      "{median} ({p25}, {p75})",                   # line 2: median and IQR
    +      "{min}, {max}")                              # line 3: min and max
    +    )
    -
    - @@ -5041,27 +5038,27 @@

    17.5.1 tbl_wide_summary()

    You may also want to display your results in wide format, rather than long. To do so in gtsummary you can use the function tbl_wide_summary().

    -
    linelist %>% 
    -     select(age_years, temp) %>%
    -     tbl_wide_summary()
    +
    linelist %>% 
    +     select(age_years, temp) %>%
    +     tbl_wide_summary()
    -
    - @@ -5537,7 +5534,7 @@

    CAUTION: NA (missing) values will not be tabulated unless you include the argument useNA = "always" (which could also be set to “no” or “ifany”).

    TIP: You can use the %$% from magrittr to remove the need for repeating data frame calls within base functions. For example the below could be written linelist %$% table(outcome, useNA = "always")

    -
    table(linelist$outcome, useNA = "always")
    +
    table(linelist$outcome, useNA = "always")
    
       Death Recover    <NA> 
    @@ -5546,8 +5543,8 @@ 

    Once these files have been imported as the object data, we will convert them to a data frame.

    -
    ## change to a data frame 
    -temp_data <- as_tibble(data) %>% 
    -  ## add in variables and correct units
    -  mutate(
    -    ## create an calendar week variable 
    -    epiweek = tsibble::yearweek(time), 
    -    ## create a date variable (start of calendar week)
    -    date = as.Date(epiweek),
    -    ## change temperature from kelvin to celsius
    -    t2m = set_units(t2m, celsius), 
    -    ## change precipitation from metres to millimetres 
    -    tp  = set_units(tp, mm)) %>% 
    -  ## group by week (keep the date too though)
    -  group_by(epiweek, date) %>% 
    -  ## get the average per week
    -  summarise(t2m = as.numeric(mean(t2m)), 
    -            tp = as.numeric(mean(tp)))
    -
    -
    `summarise()` has grouped output by 'epiweek'. You can override using the
    -`.groups` argument.
    -
    +
    ## change to a data frame 
    +temp_data <- as_tibble(data) %>% 
    +  ## add in variables and correct units
    +  mutate(
    +    ## create an calendar week variable 
    +    epiweek = tsibble::yearweek(time), 
    +    ## create a date variable (start of calendar week)
    +    date = as.Date(epiweek),
    +    ## change temperature from kelvin to celsius
    +    t2m = set_units(t2m, celsius), 
    +    ## change precipitation from metres to millimetres 
    +    tp  = set_units(tp, mm)) %>% 
    +  ## group by week (keep the date too though)
    +  group_by(epiweek, date) %>% 
    +  ## get the average per week
    +  summarise(t2m = as.numeric(mean(t2m)), 
    +            tp = as.numeric(mean(tp)))
    @@ -1078,15 +1011,15 @@

    To do this we use the tsibble() function and specify the “index”, i.e. the variable specifying the time unit of interest. In our case this is the epiweek variable.

    If we had a data set with weekly counts by province, for example, we would also be able to specify the grouping variable using the key = argument. This would allow us to do analysis for each group.

    -
    ## define time series object 
    -counts <- tsibble(counts, index = epiweek)
    +
    ## define time series object 
    +counts <- tsibble(counts, index = epiweek)

    Looking at class(counts) tells you that on top of being a tidy data frame (“tbl_df”, “tbl”, “data.frame”), it has the additional properties of a time series data frame (“tbl_ts”).

    You can take a quick look at your data by using ggplot2. We see from the plot that there is a clear seasonal pattern, and that there are no missings. However, there seems to be an issue with reporting at the beginning of each year; cases drop in the last week of the year and then increase for the first week of the next year.

    -
    ## plot a line graph of cases by week
    -ggplot(counts, aes(x = epiweek, y = case)) + 
    -     geom_line()
    +
    ## plot a line graph of cases by week
    +ggplot(counts, aes(x = epiweek, y = case)) + 
    +     geom_line()
    @@ -1102,11 +1035,11 @@

    Duplicates

    tsibble does not allow duplicate observations. So each row will need to be unique, or unique within the group (key variable). The package has a few functions that help to identify duplicates. These include are_duplicated() which gives you a TRUE/FALSE vector of whether the row is a duplicate, and duplicates() which gives you a data frame of the duplicated rows.

    See the page on De-duplication for more details on how to select rows you want.

    -
    ## get a vector of TRUE/FALSE whether rows are duplicates
    -are_duplicated(counts, index = epiweek) 
    -
    -## get a data frame of any duplicated rows 
    -duplicates(counts, index = epiweek) 
    +
    ## get a vector of TRUE/FALSE whether rows are duplicates
    +are_duplicated(counts, index = epiweek) 
    +
    +## get a data frame of any duplicated rows 
    +duplicates(counts, index = epiweek) 
    @@ -1116,27 +1049,27 @@

    Missings

    See the Missing data page for other options for imputation.

    Another alternative would be to calculate a moving average, to try and smooth over these apparent reporting issues (see next section, and the page on Moving averages).

    -
    ## create a variable with missings instead of weeks with reporting issues
    -counts <- counts %>% 
    -     mutate(case_miss = if_else(
    -          ## if epiweek contains 52, 53, 1 or 2
    -          str_detect(epiweek, "W51|W52|W53|W01|W02"), 
    -          ## then set to missing 
    -          NA_real_, 
    -          ## otherwise keep the value in case
    -          case
    -     ))
    -
    -## alternatively interpolate missings by linear trend 
    -## between two nearest adjacent points
    -counts <- counts %>% 
    -  mutate(case_int = imputeTS::na_interpolation(case_miss)
    -         )
    -
    -## to check what values have been imputed compared to the original
    -ggplot_na_imputations(counts$case_miss, counts$case_int) + 
    -  ## make a traditional plot (with black axes and white background)
    -  theme_classic()
    +
    ## create a variable with missings instead of weeks with reporting issues
    +counts <- counts %>% 
    +     mutate(case_miss = if_else(
    +          ## if epiweek contains 52, 53, 1 or 2
    +          str_detect(epiweek, "W51|W52|W53|W01|W02"), 
    +          ## then set to missing 
    +          NA_real_, 
    +          ## otherwise keep the value in case
    +          case
    +     ))
    +
    +## alternatively interpolate missings by linear trend 
    +## between two nearest adjacent points
    +counts <- counts %>% 
    +  mutate(case_int = imputeTS::na_interpolation(case_miss)
    +         )
    +
    +## to check what values have been imputed compared to the original
    +ggplot_na_imputations(counts$case_miss, counts$case_int) + 
    +  ## make a traditional plot (with black axes and white background)
    +  theme_classic()
    @@ -1155,20 +1088,20 @@

    Moving averages

    If data is very noisy (counts jumping up and down) then it can be helpful to calculate a moving average. In the example below, for each week we calculate the average number of cases from the four previous weeks. This smooths the data, to make it more interpretable. In our case this does not really add much, so we willstick to the interpolated data for further analysis. See the Moving averages page for more detail.

    -
    ## create a moving average variable (deals with missings)
    -counts <- counts %>% 
    -     ## create the ma_4w variable 
    -     ## slide over each row of the case variable
    -     mutate(ma_4wk = slider::slide_dbl(case, 
    -                               ## for each row calculate the name
    -                               ~ mean(.x, na.rm = TRUE),
    -                               ## use the four previous weeks
    -                               .before = 4))
    -
    -## make a quick visualisation of the difference 
    -ggplot(counts, aes(x = epiweek)) + 
    -     geom_line(aes(y = case)) + 
    -     geom_line(aes(y = ma_4wk), colour = "red")
    +
    ## create a moving average variable (deals with missings)
    +counts <- counts %>% 
    +     ## create the ma_4w variable 
    +     ## slide over each row of the case variable
    +     mutate(ma_4wk = slider::slide_dbl(case, 
    +                               ## for each row calculate the name
    +                               ~ mean(.x, na.rm = TRUE),
    +                               ## use the four previous weeks
    +                               .before = 4))
    +
    +## make a quick visualisation of the difference 
    +ggplot(counts, aes(x = epiweek)) + 
    +     geom_line(aes(y = case)) + 
    +     geom_line(aes(y = ma_4wk), colour = "red")
    @@ -1184,62 +1117,62 @@

    Periodicity

    Below we define a custom function to create a periodogram. See the Writing functions page for information about how to write functions in R.

    First, the function is defined. Its arguments include a dataset with a column counts, start_week = which is the first week of the dataset, a number to indicate how many periods per year (e.g. 52, 12), and lastly the output style (see details in the code below).

    -
    ## Function arguments
    -#####################
    -## x is a dataset
    -## counts is variable with count data or rates within x 
    -## start_week is the first week in your dataset
    -## period is how many units in a year 
    -## output is whether you want return spectral periodogram or the peak weeks
    -  ## "periodogram" or "weeks"
    -
    -# Define function
    -periodogram <- function(x, 
    -                        counts, 
    -                        start_week = c(2002, 1), 
    -                        period = 52, 
    -                        output = "weeks") {
    -  
    -
    -    ## make sure is not a tsibble, filter to project and only keep columns of interest
    -    prepare_data <- dplyr::as_tibble(x)
    -    
    -    # prepare_data <- prepare_data[prepare_data[[strata]] == j, ]
    -    prepare_data <- dplyr::select(prepare_data, {{counts}})
    -    
    -    ## create an intermediate "zoo" time series to be able to use with spec.pgram
    -    zoo_cases <- zoo::zooreg(prepare_data, 
    -                             start = start_week, frequency = period)
    -    
    -    ## get a spectral periodogram not using fast fourier transform 
    -    periodo <- spec.pgram(zoo_cases, fast = FALSE, plot = FALSE)
    -    
    -    ## return the peak weeks 
    -    periodo_weeks <- 1 / periodo$freq[order(-periodo$spec)] * period
    -    
    -    if (output == "weeks") {
    -      periodo_weeks
    -    } else {
    -      periodo
    -    }
    -    
    -}
    -
    -## get spectral periodogram for extracting weeks with the highest frequencies 
    -## (checking of seasonality) 
    -periodo <- periodogram(counts, 
    -                       case_int, 
    -                       start_week = c(2002, 1),
    -                       output = "periodogram")
    -
    -## pull spectrum and frequence in to a dataframe for plotting
    -periodo <- data.frame(periodo$freq, periodo$spec)
    -
    -## plot a periodogram showing the most frequently occuring periodicity 
    -ggplot(data = periodo, 
    -                aes(x = 1/(periodo.freq/52),  y = log(periodo.spec))) + 
    -  geom_line() + 
    -  labs(x = "Period (Weeks)", y = "Log(density)")
    +
    ## Function arguments
    +#####################
    +## x is a dataset
    +## counts is variable with count data or rates within x 
    +## start_week is the first week in your dataset
    +## period is how many units in a year 
    +## output is whether you want return spectral periodogram or the peak weeks
    +  ## "periodogram" or "weeks"
    +
    +# Define function
    +periodogram <- function(x, 
    +                        counts, 
    +                        start_week = c(2002, 1), 
    +                        period = 52, 
    +                        output = "weeks") {
    +  
    +
    +    ## make sure is not a tsibble, filter to project and only keep columns of interest
    +    prepare_data <- dplyr::as_tibble(x)
    +    
    +    # prepare_data <- prepare_data[prepare_data[[strata]] == j, ]
    +    prepare_data <- dplyr::select(prepare_data, {{counts}})
    +    
    +    ## create an intermediate "zoo" time series to be able to use with spec.pgram
    +    zoo_cases <- zoo::zooreg(prepare_data, 
    +                             start = start_week, frequency = period)
    +    
    +    ## get a spectral periodogram not using fast fourier transform 
    +    periodo <- spec.pgram(zoo_cases, fast = FALSE, plot = FALSE)
    +    
    +    ## return the peak weeks 
    +    periodo_weeks <- 1 / periodo$freq[order(-periodo$spec)] * period
    +    
    +    if (output == "weeks") {
    +      periodo_weeks
    +    } else {
    +      periodo
    +    }
    +    
    +}
    +
    +## get spectral periodogram for extracting weeks with the highest frequencies 
    +## (checking of seasonality) 
    +periodo <- periodogram(counts, 
    +                       case_int, 
    +                       start_week = c(2002, 1),
    +                       output = "periodogram")
    +
    +## pull spectrum and frequence in to a dataframe for plotting
    +periodo <- data.frame(periodo$freq, periodo$spec)
    +
    +## plot a periodogram showing the most frequently occuring periodicity 
    +ggplot(data = periodo, 
    +                aes(x = 1/(periodo.freq/52),  y = log(periodo.spec))) + 
    +  geom_line() + 
    +  labs(x = "Period (Weeks)", y = "Log(density)")
    @@ -1247,11 +1180,11 @@

    Periodicity

    -
    ## get a vector weeks in ascending order 
    -peak_weeks <- periodogram(counts, 
    -                          case_int, 
    -                          start_week = c(2002, 1), 
    -                          output = "weeks")
    +
    ## get a vector weeks in ascending order 
    +peak_weeks <- periodogram(counts, 
    +                          case_int, 
    +                          start_week = c(2002, 1), 
    +                          output = "weeks")

    NOTE: It is possible to use the above weeks to add them to sin and cosine terms, however we will use a function to generate these terms (see regression section below)

    @@ -1266,14 +1199,14 @@

    Decomposition

    The random (what is left after removing trend and season).
    -
    ## decompose the counts dataset 
    -counts %>% 
    -  # using an additive classical decomposition model
    -  model(classical_decomposition(case_int, type = "additive")) %>% 
    -  ## extract the important information from the model
    -  components() %>% 
    -  ## generate a plot 
    -  autoplot()
    +
    ## decompose the counts dataset 
    +counts %>% 
    +  # using an additive classical decomposition model
    +  model(classical_decomposition(case_int, type = "additive")) %>% 
    +  ## extract the important information from the model
    +  components() %>% 
    +  ## generate a plot 
    +  autoplot()
    @@ -1290,12 +1223,12 @@

    Autocorrelation

    Using the ACF() function, we can produce a plot which shows us a number of lines for the relation at different lags. Where the lag is 0 (x = 0), this line would always be 1 as it shows the relation between an observation and itself (not shown here). The first line shown here (x = 1) shows the relation between each observation and the observation before it (lag of 1), the second shows the relation between each observation and the observation before last (lag of 2) and so on until lag of 52 which shows the relation between each observation and the observation from 1 year (52 weeks before).

    Using the PACF() function (for partial autocorrelation) shows the same type of relation but adjusted for all other weeks between. This is less informative for determining periodicity.

    -
    ## using the counts dataset
    -counts %>% 
    -  ## calculate autocorrelation using a full years worth of lags
    -  ACF(case_int, lag_max = 52) %>% 
    -  ## show a plot
    -  autoplot()
    +
    ## using the counts dataset
    +counts %>% 
    +  ## calculate autocorrelation using a full years worth of lags
    +  ACF(case_int, lag_max = 52) %>% 
    +  ## show a plot
    +  autoplot()
    @@ -1303,12 +1236,12 @@

    Autocorrelation

    -
    ## using the counts data set 
    -counts %>% 
    -  ## calculate the partial autocorrelation using a full years worth of lags
    -  PACF(case_int, lag_max = 52) %>% 
    -  ## show a plot
    -  autoplot()
    +
    ## using the counts data set 
    +counts %>% 
    +  ## calculate the partial autocorrelation using a full years worth of lags
    +  PACF(case_int, lag_max = 52) %>% 
    +  ## show a plot
    +  autoplot()
    @@ -1319,8 +1252,8 @@

    Autocorrelation

    You can formally test the null hypothesis of independence in a time series (i.e.  that it is not autocorrelated) using the Ljung-Box test (in the stats package). A significant p-value suggests that there is autocorrelation in the data.

    -
    ## test for independance 
    -Box.test(counts$case_int, type = "Ljung-Box")
    +
    ## test for independance 
    +Box.test(counts$case_int, type = "Ljung-Box")
    
         Box-Ljung test
    @@ -1342,9 +1275,9 @@ 

    Fourier terms

    If only fitting one fourier term, this would be the equivalent of fitting a sine and a cosine for your most frequently occurring lag seen in your periodogram (in our case 52 weeks). We use the fourier() function from the forecast package.

    In the below code we assign using the $, as fourier() returns two columns (one for sin one for cosin) and so these are added to the dataset as a list, called “fourier” - but this list can then be used as a normal variable in regression.

    -
    ## add in fourier terms using the epiweek and case_int variabless
    -counts$fourier <- select(counts, epiweek, case_int) %>% 
    -  fourier(K = 1)
    +
    ## add in fourier terms using the epiweek and case_int variabless
    +counts$fourier <- select(counts, epiweek, case_int) %>% 
    +  fourier(K = 1)
    @@ -1355,37 +1288,37 @@

    Negative bino

    TIP: If you wanted to use rates, rather than counts you could include the population variable as a logarithmic offset term, by adding offset(log(population). You would then need to set population to be 1, before using predict() in order to produce a rate.

    TIP: For fitting more complex models such as ARIMA or prophet, see the fable package.

    -
    ## define the model you want to fit (negative binomial) 
    -model <- glm_nb_model(
    -  ## set number of cases as outcome of interest
    -  case_int ~
    -    ## use epiweek to account for the trend
    -    epiweek +
    -    ## use the fourier terms to account for seasonality
    -    fourier)
    -
    -## fit your model using the counts dataset
    -fitted_model <- trending::fit(model, data.frame(counts))
    -
    -## calculate confidence intervals and prediction intervals 
    -observed <- predict(fitted_model, simulate_pi = FALSE)
    -
    -estimate_res <- data.frame(observed$result)
    -
    -## plot your regression 
    -ggplot(data = estimate_res, aes(x = epiweek)) + 
    -  ## add in a line for the model estimate
    -  geom_line(aes(y = estimate),
    -            col = "Red") + 
    -  ## add in a band for the prediction intervals 
    -  geom_ribbon(aes(ymin = lower_pi, 
    -                  ymax = upper_pi), 
    -              alpha = 0.25) + 
    -  ## add in a line for your observed case counts
    -  geom_line(aes(y = case_int), 
    -            col = "black") + 
    -  ## make a traditional plot (with black axes and white background)
    -  theme_classic()
    +
    ## define the model you want to fit (negative binomial) 
    +model <- glm_nb_model(
    +  ## set number of cases as outcome of interest
    +  case_int ~
    +    ## use epiweek to account for the trend
    +    epiweek +
    +    ## use the fourier terms to account for seasonality
    +    fourier)
    +
    +## fit your model using the counts dataset
    +fitted_model <- trending::fit(model, data.frame(counts))
    +
    +## calculate confidence intervals and prediction intervals 
    +observed <- predict(fitted_model, simulate_pi = FALSE)
    +
    +estimate_res <- data.frame(observed$result)
    +
    +## plot your regression 
    +ggplot(data = estimate_res, aes(x = epiweek)) + 
    +  ## add in a line for the model estimate
    +  geom_line(aes(y = estimate),
    +            col = "Red") + 
    +  ## add in a band for the prediction intervals 
    +  geom_ribbon(aes(ymin = lower_pi, 
    +                  ymax = upper_pi), 
    +              alpha = 0.25) + 
    +  ## add in a line for your observed case counts
    +  geom_line(aes(y = case_int), 
    +            col = "black") + 
    +  ## make a traditional plot (with black axes and white background)
    +  theme_classic()
    @@ -1401,16 +1334,16 @@

    Residuals

    To see how well our model fits the observed data we need to look at the residuals. The residuals are the difference between the observed counts and the counts estimated from the model. We could calculate this simply by using case_int - estimate, but the residuals() function extracts this directly from the regression for us.

    What we see from the below, is that we are not explaining all of the variation that we could with the model. It might be that we should fit more fourier terms, and address the amplitude. However for this example we will leave it as is. The plots show that our model does worse in the peaks and troughs (when counts are at their highest and lowest) and that it might be more likely to underestimate the observed counts.

    -
    ## calculate the residuals 
    -estimate_res <- estimate_res %>% 
    -  mutate(resid = fitted_model$result[[1]]$residuals)
    -
    -## are the residuals fairly constant over time (if not: outbreaks? change in practice?)
    -estimate_res %>%
    -  ggplot(aes(x = epiweek, y = resid)) +
    -  geom_line() +
    -  geom_point() + 
    -  labs(x = "epiweek", y = "Residuals")
    +
    ## calculate the residuals 
    +estimate_res <- estimate_res %>% 
    +  mutate(resid = fitted_model$result[[1]]$residuals)
    +
    +## are the residuals fairly constant over time (if not: outbreaks? change in practice?)
    +estimate_res %>%
    +  ggplot(aes(x = epiweek, y = resid)) +
    +  geom_line() +
    +  geom_point() + 
    +  labs(x = "epiweek", y = "Residuals")
    @@ -1418,11 +1351,11 @@

    Residuals

    -
    ## is there autocorelation in the residuals (is there a pattern to the error?)  
    -estimate_res %>% 
    -  as_tsibble(index = epiweek) %>% 
    -  ACF(resid, lag_max = 52) %>% 
    -  autoplot()
    +
    ## is there autocorelation in the residuals (is there a pattern to the error?)  
    +estimate_res %>% 
    +  as_tsibble(index = epiweek) %>% 
    +  ACF(resid, lag_max = 52) %>% 
    +  autoplot()
    @@ -1430,12 +1363,12 @@

    Residuals

    -
    ## are residuals normally distributed (are under or over estimating?)  
    -estimate_res %>%
    -  ggplot(aes(x = resid)) +
    -  geom_histogram(binwidth = 100) +
    -  geom_rug() +
    -  labs(y = "count") 
    +
    ## are residuals normally distributed (are under or over estimating?)  
    +estimate_res %>%
    +  ggplot(aes(x = resid)) +
    +  geom_histogram(binwidth = 100) +
    +  geom_rug() +
    +  labs(y = "count") 
    @@ -1443,12 +1376,12 @@

    Residuals

    -
    ## compare observed counts to their residuals 
    -  ## should also be no pattern 
    -estimate_res %>%
    -  ggplot(aes(x = estimate, y = resid)) +
    -  geom_point() +
    -  labs(x = "Fitted", y = "Residuals")
    +
    ## compare observed counts to their residuals 
    +  ## should also be no pattern 
    +estimate_res %>%
    +  ggplot(aes(x = estimate, y = resid)) +
    +  geom_point() +
    +  labs(x = "Fitted", y = "Residuals")
    @@ -1456,11 +1389,11 @@

    Residuals

    -
    ## formally test autocorrelation of the residuals
    -## H0 is that residuals are from a white-noise series (i.e. random)
    -## test for independence 
    -## if p value significant then non-random
    -Box.test(estimate_res$resid, type = "Ljung-Box")
    +
    ## formally test autocorrelation of the residuals
    +## H0 is that residuals are from a white-noise series (i.e. random)
    +## test for independence 
    +## if p value significant then non-random
    +Box.test(estimate_res$resid, type = "Ljung-Box")
    
         Box-Ljung test
    @@ -1480,11 +1413,11 @@ 

    Merging datasets

    We can join our datasets using the week variable. For more on merging see the handbook section on joining.

    -
    ## left join so that we only have the rows already existing in counts
    -## drop the date variable from temp_data (otherwise is duplicated)
    -counts <- left_join(counts, 
    -                    select(temp_data, -date),
    -                    by = "epiweek")
    +
    ## left join so that we only have the rows already existing in counts
    +## drop the date variable from temp_data (otherwise is duplicated)
    +counts <- left_join(counts, 
    +                    select(temp_data, -date),
    +                    by = "epiweek")
    @@ -1492,25 +1425,25 @@

    Merging datase

    Descriptive analysis

    First plot your data to see if there is any obvious relation. The plot below shows that there is a clear relation in the seasonality of the two variables, and that temperature might peak a few weeks before the case number. For more on pivoting data, see the handbook section on pivoting data.

    -
    counts %>% 
    -  ## keep the variables we are interested 
    -  select(epiweek, case_int, t2m) %>% 
    -  ## change your data in to long format
    -  pivot_longer(
    -    ## use epiweek as your key
    -    !epiweek,
    -    ## move column names to the new "measure" column
    -    names_to = "measure", 
    -    ## move cell values to the new "values" column
    -    values_to = "value") %>% 
    -  ## create a plot with the dataset above
    -  ## plot epiweek on the x axis and values (counts/celsius) on the y 
    -  ggplot(aes(x = epiweek, y = value)) + 
    -    ## create a separate plot for temperate and case counts 
    -    ## let them set their own y-axes
    -    facet_grid(measure ~ ., scales = "free_y") +
    -    ## plot both as a line
    -    geom_line()
    +
    counts %>% 
    +  ## keep the variables we are interested 
    +  select(epiweek, case_int, t2m) %>% 
    +  ## change your data in to long format
    +  pivot_longer(
    +    ## use epiweek as your key
    +    !epiweek,
    +    ## move column names to the new "measure" column
    +    names_to = "measure", 
    +    ## move cell values to the new "values" column
    +    values_to = "value") %>% 
    +  ## create a plot with the dataset above
    +  ## plot epiweek on the x axis and values (counts/celsius) on the y 
    +  ggplot(aes(x = epiweek, y = value)) + 
    +    ## create a separate plot for temperate and case counts 
    +    ## let them set their own y-axes
    +    facet_grid(measure ~ ., scales = "free_y") +
    +    ## plot both as a line
    +    geom_line()
    @@ -1525,18 +1458,18 @@

    Descript

    Lags and cross-correlation

    To formally test which weeks are most highly related between cases and temperature. We can use the cross-correlation function (CCF()) from the feasts package. You could also visualise (rather than using arrange) using the autoplot() function.

    -
    counts %>% 
    -  ## calculate cross-correlation between interpolated counts and temperature
    -  CCF(case_int, t2m,
    -      ## set the maximum lag to be 52 weeks
    -      lag_max = 52, 
    -      ## return the correlation coefficient 
    -      type = "correlation") %>% 
    -  ## arange in decending order of the correlation coefficient 
    -  ## show the most associated lags
    -  arrange(-ccf) %>% 
    -  ## only show the top ten 
    -  slice_head(n = 10)
    +
    counts %>% 
    +  ## calculate cross-correlation between interpolated counts and temperature
    +  CCF(case_int, t2m,
    +      ## set the maximum lag to be 52 weeks
    +      lag_max = 52, 
    +      ## return the correlation coefficient 
    +      type = "correlation") %>% 
    +  ## arange in decending order of the correlation coefficient 
    +  ## show the most associated lags
    +  arrange(-ccf) %>% 
    +  ## only show the top ten 
    +  slice_head(n = 10)
    # A tsibble: 10 x 2 [1W]
             lag   ccf
    @@ -1556,9 +1489,9 @@ 

    Lags

    We see from this that a lag of 4 weeks is most highly correlated, so we make a lagged temperature variable to include in our regression.

    DANGER: Note that the first four weeks of our data in the lagged temperature variable are missing (NA) - as there are not four weeks prior to get data from. In order to use this dataset with the trending predict() function, we need to use the the simulate_pi = FALSE argument within predict() further down. If we did want to use the simulate option, then we have to drop these missings and store as a new data set by adding drop_na(t2m_lag4) to the code chunk below.

    -
    counts <- counts %>% 
    -  ## create a new variable for temperature lagged by four weeks
    -  mutate(t2m_lag4 = lag(t2m, n = 4))
    +
    counts <- counts %>% 
    +  ## create a new variable for temperature lagged by four weeks
    +  mutate(t2m_lag4 = lag(t2m, n = 4))
    @@ -1567,30 +1500,30 @@

    CAUTION: Note the use of simulate_pi = FALSE within the predict() argument. This is because the default behaviour of trending is to use the ciTools package to estimate a prediction interval. This does not work if there are NA counts, and also produces more granular intervals. See ?trending::predict.trending_model_fit for details.

    -
    ## define the model you want to fit (negative binomial) 
    -model <- glm_nb_model(
    -  ## set number of cases as outcome of interest
    -  case_int ~
    -    ## use epiweek to account for the trend
    -    epiweek +
    -    ## use the fourier terms to account for seasonality
    -    fourier + 
    -    ## use the temperature lagged by four weeks 
    -    t2m_lag4
    -    )
    -
    -## fit your model using the counts dataset
    -fitted_model <- trending::fit(model, data.frame(counts))
    -
    -## calculate confidence intervals and prediction intervals 
    -observed <- predict(fitted_model, simulate_pi = FALSE)
    +
    ## define the model you want to fit (negative binomial) 
    +model <- glm_nb_model(
    +  ## set number of cases as outcome of interest
    +  case_int ~
    +    ## use epiweek to account for the trend
    +    epiweek +
    +    ## use the fourier terms to account for seasonality
    +    fourier + 
    +    ## use the temperature lagged by four weeks 
    +    t2m_lag4
    +    )
    +
    +## fit your model using the counts dataset
    +fitted_model <- trending::fit(model, data.frame(counts))
    +
    +## calculate confidence intervals and prediction intervals 
    +observed <- predict(fitted_model, simulate_pi = FALSE)

    To investigate the individual terms, we can pull the original negative binomial regression out of the trending format using get_model() and pass this to the broom package tidy() function to retrieve exponentiated estimates and associated confidence intervals.

    What this shows us is that lagged temperature, after controlling for trend and seasonality, is similar to the case counts (estimate ~ 1) and significantly associated. This suggests that it might be a good variable for use in predicting future case numbers (as climate forecasts are readily available).

    -
    fitted_model %>% 
    -  ## extract original negative binomial regression
    -  get_fitted_model() #%>% 
    +
    fitted_model %>% 
    +  ## extract original negative binomial regression
    +  get_fitted_model() #%>% 
    [[1]]
     
    @@ -1606,28 +1539,28 @@ 

      ## get a tidy dataframe of results
    -  #tidy(exponentiate = TRUE, 
    -  #     conf.int = TRUE)

    +
      ## get a tidy dataframe of results
    +  #tidy(exponentiate = TRUE, 
    +  #     conf.int = TRUE)

    A quick visual inspection of the model shows that it might do a better job of estimating the observed case counts.

    -
    estimate_res <- data.frame(observed$result)
    -
    -## plot your regression 
    -ggplot(data = estimate_res, aes(x = epiweek)) + 
    -  ## add in a line for the model estimate
    -  geom_line(aes(y = estimate),
    -            col = "Red") + 
    -  ## add in a band for the prediction intervals 
    -  geom_ribbon(aes(ymin = lower_pi, 
    -                  ymax = upper_pi), 
    -              alpha = 0.25) + 
    -  ## add in a line for your observed case counts
    -  geom_line(aes(y = case_int), 
    -            col = "black") + 
    -  ## make a traditional plot (with black axes and white background)
    -  theme_classic()
    +
    estimate_res <- data.frame(observed$result)
    +
    +## plot your regression 
    +ggplot(data = estimate_res, aes(x = epiweek)) + 
    +  ## add in a line for the model estimate
    +  geom_line(aes(y = estimate),
    +            col = "Red") + 
    +  ## add in a band for the prediction intervals 
    +  geom_ribbon(aes(ymin = lower_pi, 
    +                  ymax = upper_pi), 
    +              alpha = 0.25) + 
    +  ## add in a line for your observed case counts
    +  geom_line(aes(y = case_int), 
    +            col = "black") + 
    +  ## make a traditional plot (with black axes and white background)
    +  theme_classic()
    @@ -1640,16 +1573,16 @@

    Residuals

    We investigate the residuals again to see how well our model fits the observed data. The results and interpretation here are similar to those of the previous regression, so it may be more feasible to stick with the simpler model without temperature.

    -
    ## calculate the residuals 
    -estimate_res <- estimate_res %>% 
    -  mutate(resid = case_int - estimate)
    -
    -## are the residuals fairly constant over time (if not: outbreaks? change in practice?)
    -estimate_res %>%
    -  ggplot(aes(x = epiweek, y = resid)) +
    -  geom_line() +
    -  geom_point() + 
    -  labs(x = "epiweek", y = "Residuals")
    +
    ## calculate the residuals 
    +estimate_res <- estimate_res %>% 
    +  mutate(resid = case_int - estimate)
    +
    +## are the residuals fairly constant over time (if not: outbreaks? change in practice?)
    +estimate_res %>%
    +  ggplot(aes(x = epiweek, y = resid)) +
    +  geom_line() +
    +  geom_point() + 
    +  labs(x = "epiweek", y = "Residuals")
    @@ -1657,11 +1590,11 @@

    Residuals

    -
    ## is there autocorelation in the residuals (is there a pattern to the error?)  
    -estimate_res %>% 
    -  as_tsibble(index = epiweek) %>% 
    -  ACF(resid, lag_max = 52) %>% 
    -  autoplot()
    +
    ## is there autocorelation in the residuals (is there a pattern to the error?)  
    +estimate_res %>% 
    +  as_tsibble(index = epiweek) %>% 
    +  ACF(resid, lag_max = 52) %>% 
    +  autoplot()
    @@ -1669,12 +1602,12 @@

    Residuals

    -
    ## are residuals normally distributed (are under or over estimating?)  
    -estimate_res %>%
    -  ggplot(aes(x = resid)) +
    -  geom_histogram(binwidth = 100) +
    -  geom_rug() +
    -  labs(y = "count") 
    +
    ## are residuals normally distributed (are under or over estimating?)  
    +estimate_res %>%
    +  ggplot(aes(x = resid)) +
    +  geom_histogram(binwidth = 100) +
    +  geom_rug() +
    +  labs(y = "count") 
    @@ -1682,12 +1615,12 @@

    Residuals

    -
    ## compare observed counts to their residuals 
    -  ## should also be no pattern 
    -estimate_res %>%
    -  ggplot(aes(x = estimate, y = resid)) +
    -  geom_point() +
    -  labs(x = "Fitted", y = "Residuals")
    +
    ## compare observed counts to their residuals 
    +  ## should also be no pattern 
    +estimate_res %>%
    +  ggplot(aes(x = estimate, y = resid)) +
    +  geom_point() +
    +  labs(x = "Fitted", y = "Residuals")
    @@ -1695,11 +1628,11 @@

    Residuals

    -
    ## formally test autocorrelation of the residuals
    -## H0 is that residuals are from a white-noise series (i.e. random)
    -## test for independence 
    -## if p value significant then non-random
    -Box.test(estimate_res$resid, type = "Ljung-Box")
    +
    ## formally test autocorrelation of the residuals
    +## H0 is that residuals are from a white-noise series (i.e. random)
    +## test for independence 
    +## if p value significant then non-random
    +Box.test(estimate_res$resid, type = "Ljung-Box")
    
         Box-Ljung test
    @@ -1727,17 +1660,17 @@ 

    Cut-off date

    Here we define a start date (when our observations started) and a cut-off date (the end of our baseline period - and when the period we want to predict for starts). ~We also define how many weeks are in our year of interest (the one we are going to be predicting)~. We also define how many weeks are between our baseline cut-off and the end date that we are interested in predicting for.

    NOTE: In this example we pretend to currently be at the end of September 2011 (“2011 W39”).

    -
    ## define start date (when observations began)
    -start_date <- min(counts$epiweek)
    -
    -## define a cut-off week (end of baseline, start of prediction period)
    -cut_off <- yearweek("2010-12-31")
    -
    -## define the last date interested in (i.e. end of prediction)
    -end_date <- yearweek("2011-12-31")
    -
    -## find how many weeks in period (year) of interest
    -num_weeks <- as.numeric(end_date - cut_off)
    +
    ## define start date (when observations began)
    +start_date <- min(counts$epiweek)
    +
    +## define a cut-off week (end of baseline, start of prediction period)
    +cut_off <- yearweek("2010-12-31")
    +
    +## define the last date interested in (i.e. end of prediction)
    +end_date <- yearweek("2011-12-31")
    +
    +## find how many weeks in period (year) of interest
    +num_weeks <- as.numeric(end_date - cut_off)
    @@ -1745,15 +1678,15 @@

    Cut-off date

    Add rows

    To be able to forecast in a tidyverse format, we need to have the right number of rows in our dataset, i.e. one row for each week up to the end_datedefined above. The code below allows you to add these rows for by a grouping variable - for example if we had multiple countries in one dataset, we could group by country and then add rows appropriately for each. The group_by_key() function from tsibble allows us to do this grouping and then pass the grouped data to dplyr functions, group_modify() and add_row(). Then we specify the sequence of weeks between one after the maximum week currently available in the data and the end week.

    -
    ## add in missing weeks till end of year 
    -counts <- counts %>%
    -  ## group by the region
    -  group_by_key() %>%
    -  ## for each group add rows from the highest epiweek to the end of year
    -  group_modify(~add_row(.,
    -                        epiweek = seq(max(.$epiweek) + 1, 
    -                                      end_date,
    -                                      by = 1)))
    +
    ## add in missing weeks till end of year 
    +counts <- counts %>%
    +  ## group by the region
    +  group_by_key() %>%
    +  ## for each group add rows from the highest epiweek to the end of year
    +  group_modify(~add_row(.,
    +                        epiweek = seq(max(.$epiweek) + 1, 
    +                                      end_date,
    +                                      by = 1)))
    @@ -1762,32 +1695,32 @@

    Fourier termsWe need to redefine our fourier terms - as we want to fit them to the baseline date only and then predict (extrapolate) those terms for the next year. To do this we need to combine two output lists from the fourier() function together; the first one is for the baseline data, and the second one predicts for the year of interest (by defining the h argument).

    N.b. to bind rows we have to use rbind() (rather than tidyverse bind_rows) as the fourier columns are a list (so not named individually).

    -
    ## define fourier terms (sincos) 
    -counts <- counts %>% 
    -  mutate(
    -    ## combine fourier terms for weeks prior to  and after 2010 cut-off date
    -    ## (nb. 2011 fourier terms are predicted)
    -    fourier = rbind(
    -      ## get fourier terms for previous years
    -      fourier(
    -        ## only keep the rows before 2011
    -        filter(counts, 
    -               epiweek <= cut_off), 
    -        ## include one set of sin cos terms 
    -        K = 1
    -        ), 
    -      ## predict the fourier terms for 2011 (using baseline data)
    -      fourier(
    -        ## only keep the rows before 2011
    -        filter(counts, 
    -               epiweek <= cut_off),
    -        ## include one set of sin cos terms 
    -        K = 1, 
    -        ## predict 52 weeks ahead
    -        h = num_weeks
    -        )
    -      )
    -    )
    +
    ## define fourier terms (sincos) 
    +counts <- counts %>% 
    +  mutate(
    +    ## combine fourier terms for weeks prior to  and after 2010 cut-off date
    +    ## (nb. 2011 fourier terms are predicted)
    +    fourier = rbind(
    +      ## get fourier terms for previous years
    +      fourier(
    +        ## only keep the rows before 2011
    +        filter(counts, 
    +               epiweek <= cut_off), 
    +        ## include one set of sin cos terms 
    +        K = 1
    +        ), 
    +      ## predict the fourier terms for 2011 (using baseline data)
    +      fourier(
    +        ## only keep the rows before 2011
    +        filter(counts, 
    +               epiweek <= cut_off),
    +        ## include one set of sin cos terms 
    +        K = 1, 
    +        ## predict 52 weeks ahead
    +        h = num_weeks
    +        )
    +      )
    +    )
    @@ -1798,73 +1731,73 @@

    S

    See the page on Iteration, loops, and lists to learn more about purrr.

    CAUTION: Note the use of simulate_pi = FALSE within the predict() argument. This is because the default behaviour of trending is to use the ciTools package to estimate a prediction interval. This does not work if there are NA counts, and also produces more granular intervals. See ?trending::predict.trending_model_fit for details.

    -
    # split data for fitting and prediction
    -dat <- counts %>% 
    -  group_by(epiweek <= cut_off) %>%
    -  group_split()
    -
    -## define the model you want to fit (negative binomial) 
    -model <- glm_nb_model(
    -  ## set number of cases as outcome of interest
    -  case_int ~
    -    ## use epiweek to account for the trend
    -    epiweek +
    -    ## use the furier terms to account for seasonality
    -    fourier
    -)
    -
    -# define which data to use for fitting and which for predicting
    -fitting_data <- pluck(dat, 2)
    -pred_data <- pluck(dat, 1) %>% 
    -  select(case_int, epiweek, fourier)
    -
    -# fit model 
    -fitted_model <- trending::fit(model, data.frame(fitting_data))
    -
    -# get confint and estimates for fitted data
    -observed <- fitted_model %>% 
    -  predict(simulate_pi = FALSE)
    -
    -# forecast with data want to predict with 
    -forecasts <- fitted_model %>% 
    -  predict(data.frame(pred_data), simulate_pi = FALSE)
    -
    -## combine baseline and predicted datasets
    -observed <- bind_rows(observed$result, forecasts$result)
    +
    # split data for fitting and prediction
    +dat <- counts %>% 
    +  group_by(epiweek <= cut_off) %>%
    +  group_split()
    +
    +## define the model you want to fit (negative binomial) 
    +model <- glm_nb_model(
    +  ## set number of cases as outcome of interest
    +  case_int ~
    +    ## use epiweek to account for the trend
    +    epiweek +
    +    ## use the furier terms to account for seasonality
    +    fourier
    +)
    +
    +# define which data to use for fitting and which for predicting
    +fitting_data <- pluck(dat, 2)
    +pred_data <- pluck(dat, 1) %>% 
    +  select(case_int, epiweek, fourier)
    +
    +# fit model 
    +fitted_model <- trending::fit(model, data.frame(fitting_data))
    +
    +# get confint and estimates for fitted data
    +observed <- fitted_model %>% 
    +  predict(simulate_pi = FALSE)
    +
    +# forecast with data want to predict with 
    +forecasts <- fitted_model %>% 
    +  predict(data.frame(pred_data), simulate_pi = FALSE)
    +
    +## combine baseline and predicted datasets
    +observed <- bind_rows(observed$result, forecasts$result)

    As previously, we can visualise our model with ggplot. We highlight alerts with red dots for observed counts above the 95% prediction interval. This time we also add a vertical line to label when the forecast starts.

    -
    ## plot your regression 
    -ggplot(data = observed, aes(x = epiweek)) + 
    -  ## add in a line for the model estimate
    -  geom_line(aes(y = estimate),
    -            col = "grey") + 
    -  ## add in a band for the prediction intervals 
    -  geom_ribbon(aes(ymin = lower_pi, 
    -                  ymax = upper_pi), 
    -              alpha = 0.25) + 
    -  ## add in a line for your observed case counts
    -  geom_line(aes(y = case_int), 
    -            col = "black") + 
    -  ## plot in points for the observed counts above expected
    -  geom_point(
    -    data = filter(observed, case_int > upper_pi), 
    -    aes(y = case_int), 
    -    colour = "red", 
    -    size = 2) + 
    -  ## add vertical line and label to show where forecasting started
    -  geom_vline(
    -           xintercept = as.Date(cut_off), 
    -           linetype = "dashed") + 
    -  annotate(geom = "text", 
    -           label = "Forecast", 
    -           x = cut_off, 
    -           y = max(observed$upper_pi) - 250, 
    -           angle = 90, 
    -           vjust = 1
    -           ) + 
    -  ## make a traditional plot (with black axes and white background)
    -  theme_classic()
    +
    ## plot your regression 
    +ggplot(data = observed, aes(x = epiweek)) + 
    +  ## add in a line for the model estimate
    +  geom_line(aes(y = estimate),
    +            col = "grey") + 
    +  ## add in a band for the prediction intervals 
    +  geom_ribbon(aes(ymin = lower_pi, 
    +                  ymax = upper_pi), 
    +              alpha = 0.25) + 
    +  ## add in a line for your observed case counts
    +  geom_line(aes(y = case_int), 
    +            col = "black") + 
    +  ## plot in points for the observed counts above expected
    +  geom_point(
    +    data = filter(observed, case_int > upper_pi), 
    +    aes(y = case_int), 
    +    colour = "red", 
    +    size = 2) + 
    +  ## add vertical line and label to show where forecasting started
    +  geom_vline(
    +           xintercept = as.Date(cut_off), 
    +           linetype = "dashed") + 
    +  annotate(geom = "text", 
    +           label = "Forecast", 
    +           x = cut_off, 
    +           y = max(observed$upper_pi) - 250, 
    +           angle = 90, 
    +           vjust = 1
    +           ) + 
    +  ## make a traditional plot (with black axes and white background)
    +  theme_classic()
    Warning: Removed 13 rows containing missing values or values outside the scale range
     (`geom_line()`).
    @@ -1891,156 +1824,156 @@

    Predictio

    In the below we use purrr package map() function to loop over each dataset. We then put estimates in one data set and merge with the original case counts, to use the yardstick package to compute measures of accuracy. We compute four measures including: Root mean squared error (RMSE), Mean absolute error (MAE), Mean absolute scaled error (MASE), Mean absolute percent error (MAPE).

    CAUTION: Note the use of simulate_pi = FALSE within the predict() argument. This is because the default behaviour of trending is to use the ciTools package to estimate a prediction interval. This does not work if there are NA counts, and also produces more granular intervals. See ?trending::predict.trending_model_fit for details.

    -
    ## Cross validation: predicting week(s) ahead based on sliding window
    -
    -## expand your data by rolling over in 52 week windows (before + after) 
    -## to predict 52 week ahead
    -## (creates longer and longer chains of observations - keeps older data)
    -
    -## define window want to roll over
    -roll_window <- 52
    -
    -## define weeks ahead want to predict 
    -weeks_ahead <- 52
    -
    -## create a data set of repeating, increasingly long data
    -## label each data set with a unique id
    -## only use cases before year of interest (i.e. 2011)
    -case_roll <- counts %>% 
    -  filter(epiweek < cut_off) %>% 
    -  ## only keep the week and case counts variables
    -  select(epiweek, case_int) %>% 
    -    ## drop the last x observations 
    -    ## depending on how many weeks ahead forecasting 
    -    ## (otherwise will be an actual forecast to "unknown")
    -    slice(1:(n() - weeks_ahead)) %>%
    -    as_tsibble(index = epiweek) %>% 
    -    ## roll over each week in x after windows to create grouping ID 
    -    ## depending on what rolling window specify
    -    stretch_tsibble(.init = roll_window, .step = 1) %>% 
    -  ## drop the first couple - as have no "before" cases
    -  filter(.id > roll_window)
    -
    -
    -## for each of the unique data sets run the code below
    -forecasts <- purrr::map(unique(case_roll$.id), 
    -                        function(i) {
    -  
    -  ## only keep the current fold being fit 
    -  mini_data <- filter(case_roll, .id == i) %>% 
    -    as_tibble()
    -  
    -  ## create an empty data set for forecasting on 
    -  forecast_data <- tibble(
    -    epiweek = seq(max(mini_data$epiweek) + 1,
    -                  max(mini_data$epiweek) + weeks_ahead,
    -                  by = 1),
    -    case_int = rep.int(NA, weeks_ahead),
    -    .id = rep.int(i, weeks_ahead)
    -  )
    -  
    -  ## add the forecast data to the original 
    -  mini_data <- bind_rows(mini_data, forecast_data)
    -  
    -  ## define the cut off based on latest non missing count data 
    -  cv_cut_off <- mini_data %>% 
    -    ## only keep non-missing rows
    -    drop_na(case_int) %>% 
    -    ## get the latest week
    -    summarise(max(epiweek)) %>% 
    -    ## extract so is not in a dataframe
    -    pull()
    -  
    -  ## make mini_data back in to a tsibble
    -  mini_data <- tsibble(mini_data, index = epiweek)
    -  
    -  ## define fourier terms (sincos) 
    -  mini_data <- mini_data %>% 
    -    mutate(
    -    ## combine fourier terms for weeks prior to  and after cut-off date
    -    fourier = rbind(
    -      ## get fourier terms for previous years
    -      forecast::fourier(
    -        ## only keep the rows before cut-off
    -        filter(mini_data, 
    -               epiweek <= cv_cut_off), 
    -        ## include one set of sin cos terms 
    -        K = 1
    -        ), 
    -      ## predict the fourier terms for following year (using baseline data)
    -      fourier(
    -        ## only keep the rows before cut-off
    -        filter(mini_data, 
    -               epiweek <= cv_cut_off),
    -        ## include one set of sin cos terms 
    -        K = 1, 
    -        ## predict 52 weeks ahead
    -        h = weeks_ahead
    -        )
    -      )
    -    )
    -  
    -  
    -  # split data for fitting and prediction
    -  dat <- mini_data %>% 
    -    group_by(epiweek <= cv_cut_off) %>%
    -    group_split()
    -
    -  ## define the model you want to fit (negative binomial) 
    -  model <- glm_nb_model(
    -    ## set number of cases as outcome of interest
    -    case_int ~
    -      ## use epiweek to account for the trend
    -      epiweek +
    -      ## use the furier terms to account for seasonality
    -      fourier
    -  )
    -
    -  # define which data to use for fitting and which for predicting
    -  fitting_data <- pluck(dat, 2)
    -  pred_data <- pluck(dat, 1)
    -  
    -  # fit model 
    -  fitted_model <- trending::fit(model, fitting_data)
    -  
    -  # forecast with data want to predict with 
    -  forecasts <- fitted_model %>% 
    -    predict(data.frame(pred_data), simulate_pi = FALSE)
    -  forecasts <- data.frame(forecasts$result[[1]]) %>% 
    -       ## only keep the week and the forecast estimate
    -    select(epiweek, estimate)
    -    
    -  }
    -  )
    -
    -## make the list in to a data frame with all the forecasts
    -forecasts <- bind_rows(forecasts)
    -
    -## join the forecasts with the observed
    -forecasts <- left_join(forecasts, 
    -                       select(counts, epiweek, case_int),
    -                       by = "epiweek")
    -
    -## using {yardstick} compute metrics
    -  ## RMSE: Root mean squared error
    -  ## MAE:  Mean absolute error  
    -  ## MASE: Mean absolute scaled error
    -  ## MAPE: Mean absolute percent error
    -model_metrics <- bind_rows(
    -  ## in your forcasted dataset compare the observed to the predicted
    -  rmse(forecasts, case_int, estimate), 
    -  mae( forecasts, case_int, estimate),
    -  mase(forecasts, case_int, estimate),
    -  mape(forecasts, case_int, estimate),
    -  ) %>% 
    -  ## only keep the metric type and its output
    -  select(Metric  = .metric, 
    -         Measure = .estimate) %>% 
    -  ## make in to wide format so can bind rows after
    -  pivot_wider(names_from = Metric, values_from = Measure)
    -
    -## return model metrics 
    -model_metrics
    +
    ## Cross validation: predicting week(s) ahead based on sliding window
    +
    +## expand your data by rolling over in 52 week windows (before + after) 
    +## to predict 52 week ahead
    +## (creates longer and longer chains of observations - keeps older data)
    +
    +## define window want to roll over
    +roll_window <- 52
    +
    +## define weeks ahead want to predict 
    +weeks_ahead <- 52
    +
    +## create a data set of repeating, increasingly long data
    +## label each data set with a unique id
    +## only use cases before year of interest (i.e. 2011)
    +case_roll <- counts %>% 
    +  filter(epiweek < cut_off) %>% 
    +  ## only keep the week and case counts variables
    +  select(epiweek, case_int) %>% 
    +    ## drop the last x observations 
    +    ## depending on how many weeks ahead forecasting 
    +    ## (otherwise will be an actual forecast to "unknown")
    +    slice(1:(n() - weeks_ahead)) %>%
    +    as_tsibble(index = epiweek) %>% 
    +    ## roll over each week in x after windows to create grouping ID 
    +    ## depending on what rolling window specify
    +    stretch_tsibble(.init = roll_window, .step = 1) %>% 
    +  ## drop the first couple - as have no "before" cases
    +  filter(.id > roll_window)
    +
    +
    +## for each of the unique data sets run the code below
    +forecasts <- purrr::map(unique(case_roll$.id), 
    +                        function(i) {
    +  
    +  ## only keep the current fold being fit 
    +  mini_data <- filter(case_roll, .id == i) %>% 
    +    as_tibble()
    +  
    +  ## create an empty data set for forecasting on 
    +  forecast_data <- tibble(
    +    epiweek = seq(max(mini_data$epiweek) + 1,
    +                  max(mini_data$epiweek) + weeks_ahead,
    +                  by = 1),
    +    case_int = rep.int(NA, weeks_ahead),
    +    .id = rep.int(i, weeks_ahead)
    +  )
    +  
    +  ## add the forecast data to the original 
    +  mini_data <- bind_rows(mini_data, forecast_data)
    +  
    +  ## define the cut off based on latest non missing count data 
    +  cv_cut_off <- mini_data %>% 
    +    ## only keep non-missing rows
    +    drop_na(case_int) %>% 
    +    ## get the latest week
    +    summarise(max(epiweek)) %>% 
    +    ## extract so is not in a dataframe
    +    pull()
    +  
    +  ## make mini_data back in to a tsibble
    +  mini_data <- tsibble(mini_data, index = epiweek)
    +  
    +  ## define fourier terms (sincos) 
    +  mini_data <- mini_data %>% 
    +    mutate(
    +    ## combine fourier terms for weeks prior to  and after cut-off date
    +    fourier = rbind(
    +      ## get fourier terms for previous years
    +      forecast::fourier(
    +        ## only keep the rows before cut-off
    +        filter(mini_data, 
    +               epiweek <= cv_cut_off), 
    +        ## include one set of sin cos terms 
    +        K = 1
    +        ), 
    +      ## predict the fourier terms for following year (using baseline data)
    +      fourier(
    +        ## only keep the rows before cut-off
    +        filter(mini_data, 
    +               epiweek <= cv_cut_off),
    +        ## include one set of sin cos terms 
    +        K = 1, 
    +        ## predict 52 weeks ahead
    +        h = weeks_ahead
    +        )
    +      )
    +    )
    +  
    +  
    +  # split data for fitting and prediction
    +  dat <- mini_data %>% 
    +    group_by(epiweek <= cv_cut_off) %>%
    +    group_split()
    +
    +  ## define the model you want to fit (negative binomial) 
    +  model <- glm_nb_model(
    +    ## set number of cases as outcome of interest
    +    case_int ~
    +      ## use epiweek to account for the trend
    +      epiweek +
    +      ## use the furier terms to account for seasonality
    +      fourier
    +  )
    +
    +  # define which data to use for fitting and which for predicting
    +  fitting_data <- pluck(dat, 2)
    +  pred_data <- pluck(dat, 1)
    +  
    +  # fit model 
    +  fitted_model <- trending::fit(model, fitting_data)
    +  
    +  # forecast with data want to predict with 
    +  forecasts <- fitted_model %>% 
    +    predict(data.frame(pred_data), simulate_pi = FALSE)
    +  forecasts <- data.frame(forecasts$result[[1]]) %>% 
    +       ## only keep the week and the forecast estimate
    +    select(epiweek, estimate)
    +    
    +  }
    +  )
    +
    +## make the list in to a data frame with all the forecasts
    +forecasts <- bind_rows(forecasts)
    +
    +## join the forecasts with the observed
    +forecasts <- left_join(forecasts, 
    +                       select(counts, epiweek, case_int),
    +                       by = "epiweek")
    +
    +## using {yardstick} compute metrics
    +  ## RMSE: Root mean squared error
    +  ## MAE:  Mean absolute error  
    +  ## MASE: Mean absolute scaled error
    +  ## MAPE: Mean absolute percent error
    +model_metrics <- bind_rows(
    +  ## in your forcasted dataset compare the observed to the predicted
    +  rmse(forecasts, case_int, estimate), 
    +  mae( forecasts, case_int, estimate),
    +  mase(forecasts, case_int, estimate),
    +  mape(forecasts, case_int, estimate),
    +  ) %>% 
    +  ## only keep the metric type and its output
    +  select(Metric  = .metric, 
    +         Measure = .estimate) %>% 
    +  ## make in to wide format so can bind rows after
    +  pivot_wider(names_from = Metric, values_from = Measure)
    +
    +## return model metrics 
    +model_metrics
    # A tibble: 1 × 4
        rmse   mae  mase  mape
    @@ -2058,67 +1991,67 @@ 

    su

    The second option use the glrnb method. This also fits a negative binomial glm but includes trend and fourier terms (so is favoured here). The regression is used to calculate the “control mean” (~fitted values) - it then uses a computed generalized likelihood ratio statistic to assess if there is shift in the mean for each week. Note that the threshold for each week takes in to account previous weeks so if there is a sustained shift an alarm will be triggered. (Also note that after each alarm the algorithm is reset).

    In order to work with the surveillance package, we first need to define a “surveillance time series” object (using the sts() function) to fit within the framework.

    -
    ## define surveillance time series object
    -## nb. you can include a denominator with the population object (see ?sts)
    -counts_sts <- sts(observed = counts$case_int[!is.na(counts$case_int)],
    -                  start = c(
    -                    ## subset to only keep the year from start_date 
    -                    as.numeric(str_sub(start_date, 1, 4)), 
    -                    ## subset to only keep the week from start_date
    -                    as.numeric(str_sub(start_date, 7, 8))), 
    -                  ## define the type of data (in this case weekly)
    -                  freq = 52)
    -
    -## define the week range that you want to include (ie. prediction period)
    -## nb. the sts object only counts observations without assigning a week or 
    -## year identifier to them - so we use our data to define the appropriate observations
    -weekrange <- cut_off - start_date
    +
    ## define surveillance time series object
    +## nb. you can include a denominator with the population object (see ?sts)
    +counts_sts <- sts(observed = counts$case_int[!is.na(counts$case_int)],
    +                  start = c(
    +                    ## subset to only keep the year from start_date 
    +                    as.numeric(str_sub(start_date, 1, 4)), 
    +                    ## subset to only keep the week from start_date
    +                    as.numeric(str_sub(start_date, 7, 8))), 
    +                  ## define the type of data (in this case weekly)
    +                  freq = 52)
    +
    +## define the week range that you want to include (ie. prediction period)
    +## nb. the sts object only counts observations without assigning a week or 
    +## year identifier to them - so we use our data to define the appropriate observations
    +weekrange <- cut_off - start_date

    Farrington method

    We then define each of our parameters for the Farrington method in a list. Then we run the algorithm using farringtonFlexible() and then we can extract the threshold for an alert using farringtonmethod@upperboundto include this in our dataset. It is also possible to extract a TRUE/FALSE for each week if it triggered an alert (was above the threshold) using farringtonmethod@alarm.

    -
    ## define control
    -ctrl <- list(
    -  ## define what time period that want threshold for (i.e. 2011)
    -  range = which(counts_sts@epoch > weekrange),
    -  b = 9, ## how many years backwards for baseline
    -  w = 2, ## rolling window size in weeks
    -  weightsThreshold = 2.58, ## reweighting past outbreaks (improved noufaily method - original suggests 1)
    -  ## pastWeeksNotIncluded = 3, ## use all weeks available (noufaily suggests drop 26)
    -  trend = TRUE,
    -  pThresholdTrend = 1, ## 0.05 normally, however 1 is advised in the improved method (i.e. always keep)
    -  thresholdMethod = "nbPlugin",
    -  populationOffset = TRUE
    -  )
    -
    -## apply farrington flexible method
    -farringtonmethod <- farringtonFlexible(counts_sts, ctrl)
    -
    -## create a new variable in the original dataset called threshold
    -## containing the upper bound from farrington 
    -## nb. this is only for the weeks in 2011 (so need to subset rows)
    -counts[which(counts$epiweek >= cut_off & 
    -               !is.na(counts$case_int)),
    -              "threshold"] <- farringtonmethod@upperbound
    +
    ## define control
    +ctrl <- list(
    +  ## define what time period that want threshold for (i.e. 2011)
    +  range = which(counts_sts@epoch > weekrange),
    +  b = 9, ## how many years backwards for baseline
    +  w = 2, ## rolling window size in weeks
    +  weightsThreshold = 2.58, ## reweighting past outbreaks (improved noufaily method - original suggests 1)
    +  ## pastWeeksNotIncluded = 3, ## use all weeks available (noufaily suggests drop 26)
    +  trend = TRUE,
    +  pThresholdTrend = 1, ## 0.05 normally, however 1 is advised in the improved method (i.e. always keep)
    +  thresholdMethod = "nbPlugin",
    +  populationOffset = TRUE
    +  )
    +
    +## apply farrington flexible method
    +farringtonmethod <- farringtonFlexible(counts_sts, ctrl)
    +
    +## create a new variable in the original dataset called threshold
    +## containing the upper bound from farrington 
    +## nb. this is only for the weeks in 2011 (so need to subset rows)
    +counts[which(counts$epiweek >= cut_off & 
    +               !is.na(counts$case_int)),
    +              "threshold"] <- farringtonmethod@upperbound

    We can then visualise the results in ggplot as done previously.

    -
    ggplot(counts, aes(x = epiweek)) + 
    -  ## add in observed case counts as a line
    -  geom_line(aes(y = case_int, colour = "Observed")) + 
    -  ## add in upper bound of aberration algorithm
    -  geom_line(aes(y = threshold, colour = "Alert threshold"), 
    -            linetype = "dashed", 
    -            size = 1.5) +
    -  ## define colours
    -  scale_colour_manual(values = c("Observed" = "black", 
    -                                 "Alert threshold" = "red")) + 
    -  ## make a traditional plot (with black axes and white background)
    -  theme_classic() + 
    -  ## remove title of legend 
    -  theme(legend.title = element_blank())
    +
    ggplot(counts, aes(x = epiweek)) + 
    +  ## add in observed case counts as a line
    +  geom_line(aes(y = case_int, colour = "Observed")) + 
    +  ## add in upper bound of aberration algorithm
    +  geom_line(aes(y = threshold, colour = "Alert threshold"), 
    +            linetype = "dashed", 
    +            size = 1.5) +
    +  ## define colours
    +  scale_colour_manual(values = c("Observed" = "black", 
    +                                 "Alert threshold" = "red")) + 
    +  ## make a traditional plot (with black axes and white background)
    +  theme_classic() + 
    +  ## remove title of legend 
    +  theme(legend.title = element_blank())
    @@ -2135,47 +2068,47 @@

    GLRNB method

    CAUTION: This method uses “brute force” (similar to bootstrapping) for calculating thresholds, so can take a long time!

    See the GLRNB vignette for details.

    -
    ## define control options
    -ctrl <- list(
    -  ## define what time period that want threshold for (i.e. 2011)
    -  range = which(counts_sts@epoch > weekrange),
    -  mu0 = list(S = 1,    ## number of fourier terms (harmonics) to include
    -  trend = TRUE,   ## whether to include trend or not
    -  refit = FALSE), ## whether to refit model after each alarm
    -  ## cARL = threshold for GLR statistic (arbitrary)
    -     ## 3 ~ middle ground for minimising false positives
    -     ## 1 fits to the 99%PI of glm.nb - with changes after peaks (threshold lowered for alert)
    -   c.ARL = 2,
    -   # theta = log(1.5), ## equates to a 50% increase in cases in an outbreak
    -   ret = "cases"     ## return threshold upperbound as case counts
    -  )
    -
    -## apply the glrnb method
    -glrnbmethod <- glrnb(counts_sts, control = ctrl, verbose = FALSE)
    -
    -## create a new variable in the original dataset called threshold
    -## containing the upper bound from glrnb 
    -## nb. this is only for the weeks in 2011 (so need to subset rows)
    -counts[which(counts$epiweek >= cut_off & 
    -               !is.na(counts$case_int)),
    -              "threshold_glrnb"] <- glrnbmethod@upperbound
    +
    ## define control options
    +ctrl <- list(
    +  ## define what time period that want threshold for (i.e. 2011)
    +  range = which(counts_sts@epoch > weekrange),
    +  mu0 = list(S = 1,    ## number of fourier terms (harmonics) to include
    +  trend = TRUE,   ## whether to include trend or not
    +  refit = FALSE), ## whether to refit model after each alarm
    +  ## cARL = threshold for GLR statistic (arbitrary)
    +     ## 3 ~ middle ground for minimising false positives
    +     ## 1 fits to the 99%PI of glm.nb - with changes after peaks (threshold lowered for alert)
    +   c.ARL = 2,
    +   # theta = log(1.5), ## equates to a 50% increase in cases in an outbreak
    +   ret = "cases"     ## return threshold upperbound as case counts
    +  )
    +
    +## apply the glrnb method
    +glrnbmethod <- glrnb(counts_sts, control = ctrl, verbose = FALSE)
    +
    +## create a new variable in the original dataset called threshold
    +## containing the upper bound from glrnb 
    +## nb. this is only for the weeks in 2011 (so need to subset rows)
    +counts[which(counts$epiweek >= cut_off & 
    +               !is.na(counts$case_int)),
    +              "threshold_glrnb"] <- glrnbmethod@upperbound

    Visualise the outputs as previously.

    -
    ggplot(counts, aes(x = epiweek)) + 
    -  ## add in observed case counts as a line
    -  geom_line(aes(y = case_int, colour = "Observed")) + 
    -  ## add in upper bound of aberration algorithm
    -  geom_line(aes(y = threshold_glrnb, colour = "Alert threshold"), 
    -            linetype = "dashed", 
    -            size = 1.5) +
    -  ## define colours
    -  scale_colour_manual(values = c("Observed" = "black", 
    -                                 "Alert threshold" = "red")) + 
    -  ## make a traditional plot (with black axes and white background)
    -  theme_classic() + 
    -  ## remove title of legend 
    -  theme(legend.title = element_blank())
    +
    ggplot(counts, aes(x = epiweek)) + 
    +  ## add in observed case counts as a line
    +  geom_line(aes(y = case_int, colour = "Observed")) + 
    +  ## add in upper bound of aberration algorithm
    +  geom_line(aes(y = threshold_glrnb, colour = "Alert threshold"), 
    +            linetype = "dashed", 
    +            size = 1.5) +
    +  ## define colours
    +  scale_colour_manual(values = c("Observed" = "black", 
    +                                 "Alert threshold" = "red")) + 
    +  ## make a traditional plot (with black axes and white background)
    +  theme_classic() + 
    +  ## remove title of legend 
    +  theme(legend.title = element_blank())
    @@ -2202,107 +2135,107 @@

    \(β_2 \times δ(t-t_0) + β_3\times(t-t_0 )^+\) is the generalised linear part of the post-period and is zero in the pre-period. This means that the \(β_2\) and \(β_3\) estimates are the effects of the intervention.

    We need to re-calculate the fourier terms without forecasting here, as we will use all the data available to us (i.e. retrospectively). Additionally we need to calculate the extra terms needed for the regression.

    -
    ## add in fourier terms using the epiweek and case_int variabless
    -counts$fourier <- select(counts, epiweek, case_int) %>% 
    -  as_tsibble(index = epiweek) %>% 
    -  fourier(K = 1)
    -
    -## define intervention week 
    -intervention_week <- yearweek("2008-12-31")
    -
    -## define variables for regression 
    -counts <- counts %>% 
    -  mutate(
    -    ## corresponds to t in the formula
    -      ## count of weeks (could probably also just use straight epiweeks var)
    -    # linear = row_number(epiweek), 
    -    ## corresponds to delta(t-t0) in the formula
    -      ## pre or post intervention period
    -    intervention = as.numeric(epiweek >= intervention_week), 
    -    ## corresponds to (t-t0)^+ in the formula
    -      ## count of weeks post intervention
    -      ## (choose the larger number between 0 and whatever comes from calculation)
    -    time_post = pmax(0, epiweek - intervention_week + 1))
    +
    ## add in fourier terms using the epiweek and case_int variabless
    +counts$fourier <- select(counts, epiweek, case_int) %>% 
    +  as_tsibble(index = epiweek) %>% 
    +  fourier(K = 1)
    +
    +## define intervention week 
    +intervention_week <- yearweek("2008-12-31")
    +
    +## define variables for regression 
    +counts <- counts %>% 
    +  mutate(
    +    ## corresponds to t in the formula
    +      ## count of weeks (could probably also just use straight epiweeks var)
    +    # linear = row_number(epiweek), 
    +    ## corresponds to delta(t-t0) in the formula
    +      ## pre or post intervention period
    +    intervention = as.numeric(epiweek >= intervention_week), 
    +    ## corresponds to (t-t0)^+ in the formula
    +      ## count of weeks post intervention
    +      ## (choose the larger number between 0 and whatever comes from calculation)
    +    time_post = pmax(0, epiweek - intervention_week + 1))

    We then use these terms to fit a negative binomial regression, and produce a table with percentage change. What this example shows is that there was no significant change.

    CAUTION: Note the use of simulate_pi = FALSE within the predict() argument. This is because the default behaviour of trending is to use the ciTools package to estimate a prediction interval. This does not work if there are NA counts, and also produces more granular intervals. See ?trending::predict.trending_model_fit for details.

    -
    ## define the model you want to fit (negative binomial) 
    -model <- glm_nb_model(
    -  ## set number of cases as outcome of interest
    -  case_int ~
    -    ## use epiweek to account for the trend
    -    epiweek +
    -    ## use the furier terms to account for seasonality
    -    fourier + 
    -    ## add in whether in the pre- or post-period 
    -    intervention + 
    -    ## add in the time post intervention 
    -    time_post
    -    )
    -
    -## fit your model using the counts dataset
    -fitted_model <- trending::fit(model, counts)
    -
    -## calculate confidence intervals and prediction intervals 
    -observed <- predict(fitted_model, simulate_pi = FALSE)
    +
    ## define the model you want to fit (negative binomial) 
    +model <- glm_nb_model(
    +  ## set number of cases as outcome of interest
    +  case_int ~
    +    ## use epiweek to account for the trend
    +    epiweek +
    +    ## use the furier terms to account for seasonality
    +    fourier + 
    +    ## add in whether in the pre- or post-period 
    +    intervention + 
    +    ## add in the time post intervention 
    +    time_post
    +    )
    +
    +## fit your model using the counts dataset
    +fitted_model <- trending::fit(model, counts)
    +
    +## calculate confidence intervals and prediction intervals 
    +observed <- predict(fitted_model, simulate_pi = FALSE)
    -
    ## extract original negative binomial regression
    -fitted_model$result[[1]] %>%
    -  ## get a tidy dataframe of results
    -  tidy(exponentiate = TRUE, 
    -       conf.int = TRUE) %>% 
    -  ## only keep the intervention value 
    -  filter(term == "intervention") %>% 
    -  ## change the IRR to percentage change for estimate and CIs 
    -  mutate(
    -    ## for each of the columns of interest - create a new column
    -    across(
    -      all_of(c("estimate", "conf.low", "conf.high")), 
    -      ## apply the formula to calculate percentage change
    -            .f = function(i) 100 * (i - 1), 
    -      ## add a suffix to new column names with "_perc"
    -      .names = "{.col}_perc")
    -    ) %>% 
    -  ## only keep (and rename) certain columns 
    -  select("IRR" = estimate, 
    -         "95%CI low" = conf.low, 
    -         "95%CI high" = conf.high,
    -         "Percentage change" = estimate_perc, 
    -         "95%CI low (perc)" = conf.low_perc, 
    -         "95%CI high (perc)" = conf.high_perc,
    -         "p-value" = p.value)
    +
    ## extract original negative binomial regression
    +fitted_model$result[[1]] %>%
    +  ## get a tidy dataframe of results
    +  tidy(exponentiate = TRUE, 
    +       conf.int = TRUE) %>% 
    +  ## only keep the intervention value 
    +  filter(term == "intervention") %>% 
    +  ## change the IRR to percentage change for estimate and CIs 
    +  mutate(
    +    ## for each of the columns of interest - create a new column
    +    across(
    +      all_of(c("estimate", "conf.low", "conf.high")), 
    +      ## apply the formula to calculate percentage change
    +            .f = function(i) 100 * (i - 1), 
    +      ## add a suffix to new column names with "_perc"
    +      .names = "{.col}_perc")
    +    ) %>% 
    +  ## only keep (and rename) certain columns 
    +  select("IRR" = estimate, 
    +         "95%CI low" = conf.low, 
    +         "95%CI high" = conf.high,
    +         "Percentage change" = estimate_perc, 
    +         "95%CI low (perc)" = conf.low_perc, 
    +         "95%CI high (perc)" = conf.high_perc,
    +         "p-value" = p.value)

    As previously we can visualise the outputs of the regression.

    -
    estimate_res <- data.frame(observed$result)
    -
    -ggplot(estimate_res, aes(x = epiweek)) + 
    -  ## add in observed case counts as a line
    -  geom_line(aes(y = case_int, colour = "Observed")) + 
    -  ## add in a line for the model estimate
    -  geom_line(aes(y = estimate, col = "Estimate")) + 
    -  ## add in a band for the prediction intervals 
    -  geom_ribbon(aes(ymin = lower_pi, 
    -                  ymax = upper_pi), 
    -              alpha = 0.25) + 
    -  ## add vertical line and label to show where forecasting started
    -  geom_vline(
    -           xintercept = as.Date(intervention_week), 
    -           linetype = "dashed") + 
    -  annotate(geom = "text", 
    -           label = "Intervention", 
    -           x = intervention_week, 
    -           y = max(observed$upper_pi), 
    -           angle = 90, 
    -           vjust = 1
    -           ) + 
    -  ## define colours
    -  scale_colour_manual(values = c("Observed" = "black", 
    -                                 "Estimate" = "red")) + 
    -  ## make a traditional plot (with black axes and white background)
    -  theme_classic()
    +
    estimate_res <- data.frame(observed$result)
    +
    +ggplot(estimate_res, aes(x = epiweek)) + 
    +  ## add in observed case counts as a line
    +  geom_line(aes(y = case_int, colour = "Observed")) + 
    +  ## add in a line for the model estimate
    +  geom_line(aes(y = estimate, col = "Estimate")) + 
    +  ## add in a band for the prediction intervals 
    +  geom_ribbon(aes(ymin = lower_pi, 
    +                  ymax = upper_pi), 
    +              alpha = 0.25) + 
    +  ## add vertical line and label to show where forecasting started
    +  geom_vline(
    +           xintercept = as.Date(intervention_week), 
    +           linetype = "dashed") + 
    +  annotate(geom = "text", 
    +           label = "Intervention", 
    +           x = intervention_week, 
    +           y = max(observed$upper_pi), 
    +           angle = 90, 
    +           vjust = 1
    +           ) + 
    +  ## define colours
    +  scale_colour_manual(values = c("Observed" = "black", 
    +                                 "Estimate" = "red")) + 
    +  ## make a traditional plot (with black axes and white background)
    +  theme_classic()
    Warning: Unknown or uninitialised column: `upper_pi`.
    @@ -2322,9 +2255,10 @@

    23.9 Resources

    -

    forecasting: principles and practice textbook
    -EPIET timeseries analysis case studies
    -Penn State course Surveillance package manuscript

    +

    forecasting: principles and practice textbook

    +

    EPIET timeseries analysis case studies

    +

    Penn State course

    +

    Surveillance package manuscript

    @@ -2920,7 +2854,7 @@

    var lightboxQuarto = GLightbox({"openEffect":"zoom","loop":false,"descPosition":"bottom","selector":".lightbox","closeEffect":"zoom"}); (function() { let previousOnload = window.onload; window.onload = () => { diff --git a/html_outputs/new_pages/transition_to_R.html b/html_outputs/new_pages/transition_to_R.html index bd86f292..219f4f16 100644 --- a/html_outputs/new_pages/transition_to_R.html +++ b/html_outputs/new_pages/transition_to_R.html @@ -2,12 +2,12 @@ - + -The Epidemiologist R Handbook - 4  Transition to R +4  Transition to R – The Epidemiologist R Handbook - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - - - -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    -
    - -
    - -
    - - -
    - - - -
    - -
    -
    -

    10  Characters and strings

    -
    - - - -
    - - - - -
    - - - -
    - - -
    -
    -
    -
    -

    -
    -
    -
    -
    -

    This page demonstrates use of the stringr package to evaluate and handle character values (“strings”).

    -
      -
    1. Combine, order, split, arrange - str_c(), str_glue(), str_order(), str_split()
      -
    2. -
    3. Clean and standardise. -
        -
      • Adjust length - str_pad(), str_trunc(), str_wrap().
        -
      • -
      • Change case - str_to_upper(), str_to_title(), str_to_lower(), str_to_sentence().
        -
      • -
    4. -
    5. Evaluate and extract by position - str_length(), str_sub(), word().
      -
    6. -
    7. Patterns. -
        -
      • Detect and locate - str_detect(), str_subset(), str_match(), str_extract().
        -
      • -
      • Modify and replace - str_sub(), str_replace_all().
        -
      • -
    8. -
    9. Regular expressions (“regex”).
    10. -
    -

    For ease of display most examples are shown acting on a short defined character vector, however they can easily be adapted to a column within a data frame.

    -

    This stringr vignette provided much of the inspiration for this page.

    - -
    -

    10.1 Preparation

    -
    -

    Load packages

    -

    Install or load the stringr and other tidyverse packages.

    -
    -
    # install/load packages
    -pacman::p_load(
    -  stringr,    # many functions for handling strings
    -  tidyverse,  # for optional data manipulation
    -  tools)      # alternative for converting to title case
    -
    -
    -
    -

    Import data

    -

    In this page we will occassionally reference the cleaned linelist of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

    -
    -
    -
    Warning: The `trust` argument of `import()` should be explicit for serialization formats
    -as of rio 1.0.3.
    -ℹ Missing `trust` will be set to FALSE by default for RDS in 2.0.0.
    -ℹ The deprecated feature was likely used in the rio package.
    -  Please report the issue at <https://github.com/gesistsa/rio/issues>.
    -
    -
    -
    -
    # import case linelist 
    -linelist <- import("linelist_cleaned.rds")
    -
    -

    The first 50 rows of the linelist are displayed below.

    -
    -
    -
    - -
    -
    - -
    -
    -
    -

    10.2 Unite, split, and arrange

    -

    This section covers:

    -
      -
    • Using str_c(), str_glue(), and unite() to combine strings.
      -
    • -
    • Using str_order() to arrange strings.
      -
    • -
    • Using str_split() and separate() to split strings.
    • -
    - -
    -

    Combine strings

    -

    To combine or concatenate multiple strings into one string, we suggest using str_c from stringr. If you have distinct character values to combine, simply provide them as unique arguments, separated by commas.

    -
    -
    str_c("String1", "String2", "String3")
    -
    -
    [1] "String1String2String3"
    -
    -
    -

    The argument sep = inserts a character value between each of the arguments you provided (e.g. inserting a comma, space, or newline "\n")

    -
    -
    str_c("String1", "String2", "String3", sep = ", ")
    -
    -
    [1] "String1, String2, String3"
    -
    -
    -

    The argument collapse = is relevant if you are inputting multiple vectors as arguments to str_c(). It is used to separate the elements of what would be an output vector, such that the output vector only has one long character element.

    -

    The example below shows the combination of two vectors into one (first names and last names). Another similar example might be jurisdictions and their case counts. In this example:

    -
      -
    • The sep = value appears between each first and last name
      -
    • -
    • The collapse = value appears between each person
    • -
    -
    -
    first_names <- c("abdul", "fahruk", "janice") 
    -last_names  <- c("hussein", "akinleye", "okeke")
    -
    -# sep displays between the respective input strings, while collapse displays between the elements produced
    -str_c(first_names, last_names, sep = " ", collapse = ";  ")
    -
    -
    [1] "abdul hussein;  fahruk akinleye;  janice okeke"
    -
    -
    -

    Note: Depending on your desired display context, when printing such a combined string with newlines, you may need to wrap the whole phrase in cat() for the newlines to print properly:

    -
    -
    # For newlines to print correctly, the phrase may need to be wrapped in cat()
    -cat(str_c(first_names, last_names, sep = " ", collapse = ";\n"))
    -
    -
    abdul hussein;
    -fahruk akinleye;
    -janice okeke
    -
    -
    - -
    -
    -

    Dynamic strings

    -

    Use str_glue() to insert dynamic R code into a string. This is a very useful function for creating dynamic plot captions, as demonstrated below.

    -
      -
    • All content goes between double quotation marks str_glue("").
      -
    • -
    • Any dynamic code or references to pre-defined values are placed within curly brackets {} within the double quotation marks. There can be many curly brackets in the same str_glue() command.
      -
    • -
    • To display character quotes ’’, use single quotes within the surrounding double quotes (e.g. when providing date format - see example below).
      -
    • -
    • Tip: You can use \n to force a new line.
      -
    • -
    • Tip: You use format() to adjust date display, and use Sys.Date() to display the current date.
    • -
    -

    A simple example, of a dynamic plot caption:

    -
    -
    str_glue("Data include {nrow(linelist)} cases and are current to {format(Sys.Date(), '%d %b %Y')}.")
    -
    -
    Data include 5888 cases and are current to 08 Sep 2024.
    -
    -
    -

    An alternative format is to use placeholders within the brackets and define the code in separate arguments at the end of the str_glue() function, as below. This can improve code readability if the text is long.

    -
    -
    str_glue("Linelist as of {current_date}.\nLast case hospitalized on {last_hospital}.\n{n_missing_onset} cases are missing date of onset and not shown",
    -         current_date = format(Sys.Date(), '%d %b %Y'),
    -         last_hospital = format(as.Date(max(linelist$date_hospitalisation, na.rm=T)), '%d %b %Y'),
    -         n_missing_onset = nrow(linelist %>% filter(is.na(date_onset)))
    -         )
    -
    -
    Linelist as of 08 Sep 2024.
    -Last case hospitalized on 30 Apr 2015.
    -256 cases are missing date of onset and not shown
    -
    -
    -

    Pulling from a data frame

    -

    Sometimes, it is useful to pull data from a data frame and have it pasted together in sequence. Below is an example data frame. We will use it to to make a summary statement about the jurisdictions and the new and total case counts.

    -
    -
    # make case data frame
    -case_table <- data.frame(
    -  zone        = c("Zone 1", "Zone 2", "Zone 3", "Zone 4", "Zone 5"),
    -  new_cases   = c(3, 0, 7, 0, 15),
    -  total_cases = c(40, 4, 25, 10, 103)
    -  )
    -
    -
    -
    -
    - -
    -
    -

    Use str_glue_data(), which is specially made for taking data from data frame rows:

    -
    -
    case_table %>% 
    -  str_glue_data("{zone}: {new_cases} ({total_cases} total cases)")
    -
    -
    Zone 1: 3 (40 total cases)
    -Zone 2: 0 (4 total cases)
    -Zone 3: 7 (25 total cases)
    -Zone 4: 0 (10 total cases)
    -Zone 5: 15 (103 total cases)
    -
    -
    -

    Combine strings across rows

    -

    If you are trying to “roll-up” values in a data frame column, e.g. combine values from multiple rows into just one row by pasting them together with a separator, see the section of the De-duplication page on “rolling-up” values.

    -

    Data frame to one line

    -

    You can make the statement appear in one line using str_c() (specifying the data frame and column names), and providing sep = and collapse = arguments.

    -
    -
    str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = ";  ")
    -
    -
    [1] "Zone 1 = 3;  Zone 2 = 0;  Zone 3 = 7;  Zone 4 = 0;  Zone 5 = 15"
    -
    -
    -

    You could add the pre-fix text “New Cases:” to the beginning of the statement by wrapping with a separate str_c() (if “New Cases:” was within the original str_c() it would appear multiple times).

    -
    -
    str_c("New Cases: ", str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = ";  "))
    -
    -
    [1] "New Cases: Zone 1 = 3;  Zone 2 = 0;  Zone 3 = 7;  Zone 4 = 0;  Zone 5 = 15"
    -
    -
    -
    -
    -

    Unite columns

    -

    Within a data frame, bringing together character values from multiple columns can be achieved with unite() from tidyr. This is the opposite of separate().

    -

    Provide the name of the new united column. Then provide the names of the columns you wish to unite.

    -
      -
    • By default, the separator used in the united column is underscore _, but this can be changed with the sep = argument.
      -
    • -
    • remove = removes the input columns from the data frame (TRUE by default).
      -
    • -
    • na.rm = removes missing values while uniting (FALSE by default).
    • -
    -

    Below, we define a mini-data frame to demonstrate with:

    -
    -
    df <- data.frame(
    -  case_ID = c(1:6),
    -  symptoms  = c("jaundice, fever, chills",     # patient 1
    -                "chills, aches, pains",        # patient 2 
    -                "fever",                       # patient 3
    -                "vomiting, diarrhoea",         # patient 4
    -                "bleeding from gums, fever",   # patient 5
    -                "rapid pulse, headache"),      # patient 6
    -  outcome = c("Recover", "Death", "Death", "Recover", "Recover", "Recover"))
    -
    -
    -
    df_split <- separate(df, symptoms, into = c("sym_1", "sym_2", "sym_3"), extra = "merge")
    -
    -
    Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [3, 4].
    -
    -
    -

    Here is the example data frame:

    -
    -
    -
    - -
    -
    -

    Below, we unite the three symptom columns:

    -
    -
    df_split %>% 
    -  unite(
    -    col = "all_symptoms",         # name of the new united column
    -    c("sym_1", "sym_2", "sym_3"), # columns to unite
    -    sep = ", ",                   # separator to use in united column
    -    remove = TRUE,                # if TRUE, removes input cols from the data frame
    -    na.rm = TRUE                  # if TRUE, missing values are removed before uniting
    -  )
    -
    -
      case_ID                all_symptoms outcome
    -1       1     jaundice, fever, chills Recover
    -2       2        chills, aches, pains   Death
    -3       3                       fever   Death
    -4       4         vomiting, diarrhoea Recover
    -5       5 bleeding, from, gums, fever Recover
    -6       6      rapid, pulse, headache Recover
    -
    -
    - -
    -
    -

    Split

    -

    To split a string based on a pattern, use str_split(). It evaluates the string(s) and returns a list of character vectors consisting of the newly-split values.

    -

    The simple example below evaluates one string and splits it into three. By default it returns an object of class list with one element (a character vector) for each string initially provided. If simplify = TRUE it returns a character matrix.

    -

    In this example, one string is provided, and the function returns a list with one element - a character vector with three values.

    -
    -
    str_split(string = "jaundice, fever, chills",
    -          pattern = ",")
    -
    -
    [[1]]
    -[1] "jaundice" " fever"   " chills" 
    -
    -
    -

    If the output is saved, you can then access the nth split value with bracket syntax. To access a specific value you can use syntax like this: the_returned_object[[1]][2], which would access the second value from the first evaluated string (“fever”). See the R basics page for more detail on accessing elements.

    -
    -
    pt1_symptoms <- str_split("jaundice, fever, chills", ",")
    -
    -pt1_symptoms[[1]][2]  # extracts 2nd value from 1st (and only) element of the list
    -
    -
    [1] " fever"
    -
    -
    -

    If multiple strings are provided by str_split(), there will be more than one element in the returned list.

    -
    -
    symptoms <- c("jaundice, fever, chills",     # patient 1
    -              "chills, aches, pains",        # patient 2 
    -              "fever",                       # patient 3
    -              "vomiting, diarrhoea",         # patient 4
    -              "bleeding from gums, fever",   # patient 5
    -              "rapid pulse, headache")       # patient 6
    -
    -str_split(symptoms, ",")                     # split each patient's symptoms
    -
    -
    [[1]]
    -[1] "jaundice" " fever"   " chills" 
    -
    -[[2]]
    -[1] "chills" " aches" " pains"
    -
    -[[3]]
    -[1] "fever"
    -
    -[[4]]
    -[1] "vomiting"   " diarrhoea"
    -
    -[[5]]
    -[1] "bleeding from gums" " fever"            
    -
    -[[6]]
    -[1] "rapid pulse" " headache"  
    -
    -
    -

    To return a “character matrix” instead, which may be useful if creating data frame columns, set the argument simplify = TRUE as shown below:

    -
    -
    str_split(symptoms, ",", simplify = TRUE)
    -
    -
         [,1]                 [,2]         [,3]     
    -[1,] "jaundice"           " fever"     " chills"
    -[2,] "chills"             " aches"     " pains" 
    -[3,] "fever"              ""           ""       
    -[4,] "vomiting"           " diarrhoea" ""       
    -[5,] "bleeding from gums" " fever"     ""       
    -[6,] "rapid pulse"        " headache"  ""       
    -
    -
    -

    You can also adjust the number of splits to create with the n = argument. For example, this restricts the number of splits to 2. Any further commas remain within the second values.

    -
    -
    str_split(symptoms, ",", simplify = TRUE, n = 2)
    -
    -
         [,1]                 [,2]            
    -[1,] "jaundice"           " fever, chills"
    -[2,] "chills"             " aches, pains" 
    -[3,] "fever"              ""              
    -[4,] "vomiting"           " diarrhoea"    
    -[5,] "bleeding from gums" " fever"        
    -[6,] "rapid pulse"        " headache"     
    -
    -
    -

    Note - the same outputs can be achieved with str_split_fixed(), in which you do not give the simplify argument, but must instead designate the number of columns (n).

    -
    -
    str_split_fixed(symptoms, ",", n = 2)
    -
    -
    -
    -

    Split columns

    -

    If you are trying to split data frame column, it is best to use the separate() function from dplyr. It is used to split one character column into other columns.

    -

    Let’s say we have a simple data frame df (defined and united in the unite section) containing a case_ID column, one character column with many symptoms, and one outcome column. Our goal is to separate the symptoms column into many columns - each one containing one symptom.

    -
    -
    -
    - -
    -
    -

    Assuming the data are piped into separate(), first provide the column to be separated. Then provide into = as a vector c( ) containing the new columns names, as shown below.

    -
      -
    • sep = the separator, can be a character, or a number (interpreted as the character position to split at).
    • -
    • remove = FALSE by default, removes the input column.
      -
    • -
    • convert = FALSE by default, will cause string “NA”s to become NA.
      -
    • -
    • extra = this controls what happens if there are more values created by the separation than new columns named. -
        -
      • extra = "warn" means you will see a warning but it will drop excess values (the default).
        -
      • -
      • extra = "drop" means the excess values will be dropped with no warning.
        -
      • -
      • extra = "merge" will only split to the number of new columns listed in into - this setting will preserve all your data.
      • -
    • -
    -

    An example with extra = "merge" is below - no data is lost. Two new columns are defined but any third symptoms are left in the second new column:

    -
    -
    # third symptoms combined into second new column
    -df %>% 
    -  separate(symptoms, into = c("sym_1", "sym_2"), sep=",", extra = "merge")
    -
    -
    Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
    -
    -
    -
      case_ID              sym_1          sym_2 outcome
    -1       1           jaundice  fever, chills Recover
    -2       2             chills   aches, pains   Death
    -3       3              fever           <NA>   Death
    -4       4           vomiting      diarrhoea Recover
    -5       5 bleeding from gums          fever Recover
    -6       6        rapid pulse       headache Recover
    -
    -
    -

    When the default extra = "drop" is used below, a warning is given but the third symptoms are lost:

    -
    -
    # third symptoms are lost
    -df %>% 
    -  separate(symptoms, into = c("sym_1", "sym_2"), sep=",")
    -
    -
    Warning: Expected 2 pieces. Additional pieces discarded in 2 rows [1, 2].
    -
    -
    -
    Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
    -
    -
    -
      case_ID              sym_1      sym_2 outcome
    -1       1           jaundice      fever Recover
    -2       2             chills      aches   Death
    -3       3              fever       <NA>   Death
    -4       4           vomiting  diarrhoea Recover
    -5       5 bleeding from gums      fever Recover
    -6       6        rapid pulse   headache Recover
    -
    -
    -

    CAUTION: If you do not provide enough into values for the new columns, your data may be truncated.

    - -
    -
    -

    Arrange alphabetically

    -

    Several strings can be sorted by alphabetical order. str_order() returns the order, while str_sort() returns the strings in that order.

    -
    -
    # strings
    -health_zones <- c("Alba", "Takota", "Delta")
    -
    -# return the alphabetical order
    -str_order(health_zones)
    -
    -
    [1] 1 3 2
    -
    -
    # return the strings in alphabetical order
    -str_sort(health_zones)
    -
    -
    [1] "Alba"   "Delta"  "Takota"
    -
    -
    -

    To use a different alphabet, add the argument locale =. See the full list of locales by entering stringi::stri_locale_list() in the R console.

    - -
    -
    -

    base R functions

    -

    It is common to see base R functions paste() and paste0(), which concatenate vectors after converting all parts to character. They act similarly to str_c() but the syntax is arguably more complicated - in the parentheses each part is separated by a comma. The parts are either character text (in quotes) or pre-defined code objects (no quotes). For example:

    -
    -
    n_beds <- 10
    -n_masks <- 20
    -
    -paste0("Regional hospital needs ", n_beds, " beds and ", n_masks, " masks.")
    -
    -
    [1] "Regional hospital needs 10 beds and 20 masks."
    -
    -
    -

    sep = and collapse = arguments can be specified. paste() is simply paste0() with a default sep = " " (one space).

    -
    -
    -
    -

    10.3 Clean and standardise

    - -
    -

    Change case

    -

    Often one must alter the case/capitalization of a string value, for example names of jursidictions. Use str_to_upper(), str_to_lower(), and str_to_title(), from stringr, as shown below:

    -
    -
    str_to_upper("California")
    -
    -
    [1] "CALIFORNIA"
    -
    -
    str_to_lower("California")
    -
    -
    [1] "california"
    -
    -
    -

    Using *base** R, the above can also be achieved with toupper(), tolower().

    -

    Title case

    -

    Transforming the string so each word is capitalized can be achieved with str_to_title():

    -
    -
    str_to_title("go to the US state of california ")
    -
    -
    [1] "Go To The Us State Of California "
    -
    -
    -

    Use toTitleCase() from the tools package to achieve more nuanced capitalization (words like “to”, “the”, and “of” are not capitalized).

    -
    -
    tools::toTitleCase("This is the US state of california")
    -
    -
    [1] "This is the US State of California"
    -
    -
    -

    You can also use str_to_sentence(), which capitalizes only the first letter of the string.

    -
    -
    str_to_sentence("the patient must be transported")
    -
    -
    [1] "The patient must be transported"
    -
    -
    -
    -
    -

    Pad length

    -

    Use str_pad() to add characters to a string, to a minimum length. By default spaces are added, but you can also pad with other characters using the pad = argument.

    -
    -
    # ICD codes of differing length
    -ICD_codes <- c("R10.13",
    -               "R10.819",
    -               "R17")
    -
    -# ICD codes padded to 7 characters on the right side
    -str_pad(ICD_codes, 7, "right")
    -
    -
    [1] "R10.13 " "R10.819" "R17    "
    -
    -
    # Pad with periods instead of spaces
    -str_pad(ICD_codes, 7, "right", pad = ".")
    -
    -
    [1] "R10.13." "R10.819" "R17...."
    -
    -
    -

    For example, to pad numbers with leading zeros (such as for hours or minutes), you can pad the number to minimum length of 2 with pad = "0".

    -
    -
    # Add leading zeros to two digits (e.g. for times minutes/hours)
    -str_pad("4", 2, pad = "0") 
    -
    -
    [1] "04"
    -
    -
    # example using a numeric column named "hours"
    -# hours <- str_pad(hours, 2, pad = "0")
    -
    -
    -
    -

    Truncate

    -

    str_trunc() sets a maximum length for each string. If a string exceeds this length, it is truncated (shortened) and an ellipsis (…) is included to indicate that the string was previously longer. Note that the ellipsis is counted in the length. The ellipsis characters can be changed with the argument ellipsis =. The optional side = argument specifies which where the ellipsis will appear within the truncated string (“left”, “right”, or “center”).

    -
    -
    original <- "Symptom onset on 4/3/2020 with vomiting"
    -str_trunc(original, 10, "center")
    -
    -
    [1] "Symp...ing"
    -
    -
    -
    -
    -

    Standardize length

    -

    Use str_trunc() to set a maximum length, and then use str_pad() to expand the very short strings to that truncated length. In the example below, 6 is set as the maximum length (one value is truncated), and then one very short value is padded to achieve length of 6.

    -
    -
    # ICD codes of differing length
    -ICD_codes   <- c("R10.13",
    -                 "R10.819",
    -                 "R17")
    -
    -# truncate to maximum length of 6
    -ICD_codes_2 <- str_trunc(ICD_codes, 6)
    -ICD_codes_2
    -
    -
    [1] "R10.13" "R10..." "R17"   
    -
    -
    # expand to minimum length of 6
    -ICD_codes_3 <- str_pad(ICD_codes_2, 6, "right")
    -ICD_codes_3
    -
    -
    [1] "R10.13" "R10..." "R17   "
    -
    -
    -
    -
    -

    Remove leading/trailing whitespace

    -

    Use str_trim() to remove spaces, newlines (\n) or tabs (\t) on sides of a string input. Add "right" "left", or "both" to the command to specify which side to trim (e.g. str_trim(x, "right").

    -
    -
    # ID numbers with excess spaces on right
    -IDs <- c("provA_1852  ", # two excess spaces
    -         "provA_2345",   # zero excess spaces
    -         "provA_9460 ")  # one excess space
    -
    -# IDs trimmed to remove excess spaces on right side only
    -str_trim(IDs)
    -
    -
    [1] "provA_1852" "provA_2345" "provA_9460"
    -
    -
    -
    -
    -

    Remove repeated whitespace within

    -

    Use str_squish() to remove repeated spaces that appear inside a string. For example, to convert double spaces into single spaces. It also removes spaces, newlines, or tabs on the outside of the string like str_trim().

    -
    -
    # original contains excess spaces within string
    -str_squish("  Pt requires   IV saline\n") 
    -
    -
    [1] "Pt requires IV saline"
    -
    -
    -

    Enter ?str_trim, ?str_pad in your R console to see further details.

    -
    -
    -

    Wrap into paragraphs

    -

    Use str_wrap() to wrap a long unstructured text into a structured paragraph with fixed line length. Provide the ideal character length for each line, and it applies an algorithm to insert newlines (\n) within the paragraph, as seen in the example below.

    -
    -
    pt_course <- "Symptom onset 1/4/2020 vomiting chills fever. Pt saw traditional healer in home village on 2/4/2020. On 5/4/2020 pt symptoms worsened and was admitted to Lumta clinic. Sample was taken and pt was transported to regional hospital on 6/4/2020. Pt died at regional hospital on 7/4/2020."
    -
    -str_wrap(pt_course, 40)
    -
    -
    [1] "Symptom onset 1/4/2020 vomiting chills\nfever. Pt saw traditional healer in\nhome village on 2/4/2020. On 5/4/2020\npt symptoms worsened and was admitted\nto Lumta clinic. Sample was taken and pt\nwas transported to regional hospital on\n6/4/2020. Pt died at regional hospital\non 7/4/2020."
    -
    -
    -

    The base function cat() can be wrapped around the above command in order to print the output, displaying the new lines added.

    -
    -
    cat(str_wrap(pt_course, 40))
    -
    -
    Symptom onset 1/4/2020 vomiting chills
    -fever. Pt saw traditional healer in
    -home village on 2/4/2020. On 5/4/2020
    -pt symptoms worsened and was admitted
    -to Lumta clinic. Sample was taken and pt
    -was transported to regional hospital on
    -6/4/2020. Pt died at regional hospital
    -on 7/4/2020.
    -
    -
    - -
    -
    -
    -

    10.4 Handle by position

    -
    -

    Extract by character position

    -

    Use str_sub() to return only a part of a string. The function takes three main arguments:

    -
      -
    1. the character vector(s).
      -
    2. -
    3. start position.
    4. -
    5. end position.
    6. -
    -

    A few notes on position numbers:

    -
      -
    • If a position number is positive, the position is counted starting from the left end of the string.
      -
    • -
    • If a position number is negative, it is counted starting from the right end of the string.
      -
    • -
    • Position numbers are inclusive.
      -
    • -
    • Positions extending beyond the string will be truncated (removed).
    • -
    -

    Below are some examples applied to the string “pneumonia”:

    -
    -
    # start and end third from left (3rd letter from left)
    -str_sub("pneumonia", 3, 3)
    -
    -
    [1] "e"
    -
    -
    # 0 is not present
    -str_sub("pneumonia", 0, 0)
    -
    -
    [1] ""
    -
    -
    # 6th from left, to the 1st from right
    -str_sub("pneumonia", 6, -1)
    -
    -
    [1] "onia"
    -
    -
    # 5th from right, to the 2nd from right
    -str_sub("pneumonia", -5, -2)
    -
    -
    [1] "moni"
    -
    -
    # 4th from left to a position outside the string
    -str_sub("pneumonia", 4, 15)
    -
    -
    [1] "umonia"
    -
    -
    -
    -
    -

    Extract by word position

    -

    To extract the nth ‘word’, use word(), also from stringr. Provide the string(s), then the first word position to extract, and the last word position to extract.

    -

    By default, the separator between ‘words’ is assumed to be a space, unless otherwise indicated with sep = (e.g. sep = "_" when words are separated by underscores.

    -
    -
    # strings to evaluate
    -chief_complaints <- c("I just got out of the hospital 2 days ago, but still can barely breathe.",
    -                      "My stomach hurts",
    -                      "Severe ear pain")
    -
    -# extract 1st to 3rd words of each string
    -word(chief_complaints, start = 1, end = 3, sep = " ")
    -
    -
    [1] "I just got"       "My stomach hurts" "Severe ear pain" 
    -
    -
    -
    -
    -

    Replace by character position

    -

    str_sub() paired with the assignment operator (<-) can be used to modify a part of a string:

    -
    -
    word <- "pneumonia"
    -
    -# convert the third and fourth characters to X 
    -str_sub(word, 3, 4) <- "XX"
    -
    -# print
    -word
    -
    -
    [1] "pnXXmonia"
    -
    -
    -

    An example applied to multiple strings (e.g. a column). Note the expansion in length of “HIV”.

    -
    -
    words <- c("pneumonia", "tubercolosis", "HIV")
    -
    -# convert the third and fourth characters to X 
    -str_sub(words, 3, 4) <- "XX"
    -
    -words
    -
    -
    [1] "pnXXmonia"    "tuXXrcolosis" "HIXX"        
    -
    -
    -
    -
    -

    Evaluate length

    -
    -
    str_length("abc")
    -
    -
    [1] 3
    -
    -
    -

    Alternatively, use nchar() from base R

    - -
    -
    -
    -

    10.5 Patterns

    -

    Many stringr functions work to detect, locate, extract, match, replace, and split based on a specified pattern.

    - -
    -

    Detect a pattern

    -

    Use str_detect() as below to detect presence/absence of a pattern within a string. First provide the string or vector to search in (string =), and then the pattern to look for (pattern =). Note that by default the search is case sensitive!

    -
    -
    str_detect(string = "primary school teacher", pattern = "teach")
    -
    -
    [1] TRUE
    -
    -
    -

    The argument negate = can be included and set to TRUE if you want to know if the pattern is NOT present.

    -
    -
    str_detect(string = "primary school teacher", pattern = "teach", negate = TRUE)
    -
    -
    [1] FALSE
    -
    -
    -

    To ignore case/capitalization, wrap the pattern within regex(), and within regex() add the argument ignore_case = TRUE (or T as shorthand).

    -
    -
    str_detect(string = "Teacher", pattern = regex("teach", ignore_case = T))
    -
    -
    [1] TRUE
    -
    -
    -

    When str_detect() is applied to a character vector or a data frame column, it will return TRUE or FALSE for each of the values.

    -
    -
    # a vector/column of occupations 
    -occupations <- c("field laborer",
    -                 "university professor",
    -                 "primary school teacher & tutor",
    -                 "tutor",
    -                 "nurse at regional hospital",
    -                 "lineworker at Amberdeen Fish Factory",
    -                 "physican",
    -                 "cardiologist",
    -                 "office worker",
    -                 "food service")
    -
    -# Detect presence of pattern "teach" in each string - output is vector of TRUE/FALSE
    -str_detect(occupations, "teach")
    -
    -
     [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    -
    -
    -

    If you need to count the TRUEs, simply sum() the output. This counts the number TRUE.

    -
    -
    sum(str_detect(occupations, "teach"))
    -
    -
    [1] 1
    -
    -
    -

    To search inclusive of multiple terms, include them separated by OR bars (|) within the pattern = argument, as shown below:

    -
    -
    sum(str_detect(string = occupations, pattern = "teach|professor|tutor"))
    -
    -
    [1] 3
    -
    -
    -

    If you need to build a long list of search terms, you can combine them using str_c() and sep = |, then define this is a character object, and then reference the vector later more succinctly. The example below includes possible occupation search terms for front-line medical providers.

    -
    -
    # search terms
    -occupation_med_frontline <- str_c("medical", "medicine", "hcw", "healthcare", "home care", "home health",
    -                                "surgeon", "doctor", "doc", "physician", "surgery", "peds", "pediatrician",
    -                               "intensivist", "cardiologist", "coroner", "nurse", "nursing", "rn", "lpn",
    -                               "cna", "pa", "physician assistant", "mental health",
    -                               "emergency department technician", "resp therapist", "respiratory",
    -                                "phlebotomist", "pharmacy", "pharmacist", "hospital", "snf", "rehabilitation",
    -                               "rehab", "activity", "elderly", "subacute", "sub acute",
    -                                "clinic", "post acute", "therapist", "extended care",
    -                                "dental", "dential", "dentist", sep = "|")
    -
    -occupation_med_frontline
    -
    -
    [1] "medical|medicine|hcw|healthcare|home care|home health|surgeon|doctor|doc|physician|surgery|peds|pediatrician|intensivist|cardiologist|coroner|nurse|nursing|rn|lpn|cna|pa|physician assistant|mental health|emergency department technician|resp therapist|respiratory|phlebotomist|pharmacy|pharmacist|hospital|snf|rehabilitation|rehab|activity|elderly|subacute|sub acute|clinic|post acute|therapist|extended care|dental|dential|dentist"
    -
    -
    -

    This command returns the number of occupations which contain any one of the search terms for front-line medical providers (occupation_med_frontline):

    -
    -
    sum(str_detect(string = occupations, pattern = occupation_med_frontline))
    -
    -
    [1] 2
    -
    -
    -

    Base R string search functions

    -

    The base function grepl() works similarly to str_detect(), in that it searches for matches to a pattern and returns a logical vector. The basic syntax is grepl(pattern, strings_to_search, ignore.case = FALSE, ...). One advantage is that the ignore.case argument is easier to write (there is no need to involve the regex() function).

    -

    Likewise, the base functions sub() and gsub() act similarly to str_replace(). Their basic syntax is: gsub(pattern, replacement, strings_to_search, ignore.case = FALSE). sub() will replace the first instance of the pattern, whereas gsub() will replace all instances of the pattern.

    -
    -

    Convert commas to periods

    -

    Here is an example of using gsub() to convert commas to periods in a vector of numbers. This could be useful if your data come from parts of the world other than the United States or Great Britain.

    -

    The inner gsub() which acts first on lengths is converting any periods to no space ““. The period character”.” has to be “escaped” with two slashes to actually signify a period, because “.” in regex means “any character”. Then, the result (with only commas) is passed to the outer gsub() in which commas are replaced by periods.

    -
    -
    lengths <- c("2.454,56", "1,2", "6.096,5")
    -
    -as.numeric(gsub(pattern = ",",                # find commas     
    -                replacement = ".",            # replace with periods
    -                x = gsub("\\.", "", lengths)  # vector with other periods removed (periods escaped)
    -                )
    -           )                                  # convert outcome to numeric
    -
    -
    -
    -
    -

    Replace all

    -

    Use str_replace_all() as a “find and replace” tool. First, provide the strings to be evaluated to string =, then the pattern to be replaced to pattern =, and then the replacement value to replacement =. The example below replaces all instances of “dead” with “deceased”. Note, this IS case sensitive.

    -
    -
    outcome <- c("Karl: dead",
    -            "Samantha: dead",
    -            "Marco: not dead")
    -
    -str_replace_all(string = outcome, pattern = "dead", replacement = "deceased")
    -
    -
    [1] "Karl: deceased"      "Samantha: deceased"  "Marco: not deceased"
    -
    -
    -

    Notes:

    -
      -
    • To replace a pattern with NA, use str_replace_na().
      -
    • -
    • The function str_replace() replaces only the first instance of the pattern within each evaluated string.
    • -
    - -
    -
    -

    Detect within logic

    -

    Within case_when()

    -

    str_detect() is often used within case_when() (from dplyr). Let’s say occupations is a column in the linelist. The mutate() below creates a new column called is_educator by using conditional logic via case_when(). See the page on data cleaning to learn more about case_when().

    -
    -
    df <- df %>% 
    -  mutate(is_educator = case_when(
    -    # term search within occupation, not case sensitive
    -    str_detect(occupations,
    -               regex("teach|prof|tutor|university",
    -                     ignore_case = TRUE))              ~ "Educator",
    -    # all others
    -    TRUE                                               ~ "Not an educator"))
    -
    -

    As a reminder, it may be important to add exclusion criteria to the conditional logic (negate = F):

    -
    -
    df <- df %>% 
    -  # value in new column is_educator is based on conditional logic
    -  mutate(is_educator = case_when(
    -    
    -    # occupation column must meet 2 criteria to be assigned "Educator":
    -    # it must have a search term AND NOT any exclusion term
    -    
    -    # Must have a search term
    -    str_detect(occupations,
    -               regex("teach|prof|tutor|university", ignore_case = T)) &              
    -    
    -    # AND must NOT have an exclusion term
    -    str_detect(occupations,
    -               regex("admin", ignore_case = T),
    -               negate = TRUE                        ~ "Educator"
    -    
    -    # All rows not meeting above criteria
    -    TRUE                                            ~ "Not an educator"))
    -
    - -
    -
    -

    Locate pattern position

    -

    To locate the first position of a pattern, use str_locate(). It outputs a start and end position.

    -
    -
    str_locate("I wish", "sh")
    -
    -
         start end
    -[1,]     5   6
    -
    -
    -

    Like other str functions, there is an “_all” version (str_locate_all()) which will return the positions of all instances of the pattern within each string. This outputs as a list.

    -
    -
    phrases <- c("I wish", "I hope", "he hopes", "He hopes")
    -
    -str_locate(phrases, "h" )     # position of *first* instance of the pattern
    -
    -
         start end
    -[1,]     6   6
    -[2,]     3   3
    -[3,]     1   1
    -[4,]     4   4
    -
    -
    str_locate_all(phrases, "h" ) # position of *every* instance of the pattern
    -
    -
    [[1]]
    -     start end
    -[1,]     6   6
    -
    -[[2]]
    -     start end
    -[1,]     3   3
    -
    -[[3]]
    -     start end
    -[1,]     1   1
    -[2,]     4   4
    -
    -[[4]]
    -     start end
    -[1,]     4   4
    -
    -
    - -
    -
    -

    Extract a match

    -

    str_extract_all() returns the matching patterns themselves, which is most useful when you have offered several patterns via “OR” conditions. For example, looking in the string vector of occupations (see previous tab) for either “teach”, “prof”, or “tutor”.

    -

    str_extract_all() returns a list which contains all matches for each evaluated string. See below how occupation 3 has two pattern matches within it.

    -
    -
    str_extract_all(occupations, "teach|prof|tutor")
    -
    -
    [[1]]
    -character(0)
    -
    -[[2]]
    -[1] "prof"
    -
    -[[3]]
    -[1] "teach" "tutor"
    -
    -[[4]]
    -[1] "tutor"
    -
    -[[5]]
    -character(0)
    -
    -[[6]]
    -character(0)
    -
    -[[7]]
    -character(0)
    -
    -[[8]]
    -character(0)
    -
    -[[9]]
    -character(0)
    -
    -[[10]]
    -character(0)
    -
    -
    -

    str_extract() extracts only the first match in each evaluated string, producing a character vector with one element for each evaluated string. It returns NA where there was no match. The NAs can be removed by wrapping the returned vector with na.exclude(). Note how the second of occupation 3’s matches is not shown.

    -
    -
    str_extract(occupations, "teach|prof|tutor")
    -
    -
     [1] NA      "prof"  "teach" "tutor" NA      NA      NA      NA      NA     
    -[10] NA     
    -
    -
    - -
    -
    -

    Subset and count

    -

    Aligned functions include str_subset() and str_count().

    -

    str_subset() returns the actual values which contained the pattern:

    -
    -
    str_subset(occupations, "teach|prof|tutor")
    -
    -
    [1] "university professor"           "primary school teacher & tutor"
    -[3] "tutor"                         
    -
    -
    -

    str_count() returns a vector of numbers: the number of times a search term appears in each evaluated value.

    -
    -
    str_count(occupations, regex("teach|prof|tutor", ignore_case = TRUE))
    -
    -
     [1] 0 1 2 1 0 0 0 0 0 0
    -
    -
    - -
    -
    -
    -

    10.6 Special characters

    -

    Backslash \ as escape

    -

    The backslash \ is used to “escape” the meaning of the next character. This way, a backslash can be used to have a quote mark display within other quote marks (\") - the middle quote mark will not “break” the surrounding quote marks.

    -

    Note - thus, if you want to display a backslash, you must escape it’s meaning with another backslash. So you must write two backslashes \\ to display one.

    -

    Special characters

    - ---- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Special characterRepresents
    "\\"backslash
    "\n"a new line (newline)
    "\""double-quote within double quotes
    '\''single-quote within single quotes
    "\| grave accent| carriage return| tab| vertical tab“`backspace
    -

    Run ?"'" in the R Console to display a complete list of these special characters (it will appear in the RStudio Help pane).

    - -
    -
    -

    10.7 Regular expressions (regex) and special characters

    -

    Regular expressions, or “regex”, is a concise language for describing patterns in strings. If you are not familiar with it, a regular expression can look like an alien language. Here we try to de-mystify this language a little bit.

    -

    Much of this section is adapted from this tutorial and this cheatsheet. We selectively adapt here knowing that this handbook might be viewed by people without internet access to view the other tutorials.

    -

    A regular expression is often applied to extract specific patterns from “unstructured” text - for example medical notes, chief complaints, patient history, or other free text columns in a data frame

    -

    There are four basic tools one can use to create a basic regular expression:

    -
      -
    1. Character sets.
      -
    2. -
    3. Meta characters.
      -
    4. -
    5. Quantifiers.
      -
    6. -
    7. Groups.
    8. -
    -

    Character sets

    -

    Character sets, are a way of expressing listing options for a character match, within brackets. So any a match will be triggered if any of the characters within the brackets are found in the string. For example, to look for vowels one could use this character set: “[aeiou]”. Some other common character sets are:

    - ---- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Character setMatches for
    "[A-Z]"any single capital letter
    "[a-z]"any single lowercase letter
    "[0-9]"any digit
    [:alnum:]any alphanumeric character
    [:digit:]any numeric digit
    [:alpha:]any letter (upper or lowercase)
    [:upper:]any uppercase letter
    [:lower:]any lowercase letter
    -

    Character sets can be combined within one bracket (no spaces!), such as "[A-Za-z]" (any upper or lowercase letter), or another example "[t-z0-5]" (lowercase t through z OR number 0 through 5).

    -

    Meta characters

    -

    Meta characters are shorthand for character sets. Some of the important ones are listed below:

    - ---- - - - - - - - - - - - - - - - - - - - - -
    Meta characterRepresents
    "\\s"a single space
    "\\w"any single alphanumeric character (A-Z, a-z, or 0-9)
    "\\d"any single numeric digit (0-9)
    -

    Quantifiers

    -

    Typically you do not want to search for a match on only one character. Quantifiers allow you to designate the length of letters/numbers to allow for the match.

    -

    Quantifiers are numbers written within curly brackets { } after the character they are quantifying, for example:

    -
      -
    • "A{2}" will return instances of two capital A letters.
      -
    • -
    • "A{2,4}" will return instances of between two and four capital A letters (do not put spaces!).
      -
    • -
    • "A{2,}" will return instances of two or more capital A letters.
      -
    • -
    • "A+" will return instances of one or more capital A letters (group extended until a different character is encountered).
      -
    • -
    • Precede with an * asterisk to return zero or more matches (useful if you are not sure the pattern is present).
    • -
    -

    Using the + plus symbol as a quantifier, the match will occur until a different character is encountered. For example, this expression will return all words (alpha characters: "[A-Za-z]+"

    -
    -
    # test string for quantifiers
    -test <- "A-AA-AAA-AAAA"
    -
    -

    When a quantifier of {2} is used, only pairs of consecutive A’s are returned. Two pairs are identified within AAAA.

    -
    -
    str_extract_all(test, "A{2}")
    -
    -
    [[1]]
    -[1] "AA" "AA" "AA" "AA"
    -
    -
    -

    When a quantifier of {2,4} is used, groups of consecutive A’s that are two to four in length are returned.

    -
    -
    str_extract_all(test, "A{2,4}")
    -
    -
    [[1]]
    -[1] "AA"   "AAA"  "AAAA"
    -
    -
    -

    With the quantifier +, groups of one or more are returned:

    -
    -
    str_extract_all(test, "A+")
    -
    -
    [[1]]
    -[1] "A"    "AA"   "AAA"  "AAAA"
    -
    -
    -

    Relative position

    -

    These express requirements for what precedes or follows a pattern. For example, to extract sentences, “two numbers that are followed by a period” (""). (?<=\.)\s(?=[A-Z])

    -
    -
    str_extract_all(test, "")
    -
    -
    [[1]]
    - [1] "A" "-" "A" "A" "-" "A" "A" "A" "-" "A" "A" "A" "A"
    -
    -
    - ---- - - - - - - - - - - - - - - - - - - - - - - - - -
    Position statementMatches to
    "(?<=b)a"“a” that is preceded by a “b”
    "(?<!b)a"“a” that is NOT preceded by a “b”
    "a(?=b)"“a” that is followed by a “b”
    "a(?!b)"“a” that is NOT followed by a “b”
    -

    Groups

    -

    Capturing groups in your regular expression is a way to have a more organized output upon extraction.

    -

    Regex examples

    -

    Below is a free text for the examples. We will try to extract useful information from it using a regular expression search term.

    -
    -
    pt_note <- "Patient arrived at Broward Hospital emergency ward at 18:00 on 6/12/2005. Patient presented with radiating abdominal pain from LR quadrant. Patient skin was pale, cool, and clammy. Patient temperature was 99.8 degrees farinheit. Patient pulse rate was 100 bpm and thready. Respiratory rate was 29 per minute."
    -
    -

    This expression matches to all words (any character until hitting non-character such as a space):

    -
    -
    str_extract_all(pt_note, "[A-Za-z]+")
    -
    -
    [[1]]
    - [1] "Patient"     "arrived"     "at"          "Broward"     "Hospital"   
    - [6] "emergency"   "ward"        "at"          "on"          "Patient"    
    -[11] "presented"   "with"        "radiating"   "abdominal"   "pain"       
    -[16] "from"        "LR"          "quadrant"    "Patient"     "skin"       
    -[21] "was"         "pale"        "cool"        "and"         "clammy"     
    -[26] "Patient"     "temperature" "was"         "degrees"     "farinheit"  
    -[31] "Patient"     "pulse"       "rate"        "was"         "bpm"        
    -[36] "and"         "thready"     "Respiratory" "rate"        "was"        
    -[41] "per"         "minute"     
    -
    -
    -

    The expression "[0-9]{1,2}" matches to consecutive numbers that are 1 or 2 digits in length. It could also be written "\\d{1,2}", or "[:digit:]{1,2}".

    -
    -
    str_extract_all(pt_note, "[0-9]{1,2}")
    -
    -
    [[1]]
    - [1] "18" "00" "6"  "12" "20" "05" "99" "8"  "10" "0"  "29"
    -
    -
    - - - - -

    You can view a useful list of regex expressions and tips on page 2 of this cheatsheet

    -

    Also see this tutorial.

    - -
    -
    -

    10.8 Resources

    -

    A reference sheet for stringr functions can be found here

    -

    A vignette on stringr can be found here

    - - -
    - -
    - - -
    - - - - - - - \ No newline at end of file diff --git a/new_pages/characters_strings.qmd b/new_pages/characters_strings.qmd index bc2e605c..ea95b7dd 100644 --- a/new_pages/characters_strings.qmd +++ b/new_pages/characters_strings.qmd @@ -9,7 +9,7 @@ knitr::include_graphics(here::here("images", "Characters_Strings_1500x500.png")) This page demonstrates use of the **stringr** package to evaluate and handle character values ("strings"). -1. Combine, order, split, arrange - `str_c()`, `str_glue()`, `str_order()`, `str_split()` +1. Combine, order, split, arrange - `str_c()`, `str_glue()`, `str_order()`, `str_split()`. 2. Clean and standardise. * Adjust length - `str_pad()`, `str_trunc()`, `str_wrap()`. * Change case - `str_to_upper()`, `str_to_title()`, `str_to_lower()`, `str_to_sentence()`. @@ -38,7 +38,8 @@ Install or load the **stringr** and other **tidyverse** packages. pacman::p_load( stringr, # many functions for handling strings tidyverse, # for optional data manipulation - tools) # alternative for converting to title case + tools # alternative for converting to title case + ) ``` @@ -48,7 +49,7 @@ pacman::p_load( In this page we will occassionally reference the cleaned `linelist` of cases from a simulated Ebola epidemic. If you want to follow along, click to download the "clean" linelist (as .rds file). Import data with the `import()` function from the **rio** package (it handles many file types like .xlsx, .csv, .rds - see the [Import and export](importing.qmd) page for details). -```{r, echo=F} +```{r, echo=F, warning=F, message=F} # import the linelist into R linelist <- rio::import(here::here("data", "case_linelists", "linelist_cleaned.rds")) ``` diff --git a/new_pages/cleaning.html b/new_pages/cleaning.html deleted file mode 100644 index 9fc52c57..00000000 --- a/new_pages/cleaning.html +++ /dev/null @@ -1,3994 +0,0 @@ - - - - - - - - - -The Epidemiologist R Handbook - 8  Cleaning data and core functions - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - - - -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    -
    - -
    - -
    - - -
    - - - -
    - -
    -
    -

    8  Cleaning data and core functions

    -
    - - - -
    - - - - -
    - - - -
    - - -
    -
    -
    -
    -

    -
    -
    -
    -
    -

    This page demonstrates common steps used in the process of “cleaning” a dataset, and also explains the use of many essential R data management functions.

    -

    To demonstrate data cleaning, this page begins by importing a raw case linelist dataset, and proceeds step-by-step through the cleaning process. In the R code, this manifests as a “pipe” chain, which references the “pipe” operator %>% that passes a dataset from one operation to the next.

    -
    -

    Core functions

    -

    This handbook emphasizes use of the functions from the tidyverse family of R packages. The essential R functions demonstrated in this page are listed below.

    -

    Many of these functions belong to the dplyr R package, which provides “verb” functions to solve data manipulation challenges (the name is a reference to a “data frame-plier. dplyr is part of the tidyverse family of R packages (which also includes ggplot2, tidyr, stringr, tibble, purrr, magrittr, and forcats among others).

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    FunctionUtilityPackage
    %>%“pipe” (pass) data from one function to the nextmagrittr
    mutate()create, transform, and re-define columnsdplyr
    select()keep, remove, select, or re-name columnsdplyr
    rename()rename columnsdplyr
    clean_names()standardize the syntax of column namesjanitor
    as.character(), as.numeric(), as.Date(), etc.convert the class of a columnbase R
    across()transform multiple columns at one timedplyr
    tidyselect functionsuse logic to select columnstidyselect
    filter()keep certain rowsdplyr
    distinct()de-duplicate rowsdplyr
    rowwise()operations by/within each rowdplyr
    add_row()add rows manuallytibble
    arrange()sort rowsdplyr
    recode()re-code values in a columndplyr
    case_when()re-code values in a column using more complex logical criteriadplyr
    replace_na(), na_if(), coalesce()special functions for re-codingtidyr
    age_categories() and cut()create categorical groups from a numeric columnepikit and base R
    match_df()re-code/clean values using a data dictionarymatchmaker
    which()apply logical criteria; return indicesbase R
    -

    If you want to see how these functions compare to Stata or SAS commands, see the page on Transition to R.

    -

    You may encounter an alternative data management framework from the data.table R package with operators like := and frequent use of brackets [ ]. This approach and syntax is briefly explained in the Data Table page.

    -
    -
    -

    Nomenclature

    -

    In this handbook, we generally reference “columns” and “rows” instead of “variables” and “observations”. As explained in this primer on “tidy data”, most epidemiological statistical datasets consist structurally of rows, columns, and values.

    -

    Variables contain the values that measure the same underlying attribute (like age group, outcome, or date of onset). Observations contain all values measured on the same unit (e.g. a person, site, or lab sample). So these aspects can be more difficult to tangibly define.

    -

    In “tidy” datasets, each column is a variable, each row is an observation, and each cell is a single value. However some datasets you encounter will not fit this mold - a “wide” format dataset may have a variable split across several columns (see an example in the Pivoting data page). Likewise, observations could be split across several rows.

    -

    Most of this handbook is about managing and transforming data, so referring to the concrete data structures of rows and columns is more relevant than the more abstract observations and variables. Exceptions occur primarily in pages on data analysis, where you will see more references to variables and observations.

    - - - -
    -
    -

    8.1 Cleaning pipeline

    -

    This page proceeds through typical cleaning steps, adding them sequentially to a cleaning pipe chain.

    -

    In epidemiological analysis and data processing, cleaning steps are often performed sequentially, linked together. In R, this often manifests as a cleaning “pipeline”, where the raw dataset is passed or “piped” from one cleaning step to another.

    -

    Such chains utilize dplyr “verb” functions and the magrittr pipe operator %>%. This pipe begins with the “raw” data (“linelist_raw.xlsx”) and ends with a “clean” R data frame (linelist) that can be used, saved, exported, etc.

    -

    In a cleaning pipeline the order of the steps is important. Cleaning steps might include:

    -
      -
    • Importing of data.
      -
    • -
    • Column names cleaned or changed.
      -
    • -
    • De-duplication.
      -
    • -
    • Column creation and transformation (e.g. re-coding or standardising values).
      -
    • -
    • Rows filtered or added.
    • -
    - - - -
    -
    -

    8.2 Load packages

    -

    This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

    -
    -
    pacman::p_load(
    -  rio,        # importing data  
    -  here,       # relative file pathways  
    -  janitor,    # data cleaning and tables
    -  lubridate,  # working with dates
    -  matchmaker, # dictionary-based cleaning
    -  epikit,     # age_categories() function
    -  tidyverse   # data management and visualization
    -)
    -
    - - - -
    -
    -

    8.3 Import data

    -
    -

    Import

    -

    Here we import the “raw” case linelist Excel file using the import() function from the package rio. The rio package flexibly handles many types of files (e.g. .xlsx, .csv, .tsv, .rds. See the page on Import and export for more information and tips on unusual situations (e.g. skipping rows, setting missing values, importing Google sheets, etc).

    -

    If you want to follow along, click to download the “raw” linelist (as .xlsx file).

    -

    If your dataset is large and takes a long time to import, it can be useful to have the import command be separate from the pipe chain and the “raw” saved as a distinct file. This also allows easy comparison between the original and cleaned versions.

    -

    Below we import the raw Excel file and save it as the data frame linelist_raw. We assume the file is located in your working directory or R project root, and so no sub-folders are specified in the file path.

    -
    -
    linelist_raw <- import("linelist_raw.xlsx")
    -
    -

    You can view the first 50 rows of the the data frame below. Note: the base R function head(n) allow you to view just the first n rows in the R console.

    -
    -
    -
    - -
    -
    -
    -
    -

    Review

    -

    You can use the function skim() from the package skimr to get an overview of the entire dataframe (see page on Descriptive tables for more info). Columns are summarised by class/type such as character, numeric. Note: “POSIXct” is a type of raw date class (see Working with dates).

    -
    -
    skimr::skim(linelist_raw)
    -
    -
    -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Data summary
    Namelinelist_raw
    Number of rows6611
    Number of columns28
    _______________________
    Column type frequency:
    character17
    numeric8
    POSIXct3
    ________________________
    Group variablesNone
    -

    Variable type: character

    - ---------- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    skim_variablen_missingcomplete_rateminmaxemptyn_uniquewhitespace
    case_id1370.9866058880
    date onset2930.96101005800
    outcome15000.7757020
    gender3240.9511020
    hospital15120.775360130
    infector23230.6566026970
    source23230.6557020
    age1070.98120750
    age_unit71.0056020
    fever2580.9623020
    chills2580.9623020
    cough2580.9623020
    aches2580.9623020
    vomit2580.9623020
    time_admission8440.8755010910
    merged_header01.0011010
    …2801.0011010
    -

    Variable type: numeric

    - ------------ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100
    generation71.0016.605.710.0013.0016.0020.0037.00
    lon71.00-13.230.02-13.27-13.25-13.23-13.22-13.21
    lat71.008.470.018.458.468.478.488.49
    row_num01.003240.911857.831.001647.503241.004836.506481.00
    wt_kg71.0052.6918.59-11.0041.0054.0066.00111.00
    ht_cm71.00125.2549.574.0091.00130.00159.00295.00
    ct_blood71.0021.261.6716.0020.0022.0022.0026.00
    temp1580.9838.600.9535.2038.3038.8039.2040.80
    -

    Variable type: POSIXct

    - --------- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    skim_variablen_missingcomplete_rateminmaxmediann_unique
    infection date23220.652012-04-092015-04-272014-10-04538
    hosp date71.002012-04-202015-04-302014-10-15570
    date_of_outcome10680.842012-05-142015-06-042014-10-26575
    -
    -
    - - - -
    -
    -
    -

    8.4 Column names

    -

    In R, column names are the “header” or “top” value of a column. They are used to refer to columns in the code, and serve as a default label in figures.

    -

    Other statistical software such as SAS and STATA use “labels” that co-exist as longer printed versions of the shorter column names. While R does offer the possibility of adding column labels to the data, this is not emphasized in most practice. To make column names “printer-friendly” for figures, one typically adjusts their display within the plotting commands that create the outputs (e.g. axis or legend titles of a plot, or column headers in a printed table - see the scales section of the ggplot tips page and Tables for presentation pages). If you want to assign column labels in the data, read more online here and here.

    -

    As R column names are used very often, so they must have “clean” syntax. We suggest the following:

    -
      -
    • Short names.
    • -
    • No spaces (replace with underscores _ ).
    • -
    • No unusual characters (&, #, <, >, …).
      -
    • -
    • Similar style nomenclature (e.g. all date columns named like date_onset, date_report, date_death…).
    • -
    -

    The columns names of linelist_raw are printed below using names() from base R. We can see that initially:

    -
      -
    • Some names contain spaces (e.g. infection date).
      -
    • -
    • Different naming patterns are used for dates (date onset vs. infection date).
      -
    • -
    • There must have been a merged header across the two last columns in the .xlsx. We know this because the name of two merged columns (“merged_header”) was assigned by R to the first column, and the second column was assigned a placeholder name “…28” (as it was then empty and is the 28th column).
    • -
    -
    -
    names(linelist_raw)
    -
    -
     [1] "case_id"         "generation"      "infection date"  "date onset"     
    - [5] "hosp date"       "date_of_outcome" "outcome"         "gender"         
    - [9] "hospital"        "lon"             "lat"             "infector"       
    -[13] "source"          "age"             "age_unit"        "row_num"        
    -[17] "wt_kg"           "ht_cm"           "ct_blood"        "fever"          
    -[21] "chills"          "cough"           "aches"           "vomit"          
    -[25] "temp"            "time_admission"  "merged_header"   "...28"          
    -
    -
    -

    NOTE: To reference a column name that includes spaces, surround the name with back-ticks, for example: linelist$`infection date`. note that on your keyboard, the back-tick (`) is different from the single quotation mark (’).

    -
    -

    Automatic cleaning

    -

    The function clean_names() from the package janitor standardizes column names and makes them unique by doing the following:

    -
      -
    • Converts all names to consist of only underscores, numbers, and letters.
      -
    • -
    • Accented characters are transliterated to ASCII (e.g. german o with umlaut becomes “o”, spanish “enye” becomes “n”).
      -
    • -
    • Capitalization preference for the new column names can be specified using the case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…).
      -
    • -
    • You can specify specific name replacements by providing a vector to the replace = argument (e.g. replace = c(onset = "date_of_onset")).
      -
    • -
    • Here is an online vignette.
    • -
    -

    Below, the cleaning pipeline begins by using clean_names() on the raw linelist.

    -
    -
    # pipe the raw dataset through the function clean_names(), assign result as "linelist"  
    -linelist <- linelist_raw %>% 
    -  janitor::clean_names()
    -
    -# see the new column names
    -names(linelist)
    -
    -
     [1] "case_id"         "generation"      "infection_date"  "date_onset"     
    - [5] "hosp_date"       "date_of_outcome" "outcome"         "gender"         
    - [9] "hospital"        "lon"             "lat"             "infector"       
    -[13] "source"          "age"             "age_unit"        "row_num"        
    -[17] "wt_kg"           "ht_cm"           "ct_blood"        "fever"          
    -[21] "chills"          "cough"           "aches"           "vomit"          
    -[25] "temp"            "time_admission"  "merged_header"   "x28"            
    -
    -
    -

    NOTE: The last column name “…28” was changed to “x28”.

    -
    -
    -

    Manual name cleaning

    -

    Re-naming columns manually is often necessary, even after the standardization step above. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style NEW = OLD, the new column name is given before the old column name.

    -

    Below, a re-naming command is added to the cleaning pipeline. Spaces have been added strategically to align code for easier reading.

    -
    -
    # CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
    -##################################################################################
    -linelist <- linelist_raw %>%
    -    
    -    # standardize column name syntax
    -    janitor::clean_names() %>% 
    -    
    -    # manually re-name columns
    -           # NEW name             # OLD name
    -    rename(date_infection       = infection_date,
    -           date_hospitalisation = hosp_date,
    -           date_outcome         = date_of_outcome)
    -
    -

    Now you can see that the columns names have been changed:

    -
    -
    -
     [1] "case_id"              "generation"           "date_infection"      
    - [4] "date_onset"           "date_hospitalisation" "date_outcome"        
    - [7] "outcome"              "gender"               "hospital"            
    -[10] "lon"                  "lat"                  "infector"            
    -[13] "source"               "age"                  "age_unit"            
    -[16] "row_num"              "wt_kg"                "ht_cm"               
    -[19] "ct_blood"             "fever"                "chills"              
    -[22] "cough"                "aches"                "vomit"               
    -[25] "temp"                 "time_admission"       "merged_header"       
    -[28] "x28"                 
    -
    -
    -
    -

    Rename by column position

    -

    You can also rename by column position, instead of column name, for example:

    -
    -
    rename(newNameForFirstColumn  = 1,
    -       newNameForSecondColumn = 2)
    -
    -
    -
    -

    Rename via select() and summarise()

    -

    As a shortcut, you can also rename columns within the dplyr select() and summarise() functions. select() is used to keep only certain columns (and is covered later in this page). summarise() is covered in the Grouping data and Descriptive tables pages. These functions also uses the format new_name = old_name. Here is an example:

    -
    -
    linelist_raw %>% 
    -  # rename and KEEP ONLY these columns
    -  select(# NEW name             # OLD name
    -         date_infection       = `infection date`,    
    -         date_hospitalisation = `hosp date`)
    -
    -
    -
    -
    -

    Other challenges

    -
    -

    Empty Excel column names

    -

    R cannot have dataset columns that do not have column names (headers). So, if you import an Excel dataset with data but no column headers, R will fill-in the headers with names like “…1” or “…2”. The number represents the column number (e.g. if the 4th column in the dataset has no header, then R will name it “…4”).

    -

    You can clean these names manually by referencing their position number (see example above), or their assigned name (linelist_raw$...1).

    -
    -
    -

    Merged Excel column names and cells

    -

    Merged cells in an Excel file are a common occurrence when receiving data. As explained in Transition to R, merged cells can be nice for human reading of data, but are not “tidy data” and cause many problems for machine reading of data. R cannot accommodate merged cells.

    -

    Remind people doing data entry that human-readable data is not the same as machine-readable data. Strive to train users about the principles of tidy data. If at all possible, try to change procedures so that data arrive in a tidy format without merged cells.

    -
      -
    • Each variable must have its own column.
      -
    • -
    • Each observation must have its own row.
      -
    • -
    • Each value must have its own cell.
    • -
    -

    When using rio’s import() function, the value in a merged cell will be assigned to the first cell and subsequent cells will be empty.

    -

    One solution to deal with merged cells is to import the data with the function readWorkbook() from the package openxlsx. Set the argument fillMergedCells = TRUE. This gives the value in a merged cell to all cells within the merge range.

    -
    -
    linelist_raw <- openxlsx::readWorkbook("linelist_raw.xlsx", fillMergedCells = TRUE)
    -
    -

    DANGER: If column names are merged with readWorkbook(), you will end up with duplicate column names, which you will need to fix manually - R does not work well with duplicate column names! You can re-name them by referencing their position (e.g. column 5), as explained in the section on manual column name cleaning.

    - - - -
    -
    -
    -
    -

    8.5 Select or re-order columns

    -

    Use select() from dplyr to select the columns you want to retain, and to specify their order in the data frame.

    -

    CAUTION: In the examples below, the linelist data frame is modified with select() and displayed, but not saved. This is for demonstration purposes. The modified column names are printed by piping the data frame to names().

    -

    Here are ALL the column names in the linelist at this point in the cleaning pipe chain:

    -
    -
    names(linelist)
    -
    -
     [1] "case_id"              "generation"           "date_infection"      
    - [4] "date_onset"           "date_hospitalisation" "date_outcome"        
    - [7] "outcome"              "gender"               "hospital"            
    -[10] "lon"                  "lat"                  "infector"            
    -[13] "source"               "age"                  "age_unit"            
    -[16] "row_num"              "wt_kg"                "ht_cm"               
    -[19] "ct_blood"             "fever"                "chills"              
    -[22] "cough"                "aches"                "vomit"               
    -[25] "temp"                 "time_admission"       "merged_header"       
    -[28] "x28"                 
    -
    -
    -
    -

    Keep columns

    -

    Select only the columns you want to remain

    -

    Put their names in the select() command, with no quotation marks. They will appear in the data frame in the order you provide. Note that if you include a column that does not exist, R will return an error (see use of any_of() below if you want no error in this situation).

    -
    -
    # linelist dataset is piped through select() command, and names() prints just the column names
    -linelist %>% 
    -  select(case_id, date_onset, date_hospitalisation, fever) %>% 
    -  names()  # display the column names
    -
    -
    [1] "case_id"              "date_onset"           "date_hospitalisation"
    -[4] "fever"               
    -
    -
    -
    -
    -

    “tidyselect” helper functions

    -

    These helper functions exist to make it easy to specify columns to keep, discard, or transform. They are from the package tidyselect, which is included in tidyverse and underlies how columns are selected in dplyr functions.

    -

    For example, if you want to re-order the columns, everything() is a useful function to signify “all other columns not yet mentioned”. The command below moves columns date_onset and date_hospitalisation to the beginning (left) of the dataset, but keeps all the other columns afterward. Note that everything() is written with empty parentheses:

    -
    -
    # move date_onset and date_hospitalisation to beginning
    -linelist %>% 
    -  select(date_onset, date_hospitalisation, everything()) %>% 
    -  names()
    -
    -
     [1] "date_onset"           "date_hospitalisation" "case_id"             
    - [4] "generation"           "date_infection"       "date_outcome"        
    - [7] "outcome"              "gender"               "hospital"            
    -[10] "lon"                  "lat"                  "infector"            
    -[13] "source"               "age"                  "age_unit"            
    -[16] "row_num"              "wt_kg"                "ht_cm"               
    -[19] "ct_blood"             "fever"                "chills"              
    -[22] "cough"                "aches"                "vomit"               
    -[25] "temp"                 "time_admission"       "merged_header"       
    -[28] "x28"                 
    -
    -
    -

    Here are other “tidyselect” helper functions that also work within dplyr functions like select(), across(), and summarise():

    -
      -
    • everything() - all other columns not mentioned.
      -
    • -
    • last_col() - the last column.
    • -
    • where() - applies a function to all columns and selects those which are TRUE.
      -
    • -
    • contains() - columns containing a character string. -
        -
      • example: select(contains("time")).
        -
      • -
    • -
    • starts_with() - matches to a specified prefix. -
        -
      • example: select(starts_with("date_")).
        -
      • -
    • -
    • ends_with() - matches to a specified suffix. -
        -
      • example: select(ends_with("_post")).
        -
      • -
    • -
    • matches() - to apply a regular expression (regex). -
        -
      • example: select(matches("[pt]al")).
      • -
    • -
    • num_range() - a numerical range like x01, x02, x03.
      -
    • -
    • any_of() - matches IF column exists but returns no error if it is not found. -
        -
      • example: select(any_of(date_onset, date_death, cardiac_arrest)).
      • -
    • -
    -

    In addition, use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.

    -

    Use where() to specify logical criteria for columns. If providing a function inside where(), do not include the function’s empty parentheses. The command below selects columns that are class Numeric.

    -
    -
    # select columns that are class Numeric
    -linelist %>% 
    -  select(where(is.numeric)) %>% 
    -  names()
    -
    -
    [1] "generation" "lon"        "lat"        "row_num"    "wt_kg"     
    -[6] "ht_cm"      "ct_blood"   "temp"      
    -
    -
    -

    Use contains() to select only columns in which the column name contains a specified character string. ends_with() and starts_with() provide more nuance.

    -
    -
    # select columns containing certain characters
    -linelist %>% 
    -  select(contains("date")) %>% 
    -  names()
    -
    -
    [1] "date_infection"       "date_onset"           "date_hospitalisation"
    -[4] "date_outcome"        
    -
    -
    -

    The function matches() works similarly to contains() but can be provided a regular expression (see page on Characters and strings), such as multiple strings separated by OR bars within the parentheses:

    -
    -
    # searched for multiple character matches
    -linelist %>% 
    -  select(matches("onset|hosp|fev")) %>%   # note the OR symbol "|"
    -  names()
    -
    -
    [1] "date_onset"           "date_hospitalisation" "hospital"            
    -[4] "fever"               
    -
    -
    -

    CAUTION: If a column name that you specifically provide does not exist in the data, it can return an error and stop your code. Consider using any_of() to cite columns that may or may not exist, especially useful in negative (remove) selections.

    -

    Only one of these columns exists, but no error is produced and the code continues without stopping your cleaning chain.

    -
    -
    linelist %>% 
    -  select(any_of(c("date_onset", "village_origin", "village_detection", "village_residence", "village_travel"))) %>% 
    -  names()
    -
    -
    [1] "date_onset"
    -
    -
    -
    -
    -

    Remove columns

    -

    Indicate which columns to remove by placing a minus symbol “-” in front of the column name (e.g. select(-outcome)), or a vector of column names (as below). All other columns will be retained.

    -
    -
    linelist %>% 
    -  select(-c(date_onset, fever:vomit)) %>% # remove date_onset and all columns from fever to vomit
    -  names()
    -
    -
     [1] "case_id"              "generation"           "date_infection"      
    - [4] "date_hospitalisation" "date_outcome"         "outcome"             
    - [7] "gender"               "hospital"             "lon"                 
    -[10] "lat"                  "infector"             "source"              
    -[13] "age"                  "age_unit"             "row_num"             
    -[16] "wt_kg"                "ht_cm"                "ct_blood"            
    -[19] "temp"                 "time_admission"       "merged_header"       
    -[22] "x28"                 
    -
    -
    -

    You can also remove a column using base R syntax, by defining it as NULL. For example:

    -
    -
    linelist$date_onset <- NULL   # deletes column with base R syntax 
    -
    -
    -
    -

    Standalone

    -

    select() can also be used as an independent command (not in a pipe chain). In this case, the first argument is the original dataframe to be operated upon.

    -
    -
    # Create a new linelist with id and age-related columns
    -linelist_age <- select(linelist, case_id, contains("age"))
    -
    -# display the column names
    -names(linelist_age)
    -
    -
    [1] "case_id"  "age"      "age_unit"
    -
    -
    -
    -

    Add to the pipe chain

    -

    In the linelist_raw, there are a few columns we do not need: row_num, merged_header, and x28. We remove them with a select() command in the cleaning pipe chain:

    -
    -
    # CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
    -##################################################################################
    -
    -# begin cleaning pipe chain
    -###########################
    -linelist <- linelist_raw %>%
    -    
    -    # standardize column name syntax
    -    janitor::clean_names() %>% 
    -    
    -    # manually re-name columns
    -           # NEW name             # OLD name
    -    rename(date_infection       = infection_date,
    -           date_hospitalisation = hosp_date,
    -           date_outcome         = date_of_outcome) %>% 
    -    
    -    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    -    #####################################################
    -
    -    # remove column
    -    select(-c(row_num, merged_header, x28))
    -
    - - - -
    -
    -
    -
    -

    8.6 Deduplication

    -

    See the handbook page on De-duplication for extensive options on how to de-duplicate data. Only a very simple row de-duplication example is presented here.

    -

    The package dplyr offers the distinct() function. This function examines every row and reduce the data frame to only the unique rows. That is, it removes rows that are 100% duplicates.

    -

    When evaluating duplicate rows, it takes into account a range of columns - by default it considers all columns. As shown in the de-duplication page, you can adjust this column range so that the uniqueness of rows is only evaluated in regards to certain columns.

    -

    In this simple example, we just add the empty command distinct() to the pipe chain. This ensures there are no rows that are 100% duplicates of other rows (evaluated across all columns).

    -

    We begin with nrow(linelist) rows in linelist.

    -
    -
    linelist <- linelist %>% 
    -  distinct()
    -
    -

    After de-duplication there are nrow(linelist) rows. Any removed rows would have been 100% duplicates of other rows.

    -

    Below, the distinct() command is added to the cleaning pipe chain:

    -
    -
    # CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
    -##################################################################################
    -
    -# begin cleaning pipe chain
    -###########################
    -linelist <- linelist_raw %>%
    -    
    -    # standardize column name syntax
    -    janitor::clean_names() %>% 
    -    
    -    # manually re-name columns
    -           # NEW name             # OLD name
    -    rename(date_infection       = infection_date,
    -           date_hospitalisation = hosp_date,
    -           date_outcome         = date_of_outcome) %>% 
    -    
    -    # remove column
    -    select(-c(row_num, merged_header, x28)) %>% 
    -  
    -    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    -    #####################################################
    -    
    -    # de-duplicate
    -    distinct()
    -
    - - - -
    -
    -

    8.7 Column creation and transformation

    -

    We recommend using the dplyr function mutate() to add a new column, or to modify an existing one.

    -

    Below is an example of creating a new column with mutate(). The syntax is: mutate(new_column_name = value or transformation).

    -

    In Stata, this is similar to the command generate, but R’s mutate() can also be used to modify an existing column.

    -
    -

    New columns

    -

    The most basic mutate() command to create a new column might look like this. It creates a new column new_col where the value in every row is 10.

    -
    -
    linelist <- linelist %>% 
    -  mutate(new_col = 10)
    -
    -

    You can also reference values in other columns, to perform calculations. Below, a new column bmi is created to hold the Body Mass Index (BMI) for each case - as calculated using the formula BMI = kg/m^2, using column ht_cm and column wt_kg.

    -
    -
    linelist <- linelist %>% 
    -  mutate(bmi = wt_kg / (ht_cm/100)^2)
    -
    -

    If creating multiple new columns, separate each with a comma and new line. Below are examples of new columns, including ones that consist of values from other columns combined using str_glue() from the stringr package (see page on Characters and strings.

    -
    -
    new_col_demo <- linelist %>%                       
    -  mutate(
    -    new_var_dup    = case_id,             # new column = duplicate/copy another existing column
    -    new_var_static = 7,                   # new column = all values the same
    -    new_var_static = new_var_static + 5,  # you can overwrite a column, and it can be a calculation using other variables
    -    new_var_paste  = stringr::str_glue("{hospital} on ({date_hospitalisation})") # new column = pasting together values from other columns
    -    ) %>% 
    -  select(case_id, hospital, date_hospitalisation, contains("new"))        # show only new columns, for demonstration purposes
    -
    -

    Review the new columns. For demonstration purposes, only the new columns and the columns used to create them are shown:

    -
    -
    -
    - -
    -
    -

    TIP: A variation on mutate() is the function transmute(). This function adds a new column just like mutate(), but also drops/removes all other columns that you do not mention within its parentheses.

    -
    -
    # HIDDEN FROM READER
    -# removes new demo columns created above
    -# linelist <- linelist %>% 
    -#   select(-contains("new_var"))
    -
    -
    -
    -

    Convert column class

    -

    Columns containing values that are dates, numbers, or logical values (TRUE/FALSE) will only behave as expected if they are correctly classified. There is a difference between “2” of class character and 2 of class numeric!

    -

    There are ways to set column class during the import commands, but this is often cumbersome. See the R Basics section on object classes to learn more about converting the class of objects and columns.

    -

    First, let’s run some checks on important columns to see if they are the correct class. We also saw this in the beginning when we ran skim().

    -

    Currently, the class of the age column is character. To perform quantitative analyses, we need these numbers to be recognized as numeric!

    -
    -
    class(linelist$age)
    -
    -
    [1] "character"
    -
    -
    -

    The class of the date_onset column is also character! To perform analyses, these dates must be recognized as dates!

    -
    -
    class(linelist$date_onset)
    -
    -
    [1] "character"
    -
    -
    -

    To resolve this, use the ability of mutate() to re-define a column with a transformation. We define the column as itself, but converted to a different class. Here is a basic example, converting or ensuring that the column age is class Numeric:

    -
    -
    linelist <- linelist %>% 
    -  mutate(age = as.numeric(age))
    -
    -

    In a similar way, you can use as.character() and as.logical(). To convert to class Factor, you can use factor() from base R or as_factor() from forcats. Read more about this in the Factors page.

    -

    You must be careful when converting to class Date. Several methods are explained on the page Working with dates. Typically, the raw date values must all be in the same format for conversion to work correctly (e.g “MM/DD/YYYY”, or “DD MM YYYY”). After converting to class Date, check your data to confirm that each value was converted correctly.

    -
    -
    -

    Grouped data

    -

    If your data frame is already grouped (see page on Grouping data), mutate() may behave differently than if the data frame is not grouped. Any summarizing functions, like mean(), median(), max(), etc. will calculate by group, not by all the rows.

    -
    -
    # age normalized to mean of ALL rows
    -linelist %>% 
    -  mutate(age_norm = age / mean(age, na.rm=T))
    -
    -# age normalized to mean of hospital group
    -linelist %>% 
    -  group_by(hospital) %>% 
    -  mutate(age_norm = age / mean(age, na.rm=T))
    -
    -

    Read more about using mutate () on grouped dataframes in this tidyverse mutate documentation.

    -
    -
    -

    Transform multiple columns

    -

    Often to write concise code you want to apply the same transformation to multiple columns at once. A transformation can be applied to multiple columns at once using the across() function from the package dplyr (also contained within tidyverse package). across() can be used with any dplyr function, but is commonly used within select(), mutate(), filter(), or summarise(). See how it is applied to summarise() in the page on Descriptive tables.

    -

    Specify the columns to the argument .cols = and the function(s) to apply to .fns =. Any additional arguments to provide to the .fns function can be included after a comma, still within across().

    -
    -

    across() column selection

    -

    Specify the columns to the argument .cols =. You can name them individually, or use “tidyselect” helper functions. Specify the function to .fns =. Note that using the function mode demonstrated below, the function is written without its parentheses ( ).

    -

    Here the transformation as.character() is applied to specific columns named within across().

    -
    -
    linelist <- linelist %>% 
    -  mutate(across(.cols = c(temp, ht_cm, wt_kg), .fns = as.character))
    -
    -

    The “tidyselect” helper functions are available to assist you in specifying columns. They are detailed above in the section on Selecting and re-ordering columns, and they include: everything(), last_col(), where(), starts_with(), ends_with(), contains(), matches(), num_range() and any_of().

    -

    Here is an example of how one would change all columns to character class:

    -
    -
    #to change all columns to character class
    -linelist <- linelist %>% 
    -  mutate(across(.cols = everything(), .fns = as.character))
    -
    -

    Convert to character all columns where the name contains the string “date” (note the placement of commas and parentheses):

    -
    -
    #to change all columns to character class
    -linelist <- linelist %>% 
    -  mutate(across(.cols = contains("date"), .fns = as.character))
    -
    -

    Below, an example of mutating the columns that are currently class POSIXct (a raw datetime class that shows timestamps) - in other words, where the function is.POSIXct() evaluates to TRUE. Then we want to apply the function as.Date() to these columns to convert them to a normal class Date.

    -
    -
    linelist <- linelist %>% 
    -  mutate(across(.cols = where(is.POSIXct), .fns = as.Date))
    -
    -
      -
    • Note that within across() we also use the function where() as is.POSIXct is evaluating to either TRUE or FALSE.
      -
    • -
    • Note that is.POSIXct() is from the package lubridate. Other similar “is” functions like is.character(), is.numeric(), and is.logical() are from base R.
    • -
    -
    -
    -

    across() functions

    -

    You can read the documentation with ?across for details on how to provide functions to across(). A few summary points: there are several ways to specify the function(s) to perform on a column and you can even define your own functions:

    -
      -
    • You can provide the function name alone (e.g. mean or as.character).
      -
    • -
    • You can provide the function in purrr-style (e.g. ~ mean(.x, na.rm = TRUE)) (see this page).
      -
    • -
    • You can specify multiple functions by providing a list (e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))). -
        -
      • If you provide multiple functions, multiple transformed columns will be returned per input column, with unique names in the format col_fn. You can adjust how the new columns are named with the .names = argument using glue syntax (see page on Characters and strings) where {.col} and {.fn} are shorthand for the input column and function.
      • -
    • -
    -

    Here are a few online resources on using across(): creator Hadley Wickham’s thoughts/rationale

    -
    -
    -
    -

    coalesce()

    -

    This dplyr function finds the first non-missing value at each position. It “fills-in” missing values with the first available value in an order you specify.

    -

    Here is an example outside the context of a data frame: Let us say you have two vectors, one containing the patient’s village of detection and another containing the patient’s village of residence. You can use coalesce to pick the first non-missing value for each index:

    -
    -
    village_detection <- c("a", "b", NA,  NA)
    -village_residence <- c("a", "c", "a", "d")
    -
    -village <- coalesce(village_detection, village_residence)
    -village    # print
    -
    -
    [1] "a" "b" "a" "d"
    -
    -
    -

    This works the same if you provide data frame columns: for each row, the function will assign the new column value with the first non-missing value in the columns you provided (in order provided).

    -
    -
    linelist <- linelist %>% 
    -  mutate(village = coalesce(village_detection, village_residence))
    -
    -

    This is an example of a “row-wise” operation. For more complicated row-wise calculations, see the section below on Row-wise calculations.

    -
    -
    -

    Cumulative math

    -

    If you want a column to reflect the cumulative sum/mean/min/max etc as assessed down the rows of a dataframe to that point, use the following functions:

    -

    cumsum() returns the cumulative sum, as shown below:

    -
    -
    sum(c(2,4,15,10))     # returns only one number
    -
    -
    [1] 31
    -
    -
    cumsum(c(2,4,15,10))  # returns the cumulative sum at each step
    -
    -
    [1]  2  6 21 31
    -
    -
    -

    This can be used in a dataframe when making a new column. For example, to calculate the cumulative number of cases per day in an outbreak, consider code like this:

    -
    -
    cumulative_case_counts <- linelist %>%  # begin with case linelist
    -  count(date_onset) %>%                 # count of rows per day, as column 'n'   
    -  mutate(cumulative_cases = cumsum(n))  # new column, of the cumulative sum at each row
    -
    -

    Below are the first 10 rows:

    -
    -
    head(cumulative_case_counts, 10)
    -
    -
       date_onset n cumulative_cases
    -1  2012-04-15 1                1
    -2  2012-05-05 1                2
    -3  2012-05-08 1                3
    -4  2012-05-31 1                4
    -5  2012-06-02 1                5
    -6  2012-06-07 1                6
    -7  2012-06-14 1                7
    -8  2012-06-21 1                8
    -9  2012-06-24 1                9
    -10 2012-06-25 1               10
    -
    -
    -

    See the page on Epidemic curves for how to plot cumulative incidence with the epicurve.

    -

    See also:
    -cumsum(), cummean(), cummin(), cummax(), cumany(), cumall()

    -
    -
    -

    Using base R

    -

    To define a new column (or re-define a column) using base R, write the name of data frame, connected with $, to the new column (or the column to be modified). Use the assignment operator <- to define the new value(s). Remember that when using base R you must specify the data frame name before the column name every time (e.g. dataframe$column). Here is an example of creating the bmi column using base R:

    -
    -
    linelist$bmi = linelist$wt_kg / (linelist$ht_cm / 100) ^ 2)
    -
    -
    -
    -

    Add to pipe chain

    -

    Below, a new column is added to the pipe chain and some classes are converted.

    -
    -
    # CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
    -##################################################################################
    -
    -# begin cleaning pipe chain
    -###########################
    -linelist <- linelist_raw %>%
    -    
    -    # standardize column name syntax
    -    janitor::clean_names() %>% 
    -    
    -    # manually re-name columns
    -           # NEW name             # OLD name
    -    rename(date_infection       = infection_date,
    -           date_hospitalisation = hosp_date,
    -           date_outcome         = date_of_outcome) %>% 
    -    
    -    # remove column
    -    select(-c(row_num, merged_header, x28)) %>% 
    -  
    -    # de-duplicate
    -    distinct() %>% 
    -  
    -    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    -    ###################################################
    -    # add new column
    -    mutate(bmi = wt_kg / (ht_cm/100)^2) %>% 
    -  
    -    # convert class of columns
    -    mutate(across(contains("date"), as.Date), 
    -           generation = as.numeric(generation),
    -           age        = as.numeric(age)) 
    -
    -
    -
    -
    -

    8.8 Re-code values

    -

    Here are a few scenarios where you need to re-code (change) values:

    -
      -
    • to edit one specific value (e.g. one date with an incorrect year or format).
      -
    • -
    • to reconcile values not spelled the same.
    • -
    • to create a new column of categorical values.
      -
    • -
    • to create a new column of numeric categories (e.g. age categories).
    • -
    -
    -

    Specific values

    -

    To change values manually you can use the recode() function within the mutate() function.

    -

    Imagine there is a nonsensical date in the data (e.g. “2014-14-15”): you could fix the date manually in the raw source data, or, you could write the change into the cleaning pipeline via mutate() and recode(). The latter is more transparent and reproducible to anyone else seeking to understand or repeat your analysis.

    -
    -
    # fix incorrect values                   # old value       # new value
    -linelist <- linelist %>% 
    -  mutate(date_onset = recode(date_onset, "2014-14-15" = "2014-04-15"))
    -
    -

    The mutate() line above can be read as: “mutate the column date_onset to equal the column date_onset re-coded so that OLD VALUE is changed to NEW VALUE”. Note that this pattern (OLD = NEW) for recode() is the opposite of most R patterns (new = old). The R development community is working on revising this.

    -

    Here is another example re-coding multiple values within one column.

    -

    In linelist the values in the column “hospital” must be cleaned. There are several different spellings and many missing values.

    -
    -
    table(linelist$hospital, useNA = "always")  # print table of all unique values, including missing  
    -
    -
    
    -                     Central Hopital                     Central Hospital 
    -                                  11                                  457 
    -                          Hospital A                           Hospital B 
    -                                 290                                  289 
    -                    Military Hopital                    Military Hospital 
    -                                  32                                  798 
    -                    Mitylira Hopital                    Mitylira Hospital 
    -                                   1                                   79 
    -                               Other                         Port Hopital 
    -                                 907                                   48 
    -                       Port Hospital St. Mark's Maternity Hospital (SMMH) 
    -                                1756                                  417 
    -  St. Marks Maternity Hopital (SMMH)                                 <NA> 
    -                                  11                                 1512 
    -
    -
    -

    The recode() command below re-defines the column “hospital” as the current column “hospital”, but with the specified recode changes. Don’t forget commas after each!

    -
    -
    linelist <- linelist %>% 
    -  mutate(hospital = recode(hospital,
    -                     # for reference: OLD = NEW
    -                      "Mitylira Hopital"  = "Military Hospital",
    -                      "Mitylira Hospital" = "Military Hospital",
    -                      "Military Hopital"  = "Military Hospital",
    -                      "Port Hopital"      = "Port Hospital",
    -                      "Central Hopital"   = "Central Hospital",
    -                      "other"             = "Other",
    -                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
    -                      ))
    -
    -

    Now we see the spellings in the hospital column have been corrected and consolidated:

    -
    -
    table(linelist$hospital, useNA = "always")
    -
    -
    
    -                    Central Hospital                           Hospital A 
    -                                 468                                  290 
    -                          Hospital B                    Military Hospital 
    -                                 289                                  910 
    -                               Other                        Port Hospital 
    -                                 907                                 1804 
    -St. Mark's Maternity Hospital (SMMH)                                 <NA> 
    -                                 428                                 1512 
    -
    -
    -

    TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.

    -

    TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween (““).

    -
    -
    -

    By logic

    -

    Below we demonstrate how to re-code values in a column using logic and conditions:

    -
      -
    • Using replace(), ifelse() and if_else() for simple logic.
    • -
    • Using case_when() for more complex logic.
    • -
    -
    -
    -

    Simple logic

    -
    -

    replace()

    -

    To re-code with simple logical criteria, you can use replace() within mutate(). replace() is a function from base R. Use a logic condition to specify the rows to change . The general syntax is:

    -
    -
    mutate(col_to_change = replace(col_to_change, criteria for rows, new value))
    -
    -

    One common situation to use replace() is changing just one value in one row, using an unique row identifier. Below, the gender is changed to “Female” in the row where the column case_id is “2195”.

    -
    -
    # Example: change gender of one specific observation to "Female" 
    -linelist <- linelist %>% 
    -  mutate(gender = replace(gender, case_id == "2195", "Female"))
    -
    -

    The equivalent command using base R syntax and indexing brackets [ ] is below. It reads as “Change the value of the dataframe linelist‘s column gender (for the rows where linelist’s column case_id has the value ’2195’) to ‘Female’”.

    -
    -
    linelist$gender[linelist$case_id == "2195"] <- "Female"
    -
    -
    -
    -

    ifelse() and if_else()

    -

    Another tool for simple logic is ifelse() and its partner if_else(). However, in most cases for re-coding it is more clear to use case_when() (detailed below). These “if else” commands are simplified versions of an if and else programming statement. The general syntax is:
    -ifelse(condition, value to return if condition evaluates to TRUE, value to return if condition evaluates to FALSE)

    -

    Below, the column source_known is defined. Its value in a given row is set to “known” if the row’s value in column source is not missing. If the value in source is missing, then the value in source_known is set to “unknown”.

    -
    -
    linelist <- linelist %>% 
    -  mutate(source_known = ifelse(!is.na(source), "known", "unknown"))
    -
    -

    if_else() is a special version from dplyr that handles dates. Note that if the ‘true’ value is a date, the ‘false’ value must also qualify a date, hence using the special value NA_real_ instead of just NA.

    -
    -
    # Create a date of death column, which is NA if patient has not died.
    -linelist <- linelist %>% 
    -  mutate(date_death = if_else(outcome == "Death", date_outcome, NA_real_))
    -
    -

    Avoid stringing together many ifelse commands… use case_when() instead! case_when() is much easier to read and you’ll make fewer errors.

    -
    -
    -
    -
    -

    -
    -
    -
    -
    -

    Outside of the context of a data frame, if you want to have an object used in your code switch its value, consider using switch() from base R.

    -
    -
    -
    -

    Complex logic

    -

    Use dplyr’s case_when() if you are re-coding into many new groups, or if you need to use complex logic statements to re-code values. This function evaluates every row in the data frame, assess whether the rows meets specified criteria, and assigns the correct new value.

    -

    case_when() commands consist of statements that have a Right-Hand Side (RHS) and a Left-Hand Side (LHS) separated by a “tilde” ~. The logic criteria are in the left side and the pursuant values are in the right side of each statement. Statements are separated by commas.

    -

    For example, here we utilize the columns age and age_unit to create a column age_years:

    -
    -
    linelist <- linelist %>% 
    -  mutate(age_years = case_when(
    -       age_unit == "years"  ~ age,       # if age unit is years
    -       age_unit == "months" ~ age/12,    # if age unit is months, divide age by 12
    -       is.na(age_unit)      ~ age))      # if age unit is missing, assume years
    -                                         # any other circumstance, assign NA (missing)
    -
    -

    As each row in the data is evaluated, the criteria are applied/evaluated in the order the case_when() statements are written, from top-to-bottom. If the top criteria evaluates to TRUE for a given row, the RHS value is assigned, and the remaining criteria are not even tested for that row in the data. Thus, it is best to write the most specific criteria first, and the most general last. A data row that does not meet any of the RHS criteria will be assigned NA.

    -

    Sometimes, you may with to write a final statement that assigns a value for all other scenarios not described by one of the previous lines. To do this, place TRUE on the left-side, which will capture any row that did not meet any of the previous criteria. The right-side of this statement could be assigned a value like “check me!” or missing.

    -

    Below is another example of case_when() used to create a new column with the patient classification, according to a case definition for confirmed and suspect cases:

    -
    -
    linelist <- linelist %>% 
    -     mutate(case_status = case_when(
    -          
    -          # if patient had lab test and it is positive,
    -          # then they are marked as a confirmed case 
    -          ct_blood < 20                   ~ "Confirmed",
    -          
    -          # given that a patient does not have a positive lab result,
    -          # if patient has a "source" (epidemiological link) AND has fever, 
    -          # then they are marked as a suspect case
    -          !is.na(source) & fever == "yes" ~ "Suspect",
    -          
    -          # any other patient not addressed above 
    -          # is marked for follow up
    -          TRUE                            ~ "To investigate"))
    -
    -

    DANGER: Values on the right-side must all be the same class - either numeric, character, date, logical, etc. To assign missing (NA), you may need to use special variations of NA such as NA_character_, NA_real_ (for numeric or POSIX), and as.Date(NA). Read more in Working with dates.

    -
    -
    -

    Missing values

    -

    Below are special functions for handling missing values in the context of data cleaning.

    -

    See the page on Missing data for more detailed tips on identifying and handling missing values. For example, the is.na() function which logically tests for missingness.

    -

    replace_na()

    -

    To change missing values (NA) to a specific value, such as “Missing”, use the dplyr function replace_na() within mutate(). Note that this is used in the same manner as recode above - the name of the variable must be repeated within replace_na().

    -
    -
    linelist <- linelist %>% 
    -  mutate(hospital = replace_na(hospital, "Missing"))
    -
    -

    fct_explicit_na()

    -

    This is a function from the forcats package. The forcats package handles columns of class Factor. Factors are R’s way to handle ordered values such as c("First", "Second", "Third") or to set the order that values (e.g. hospitals) appear in tables and plots. See the page on Factors.

    -

    If your data are class Factor and you try to convert NA to “Missing” by using replace_na(), you will get this error: invalid factor level, NA generated. You have tried to add “Missing” as a value, when it was not defined as a possible level of the factor, and it was rejected.

    -

    The easiest way to solve this is to use the forcats function fct_explicit_na() which converts a column to class factor, and converts NA values to the character “(Missing)”.

    -
    -
    linelist %>% 
    -  mutate(hospital = fct_explicit_na(hospital))
    -
    -

    A slower alternative would be to add the factor level using fct_expand() and then convert the missing values.

    -

    na_if()

    -

    To convert a specific value to NA, use dplyr’s na_if(). The command below performs the opposite operation of replace_na(). In the example below, any values of “Missing” in the column hospital are converted to NA.

    -
    -
    linelist <- linelist %>% 
    -  mutate(hospital = na_if(hospital, "Missing"))
    -
    -

    Note: na_if() cannot be used for logic criteria (e.g. “all values > 99”) - use replace() or case_when() for this:

    -
    -
    # Convert temperatures above 40 to NA 
    -linelist <- linelist %>% 
    -  mutate(temp = replace(temp, temp > 40, NA))
    -
    -# Convert onset dates earlier than 1 Jan 2000 to missing
    -linelist <- linelist %>% 
    -  mutate(date_onset = replace(date_onset, date_onset > as.Date("2000-01-01"), NA))
    -
    -
    -
    -

    Cleaning dictionary

    -

    Use the R package matchmaker and its function match_df() to clean a data frame with a cleaning dictionary.

    -
      -
    1. Create a cleaning dictionary with 3 columns: -
        -
      • A “from” column (the incorrect value).
        -
      • -
      • A “to” column (the correct value).
        -
      • -
      • A column specifying the column for the changes to be applied (or “.global” to apply to all columns).
      • -
    2. -
    -

    Note: .global dictionary entries will be overridden by column-specific dictionary entries.

    -
    -
    -
    -
    -

    -
    -
    -
    -
    -
      -
    1. Import the dictionary file into R. This example can be downloaded via instructions on the Download handbook and data page.
    2. -
    -
    -
    cleaning_dict <- import("cleaning_dict.csv")
    -
    -
      -
    1. Pipe the raw linelist to match_df(), specifying to dictionary = the cleaning dictionary data frame. The from = argument should be the name of the dictionary column which contains the “old” values, the by = argument should be dictionary column which contains the corresponding “new” values, and the third column lists the column in which to make the change. Use .global in the by = column to apply a change across all columns. A fourth dictionary column order can be used to specify factor order of new values.
    2. -
    -

    Read more details in the package documentation by running ?match_df. Note this function can take a long time to run for a large dataset.

    -
    -
    linelist <- linelist %>%     # provide or pipe your dataset
    -     matchmaker::match_df(
    -          dictionary = cleaning_dict,  # name of your dictionary
    -          from = "from",               # column with values to be replaced (default is col 1)
    -          to = "to",                   # column with final values (default is col 2)
    -          by = "col"                   # column with column names (default is col 3)
    -  )
    -
    -

    Now scroll to the right to see how values have changed - particularly gender (lowercase to uppercase), and all the symptoms columns have been transformed from yes/no to 1/0.

    -
    -
    -
    - -
    -
    -

    Note that your column names in the cleaning dictionary must correspond to the names at this point in your cleaning script. See this online reference for the linelist package for more details.

    -
    -

    Add to pipe chain

    -

    Below, some new columns and column transformations are added to the pipe chain.

    -
    -
    # CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
    -##################################################################################
    -
    -# begin cleaning pipe chain
    -###########################
    -linelist <- linelist_raw %>%
    -    
    -    # standardize column name syntax
    -    janitor::clean_names() %>% 
    -    
    -    # manually re-name columns
    -           # NEW name             # OLD name
    -    rename(date_infection       = infection_date,
    -           date_hospitalisation = hosp_date,
    -           date_outcome         = date_of_outcome) %>% 
    -    
    -    # remove column
    -    select(-c(row_num, merged_header, x28)) %>% 
    -  
    -    # de-duplicate
    -    distinct() %>% 
    -  
    -    # add column
    -    mutate(bmi = wt_kg / (ht_cm/100)^2) %>%     
    -
    -    # convert class of columns
    -    mutate(across(contains("date"), as.Date), 
    -           generation = as.numeric(generation),
    -           age        = as.numeric(age)) %>% 
    -    
    -    # add column: delay to hospitalisation
    -    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    -    
    -   # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    -   ###################################################
    -
    -    # clean values of hospital column
    -    mutate(hospital = recode(hospital,
    -                      # OLD = NEW
    -                      "Mitylira Hopital"  = "Military Hospital",
    -                      "Mitylira Hospital" = "Military Hospital",
    -                      "Military Hopital"  = "Military Hospital",
    -                      "Port Hopital"      = "Port Hospital",
    -                      "Central Hopital"   = "Central Hospital",
    -                      "other"             = "Other",
    -                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
    -                      )) %>% 
    -    
    -    mutate(hospital = replace_na(hospital, "Missing")) %>% 
    -
    -    # create age_years column (from age and age_unit)
    -    mutate(age_years = case_when(
    -          age_unit == "years" ~ age,
    -          age_unit == "months" ~ age/12,
    -          is.na(age_unit) ~ age,
    -          TRUE ~ NA_real_))
    -
    - - - -
    -
    -
    -
    -

    8.9 Numeric categories

    -

    Here we describe some special approaches for creating categories from numerical columns. Common examples include age categories, groups of lab values, etc. Here we will discuss:

    -
      -
    • age_categories(), from the epikit package.
      -
    • -
    • cut(), from base R.
      -
    • -
    • case_when().
      -
    • -
    • quantile breaks with quantile() and ntile().
    • -
    -
    -

    Review distribution

    -

    For this example we will create an age_cat column using the age_years column.

    -
    -
    #check the class of the linelist variable age
    -class(linelist$age_years)
    -
    -
    [1] "numeric"
    -
    -
    -

    First, examine the distribution of your data, to make appropriate cut-points. See the page on ggplot basics.

    -
    -
    # examine the distribution
    -hist(linelist$age_years)
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -
    summary(linelist$age_years, na.rm=T)
    -
    -
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    -   0.00    6.00   13.00   16.04   23.00   84.00     107 
    -
    -
    -

    CAUTION: Sometimes, numeric variables will import as class “character”. This occurs if there are non-numeric characters in some of the values, for example an entry of “2 months” for age, or (depending on your R locale settings) if a comma is used in the decimals place (e.g. “4,5” to mean four and one half years)..

    - -
    -
    -

    age_categories()

    -

    With the epikit package, you can use the age_categories() function to easily categorize and label numeric columns (note: this function can be applied to non-age numeric variables too). As a bonum, the output column is automatically an ordered factor.

    -

    Here are the required inputs:

    -
      -
    • A numeric vector (column)
      -
    • -
    • The breakers = argument - provide a numeric vector of break points for the new groups
    • -
    -

    First, the simplest example:

    -
    -
    # Simple example
    -################
    -pacman::p_load(epikit)                    # load package
    -
    -linelist <- linelist %>% 
    -  mutate(
    -    age_cat = age_categories(             # create new column
    -      age_years,                            # numeric column to make groups from
    -      breakers = c(0, 5, 10, 15, 20,        # break points
    -                   30, 40, 50, 60, 70)))
    -
    -# show table
    -table(linelist$age_cat, useNA = "always")
    -
    -
    
    -  0-4   5-9 10-14 15-19 20-29 30-39 40-49 50-59 60-69   70+  <NA> 
    - 1227  1223  1048   827  1216   597   251    78    27     7   107 
    -
    -
    -

    The break values you specify are by default the lower bounds - that is, they are included in the “higher” group / the groups are “open” on the lower/left side. As shown below, you can add 1 to each break value to achieve groups that are open at the top/right.

    -
    -
    # Include upper ends for the same categories
    -############################################
    -linelist <- linelist %>% 
    -  mutate(
    -    age_cat = age_categories(
    -      age_years, 
    -      breakers = c(0, 6, 11, 16, 21, 31, 41, 51, 61, 71)))
    -
    -# show table
    -table(linelist$age_cat, useNA = "always")
    -
    -
    
    -  0-5  6-10 11-15 16-20 21-30 31-40 41-50 51-60 61-70   71+  <NA> 
    - 1469  1195  1040   770  1149   547   231    70    24     6   107 
    -
    -
    -

    You can adjust how the labels are displayed with separator =. The default is “-”

    -

    You can adjust how the top numbers are handled, with the ceiling = arguemnt. To set an upper cut-off set ceiling = TRUE. In this use, the highest break value provided is a “ceiling” and a category “XX+” is not created. Any values above highest break value (or to upper =, if defined) are categorized as NA. Below is an example with ceiling = TRUE, so that there is no category of XX+ and values above 70 (the highest break value) are assigned as NA.

    -
    -
    # With ceiling set to TRUE
    -##########################
    -linelist <- linelist %>% 
    -  mutate(
    -    age_cat = age_categories(
    -      age_years, 
    -      breakers = c(0, 5, 10, 15, 20, 30, 40, 50, 60, 70),
    -      ceiling = TRUE)) # 70 is ceiling, all above become NA
    -
    -# show table
    -table(linelist$age_cat, useNA = "always")
    -
    -
    
    -  0-4   5-9 10-14 15-19 20-29 30-39 40-49 50-59 60-70  <NA> 
    - 1227  1223  1048   827  1216   597   251    78    28   113 
    -
    -
    -

    Alternatively, instead of breakers =, you can provide all of lower =, upper =, and by =:

    -
      -
    • lower = The lowest number you want considered - default is 0
      -
    • -
    • upper = The highest number you want considered
      -
    • -
    • by = The number of years between groups
    • -
    -
    -
    linelist <- linelist %>% 
    -  mutate(
    -    age_cat = age_categories(
    -      age_years, 
    -      lower = 0,
    -      upper = 100,
    -      by = 10))
    -
    -# show table
    -table(linelist$age_cat, useNA = "always")
    -
    -
    
    -  0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99  100+  <NA> 
    - 2450  1875  1216   597   251    78    27     6     1     0     0   107 
    -
    -
    -

    See the function’s Help page for more details (enter ?age_categories in the R console).

    - -
    -
    -

    cut()

    -

    cut() is a base R alternative to age_categories(), but I think you will see why age_categories() was developed to simplify this process. Some notable differences from age_categories() are:

    -
      -
    • You do not need to install/load another package.
      -
    • -
    • You can specify whether groups are open/closed on the right/left.
      -
    • -
    • You must provide accurate labels yourself.
      -
    • -
    • If you want 0 included in the lowest group you must specify this.
    • -
    -

    The basic syntax within cut() is to first provide the numeric column to be cut (age_years), and then the breaks argument, which is a numeric vector c() of break points. Using cut(), the resulting column is an ordered factor.

    -

    By default, the categorization occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). This is the opposite behavior from the age_categories() function. The default labels use the notation “(A, B]”, which means A is not included but B is.Reverse this behavior by providing the right = TRUE argument.

    -

    Thus, by default, “0” values are excluded from the lowest group, and categorized as NA! “0” values could be infants coded as age 0 so be careful! To change this, add the argument include.lowest = TRUE so that any “0” values will be included in the lowest group. The automatically-generated label for the lowest category will then be “[A],B]”. Note that if you include the include.lowest = TRUE argument and right = TRUE, the extreme inclusion will now apply to the highest break point value and category, not the lowest.

    -

    You can provide a vector of customized labels using the labels = argument. As these are manually written, be very careful to ensure they are accurate! Check your work using cross-tabulation, as described below.

    -

    An example of cut() applied to age_years to make the new variable age_cat is below:

    -
    -
    # Create new variable, by cutting the numeric age variable
    -# lower break is excluded but upper break is included in each category
    -linelist <- linelist %>% 
    -  mutate(
    -    age_cat = cut(
    -      age_years,
    -      breaks = c(0, 5, 10, 15, 20,
    -                 30, 50, 70, 100),
    -      include.lowest = TRUE         # include 0 in lowest group
    -      ))
    -
    -# tabulate the number of observations per group
    -table(linelist$age_cat, useNA = "always")
    -
    -
    
    -   [0,5]   (5,10]  (10,15]  (15,20]  (20,30]  (30,50]  (50,70] (70,100] 
    -    1469     1195     1040      770     1149      778       94        6 
    -    <NA> 
    -     107 
    -
    -
    -

    Check your work!!! Verify that each age value was assigned to the correct category by cross-tabulating the numeric and category columns. Examine assignment of boundary values (e.g. 15, if neighboring categories are 10-15 and 16-20).

    -
    -
    # Cross tabulation of the numeric and category columns. 
    -table("Numeric Values" = linelist$age_years,   # names specified in table for clarity.
    -      "Categories"     = linelist$age_cat,
    -      useNA = "always")                        # don't forget to examine NA values
    -
    -
                        Categories
    -Numeric Values       [0,5] (5,10] (10,15] (15,20] (20,30] (30,50] (50,70]
    -  0                    136      0       0       0       0       0       0
    -  0.0833333333333333     1      0       0       0       0       0       0
    -  0.25                   2      0       0       0       0       0       0
    -  0.333333333333333      6      0       0       0       0       0       0
    -  0.416666666666667      1      0       0       0       0       0       0
    -  0.5                    6      0       0       0       0       0       0
    -  0.583333333333333      3      0       0       0       0       0       0
    -  0.666666666666667      3      0       0       0       0       0       0
    -  0.75                   3      0       0       0       0       0       0
    -  0.833333333333333      1      0       0       0       0       0       0
    -  0.916666666666667      1      0       0       0       0       0       0
    -  1                    275      0       0       0       0       0       0
    -  1.5                    2      0       0       0       0       0       0
    -  2                    308      0       0       0       0       0       0
    -  3                    246      0       0       0       0       0       0
    -  4                    233      0       0       0       0       0       0
    -  5                    242      0       0       0       0       0       0
    -  6                      0    241       0       0       0       0       0
    -  7                      0    256       0       0       0       0       0
    -  8                      0    239       0       0       0       0       0
    -  9                      0    245       0       0       0       0       0
    -  10                     0    214       0       0       0       0       0
    -  11                     0      0     220       0       0       0       0
    -  12                     0      0     224       0       0       0       0
    -  13                     0      0     191       0       0       0       0
    -  14                     0      0     199       0       0       0       0
    -  15                     0      0     206       0       0       0       0
    -  16                     0      0       0     186       0       0       0
    -  17                     0      0       0     164       0       0       0
    -  18                     0      0       0     141       0       0       0
    -  19                     0      0       0     130       0       0       0
    -  20                     0      0       0     149       0       0       0
    -  21                     0      0       0       0     158       0       0
    -  22                     0      0       0       0     149       0       0
    -  23                     0      0       0       0     125       0       0
    -  24                     0      0       0       0     144       0       0
    -  25                     0      0       0       0     107       0       0
    -  26                     0      0       0       0     100       0       0
    -  27                     0      0       0       0     117       0       0
    -  28                     0      0       0       0      85       0       0
    -  29                     0      0       0       0      82       0       0
    -  30                     0      0       0       0      82       0       0
    -  31                     0      0       0       0       0      68       0
    -  32                     0      0       0       0       0      84       0
    -  33                     0      0       0       0       0      78       0
    -  34                     0      0       0       0       0      58       0
    -  35                     0      0       0       0       0      58       0
    -  36                     0      0       0       0       0      33       0
    -  37                     0      0       0       0       0      46       0
    -  38                     0      0       0       0       0      45       0
    -  39                     0      0       0       0       0      45       0
    -  40                     0      0       0       0       0      32       0
    -  41                     0      0       0       0       0      34       0
    -  42                     0      0       0       0       0      26       0
    -  43                     0      0       0       0       0      31       0
    -  44                     0      0       0       0       0      24       0
    -  45                     0      0       0       0       0      27       0
    -  46                     0      0       0       0       0      25       0
    -  47                     0      0       0       0       0      16       0
    -  48                     0      0       0       0       0      21       0
    -  49                     0      0       0       0       0      15       0
    -  50                     0      0       0       0       0      12       0
    -  51                     0      0       0       0       0       0      13
    -  52                     0      0       0       0       0       0       7
    -  53                     0      0       0       0       0       0       4
    -  54                     0      0       0       0       0       0       6
    -  55                     0      0       0       0       0       0       9
    -  56                     0      0       0       0       0       0       7
    -  57                     0      0       0       0       0       0       9
    -  58                     0      0       0       0       0       0       6
    -  59                     0      0       0       0       0       0       5
    -  60                     0      0       0       0       0       0       4
    -  61                     0      0       0       0       0       0       2
    -  62                     0      0       0       0       0       0       1
    -  63                     0      0       0       0       0       0       5
    -  64                     0      0       0       0       0       0       1
    -  65                     0      0       0       0       0       0       5
    -  66                     0      0       0       0       0       0       3
    -  67                     0      0       0       0       0       0       2
    -  68                     0      0       0       0       0       0       1
    -  69                     0      0       0       0       0       0       3
    -  70                     0      0       0       0       0       0       1
    -  72                     0      0       0       0       0       0       0
    -  73                     0      0       0       0       0       0       0
    -  76                     0      0       0       0       0       0       0
    -  84                     0      0       0       0       0       0       0
    -  <NA>                   0      0       0       0       0       0       0
    -                    Categories
    -Numeric Values       (70,100] <NA>
    -  0                         0    0
    -  0.0833333333333333        0    0
    -  0.25                      0    0
    -  0.333333333333333         0    0
    -  0.416666666666667         0    0
    -  0.5                       0    0
    -  0.583333333333333         0    0
    -  0.666666666666667         0    0
    -  0.75                      0    0
    -  0.833333333333333         0    0
    -  0.916666666666667         0    0
    -  1                         0    0
    -  1.5                       0    0
    -  2                         0    0
    -  3                         0    0
    -  4                         0    0
    -  5                         0    0
    -  6                         0    0
    -  7                         0    0
    -  8                         0    0
    -  9                         0    0
    -  10                        0    0
    -  11                        0    0
    -  12                        0    0
    -  13                        0    0
    -  14                        0    0
    -  15                        0    0
    -  16                        0    0
    -  17                        0    0
    -  18                        0    0
    -  19                        0    0
    -  20                        0    0
    -  21                        0    0
    -  22                        0    0
    -  23                        0    0
    -  24                        0    0
    -  25                        0    0
    -  26                        0    0
    -  27                        0    0
    -  28                        0    0
    -  29                        0    0
    -  30                        0    0
    -  31                        0    0
    -  32                        0    0
    -  33                        0    0
    -  34                        0    0
    -  35                        0    0
    -  36                        0    0
    -  37                        0    0
    -  38                        0    0
    -  39                        0    0
    -  40                        0    0
    -  41                        0    0
    -  42                        0    0
    -  43                        0    0
    -  44                        0    0
    -  45                        0    0
    -  46                        0    0
    -  47                        0    0
    -  48                        0    0
    -  49                        0    0
    -  50                        0    0
    -  51                        0    0
    -  52                        0    0
    -  53                        0    0
    -  54                        0    0
    -  55                        0    0
    -  56                        0    0
    -  57                        0    0
    -  58                        0    0
    -  59                        0    0
    -  60                        0    0
    -  61                        0    0
    -  62                        0    0
    -  63                        0    0
    -  64                        0    0
    -  65                        0    0
    -  66                        0    0
    -  67                        0    0
    -  68                        0    0
    -  69                        0    0
    -  70                        0    0
    -  72                        1    0
    -  73                        3    0
    -  76                        1    0
    -  84                        1    0
    -  <NA>                      0  107
    -
    -
    -

    Re-labeling NA values

    -

    You may want to assign NA values a label such as “Missing”. Because the new column is class Factor (restricted values), you cannot simply mutate it with replace_na(), as this value will be rejected. Instead, use fct_explicit_na() from forcats as explained in the Factors page.

    -
    -
    linelist <- linelist %>% 
    -  
    -  # cut() creates age_cat, automatically of class Factor      
    -  mutate(age_cat = cut(
    -    age_years,
    -    breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),          
    -    right = FALSE,
    -    include.lowest = TRUE,        
    -    labels = c("0-4", "5-9", "10-14", "15-19", "20-29", "30-49", "50-69", "70-100")),
    -         
    -    # make missing values explicit
    -    age_cat = fct_explicit_na(
    -      age_cat,
    -      na_level = "Missing age")  # you can specify the label
    -  )    
    -
    -
    Warning: There was 1 warning in `mutate()`.
    -ℹ In argument: `age_cat = fct_explicit_na(age_cat, na_level = "Missing age")`.
    -Caused by warning:
    -! `fct_explicit_na()` was deprecated in forcats 1.0.0.
    -ℹ Please use `fct_na_value_to_level()` instead.
    -
    -
    # table to view counts
    -table(linelist$age_cat, useNA = "always")
    -
    -
    
    -        0-4         5-9       10-14       15-19       20-29       30-49 
    -       1227        1223        1048         827        1216         848 
    -      50-69      70-100 Missing age        <NA> 
    -        105           7         107           0 
    -
    -
    -

    Quickly make breaks and labels

    -

    For a fast way to make breaks and label vectors, use something like below. See the R basics page for references on seq() and rep().

    -
    -
    # Make break points from 0 to 90 by 5
    -age_seq = seq(from = 0, to = 90, by = 5)
    -age_seq
    -
    -# Make labels for the above categories, assuming default cut() settings
    -age_labels = paste0(age_seq + 1, "-", age_seq + 5)
    -age_labels
    -
    -# check that both vectors are the same length
    -length(age_seq) == length(age_labels)
    -
    -

    Read more about cut() in its Help page by entering ?cut in the R console.

    -
    -
    -

    Quantile breaks

    -

    In common understanding, “quantiles” or “percentiles” typically refer to a value below which a proportion of values fall. For example, the 95th percentile of ages in linelist would be the age below which 95% of the age fall.

    -

    However in common speech, “quartiles” and “deciles” can also refer to the groups of data as equally divided into 4, or 10 groups (note there will be one more break point than group).

    -

    To get quantile break points, you can use quantile() from the stats package from base R. You provide a numeric vector (e.g. a column in a dataset) and vector of numeric probability values ranging from 0 to 1.0. The break points are returned as a numeric vector. Explore the details of the statistical methodologies by entering ?quantile.

    -
      -
    • If your input numeric vector has any missing values it is best to set na.rm = TRUE
      -
    • -
    • Set names = FALSE to get an un-named numeric vector
    • -
    -
    -
    quantile(linelist$age_years,               # specify numeric vector to work on
    -  probs = c(0, .25, .50, .75, .90, .95),   # specify the percentiles you want
    -  na.rm = TRUE)                            # ignore missing values 
    -
    -
     0% 25% 50% 75% 90% 95% 
    -  0   6  13  23  33  41 
    -
    -
    -

    You can use the results of quantile() as break points in age_categories() or cut(). Below we create a new column deciles using cut() where the breaks are defined using quantiles() on age_years. Below, we display the results using tabyl() from janitor so you can see the percentages (see the Descriptive tables page). Note how they are not exactly 10% in each group.

    -
    -
    linelist %>%                                # begin with linelist
    -  mutate(deciles = cut(age_years,           # create new column decile as cut() on column age_years
    -    breaks = quantile(                      # define cut breaks using quantile()
    -      age_years,                               # operate on age_years
    -      probs = seq(0, 1, by = 0.1),             # 0.0 to 1.0 by 0.1
    -      na.rm = TRUE),                           # ignore missing values
    -    include.lowest = TRUE)) %>%             # for cut() include age 0
    -  janitor::tabyl(deciles)                   # pipe to table to display
    -
    -
     deciles   n    percent valid_percent
    -   [0,2] 748 0.11319613    0.11505922
    -   (2,5] 721 0.10911017    0.11090601
    -   (5,7] 497 0.07521186    0.07644978
    -  (7,10] 698 0.10562954    0.10736810
    - (10,13] 635 0.09609564    0.09767728
    - (13,17] 755 0.11425545    0.11613598
    - (17,21] 578 0.08746973    0.08890940
    - (21,26] 625 0.09458232    0.09613906
    - (26,33] 596 0.09019370    0.09167820
    - (33,84] 648 0.09806295    0.09967697
    -    <NA> 107 0.01619249            NA
    -
    -
    -
    -
    -

    Evenly-sized groups

    -

    Another tool to make numeric groups is the the dplyr function ntile(), which attempts to break your data into n evenly-sized groups - but be aware that unlike with quantile() the same value could appear in more than one group. Provide the numeric vector and then the number of groups. The values in the new column created is just group “numbers” (e.g. 1 to 10), not the range of values themselves as when using cut().

    -
    -
    # make groups with ntile()
    -ntile_data <- linelist %>% 
    -  mutate(even_groups = ntile(age_years, 10))
    -
    -# make table of counts and proportions by group
    -ntile_table <- ntile_data %>% 
    -  janitor::tabyl(even_groups)
    -  
    -# attach min/max values to demonstrate ranges
    -ntile_ranges <- ntile_data %>% 
    -  group_by(even_groups) %>% 
    -  summarise(
    -    min = min(age_years, na.rm=T),
    -    max = max(age_years, na.rm=T)
    -  )
    -
    -
    Warning: There were 2 warnings in `summarise()`.
    -The first warning was:
    -ℹ In argument: `min = min(age_years, na.rm = T)`.
    -ℹ In group 11: `even_groups = NA`.
    -Caused by warning in `min()`:
    -! no non-missing arguments to min; returning Inf
    -ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
    -
    -
    # combine and print - note that values are present in multiple groups
    -left_join(ntile_table, ntile_ranges, by = "even_groups")
    -
    -
     even_groups   n    percent valid_percent min  max
    -           1 651 0.09851695    0.10013844   0    2
    -           2 650 0.09836562    0.09998462   2    5
    -           3 650 0.09836562    0.09998462   5    7
    -           4 650 0.09836562    0.09998462   7   10
    -           5 650 0.09836562    0.09998462  10   13
    -           6 650 0.09836562    0.09998462  13   17
    -           7 650 0.09836562    0.09998462  17   21
    -           8 650 0.09836562    0.09998462  21   26
    -           9 650 0.09836562    0.09998462  26   33
    -          10 650 0.09836562    0.09998462  33   84
    -          NA 107 0.01619249            NA Inf -Inf
    -
    -
    - -
    -
    -

    case_when()

    -

    It is possible to use the dplyr function case_when() to create categories from a numeric column, but it is easier to use age_categories() from epikit or cut() because these will create an ordered factor automatically.

    -

    If using case_when(), please review the proper use as described earlier in the Re-code values section of this page. Also be aware that all right-hand side values must be of the same class. Thus, if you want NA on the right-side you should either write “Missing” or use the special NA value NA_character_.

    -
    -
    -

    Add to pipe chain

    -

    Below, code to create two categorical age columns is added to the cleaning pipe chain:

    -
    -
    # CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
    -##################################################################################
    -
    -# begin cleaning pipe chain
    -###########################
    -linelist <- linelist_raw %>%
    -    
    -    # standardize column name syntax
    -    janitor::clean_names() %>% 
    -    
    -    # manually re-name columns
    -           # NEW name             # OLD name
    -    rename(date_infection       = infection_date,
    -           date_hospitalisation = hosp_date,
    -           date_outcome         = date_of_outcome) %>% 
    -    
    -    # remove column
    -    select(-c(row_num, merged_header, x28)) %>% 
    -  
    -    # de-duplicate
    -    distinct() %>% 
    -
    -    # add column
    -    mutate(bmi = wt_kg / (ht_cm/100)^2) %>%     
    -
    -    # convert class of columns
    -    mutate(across(contains("date"), as.Date), 
    -           generation = as.numeric(generation),
    -           age        = as.numeric(age)) %>% 
    -    
    -    # add column: delay to hospitalisation
    -    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    -    
    -    # clean values of hospital column
    -    mutate(hospital = recode(hospital,
    -                      # OLD = NEW
    -                      "Mitylira Hopital"  = "Military Hospital",
    -                      "Mitylira Hospital" = "Military Hospital",
    -                      "Military Hopital"  = "Military Hospital",
    -                      "Port Hopital"      = "Port Hospital",
    -                      "Central Hopital"   = "Central Hospital",
    -                      "other"             = "Other",
    -                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
    -                      )) %>% 
    -    
    -    mutate(hospital = replace_na(hospital, "Missing")) %>% 
    -
    -    # create age_years column (from age and age_unit)
    -    mutate(age_years = case_when(
    -          age_unit == "years" ~ age,
    -          age_unit == "months" ~ age/12,
    -          is.na(age_unit) ~ age)) %>% 
    -  
    -    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    -    ###################################################   
    -    mutate(
    -          # age categories: custom
    -          age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
    -        
    -          # age categories: 0 to 85 by 5s
    -          age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5)))
    -
    - -
    -
    -
    -

    8.10 Add rows

    -
    -

    One-by-one

    -

    Adding rows one-by-one manually is tedious but can be done with add_row() from dplyr. Remember that each column must contain values of only one class (either character, numeric, logical, etc.). So adding a row requires nuance to maintain this.

    -
    -
    linelist <- linelist %>% 
    -  add_row(row_num = 666,
    -          case_id = "abc",
    -          generation = 4,
    -          `infection date` = as.Date("2020-10-10"),
    -          .before = 2)
    -
    -

    Use .before and .after. to specify the placement of the row you want to add. .before = 3 will put the new row before the current 3rd row. The default behavior is to add the row to the end. Columns not specified will be left empty (NA).

    -

    The new row number may look strange (“…23”) but the row numbers in the pre-existing rows have changed. So if using the command twice, examine/test the insertion carefully.

    -

    If a class you provide is off you will see an error like this:

    -
    Error: Can't combine ..1$infection date <date> and ..2$infection date <character>.
    -

    (when inserting a row with a date value, remember to wrap the date in the function as.Date() like as.Date("2020-10-10")).

    -
    -
    -

    Bind rows

    -

    To combine datasets together by binding the rows of one dataframe to the bottom of another data frame, you can use bind_rows() from dplyr. This is explained in more detail in the page Joining data.

    - - - -
    -
    -
    -

    8.11 Filter rows

    -

    A typical cleaning step after you have cleaned the columns and re-coded values is to filter the data frame for specific rows using the dplyr verb filter().

    -

    Within filter(), specify the logic that must be TRUE for a row in the dataset to be kept. Below we show how to filter rows based on simple and complex logical conditions.

    - -
    -

    Simple filter

    -

    This simple example re-defines the dataframe linelist as itself, having filtered the rows to meet a logical condition. Only the rows where the logical statement within the parentheses evaluates to TRUE are kept.

    -

    In this example, the logical statement is gender == "f", which is asking whether the value in the column gender is equal to “f” (case sensitive).

    -

    Before the filter is applied, the number of rows in linelist is nrow(linelist).

    -
    -
    linelist <- linelist %>% 
    -  filter(gender == "f")   # keep only rows where gender is equal to "f"
    -
    -

    After the filter is applied, the number of rows in linelist is linelist %>% filter(gender == "f") %>% nrow().

    -
    -
    -

    Filter out missing values

    -

    It is fairly common to want to filter out rows that have missing values. Resist the urge to write filter(!is.na(column) & !is.na(column)) and instead use the tidyr function that is custom-built for this purpose: drop_na(). If run with empty parentheses, it removes rows with any missing values. Alternatively, you can provide names of specific columns to be evaluated for missingness, or use the “tidyselect” helper functions described above.

    -
    -
    linelist %>% 
    -  drop_na(case_id, age_years)  # drop rows with missing values for case_id or age_years
    -
    -

    See the page on Missing data for many techniques to analyse and manage missingness in your data.

    -
    -
    -

    Filter by row number

    -

    In a data frame or tibble, each row will usually have a “row number” that (when seen in R Viewer) appears to the left of the first column. It is not itself a true column in the data, but it can be used in a filter() statement.

    -

    To filter based on “row number”, you can use the dplyr function row_number() with open parentheses as part of a logical filtering statement. Often you will use the %in% operator and a range of numbers as part of that logical statement, as shown below. To see the first N rows, you can also use the special dplyr function head().

    -
    -
    # View first 100 rows
    -linelist %>% head(100)     # or use tail() to see the n last rows
    -
    -# Show row 5 only
    -linelist %>% filter(row_number() == 5)
    -
    -# View rows 2 through 20, and three specific columns
    -linelist %>% filter(row_number() %in% 2:20) %>% select(date_onset, outcome, age)
    -
    -

    You can also convert the row numbers to a true column by piping your data frame to the tibble function rownames_to_column() (do not put anything in the parentheses).

    - -
    -
    -

    Complex filter

    -

    More complex logical statements can be constructed using parentheses ( ), OR |, negate !, %in%, and AND & operators. An example is below:

    -

    Note: You can use the ! operator in front of a logical criteria to negate it. For example, !is.na(column) evaluates to true if the column value is not missing. Likewise !column %in% c("a", "b", "c") evaluates to true if the column value is not in the vector.

    -
    -

    Examine the data

    -

    Below is a simple one-line command to create a histogram of onset dates. See that a second smaller outbreak from 2012-2013 is also included in this raw dataset. For our analyses, we want to remove entries from this earlier outbreak.

    -
    -
    hist(linelist$date_onset, breaks = 50)
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -
    -

    How filters handle missing numeric and date values

    -

    Can we just filter by date_onset to rows after June 2013? Caution! Applying the code filter(date_onset > as.Date("2013-06-01"))) would remove any rows in the later epidemic with a missing date of onset!

    -

    DANGER: Filtering to greater than (>) or less than (<) a date or number can remove any rows with missing values (NA)! This is because NA is treated as infinitely large and small.

    -

    (See the page on Working with dates for more information on working with dates and the package lubridate)

    -
    -
    -

    Design the filter

    -

    Examine a cross-tabulation to make sure we exclude only the correct rows:

    -
    -
    table(Hospital  = linelist$hospital,                     # hospital name
    -      YearOnset = lubridate::year(linelist$date_onset),  # year of date_onset
    -      useNA     = "always")                              # show missing values
    -
    -
                                          YearOnset
    -Hospital                               2012 2013 2014 2015 <NA>
    -  Central Hospital                        0    0  351   99   18
    -  Hospital A                            229   46    0    0   15
    -  Hospital B                            227   47    0    0   15
    -  Military Hospital                       0    0  676  200   34
    -  Missing                                 0    0 1117  318   77
    -  Other                                   0    0  684  177   46
    -  Port Hospital                           9    1 1372  347   75
    -  St. Mark's Maternity Hospital (SMMH)    0    0  322   93   13
    -  <NA>                                    0    0    0    0    0
    -
    -
    -

    What other criteria can we filter on to remove the first outbreak (in 2012 & 2013) from the dataset? We see that:

    -
      -
    • The first epidemic in 2012 & 2013 occurred at Hospital A, Hospital B, and that there were also 10 cases at Port Hospital.
      -
    • -
    • Hospitals A & B did not have cases in the second epidemic, but Port Hospital did.
    • -
    -

    We want to exclude:

    -
      -
    • The rows with onset in 2012 and 2013 at either hospital A, B, or Port: nrow(linelist %>% filter(hospital %in% c("Hospital A", "Hospital B") | date_onset < as.Date("2013-06-01")))

      -
        -
      • Exclude rows with onset in 2012 and 2013 nrow(linelist %>% filter(date_onset < as.Date("2013-06-01")))
      • -
      • Exclude rows from Hospitals A & B with missing onset dates
        -nrow(linelist %>% filter(hospital %in% c('Hospital A', 'Hospital B') & is.na(date_onset)))
      • -
      • Do not exclude other rows with missing onset dates.
        -nrow(linelist %>% filter(!hospital %in% c('Hospital A', 'Hospital B') & is.na(date_onset)))
      • -
    • -
    -

    We start with a linelist of nrow(linelist)`. Here is our filter statement:

    -
    -
    linelist <- linelist %>% 
    -  # keep rows where onset is after 1 June 2013 OR where onset is missing and it was a hospital OTHER than Hospital A or B
    -  filter(date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))
    -
    -nrow(linelist)
    -
    -
    [1] 6019
    -
    -
    -

    When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, and the 10 Port Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.

    -
    -
    table(Hospital  = linelist$hospital,                     # hospital name
    -      YearOnset = lubridate::year(linelist$date_onset),  # year of date_onset
    -      useNA     = "always")                              # show missing values
    -
    -
                                          YearOnset
    -Hospital                               2014 2015 <NA>
    -  Central Hospital                      351   99   18
    -  Military Hospital                     676  200   34
    -  Missing                              1117  318   77
    -  Other                                 684  177   46
    -  Port Hospital                        1372  347   75
    -  St. Mark's Maternity Hospital (SMMH)  322   93   13
    -  <NA>                                    0    0    0
    -
    -
    -

    Multiple statements can be included within one filter command (separated by commas), or you can always pipe to a separate filter() command for clarity.

    -

    Note: some readers may notice that it would be easier to just filter by date_hospitalisation because it is 100% complete with no missing values. This is true. But date_onset is used for purposes of demonstrating a complex filter.

    -
    -
    -
    -

    Standalone

    -

    Filtering can also be done as a stand-alone command (not part of a pipe chain). Like other dplyr verbs, in this case the first argument must be the dataset itself.

    -
    -
    # dataframe <- filter(dataframe, condition(s) for rows to keep)
    -
    -linelist <- filter(linelist, !is.na(case_id))
    -
    -

    You can also use base R to subset using square brackets which reflect the [rows, columns] that you want to retain.

    -
    -
    # dataframe <- dataframe[row conditions, column conditions] (blank means keep all)
    -
    -linelist <- linelist[!is.na(case_id), ]
    -
    -
    -
    -

    Quickly review records

    -

    Often you want to quickly review a few records, for only a few columns. The base R function View() will print a data frame for viewing in your RStudio.

    -

    View the linelist in RStudio:

    -
    -
    View(linelist)
    -
    -

    Here are two examples of viewing specific cells (specific rows, and specific columns):

    -

    With dplyr functions filter() and select():

    -

    Within View(), pipe the dataset to filter() to keep certain rows, and then to select() to keep certain columns. For example, to review onset and hospitalization dates of 3 specific cases:

    -
    -
    View(linelist %>%
    -       filter(case_id %in% c("11f8ea", "76b97a", "47a5f5")) %>%
    -       select(date_onset, date_hospitalisation))
    -
    -

    You can achieve the same with base R syntax, using brackets [ ] to subset you want to see.

    -
    -
    View(linelist[linelist$case_id %in% c("11f8ea", "76b97a", "47a5f5"), c("date_onset", "date_hospitalisation")])
    -
    -
    -

    Add to pipe chain

    -
    -
    # CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
    -##################################################################################
    -
    -# begin cleaning pipe chain
    -###########################
    -linelist <- linelist_raw %>%
    -    
    -    # standardize column name syntax
    -    janitor::clean_names() %>% 
    -    
    -    # manually re-name columns
    -           # NEW name             # OLD name
    -    rename(date_infection       = infection_date,
    -           date_hospitalisation = hosp_date,
    -           date_outcome         = date_of_outcome) %>% 
    -    
    -    # remove column
    -    select(-c(row_num, merged_header, x28)) %>% 
    -  
    -    # de-duplicate
    -    distinct() %>% 
    -
    -    # add column
    -    mutate(bmi = wt_kg / (ht_cm/100)^2) %>%     
    -
    -    # convert class of columns
    -    mutate(across(contains("date"), as.Date), 
    -           generation = as.numeric(generation),
    -           age        = as.numeric(age)) %>% 
    -    
    -    # add column: delay to hospitalisation
    -    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    -    
    -    # clean values of hospital column
    -    mutate(hospital = recode(hospital,
    -                      # OLD = NEW
    -                      "Mitylira Hopital"  = "Military Hospital",
    -                      "Mitylira Hospital" = "Military Hospital",
    -                      "Military Hopital"  = "Military Hospital",
    -                      "Port Hopital"      = "Port Hospital",
    -                      "Central Hopital"   = "Central Hospital",
    -                      "other"             = "Other",
    -                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
    -                      )) %>% 
    -    
    -    mutate(hospital = replace_na(hospital, "Missing")) %>% 
    -
    -    # create age_years column (from age and age_unit)
    -    mutate(age_years = case_when(
    -          age_unit == "years" ~ age,
    -          age_unit == "months" ~ age/12,
    -          is.na(age_unit) ~ age)) %>% 
    -  
    -    mutate(
    -          # age categories: custom
    -          age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
    -        
    -          # age categories: 0 to 85 by 5s
    -          age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5))) %>% 
    -    
    -    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    -    ###################################################
    -    filter(
    -          # keep only rows where case_id is not missing
    -          !is.na(case_id),  
    -          
    -          # also filter to keep only the second outbreak
    -          date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))
    -
    - - - -
    -
    -
    -
    -

    8.12 Row-wise calculations

    -

    If you want to perform a calculation within a row, you can use rowwise() from dplyr. See this online vignette on row-wise calculations. For example, this code applies rowwise() and then creates a new column that sums the number of the specified symptom columns that have value “yes”, for each row in the linelist. The columns are specified within sum() by name within a vector c(). rowwise() is essentially a special kind of group_by(), so it is best to use ungroup() when you are done (page on Grouping data).

    -
    -
    linelist %>%
    -  rowwise() %>%
    -  mutate(num_symptoms = sum(c(fever, chills, cough, aches, vomit) == "yes")) %>% 
    -  ungroup() %>% 
    -  select(fever, chills, cough, aches, vomit, num_symptoms) # for display
    -
    -
    # A tibble: 5,888 × 6
    -   fever chills cough aches vomit num_symptoms
    -   <chr> <chr>  <chr> <chr> <chr>        <int>
    - 1 no    no     yes   no    yes              2
    - 2 <NA>  <NA>   <NA>  <NA>  <NA>            NA
    - 3 <NA>  <NA>   <NA>  <NA>  <NA>            NA
    - 4 no    no     no    no    no               0
    - 5 no    no     yes   no    yes              2
    - 6 no    no     yes   no    yes              2
    - 7 <NA>  <NA>   <NA>  <NA>  <NA>            NA
    - 8 no    no     yes   no    yes              2
    - 9 no    no     yes   no    yes              2
    -10 no    no     yes   no    no               1
    -# ℹ 5,878 more rows
    -
    -
    -

    As you specify the column to evaluate, you may want to use the “tidyselect” helper functions described in the select() section of this page. You just have to make one adjustment (because you are not using them within a dplyr function like select() or summarise()).

    -

    Put the column-specification criteria within the dplyr function c_across(). This is because c_across (documentation) is designed to work with rowwise() specifically. For example, the following code:

    -
      -
    • Applies rowwise() so the following operation (sum()) is applied within each row (not summing entire columns).
      -
    • -
    • Creates new column num_NA_dates, defined for each row as the number of columns (with name containing “date”) for which is.na() evaluated to TRUE (they are missing data).
      -
    • -
    • ungroup() to remove the effects of rowwise() for subsequent steps.
    • -
    -
    -
    linelist %>%
    -  rowwise() %>%
    -  mutate(num_NA_dates = sum(is.na(c_across(contains("date"))))) %>% 
    -  ungroup() %>% 
    -  select(num_NA_dates, contains("date")) # for display
    -
    -
    # A tibble: 5,888 × 5
    -   num_NA_dates date_infection date_onset date_hospitalisation date_outcome
    -          <int> <date>         <date>     <date>               <date>      
    - 1            1 2014-05-08     2014-05-13 2014-05-15           NA          
    - 2            1 NA             2014-05-13 2014-05-14           2014-05-18  
    - 3            1 NA             2014-05-16 2014-05-18           2014-05-30  
    - 4            1 2014-05-04     2014-05-18 2014-05-20           NA          
    - 5            0 2014-05-18     2014-05-21 2014-05-22           2014-05-29  
    - 6            0 2014-05-03     2014-05-22 2014-05-23           2014-05-24  
    - 7            0 2014-05-22     2014-05-27 2014-05-29           2014-06-01  
    - 8            0 2014-05-28     2014-06-02 2014-06-03           2014-06-07  
    - 9            1 NA             2014-06-05 2014-06-06           2014-06-18  
    -10            1 NA             2014-06-05 2014-06-07           2014-06-09  
    -# ℹ 5,878 more rows
    -
    -
    -

    You could also provide other functions, such as max() to get the latest or most recent date for each row:

    -
    -
    linelist %>%
    -  rowwise() %>%
    -  mutate(latest_date = max(c_across(contains("date")), na.rm=T)) %>% 
    -  ungroup() %>% 
    -  select(latest_date, contains("date"))  # for display
    -
    -
    # A tibble: 5,888 × 5
    -   latest_date date_infection date_onset date_hospitalisation date_outcome
    -   <date>      <date>         <date>     <date>               <date>      
    - 1 2014-05-15  2014-05-08     2014-05-13 2014-05-15           NA          
    - 2 2014-05-18  NA             2014-05-13 2014-05-14           2014-05-18  
    - 3 2014-05-30  NA             2014-05-16 2014-05-18           2014-05-30  
    - 4 2014-05-20  2014-05-04     2014-05-18 2014-05-20           NA          
    - 5 2014-05-29  2014-05-18     2014-05-21 2014-05-22           2014-05-29  
    - 6 2014-05-24  2014-05-03     2014-05-22 2014-05-23           2014-05-24  
    - 7 2014-06-01  2014-05-22     2014-05-27 2014-05-29           2014-06-01  
    - 8 2014-06-07  2014-05-28     2014-06-02 2014-06-03           2014-06-07  
    - 9 2014-06-18  NA             2014-06-05 2014-06-06           2014-06-18  
    -10 2014-06-09  NA             2014-06-05 2014-06-07           2014-06-09  
    -# ℹ 5,878 more rows
    -
    -
    -
    -
    -

    8.13 Arrange and sort

    -

    Use the dplyr function arrange() to sort or order the rows by column values.

    -

    Simple list the columns in the order they should be sorted on. Specify .by_group = TRUE if you want the sorting to to first occur by any groupings applied to the data (see page on Grouping data).

    -

    By default, column will be sorted in “ascending” order (which applies to numeric and also to character columns). You can sort a variable in “descending” order by wrapping it with desc().

    -

    Sorting data with arrange() is particularly useful when making Tables for presentation, using slice() to take the “top” rows per group, or setting factor level order by order of appearance.

    -

    For example, to sort the our linelist rows by hospital, then by date_onset in descending order, we would use:

    -
    -
    linelist %>% 
    -   arrange(hospital, desc(date_onset))
    -
    - - -
    - -
    - - -
    - - - - - - - \ No newline at end of file diff --git a/new_pages/cleaning.qmd b/new_pages/cleaning.qmd index 95fe5af2..75cfea22 100644 --- a/new_pages/cleaning.qmd +++ b/new_pages/cleaning.qmd @@ -1007,17 +1007,17 @@ linelist <- linelist %>% ``` -**fct_explicit_na()** +**fct_na_value_to_level()** This is a function from the **forcats** package. The **forcats** package handles columns of class Factor. Factors are R's way to handle *ordered* values such as `c("First", "Second", "Third")` or to set the order that values (e.g. hospitals) appear in tables and plots. See the page on [Factors](factors.qmd). If your data are class Factor and you try to convert `NA` to "Missing" by using `replace_na()`, you will get this error: `invalid factor level, NA generated`. You have tried to add "Missing" as a value, when it was not defined as a possible level of the factor, and it was rejected. -The easiest way to solve this is to use the **forcats** function `fct_explicit_na()` which converts a column to class factor, and converts `NA` values to the character "(Missing)". +The easiest way to solve this is to use the **forcats** function `fct_na_value_to_level()` which converts a column to class factor, and converts `NA` values to the character "(Missing)". ```{r, eval=F} linelist %>% - mutate(hospital = fct_explicit_na(hospital)) + mutate(hospital = fct_na_value_to_level(hospital)) ``` A slower alternative would be to add the factor level using `fct_expand()` and then convert the missing values. @@ -1342,7 +1342,7 @@ table("Numeric Values" = linelist$age_years, # names specified in table for cl **Re-labeling `NA` values** -You may want to assign `NA` values a label such as "Missing". Because the new column is class Factor (restricted values), you cannot simply mutate it with `replace_na()`, as this value will be rejected. Instead, use `fct_explicit_na()` from **forcats** as explained in the [Factors](factors.qmd) page. +You may want to assign `NA` values a label such as "Missing". Because the new column is class Factor (restricted values), you cannot simply mutate it with `replace_na()`, as this value will be rejected. Instead, use `fct_na_value_to_level()` from **forcats** as explained in the [Factors](factors.qmd) page. ```{r} linelist <- linelist %>% @@ -1356,9 +1356,9 @@ linelist <- linelist %>% labels = c("0-4", "5-9", "10-14", "15-19", "20-29", "30-49", "50-69", "70-100")), # make missing values explicit - age_cat = fct_explicit_na( + age_cat = fct_na_value_to_level( age_cat, - na_level = "Missing age") # you can specify the label + level = "Missing age") # you can specify the label ) # table to view counts diff --git a/new_pages/cleaning_files/figure-html/unnamed-chunk-69-1.png b/new_pages/cleaning_files/figure-html/unnamed-chunk-69-1.png deleted file mode 100644 index e06a3121..00000000 Binary files a/new_pages/cleaning_files/figure-html/unnamed-chunk-69-1.png and /dev/null differ diff --git a/new_pages/cleaning_files/figure-html/unnamed-chunk-87-1.png b/new_pages/cleaning_files/figure-html/unnamed-chunk-87-1.png deleted file mode 100644 index cdfad7a8..00000000 Binary files a/new_pages/cleaning_files/figure-html/unnamed-chunk-87-1.png and /dev/null differ diff --git a/new_pages/contact_tracing.qmd b/new_pages/contact_tracing.qmd index b2dadb54..4ed1c1ec 100644 --- a/new_pages/contact_tracing.qmd +++ b/new_pages/contact_tracing.qmd @@ -17,7 +17,7 @@ You can read more about the Go.Data project on the [Github Documentation site](h This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize `p_load()` from **pacman**, which installs the package if necessary *and* loads it for use. You can also load installed packages with `library()` from **base** R. See the page on [R basics](basics.qmd) for more information on R packages. -```{r, message = F} +```{r, message = F, warning=F} pacman::p_load( rio, # importing data here, # relative file pathways @@ -69,7 +69,7 @@ Below, the datasets are imported using the `import()` function from the **rio** These data are a table of the cases, and information about them. -```{r} +```{r, warning=F, message=F} cases <- import(here("data", "godata", "cases_clean.rds")) %>% select(case_id, firstName, lastName, gender, age, age_class, occupation, classification, was_contact, hospitalization_typeid) @@ -90,7 +90,7 @@ These data are a table of all the contacts and information about them. Again, pr * Artificially assign rows with missing admin level 2 to "Djembe", to improve clarity of some example visualisations. -```{r} +```{r, warning=F, message=F} contacts <- import(here("data", "godata", "contacts_clean.rds")) %>% mutate(age_class = forcats::fct_rev(age_class)) %>% select(contact_id, contact_status, firstName, lastName, gender, age, @@ -112,7 +112,7 @@ These data are records of the "follow-up" interactions with the contacts. Each c We import and perform a few cleaning steps. We select certain columns, and also convert a character column to all lowercase values. -```{r} +```{r, warning=F, message=F} followups <- rio::import(here::here("data", "godata", "followups_clean.rds")) %>% select(contact_id, followup_status, followup_number, date_of_followup, admin_2_name, admin_1_name) %>% @@ -610,4 +610,5 @@ ggplot(data = long_prop) + # use long data, with proportions as Freq ## Resources [Go.Data](https://worldhealthorganization.github.io/godata/) + [Automated R Reporting using Go.Data API](https://github.com/WorldHealthOrganization/godata/tree/master/analytics/r-reporting) diff --git a/new_pages/dates.qmd b/new_pages/dates.qmd index f3cc3f55..fd3de052 100644 --- a/new_pages/dates.qmd +++ b/new_pages/dates.qmd @@ -35,14 +35,15 @@ pacman::p_load( zoo, # additional date/time functions here, # file management rio, # data import/export - tidyverse) # data management and visualization + tidyverse # data management and visualization + ) ``` ### Import data {.unnumbered} We import the dataset of cases from a simulated Ebola epidemic. If you want to download the data to follow along step-by-step, see instruction in the [Download handbook and data](data_used.qmd) page. We assume the file is in the working directory so no sub-folders are specified in this file path. -```{r, echo=F} +```{r, echo=F, warning=F, message=F} linelist <- rio::import(here::here("data", "case_linelists", "linelist_cleaned.rds")) ``` @@ -214,7 +215,12 @@ You can use the **lubridate** functions `make_date()` and `make_datetime()` to c ```{r, eval=F} linelist <- linelist %>% - mutate(onset_date = make_date(year = onset_year, month = onset_month, day = onset_day)) + mutate( + onset_date = make_date( + year = onset_year, + month = onset_month, + day = onset_day) + ) ``` diff --git a/new_pages/descriptive_statistics.qmd b/new_pages/descriptive_statistics.qmd index 001e7cee..0c4da8c6 100644 --- a/new_pages/descriptive_statistics.qmd +++ b/new_pages/descriptive_statistics.qmd @@ -29,7 +29,7 @@ pacman::p_load( We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the "clean" linelist (as .rds file). The dataset is imported using the `import()` function from the **rio** package. See the page on [Import and export](importing.qmd) for various ways to import data. -```{r, echo=F} +```{r, echo=F, warning=F, message=F} # import the linelist into R linelist <- rio::import(here::here("data", "case_linelists", "linelist_cleaned.rds")) ``` diff --git a/new_pages/editorial_style.qmd b/new_pages/editorial_style.qmd index 1769aea7..4e1bd2a2 100644 --- a/new_pages/editorial_style.qmd +++ b/new_pages/editorial_style.qmd @@ -139,7 +139,7 @@ With version 1.0.1 the following changes have been implemented: * Interactive plots: added `ungroup()` to chunk that makes `agg_weeks` so that `expand()` works as intended. * Time series: added `data.frame()` around objects within all `trending::fit()` and `predict()` commands. * Combinations analysis: Switch `case_when()` to `ifelse()` and added optional `across()` code for preparing the data. -* Transmission chains: Update to more recent version of **epicontacts**. +* Transmission chains: Update to more recent version of [**epicontacts**](https://www.repidemicsconsortium.org/epicontacts/). diff --git a/new_pages/epidemic_models.qmd b/new_pages/epidemic_models.qmd index 5606f730..a82a02a9 100644 --- a/new_pages/epidemic_models.qmd +++ b/new_pages/epidemic_models.qmd @@ -9,9 +9,9 @@ There exists a growing body of tools for epidemic modelling that lets us conduct fairly complex analyses with minimal effort. This section will provide an overview on how to use these tools to: -* estimate the effective reproduction number Rt and related statistics +* Estimate the effective reproduction number Rt and related statistics such as the doubling time. -* produce short-term projections of future incidence. +* Produce short-term projections of future incidence. It is *not* intended as an overview of the methodologies and statistical methods underlying these tools, so please refer to the Resources tab for links to some @@ -139,7 +139,7 @@ pacman::p_load( We will use the cleaned case linelist for all analyses in this section. If you want to follow along, click to download the "clean" linelist (as .rds file). See the [Download handbook and data](data_used.qmd) page to download all example data used in this handbook. -```{r, echo=F} +```{r, echo=F, warning=F, message=F} # import the linelist into R linelist <- rio::import(here::here("data", "case_linelists", "linelist_cleaned.rds")) ``` @@ -473,7 +473,8 @@ cases <- incidence2::incidence(linelist, date_index = "date_onset") %>% # get ca by = "day"), fill = list(count = 0)) %>% # convert NA counts to 0 rename(I = count, # rename to names expected by estimateR - dates = date_index) + dates = date_index + ) ``` The package provides several options for specifying the serial interval, the diff --git a/new_pages/factors.html b/new_pages/factors.html deleted file mode 100644 index f456e7cf..00000000 --- a/new_pages/factors.html +++ /dev/null @@ -1,1962 +0,0 @@ - - - - - - - - - -The Epidemiologist R Handbook - 11  Factors - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - - - -
    - - - - - - - - - - -
    -
    - -
    - -
    - - -
    - - - -
    - -
    -
    -

    11  Factors

    -
    - - - -
    - - - - -
    - - - -
    - - -
    -
    -
    -
    -

    -
    -
    -
    -
    -

    In R, factors are a class of data that allow for ordered categories with a fixed set of acceptable values.

    -

    Typically, you would convert a column from character or numeric class to a factor if you want to set an intrinsic order to the values (“levels”) so they can be displayed non-alphabetically in plots and tables. Another common use of factors is to standardise the legends of plots so they do not fluctuate if certain values are temporarily absent from the data.

    -

    This page demonstrates use of functions from the package forcats (a short name for “For categorical variables”) and some base R functions. We also touch upon the use of lubridate and aweek for special factor cases related to epidemiological weeks.

    -

    A complete list of forcats functions can be found online here. Below we demonstrate some of the most common ones.

    - -
    -

    11.1 Preparation

    -
    -

    Load packages

    -

    This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

    -
    -
    pacman::p_load(
    -  rio,           # import/export
    -  here,          # filepaths
    -  lubridate,     # working with dates
    -  forcats,       # factors
    -  aweek,         # create epiweeks with automatic factor levels
    -  janitor,       # tables
    -  tidyverse      # data mgmt and viz
    -  )
    -
    -
    -
    -

    Import data

    -

    We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import your data with the import() function from the rio package (it accepts many file types like .xlsx, .rds, .csv - see the Import and export page for details).

    -
    -
    -
    Warning: The `trust` argument of `import()` should be explicit for serialization formats
    -as of rio 1.0.3.
    -ℹ Missing `trust` will be set to FALSE by default for RDS in 2.0.0.
    -ℹ The deprecated feature was likely used in the rio package.
    -  Please report the issue at <https://github.com/gesistsa/rio/issues>.
    -
    -
    -
    -
    # import your dataset
    -linelist <- import("linelist_cleaned.rds")
    -
    -
    -
    -

    New categorical variable

    -

    For demonstration in this page we will use a common scenario - the creation of a new categorical variable.

    -

    Note that if you convert a numeric column to class factor, you will not be able to calculate numeric statistics on it.

    -
    -

    Create column

    -

    We use the existing column days_onset_hosp (days from symptom onset to hospital admission) and create a new column delay_cat by classifying each row into one of several categories. We do this with the dplyr function case_when(), which sequentially applies logical criteria (right-side) to each row and returns the corresponding left-side value for the new column delay_cat. Read more about case_when() in Cleaning data and core functions.

    -
    -
    linelist <- linelist %>% 
    -  mutate(delay_cat = case_when(
    -    # criteria                                   # new value if TRUE
    -    days_onset_hosp < 2                        ~ "<2 days",
    -    days_onset_hosp >= 2 & days_onset_hosp < 5 ~ "2-5 days",
    -    days_onset_hosp >= 5                       ~ ">5 days",
    -    is.na(days_onset_hosp)                     ~ NA_character_,
    -    TRUE                                       ~ "Check me"))  
    -
    -
    -
    -

    Default value order

    -

    As created with case_when(), the new column delay_cat is a categorical column of class Character - not yet a factor. Thus, in a frequency table, we see that the unique values appear in a default alpha-numeric order - an order that does not make much intuitive sense:

    -
    -
    table(linelist$delay_cat, useNA = "always")
    -
    -
    
    - <2 days  >5 days 2-5 days     <NA> 
    -    2990      602     2040      256 
    -
    -
    -

    Likewise, if we make a bar plot, the values also appear in this order on the x-axis (see the ggplot basics page for more on ggplot2 - the most common visualization package in R).

    -
    -
    ggplot(data = linelist) +
    -  geom_bar(mapping = aes(x = delay_cat))
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -
    -
    -
    -

    11.2 Convert to factor

    -

    To convert a character or numeric column to class factor, you can use any function from the forcats package (many are detailed below). They will convert to class factor and then also perform or allow certain ordering of the levels - for example using fct_relevel() lets you manually specify the level order. The function as_factor() simply converts the class without any further capabilities.

    -

    The base R function factor() converts a column to factor and allows you to manually specify the order of the levels, as a character vector to its levels = argument.

    -

    Below we use mutate() and fct_relevel() to convert the column delay_cat from class character to class factor. The column delay_cat is created in the Preparation section above.

    -
    -
    linelist <- linelist %>%
    -  mutate(delay_cat = fct_relevel(delay_cat))
    -
    -

    The unique “values” in this column are now considered “levels” of the factor. The levels have an order, which can be printed with the base R function levels(), or alternatively viewed in a count table via table() from base R or tabyl() from janitor. By default, the order of the levels will be alpha-numeric, as before. Note that NA is not a factor level.

    -
    -
    levels(linelist$delay_cat)
    -
    -
    [1] "<2 days"  ">5 days"  "2-5 days"
    -
    -
    -

    The function fct_relevel() has the additional utility of allowing you to manually specify the level order. Simply write the level values in order, in quotation marks, separated by commas, as shown below. Note that the spelling must exactly match the values. If you want to create levels that do not exist in the data, use fct_expand() instead).

    -
    -
    linelist <- linelist %>%
    -  mutate(delay_cat = fct_relevel(delay_cat, "<2 days", "2-5 days", ">5 days"))
    -
    -

    We can now see that the levels are ordered, as specified in the previous command, in a sensible order.

    -
    -
    levels(linelist$delay_cat)
    -
    -
    [1] "<2 days"  "2-5 days" ">5 days" 
    -
    -
    -

    Now the plot order makes more intuitive sense as well.

    -
    -
    ggplot(data = linelist) +
    -  geom_bar(mapping = aes(x = delay_cat))
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -
    -

    11.3 Add or drop levels

    -
    -

    Add

    -

    If you need to add levels to a factor, you can do this with fct_expand(). Just write the column name followed by the new levels (separated by commas). By tabulating the values, we can see the new levels and the zero counts. You can use table() from base R, or tabyl() from janitor:

    -
    -
    linelist %>% 
    -  mutate(delay_cat = fct_expand(delay_cat, "Not admitted to hospital", "Transfer to other jurisdiction")) %>% 
    -  tabyl(delay_cat)   # print table
    -
    -
                          delay_cat    n    percent valid_percent
    -                        <2 days 2990 0.50781250     0.5308949
    -                       2-5 days 2040 0.34646739     0.3622159
    -                        >5 days  602 0.10224185     0.1068892
    -       Not admitted to hospital    0 0.00000000     0.0000000
    - Transfer to other jurisdiction    0 0.00000000     0.0000000
    -                           <NA>  256 0.04347826            NA
    -
    -
    -

    Note: there is a special forcats function to easily add missing values (NA) as a level. See the section on Missing values below.

    -
    -
    -

    Drop

    -

    If you use fct_drop(), the “unused” levels with zero counts will be dropped from the set of levels. The levels we added above (“Not admitted to a hospital”) exists as a level but no rows actually have those values. So they will be dropped by applying fct_drop() to our factor column:

    -
    -
    linelist %>% 
    -  mutate(delay_cat = fct_drop(delay_cat)) %>% 
    -  tabyl(delay_cat)
    -
    -
     delay_cat    n    percent valid_percent
    -   <2 days 2990 0.50781250     0.5308949
    -  2-5 days 2040 0.34646739     0.3622159
    -   >5 days  602 0.10224185     0.1068892
    -      <NA>  256 0.04347826            NA
    -
    -
    -
    -
    -
    -

    11.4 Adjust level order

    -

    The package forcats offers useful functions to easily adjust the order of a factor’s levels (after a column been defined as class factor):

    -

    These functions can be applied to a factor column in two contexts:

    -
      -
    1. To the column in the data frame, as usual, so the transformation is available for any subsequent use of the data.
      -
    2. -
    3. Inside of a plot, so that the change is applied only within the plot.
    4. -
    -
    -

    Manually

    -

    This function is used to manually order the factor levels. If used on a non-factor column, the column will first be converted to class factor.

    -

    Within the parentheses first provide the factor column name, then provide either:

    -
      -
    • All the levels in the desired order (as a character vector c()), or,
      -
    • -
    • One level and it’s corrected placement using the after = argument.
    • -
    -

    Here is an example of redefining the column delay_cat (which is already class Factor) and specifying all the desired order of levels.

    -
    -
    # re-define level order
    -linelist <- linelist %>% 
    -  mutate(delay_cat = fct_relevel(delay_cat, c("<2 days", "2-5 days", ">5 days")))
    -
    -

    If you only want to move one level, you can specify it to fct_relevel() alone and give a number to the after = argument to indicate where in the order it should be. For example, the command below shifts “<2 days” to the second position:

    -
    -
    # re-define level order
    -linelist %>% 
    -  mutate(delay_cat = fct_relevel(delay_cat, "<2 days", after = 1)) %>% 
    -  tabyl(delay_cat)
    -
    -
    -
    -

    Within a plot

    -

    The forcats commands can be used to set the level order in the data frame, or only within a plot. By using the command to “wrap around” the column name within the ggplot() plotting command, you can reverse/relevel/etc. the transformation will only apply within that plot.

    -

    Below, two plots are created with ggplot() (see the ggplot basics page). In the first, the delay_cat column is mapped to the x-axis of the plot, with it’s default level order as in the data linelist. In the second example it is wrapped within fct_relevel() and the order is changed in the plot.

    -
    -
    # Alpha-numeric default order - no adjustment within ggplot
    -ggplot(data = linelist) +
    -    geom_bar(mapping = aes(x = delay_cat))
    -
    -# Factor level order adjusted within ggplot
    -ggplot(data = linelist) +
    -  geom_bar(mapping = aes(x = fct_relevel(delay_cat, c("<2 days", "2-5 days", ">5 days"))))
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -
    -

    -
    -
    -
    -
    -

    Note that default x-axis title is now quite complicated - you can overwrite this title with the ggplot2 labs() argument.

    -
    -
    -

    Reverse

    -

    It is rather common that you want to reverse the level order. Simply wrap the factor with fct_rev().

    -

    Note that if you want to reverse only a plot legend but not the actual factor levels, you can do that with guides() (see ggplot tips).

    -
    -
    -

    By frequency

    -

    To order by frequency that the value appears in the data, use fct_infreq(). Any missing values (NA) will automatically be included at the end, unless they are converted to an explicit level (see this section). You can reverse the order by further wrapping with fct_rev().

    -

    This function can be used within a ggplot(), as shown below.

    -
    -
    # ordered by frequency
    -ggplot(data = linelist, aes(x = fct_infreq(delay_cat))) +
    -  geom_bar() +
    -  labs(x = "Delay onset to admission (days)",
    -       title = "Ordered by frequency")
    -
    -# reversed frequency
    -ggplot(data = linelist, aes(x = fct_rev(fct_infreq(delay_cat)))) +
    -  geom_bar() +
    -  labs(x = "Delay onset to admission (days)",
    -       title = "Reverse of order by frequency")
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -
    -

    By appearance

    -

    Use fct_inorder() to set the level order to match the order of appearance in the data, starting from the first row. This can be useful if you first carefully arrange() the data in the data frame, and then use this to set the factor order.

    -
    -
    -

    By summary statistic of another column

    -

    You can use fct_reorder() to order the levels of one column by a summary statistic of another column. Visually, this can result in pleasing plots where the bars/points ascend or descend steadily across the plot.

    -

    In the examples below, the x-axis is delay_cat, and the y-axis is numeric column ct_blood (cycle-threshold value). Box plots show the CT value distribution by delay_cat group. We want to order the box plots in ascending order by the group median CT value.

    -

    In the first example below, the default order alpha-numeric level order is used. You can see the box plot heights are jumbled and not in any particular order. In the second example, the delay_cat column (mapped to the x-axis) has been wrapped in fct_reorder(), the column ct_blood is given as the second argument, and “median” is given as the third argument (you could also use “max”, “mean”, “min”, etc). Thus, the order of the levels of delay_cat will now reflect ascending median CT values of each delay_cat group’s median CT value. This is reflected in the second plot - the box plots have been re-arranged to ascend. Note how NA (missing) will appear at the end, unless converted to an explicit level.

    -
    -
    # boxplots ordered by original factor levels
    -ggplot(data = linelist) +
    -  geom_boxplot(
    -    aes(x = delay_cat,
    -        y = ct_blood, 
    -        fill = delay_cat)) +
    -  labs(x = "Delay onset to admission (days)",
    -       title = "Ordered by original alpha-numeric levels") +
    -  theme_classic() +
    -  theme(legend.position = "none")
    -
    -
    -# boxplots ordered by median CT value
    -ggplot(data = linelist) +
    -  geom_boxplot(
    -    aes(x = fct_reorder(delay_cat, ct_blood, "median"),
    -        y = ct_blood,
    -        fill = delay_cat)) +
    -  labs(x = "Delay onset to admission (days)",
    -       title = "Ordered by median CT value in group") +
    -  theme_classic() +
    -  theme(legend.position = "none")
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -
    -

    -
    -
    -
    -
    -

    Note in this example above there are no steps required prior to the ggplot() call - the grouping and calculations are all done internally to the ggplot command.

    -
    -
    -

    By “end” value

    -

    Use fct_reorder2() for grouped line plots. It orders the levels (and therefore the legend) to align with the vertical ordering of the lines at the “end” of the plot. Technically speaking, it “orders by the y-values associated with the largest x values.”

    -

    For example, if you have lines showing case counts by hospital over time, you can apply fct_reorder2() to the color = argument within aes(), such that the vertical order of hospitals appearing in the legend aligns with the order of lines at the terminal end of the plot. Read more in the online documentation.

    -
    -
    epidemic_data <- linelist %>%         # begin with the linelist   
    -    filter(date_onset < as.Date("2014-09-21")) %>%    # cut-off date, for visual clarity
    -    count(                                            # get case counts per week and by hospital
    -      epiweek = lubridate::floor_date(date_onset, "week"),  
    -      hospital                                            
    -    ) 
    -  
    -ggplot(data = epidemic_data) +                       # start plot
    -  geom_line(                                        # make lines
    -    aes(
    -      x = epiweek,                                  # x-axis epiweek
    -      y = n,                                        # height is number of cases per week
    -      color = fct_reorder2(hospital, epiweek, n))) + # data grouped and colored by hospital, with factor order by height at end of plot
    -  labs(title = "Factor levels (and legend display) by line height at end of plot",
    -       color = "Hospital")                          # change legend title
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -
    -
    -

    11.5 Missing values

    -

    If you have NA values in your factor column, you can easily convert them to a named level such as “Missing” with fct_explicit_na(). The NA values are converted to “(Missing)” at the end of the level order by default. You can adjust the level name with the argument na_level =.

    -

    Below, this opertation is performed on the column delay_cat and a table is printed with tabyl() with NA converted to “Missing delay”.

    -
    -
    linelist %>% 
    -  mutate(delay_cat = fct_explicit_na(delay_cat, na_level = "Missing delay")) %>% 
    -  tabyl(delay_cat)
    -
    -
    Warning: There was 1 warning in `mutate()`.
    -ℹ In argument: `delay_cat = fct_explicit_na(delay_cat, na_level = "Missing
    -  delay")`.
    -Caused by warning:
    -! `fct_explicit_na()` was deprecated in forcats 1.0.0.
    -ℹ Please use `fct_na_value_to_level()` instead.
    -
    -
    -
         delay_cat    n    percent
    -      2-5 days 2040 0.34646739
    -       <2 days 2990 0.50781250
    -       >5 days  602 0.10224185
    - Missing delay  256 0.04347826
    -
    -
    -
    -
    -

    11.6 Combine levels

    -
    -

    Manually

    -

    You can adjust the level displays manually manually with fct_recode(). This is like the dplyr function recode() (see Cleaning data and core functions), but it allows the creation of new factor levels. If you use the simple recode() on a factor, new re-coded values will be rejected unless they have already been set as permissible levels.

    -

    This tool can also be used to “combine” levels, by assigning multiple levels the same re-coded value. Just be careful to not lose information! Consider doing these combining steps in a new column (not over-writing the existing column).

    -

    DANGER: fct_recode() has a different syntax than recode(). recode() uses OLD = NEW, whereas fct_recode() uses NEW = OLD.

    -

    The current levels of delay_cat are:

    -
    -
    levels(linelist$delay_cat)
    -
    -
    [1] "<2 days"  "2-5 days" ">5 days" 
    -
    -
    -

    The new levels are created using syntax fct_recode(column, "new" = "old", "new" = "old", "new" = "old") and printed:

    -
    -
    linelist %>% 
    -  mutate(delay_cat = fct_recode(
    -    delay_cat,
    -    "Less than 2 days" = "<2 days",
    -    "2 to 5 days"      = "2-5 days",
    -    "More than 5 days" = ">5 days")) %>% 
    -  tabyl(delay_cat)
    -
    -
            delay_cat    n    percent valid_percent
    - Less than 2 days 2990 0.50781250     0.5308949
    -      2 to 5 days 2040 0.34646739     0.3622159
    - More than 5 days  602 0.10224185     0.1068892
    -             <NA>  256 0.04347826            NA
    -
    -
    -

    Here they are manually combined with fct_recode(). Note there is no error raised at the creation of a new level “Less than 5 days”.

    -
    -
    linelist %>% 
    -  mutate(delay_cat = fct_recode(
    -    delay_cat,
    -    "Less than 5 days" = "<2 days",
    -    "Less than 5 days" = "2-5 days",
    -    "More than 5 days" = ">5 days")) %>% 
    -  tabyl(delay_cat)
    -
    -
            delay_cat    n    percent valid_percent
    - Less than 5 days 5030 0.85427989     0.8931108
    - More than 5 days  602 0.10224185     0.1068892
    -             <NA>  256 0.04347826            NA
    -
    -
    -
    -
    -

    Reduce into “Other”

    -

    You can use fct_other() to manually assign factor levels to an “Other” level. Below, all levels in the column hospital, aside from “Port Hospital” and “Central Hospital”, are combined into “Other”. You can provide a vector to either keep =, or drop =. You can change the display of the “Other” level with other_level =.

    -
    -
    linelist %>%    
    -  mutate(hospital = fct_other(                      # adjust levels
    -    hospital,
    -    keep = c("Port Hospital", "Central Hospital"),  # keep these separate
    -    other_level = "Other Hospital")) %>%            # All others as "Other Hospital"
    -  tabyl(hospital)                                   # print table
    -
    -
             hospital    n    percent
    - Central Hospital  454 0.07710598
    -    Port Hospital 1762 0.29925272
    -   Other Hospital 3672 0.62364130
    -
    -
    -
    -
    -

    Reduce by frequency

    -

    You can combine the least-frequent factor levels automatically using fct_lump().

    -

    To “lump” together many low-frequency levels into an “Other” group, do one of the following:

    -
      -
    • Set n = as the number of groups you want to keep. The n most-frequent levels will be kept, and all others will combine into “Other”.
      -
    • -
    • Set prop = as the threshold frequency proportion for levels above which you want to keep. All other values will combine into “Other”.
    • -
    -

    You can change the display of the “Other” level with other_level =. Below, all but the two most-frequent hospitals are combined into “Other Hospital”.

    -
    -
    linelist %>%    
    -  mutate(hospital = fct_lump(                      # adjust levels
    -    hospital,
    -    n = 2,                                          # keep top 2 levels
    -    other_level = "Other Hospital")) %>%            # all others as "Other Hospital"
    -  tabyl(hospital)                                   # print table
    -
    -
           hospital    n   percent
    -        Missing 1469 0.2494905
    -  Port Hospital 1762 0.2992527
    - Other Hospital 2657 0.4512568
    -
    -
    -
    -
    -
    -

    11.7 Show all levels

    -

    One benefit of using factors is to standardise the appearance of plot legends and tables, regardless of which values are actually present in a dataset.

    -

    If you are preparing many figures (e.g. for multiple jurisdictions) you will want the legends and tables to appear identically even with varying levels of data completion or data composition.

    -
    -

    In plots

    -

    In a ggplot() figure, simply add the argument drop = FALSE in the relevant scale_xxxx() function. All factor levels will be displayed, regardless of whether they are present in the data. If your factor column levels are displayed using fill =, then in scale_fill_discrete() you include drop = FALSE, as shown below. If your levels are displayed with x = (to the x-axis) color = or size = you would provide this to scale_color_discrete() or scale_size_discrete() accordingly.

    -

    This example is a stacked bar plot of age category, by hospital. Adding scale_fill_discrete(drop = FALSE) ensures that all age groups appear in the legend, even if not present in the data.

    -
    -
    ggplot(data = linelist) +
    -  geom_bar(mapping = aes(x = hospital, fill = age_cat)) +
    -  scale_fill_discrete(drop = FALSE) +                        # show all age groups in the legend, even those not present
    -  labs(
    -    title = "All age groups will appear in legend, even if not present in data")
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -
    -

    In tables

    -

    Both the base R table() and tabyl() from janitor will show all factor levels (even unused levels).

    -

    If you use count() or summarise() from dplyr to make a table, add the argument .drop = FALSE to include counts for all factor levels even those unused.

    -

    Read more in the Descriptive tables page, or at the scale_discrete documentation, or the count() documentation. You can see another example in the Contact tracing page.

    -
    -
    -
    -

    11.8 Epiweeks

    -

    Please see the extensive discussion of how to create epidemiological weeks in the Grouping data page. Also see the Working with dates page for tips on how to create and format epidemiological weeks.

    -
    -

    Epiweeks in a plot

    -

    If your goal is to create epiweeks to display in a plot, you can do this simply with lubridate’s floor_date(), as explained in the Grouping data page. The values returned will be of class Date with format YYYY-MM-DD. If you use this column in a plot, the dates will naturally order correctly, and you do not need to worry about levels or converting to class Factor. See the ggplot() histogram of onset dates below.

    -

    In this approach, you can adjust the display of the dates on an axis with scale_x_date(). See the page on Epidemic curves for more information. You can specify a “strptime” display format to the date_labels = argument of scale_x_date(). These formats use “%” placeholders and are covered in the Working with dates page. Use “%Y” to represent a 4-digit year, and either “%W” or “%U” to represent the week number (Monday or Sunday weeks respectively).

    -
    -
    linelist %>% 
    -  mutate(epiweek_date = floor_date(date_onset, "week")) %>%  # create week column
    -  ggplot() +                                                  # begin ggplot
    -  geom_histogram(mapping = aes(x = epiweek_date)) +           # histogram of date of onset
    -  scale_x_date(date_labels = "%Y-W%W")                       # adjust disply of dates to be YYYY-WWw
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -
    -

    Epiweeks in the data

    -

    However, if your purpose in factoring is not to plot, you can approach this one of two ways:

    -
      -
    1. For fine control over the display, convert the lubridate epiweek column (YYYY-MM-DD) to the desired display format (YYYY-Www) within the data frame itself, and then convert it to class Factor.
    2. -
    -

    First, use format() from base R to convert the date display from YYYY-MM-DD to YYYY-Www display (see the Working with dates page). In this process the class will be converted to character. Then, convert from character to class Factor with factor().

    -
    -
    linelist <- linelist %>% 
    -  mutate(epiweek_date = floor_date(date_onset, "week"),       # create epiweeks (YYYY-MM-DD)
    -         epiweek_formatted = format(epiweek_date, "%Y-W%W"),  # Convert to display (YYYY-WWw)
    -         epiweek_formatted = factor(epiweek_formatted))       # Convert to factor
    -
    -# Display levels
    -levels(linelist$epiweek_formatted)
    -
    -
     [1] "2014-W13" "2014-W14" "2014-W15" "2014-W16" "2014-W17" "2014-W18"
    - [7] "2014-W19" "2014-W20" "2014-W21" "2014-W22" "2014-W23" "2014-W24"
    -[13] "2014-W25" "2014-W26" "2014-W27" "2014-W28" "2014-W29" "2014-W30"
    -[19] "2014-W31" "2014-W32" "2014-W33" "2014-W34" "2014-W35" "2014-W36"
    -[25] "2014-W37" "2014-W38" "2014-W39" "2014-W40" "2014-W41" "2014-W42"
    -[31] "2014-W43" "2014-W44" "2014-W45" "2014-W46" "2014-W47" "2014-W48"
    -[37] "2014-W49" "2014-W50" "2014-W51" "2015-W00" "2015-W01" "2015-W02"
    -[43] "2015-W03" "2015-W04" "2015-W05" "2015-W06" "2015-W07" "2015-W08"
    -[49] "2015-W09" "2015-W10" "2015-W11" "2015-W12" "2015-W13" "2015-W14"
    -[55] "2015-W15" "2015-W16"
    -
    -
    -

    DANGER: If you place the weeks ahead of the years (“Www-YYYY”) (“%W-%Y”), the default alpha-numeric level ordering will be incorrect (e.g. 01-2015 will be before 35-2014). You could need to manually adjust the order, which would be a long painful process.

    -
      -
    1. For fast default display, use the aweek package and it’s function date2week(). You can set the week_start = day, and if you set factor = TRUE then the output column is an ordered factor. As a bonus, the factor includes levels for all possible weeks in the span - even if there are no cases that week.
    2. -
    -
    -
    df <- linelist %>% 
    -  mutate(epiweek = date2week(date_onset, week_start = "Monday", factor = TRUE))
    -
    -levels(df$epiweek)
    -
    -

    See the Working with dates page for more information about aweek. It also offers the reverse function week2date().

    - -
    -
    -
    -

    11.9 Resources

    -

    R for Data Science page on factors
    -aweek package vignette

    - - -
    - -
    - - -
    - - - - - - - \ No newline at end of file diff --git a/new_pages/factors.qmd b/new_pages/factors.qmd index b740dbbb..b94408da 100644 --- a/new_pages/factors.qmd +++ b/new_pages/factors.qmd @@ -39,7 +39,7 @@ pacman::p_load( We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the "clean" linelist (as .rds file). Import your data with the `import()` function from the **rio** package (it accepts many file types like .xlsx, .rds, .csv - see the [Import and export](importing.qmd) page for details). -```{r, echo=F} +```{r, echo=F, warning=F, message=F} # import the linelist into R linelist <- rio::import(here::here("data", "case_linelists", "linelist_cleaned.rds")) ``` @@ -109,7 +109,7 @@ linelist <- linelist %>% levels(linelist$delay_cat) ``` -The function `fct_relevel()` has the additional utility of allowing you to manually specify the level order. Simply write the level values in order, in quotation marks, separated by commas, as shown below. Note that the spelling must exactly match the values. If you want to create levels that do not exist in the data, use [`fct_expand()` instead](#fct_add)). +The function `fct_relevel()` has the additional utility of allowing you to manually specify the level order. Simply write the level values in order, in quotation marks, separated by commas, as shown below. Note that the spelling must exactly match the values. If you want to create levels that do not exist in the data, use [`fct_expand()` instead](#fct_add). ```{r} linelist <- linelist %>% @@ -518,5 +518,6 @@ See the [Working with dates](dates.qmd) page for more information about **aweek* ## Resources {} -R for Data Science page on [factors](https://r4ds.had.co.nz/factors.html) +R for Data Science page on [factors](https://r4ds.had.co.nz/factors.html) + [aweek package vignette](https://cran.r-project.org/web/packages/aweek/vignettes/introduction.html) diff --git a/new_pages/factors_files/figure-html/unnamed-chunk-12-1.png b/new_pages/factors_files/figure-html/unnamed-chunk-12-1.png deleted file mode 100644 index 3357f8f7..00000000 Binary files a/new_pages/factors_files/figure-html/unnamed-chunk-12-1.png and /dev/null differ diff --git a/new_pages/factors_files/figure-html/unnamed-chunk-18-1.png b/new_pages/factors_files/figure-html/unnamed-chunk-18-1.png deleted file mode 100644 index f0f7aeb8..00000000 Binary files a/new_pages/factors_files/figure-html/unnamed-chunk-18-1.png and /dev/null differ diff --git a/new_pages/factors_files/figure-html/unnamed-chunk-18-2.png b/new_pages/factors_files/figure-html/unnamed-chunk-18-2.png deleted file mode 100644 index d1b7f211..00000000 Binary files a/new_pages/factors_files/figure-html/unnamed-chunk-18-2.png and /dev/null differ diff --git a/new_pages/factors_files/figure-html/unnamed-chunk-19-1.png b/new_pages/factors_files/figure-html/unnamed-chunk-19-1.png deleted file mode 100644 index fc4a23ea..00000000 Binary files a/new_pages/factors_files/figure-html/unnamed-chunk-19-1.png and /dev/null differ diff --git a/new_pages/factors_files/figure-html/unnamed-chunk-19-2.png b/new_pages/factors_files/figure-html/unnamed-chunk-19-2.png deleted file mode 100644 index ea8cb26d..00000000 Binary files a/new_pages/factors_files/figure-html/unnamed-chunk-19-2.png and /dev/null differ diff --git a/new_pages/factors_files/figure-html/unnamed-chunk-20-1.png b/new_pages/factors_files/figure-html/unnamed-chunk-20-1.png deleted file mode 100644 index d8ec035b..00000000 Binary files a/new_pages/factors_files/figure-html/unnamed-chunk-20-1.png and /dev/null differ diff --git a/new_pages/factors_files/figure-html/unnamed-chunk-20-2.png b/new_pages/factors_files/figure-html/unnamed-chunk-20-2.png deleted file mode 100644 index b6c7768b..00000000 Binary files a/new_pages/factors_files/figure-html/unnamed-chunk-20-2.png and /dev/null differ diff --git a/new_pages/factors_files/figure-html/unnamed-chunk-21-1.png b/new_pages/factors_files/figure-html/unnamed-chunk-21-1.png deleted file mode 100644 index 427e9202..00000000 Binary files a/new_pages/factors_files/figure-html/unnamed-chunk-21-1.png and /dev/null differ diff --git a/new_pages/factors_files/figure-html/unnamed-chunk-29-1.png b/new_pages/factors_files/figure-html/unnamed-chunk-29-1.png deleted file mode 100644 index f913527c..00000000 Binary files a/new_pages/factors_files/figure-html/unnamed-chunk-29-1.png and /dev/null differ diff --git a/new_pages/factors_files/figure-html/unnamed-chunk-30-1.png b/new_pages/factors_files/figure-html/unnamed-chunk-30-1.png deleted file mode 100644 index 223af729..00000000 Binary files a/new_pages/factors_files/figure-html/unnamed-chunk-30-1.png and /dev/null differ diff --git a/new_pages/factors_files/figure-html/unnamed-chunk-7-1.png b/new_pages/factors_files/figure-html/unnamed-chunk-7-1.png deleted file mode 100644 index c585d0d1..00000000 Binary files a/new_pages/factors_files/figure-html/unnamed-chunk-7-1.png and /dev/null differ diff --git a/new_pages/gis.qmd b/new_pages/gis.qmd index ef3ebb0c..d8de8cb0 100644 --- a/new_pages/gis.qmd +++ b/new_pages/gis.qmd @@ -108,7 +108,7 @@ knitr::include_graphics(here::here("images", "gis_heatmap.png")) # proportional symbols img here ``` -You can also combine several different types of visualizations to show complex geographic patterns. For example, the cases (dots) in the map below are colored according to their closest health facility (see legend). The large red circles show *health facility catchment areas* of a certain radius, and the bright red case-dots those that were outside any catchment range: +You can also combine several different types of visualizations to show complex geographic patterns. For example, the cases (dots) in the map below are colored according to their closest health facility (see legend). The large black circles show *health facility catchment areas* of a certain radius, and the bright red case-dots those that were outside any catchment range: ```{r, fig.align = "center", echo=F} knitr::include_graphics(here::here("images", "gis_hf_catchment.png")) @@ -1054,5 +1054,8 @@ knitr::include_graphics(here::here("images", "gis_lmflowchart.jpg")) * **SpatialEpiApp** - a [Shiny app that is downloadable as an R package](https://github.com/Paula-Moraga/SpatialEpiApp), allowing you to provide your own data and conduct mapping, cluster analysis, and spatial statistics. +[Spatial Statistics for Data Science: Theory and Practice with R](https://www.paulamoraga.com/book-spatial/index.html) + +[Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny](https://www.paulamoraga.com/book-geospatial/) * An Introduction to Spatial Econometrics in R [workshop](http://www.econ.uiuc.edu/~lab/workshop/Spatial_in_R.html) diff --git a/new_pages/grouping.qmd b/new_pages/grouping.qmd index dac4f421..2b5b5ee8 100644 --- a/new_pages/grouping.qmd +++ b/new_pages/grouping.qmd @@ -40,7 +40,8 @@ pacman::p_load( rio, # to import data here, # to locate files tidyverse, # to clean, handle, and plot the data (includes dplyr) - janitor) # adding total rows and columns + janitor # adding total rows and columns + ) ``` @@ -50,7 +51,7 @@ pacman::p_load( We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the "clean" linelist (as .rds file). The dataset is imported using the `import()` function from the **rio** package. See the page on [Import and export](importing.qmd) for various ways to import data. -```{r, echo=F} +```{r, echo=F, warning=F, message=F} linelist <- rio::import(here("data", "case_linelists", "linelist_cleaned.rds")) ``` @@ -139,7 +140,7 @@ by_outcome_gender <- by_outcome %>% ``` -** Keep all groups** +**Keep all groups** If you group on a column of class factor there may be levels of the factor that are not currently present in the data. If you group on this column, by default those non-present levels are dropped and not included as groups. To change this so that all levels appear as groups (even if not present in the data), set `.drop = FALSE` in your `group_by()` command. @@ -211,7 +212,7 @@ linelist %>% ## Counts and tallies -`count()` and `tally()` provide similar functionality but are different. Read more about the distinction between `tally()` and `count()` [here](https://dplyr.tidyverse.org/reference/count.html) +`count()` and `tally()` provide similar functionality but are different. Read more about the distinction between `tally()` and `count()` [here](https://dplyr.tidyverse.org/reference/count.html). ### `tally()` {.unnumbered} diff --git a/new_pages/importing.html b/new_pages/importing.html deleted file mode 100644 index b0de66ec..00000000 --- a/new_pages/importing.html +++ /dev/null @@ -1,2405 +0,0 @@ - - - - - - - - - -The Epidemiologist R Handbook - 7  Import and export - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - - - -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    -
    - -
    - -
    - - -
    - - - -
    - -
    -
    -

    7  Import and export

    -
    - - - -
    - - - - -
    - - - -
    - - -
    -
    -
    -
    -

    -
    -
    -
    -
    -

    In this page we describe ways to locate, import, and export files:

    -
      -
    • Use of the rio package to flexibly import() and export() many types of files.
      -
    • -
    • Use of the here package to locate files relative to an R project root. To prevent complications from file paths that are specific to one computer.
      -
    • -
    • Specific import scenarios, such as: -
        -
      • Specific Excel sheets.
        -
      • -
      • Messy headers and skipping rows.
        -
      • -
      • From Google sheets.
        -
      • -
      • From data posted to websites.
        -
      • -
      • With APIs.
        -
      • -
      • Importing the most recent file.
        -
      • -
    • -
    • Manual data entry.
      -
    • -
    • R-specific file types such as RDS and RData.
      -
    • -
    • Exporting/saving files and plots.
    • -
    - -
    -

    7.1 Overview

    -

    When you import a “dataset” into R, you are generally creating a new data frame object in your R environment and defining it as an imported file (e.g. Excel, CSV, TSV, RDS) that is located in your folder directories at a certain file path/address.

    -

    You can import/export many types of files, including those created by other statistical programs (SAS, STATA, SPSS). You can also connect to relational databases.

    -

    R even has its own data formats:

    -
      -
    • An RDS file (.rds) stores a single R object such as a data frame. These are useful to store cleaned data, as they maintain R column classes. Read more in this section.
      -
    • -
    • An RData file (.Rdata) can be used to store multiple objects, or even a complete R workspace. Read more in this section.
    • -
    - -
    -
    -

    7.2 The rio package

    -

    The R package we recommend is: rio. The name “rio” is an abbreviation of “R I/O” (input/output).

    -

    Its functions import() and export() can handle many different file types (e.g. .xlsx, .csv, .rds, .tsv). When you provide a file path to either of these functions (including the file extension like “.csv”), rio will read the extension and use the correct tool to import or export the file.

    -

    The alternative to using rio is to use functions from many other packages, each of which is specific to a type of file. For example, read.csv() (base R), read.xlsx() (openxlsx package), and write_csv() (readr pacakge), etc. These alternatives can be difficult to remember, whereas using import() and export() from rio is easy.

    -

    rio’s functions import() and export() use the appropriate package and function for a given file, based on its file extension. See the end of this page for a complete table of which packages/functions rio uses in the background. It can also be used to import STATA, SAS, and SPSS files, among dozens of other file types.

    -

    Import/export of shapefiles requires other packages, as detailed in the page on GIS basics.

    -
    -
    -

    7.3 The here package

    -

    The package here and its function here() make it easy to tell R where to find and to save your files - in essence, it builds file paths.

    -

    Used in conjunction with an R project, here allows you to describe the location of files in your R project in relation to the R project’s root directory (the top-level folder). This is useful when the R project may be shared or accessed by multiple people/computers. It prevents complications due to the unique file paths on different computers (e.g. "C:/Users/Laura/Documents..." by “starting” the file path in a place common to all users (the R project root).

    -

    This is how here() works within an R project:

    -
      -
    • When the here package is first loaded within the R project, it places a small file called “.here” in the root folder of your R project as a “benchmark” or “anchor”.
      -
    • -
    • In your scripts, to reference a file in the R project’s sub-folders, you use the function here() to build the file path in relation to that anchor.
    • -
    • To build the file path, write the names of folders beyond the root, within quotes, separated by commas, finally ending with the file name and file extension as shown below.
      -
    • -
    • here() file paths can be used for both importing and exporting.
    • -
    -

    For example, below, the function import() is being provided a file path constructed with here().

    -
    -
    linelist <- import(here("data", "linelists", "ebola_linelist.xlsx"))
    -
    -

    The command here("data", "linelists", "ebola_linelist.xlsx") is actually providing the full file path that is unique to the user’s computer:

    -
    "C:/Users/Laura/Documents/my_R_project/data/linelists/ebola_linelist.xlsx"
    -

    The beauty is that the R command using here() can be successfully run on any computer accessing the R project.

    -

    TIP: If you are unsure where the “.here” root is set to, run the function here() with empty parentheses.

    -

    Read more about the here package at this link.

    - -
    -
    -

    7.4 File paths

    -

    When importing or exporting data, you must provide a file path. You can do this one of three ways:

    -
      -
    1. Recommended: provide a “relative” file path with the here package.
      -
    2. -
    3. Provide the “full” / “absolute” file path.
      -
    4. -
    5. Manual file selection.
    6. -
    -
    -

    “Relative” file paths

    -

    In R, “relative” file paths consist of the file path relative to the root of an R project. They allow for more simple file paths that can work on different computers (e.g. if the R project is on a shared drive or is sent by email). As described above, relative file paths are facilitated by use of the here package.

    -

    An example of a relative file path constructed with here() is below. We assume the work is in an R project that contains a sub-folder “data” and within that a subfolder “linelists”, in which there is the .xlsx file of interest.

    -
    -
    linelist <- import(here("data", "linelists", "ebola_linelist.xlsx"))
    -
    -
    -
    -

    “Absolute” file paths

    -

    Absolute or “full” file paths can be provided to functions like import() but they are “fragile” as they are unique to the user’s specific computer and therefore not recommended.

    -

    Below is an example of an absolute file path, where in Laura’s computer there is a folder “analysis”, a sub-folder “data” and within that a sub-folder “linelists”, in which there is the .xlsx file of interest.

    -
    -
    linelist <- import("C:/Users/Laura/Documents/analysis/data/linelists/ebola_linelist.xlsx")
    -
    -

    A few things to note about absolute file paths:

    -
      -
    • Avoid using absolute file paths as they will not work if the script is run on a different computer.
    • -
    • Use forward slashes (/), as in the example above (note: this is NOT the default for Windows file paths).
      -
    • -
    • File paths that begin with double slashes (e.g. “//…”) will likely not be recognized by R and will produce an error. Consider moving your work to a “named” or “lettered” drive that begins with a letter (e.g. “J:” or “C:”). See the page on Directory interactions for more details on this issue.
    • -
    -

    One scenario where absolute file paths may be appropriate is when you want to import a file from a shared drive that has the same full file path for all users.

    -

    TIP: To quickly convert all \ to /, highlight the code of interest, use Ctrl+f (in Windows), check the option box for “In selection”, and then use the replace functionality to convert them.

    - -
    -
    -

    Select file manually

    -

    You can import data manually via one of these methods:

    -
      -
    1. Environment RStudio Pane, click “Import Dataset”, and select the type of data.
    2. -
    3. Click File / Import Dataset / (select the type of data).
      -
    4. -
    5. To hard-code manual selection, use the base R command file.choose() (leaving the parentheses empty) to trigger appearance of a pop-up window that allows the user to manually select the file from their computer. For example:
    6. -
    -
    -
    # Manual selection of a file. When this command is run, a POP-UP window will appear. 
    -# The file path selected will be supplied to the import() command.
    -
    -my_data <- import(file.choose())
    -
    -

    TIP: The pop-up window may appear BEHIND your RStudio window.

    -
    -
    -
    -

    7.5 Import data

    -

    To use import() to import a dataset is quite simple. Simply provide the path to the file (including the file name and file extension) in quotes. If using here() to build the file path, follow the instructions above. Below are a few examples:

    -

    Importing a csv file that is located in your “working directory” or in the R project root folder:

    -
    -
    linelist <- import("linelist_cleaned.csv")
    -
    -

    Importing the first sheet of an Excel workbook that is located in “data” and “linelists” sub-folders of the R project (the file path built using here()):

    -
    -
    linelist <- import(here("data", "linelists", "linelist_cleaned.xlsx"))
    -
    -

    Importing a data frame (a .rds file) using an absolute file path:

    -
    -
    linelist <- import("C:/Users/Laura/Documents/tuberculosis/data/linelists/linelist_cleaned.rds")
    -
    -
    -

    Specific Excel sheets

    -

    By default, if you provide an Excel workbook (.xlsx) to import(), the workbook’s first sheet will be imported. If you want to import a specific sheet, include the sheet name to the which = argument. For example:

    -
    -
    my_data <- import("my_excel_file.xlsx", which = "Sheetname")
    -
    -

    If using the here() method to provide a relative pathway to import(), you can still indicate a specific sheet by adding the which = argument after the closing parentheses of the here() function.

    -
    -
    # Demonstration: importing a specific Excel sheet when using relative pathways with the 'here' package
    -linelist_raw <- import(here("data", "linelist.xlsx"), which = "Sheet1")`  
    -
    -

    To export a data frame from R to a specific Excel sheet and have the rest of the Excel workbook remain unchanged, you will have to import, edit, and export with an alternative package catered to this purpose such as openxlsx. See more information in the page on Directory interactions or at this github page.

    -

    If your Excel workbook is .xlsb (binary format Excel workbook) you may not be able to import it using rio. Consider re-saving it as .xlsx, or using a package like readxlsb which is built for this purpose.

    - -
    -
    -

    Missing values

    -

    You may want to designate which value(s) in your dataset should be considered as missing. As explained in the page on Missing data, the value in R for missing data is NA, but perhaps the dataset you want to import uses 99, “Missing”, or just empty character space “” instead.

    -

    Use the na = argument for import() and provide the value(s) within quotes (even if they are numbers). You can specify multiple values by including them within a vector, using c() as shown below.

    -

    Here, the value “99” in the imported dataset is considered missing and converted to NA in R.

    -
    -
    linelist <- import(here("data", "my_linelist.xlsx"), na = "99")
    -
    -

    Here, any of the values “Missing”, “” (empty cell), or ” ” (single space) in the imported dataset are converted to NA in R.

    -
    -
    linelist <- import(here("data", "my_linelist.csv"), na = c("Missing", "", " "))
    -
    - -
    -
    -

    Skip rows

    -

    Sometimes, you may want to avoid importing a row of data. You can do this with the argument skip = if using import() from rio on a .xlsx or .csv file. Provide the number of rows you want to skip.

    -
    -
    linelist_raw <- import("linelist_raw.xlsx", skip = 1)  # does not import header row
    -
    -

    Unfortunately skip = only accepts one integer value, not a range (e.g. “2:10” does not work). To skip import of specific rows that are not consecutive from the top, consider importing multiple times and using bind_rows() from dplyr. See the example below of skipping only row 2.

    -
    -
    -

    Manage a second header row

    -

    Sometimes, your data may have a second row, for example if it is a “data dictionary” row as shown below. This situation can be problematic because it can result in all columns being imported as class “character”.

    -
    -
    -
    Warning: The `trust` argument of `import()` should be explicit for serialization formats
    -as of rio 1.0.3.
    -ℹ Missing `trust` will be set to FALSE by default for RDS in 2.0.0.
    -ℹ The deprecated feature was likely used in the rio package.
    -  Please report the issue at <https://github.com/gesistsa/rio/issues>.
    -
    -
    -

    Below is an example of this kind of dataset (with the first row being the data dictionary).

    -
    -
    -
    - -
    -
    -
    -

    Remove the second header row

    -

    To drop the second header row, you will likely need to import the data twice.

    -
      -
    1. Import the data in order to store the correct column names
      -
    2. -
    3. Import the data again, skipping the first two rows (header and second rows)
      -
    4. -
    5. Bind the correct names onto the reduced dataframe
    6. -
    -

    The exact argument used to bind the correct column names depends on the type of data file (.csv, .tsv, .xlsx, etc.). This is because rio is using a different function for the different file types (see table above).

    -

    For Excel files: (col_names =)

    -
    -
    # import first time; store the column names
    -linelist_raw_names <- import("linelist_raw.xlsx") %>% 
    -     names()  # save true column names
    -
    -# import second time; skip row 2, and assign column names to argument col_names =
    -linelist_raw <- import("linelist_raw.xlsx",
    -                       skip = 2,
    -                       col_names = linelist_raw_names
    -                       ) 
    -
    -

    For CSV files: (col.names =)

    -
    -
    # import first time; sotre column names
    -linelist_raw_names <- import("linelist_raw.csv") %>% 
    -     names() # save true column names
    -
    -# note argument for csv files is 'col.names = '
    -linelist_raw <- import("linelist_raw.csv",
    -                       skip = 2,
    -                       col.names = linelist_raw_names
    -                       ) 
    -
    -

    Backup option - changing column names as a separate command

    -
    -
    # assign/overwrite headers using the base 'colnames()' function
    -colnames(linelist_raw) <- linelist_raw_names
    -
    -
    -
    -

    Make a data dictionary

    -

    Bonus! If you do have a second row that is a data dictionary, you can easily create a proper data dictionary from it. This tip is adapted from this post.

    -
    -
    dict <- linelist_2headers %>%             # begin: linelist with dictionary as first row
    -  head(1) %>%                             # keep only column names and first dictionary row                
    -  pivot_longer(cols = everything(),       # pivot all columns to long format
    -               names_to = "Column",       # assign new column names
    -               values_to = "Description")
    -
    -
    -
    -
    - -
    -
    -
    -
    -

    Combine the two header rows

    -

    In some cases when your raw dataset has two header rows (or more specifically, the 2nd row of data is a secondary header), you may want to “combine” them or add the values in the second header row into the first header row.

    -

    The command below will define the data frame’s column names as the combination (pasting together) of the first (true) headers with the value immediately underneath (in the first row).

    -
    -
    names(my_data) <- paste(names(my_data), my_data[1, ], sep = "_")
    -
    - -
    -
    -
    -

    Google sheets

    -

    You can import data from an online Google spreadsheet with the googlesheet4 package and by authenticating your access to the spreadsheet.

    -
    -
    pacman::p_load("googlesheets4")
    -
    -

    Below, a demo Google sheet is imported and saved. This command may prompt confirmation of authentification of your Google account. Follow prompts and pop-ups in your internet browser to grant Tidyverse API packages permissions to edit, create, and delete your spreadsheets in Google Drive.

    -

    The sheet below is “viewable for anyone with the link” and you can try to import it.

    -
    -
    Gsheets_demo <- read_sheet("https://docs.google.com/spreadsheets/d/1scgtzkVLLHAe5a6_eFQEwkZcc14yFUx1KgOMZ4AKUfY/edit#gid=0")
    -
    -

    The sheet can also be imported using only the sheet ID, a shorter part of the URL:

    -
    -
    Gsheets_demo <- read_sheet("1scgtzkVLLHAe5a6_eFQEwkZcc14yFUx1KgOMZ4AKUfY")
    -
    -

    Another package, googledrive offers useful functions for writing, editing, and deleting Google sheets. For example, using the gs4_create() and sheet_write() functions found in this package.

    -

    Here are some other helpful online tutorials:
    -Google sheets importing tutorial. More detailed tutorial.
    -Interaction between the googlesheets4 and tidyverse.

    -

    Additionally, you can also use import from the rio package.

    -
    -
    Gsheets_demo <- rio("https://docs.google.com/spreadsheets/d/1scgtzkVLLHAe5a6_eFQEwkZcc14yFUx1KgOMZ4AKUfY/edit#gid=0")
    -
    -
    -
    -
    -

    7.6 Multiple files - import, export, split, combine

    -

    See the page on Iteration, loops, and lists for examples of how to import and combine multiple files, or multiple Excel workbook files.

    -

    That page also has examples on how to split a data frame into parts and export each one separately, or as named sheets in an Excel workbook.

    - -
    -
    -

    7.7 Import from Github

    -

    Importing data directly from Github into R can be very easy or can require a few steps - depending on the file type. Below are some approaches:

    -
    -

    CSV files

    -

    It can be easy to import a .csv file directly from Github into R with an R command.

    -
      -
    1. Go to the Github repo, locate the file of interest, and click on it.
      -
    2. -
    3. Click on the “Raw” button (you will then see the “raw” csv data, as shown below).
      -
    4. -
    5. Copy the URL (web address).
      -
    6. -
    7. Place the URL in quotes within the import() R command.
    8. -
    -
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -
    -

    XLSX files

    -

    You may not be able to view the “Raw” data for some files (e.g. .xlsx, .rds, .nwk, .shp)

    -
      -
    1. Go to the Github repo, locate the file of interest, and click on it
      -
    2. -
    3. Click the “Download” button, as shown below
      -
    4. -
    5. Save the file on your computer, and import it into R
    6. -
    -
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -
    -

    Shapefiles

    -

    Shapefiles have many sub-component files, each with a different file extention. One file will have the “.shp” extension, but others may have “.dbf”, “.prj”, etc. To download a shapefile from Github, you will need to download each of the sub-component files individually, and save them in the same folder on your computer. In Github, click on each file individually and download them by clicking on the “Download” button.

    -

    Once saved to your computer you can import the shapefile as shown in the GIS basics page using st_read() from the sf package. You only need to provide the filepath and name of the “.shp” file - as long as the other related files are within the same folder on your computer.

    -

    Below, you can see how the shapefile “sle_adm3” consists of many files - each of which must be downloaded from Github.

    -
    -
    -
    -
    -

    -
    -
    -
    -
    - -
    -
    -
    -

    7.8 Manual data entry

    -
    -

    Entry by rows

    -

    Use the tribble function from the tibble package from the tidyverse (online tibble reference).

    -

    Note how column headers start with a tilde (~). Also note that each column must contain only one class of data (character, numeric, etc.). You can use tabs, spacing, and new rows to make the data entry more intuitive and readable. Spaces do not matter between values, but each row is represented by a new line of code. For example:

    -
    -
    # create the dataset manually by row
    -manual_entry_rows <- tibble::tribble(
    -  ~colA, ~colB,
    -  "a",   1,
    -  "b",   2,
    -  "c",   3
    -  )
    -
    -

    And now we display the new dataset:

    -
    -
    -
    - -
    -
    -
    -
    -

    Entry by columns

    -

    Since a data frame consists of vectors (vertical columns), the base approach to manual dataframe creation in R expects you to define each column and then bind them together. This can be counter-intuitive in epidemiology, as we usually think about our data in rows (as above).

    -
    -
    # define each vector (vertical column) separately, each with its own name
    -PatientID <- c(235, 452, 778, 111)
    -Treatment <- c("Yes", "No", "Yes", "Yes")
    -Death     <- c(1, 0, 1, 0)
    -
    -

    CAUTION: All vectors must be the same length (same number of values).

    -

    The vectors can then be bound together using the function data.frame():

    -
    -
    # combine the columns into a data frame, by referencing the vector names
    -manual_entry_cols <- data.frame(PatientID, Treatment, Death)
    -
    -

    And now we display the new dataset:

    -
    -
    -
    - -
    -
    -
    -
    -

    Pasting from clipboard

    -

    If you copy data from elsewhere and have it on your clipboard, you can try one of the two ways below:

    -

    From the clipr package, you can use read_clip_tbl() to import as a data frame, or just just read_clip() to import as a character vector. In both cases, leave the parentheses empty.

    -
    -
    linelist <- clipr::read_clip_tbl()  # imports current clipboard as data frame
    -linelist <- clipr::read_clip()      # imports as character vector
    -
    -

    You can also easily export to your system’s clipboard with clipr. See the section below on Export.

    -

    Alternatively, you can use the the read.table() function from base R with file = "clipboard") to import as a data frame:

    -
    -
    df_from_clipboard <- read.table(
    -  file = "clipboard",  # specify this as "clipboard"
    -  sep = "t",           # separator could be tab, or commas, etc.
    -  header=TRUE)         # if there is a header row
    -
    -
    -
    -
    -

    7.9 Import most recent file

    -

    Often you may receive daily updates to your datasets. In this case you will want to write code that imports the most recent file. Below we present two ways to approach this:

    -
      -
    • Selecting the file based on the date in the file name
      -
    • -
    • Selecting the file based on file metadata (last modification)
    • -
    -
    -

    Dates in file name

    -

    This approach depends on three premises:

    -
      -
    1. You trust the dates in the file names.
      -
    2. -
    3. The dates are numeric and appear in generally the same format (e.g. year then month then day).
      -
    4. -
    5. There are no other numbers in the file name.
    6. -
    -

    We will explain each step, and then show you them combined at the end.

    -

    First, use dir() from base R to extract just the file names for each file in the folder of interest. See the page on Directory interactions for more details about dir(). In this example, the folder of interest is the folder “linelists” within the folder “example” within “data” within the R project.

    -
    -
    linelist_filenames <- dir(here("data", "example", "linelists")) # get file names from folder
    -linelist_filenames                                              # print
    -
    -
    [1] "20201007linelist.csv"          "case_linelist_2020-10-02.csv" 
    -[3] "case_linelist_2020-10-03.csv"  "case_linelist_2020-10-04.csv" 
    -[5] "case_linelist_2020-10-05.csv"  "case_linelist_2020-10-08.xlsx"
    -[7] "case_linelist20201006.csv"    
    -
    -
    -

    Once you have this vector of names, you can extract the dates from them by applying str_extract() from stringr using this regular expression. It extracts any numbers in the file name (including any other characters in the middle such as dashes or slashes). You can read more about stringr in the Strings and characters page.

    -
    -
    linelist_dates_raw <- stringr::str_extract(linelist_filenames, "[0-9].*[0-9]") # extract numbers and any characters in between
    -linelist_dates_raw  # print
    -
    -
    [1] "20201007"   "2020-10-02" "2020-10-03" "2020-10-04" "2020-10-05"
    -[6] "2020-10-08" "20201006"  
    -
    -
    -

    Assuming the dates are written in generally the same date format (e.g. Year then Month then Day) and the years are 4-digits, you can use lubridate’s flexible conversion functions (ymd(), dmy(), or mdy()) to convert them to dates. For these functions, the dashes, spaces, or slashes do not matter, only the order of the numbers. Read more in the Working with dates page.

    -
    -
    linelist_dates_clean <- lubridate::ymd(linelist_dates_raw)
    -linelist_dates_clean
    -
    -
    [1] "2020-10-07" "2020-10-02" "2020-10-03" "2020-10-04" "2020-10-05"
    -[6] "2020-10-08" "2020-10-06"
    -
    -
    -

    The base R function which.max() can then be used to return the index position (e.g. 1st, 2nd, 3rd, …) of the maximum date value. The latest file is correctly identified as the 6th file - “case_linelist_2020-10-08.xlsx”.

    -
    -
    index_latest_file <- which.max(linelist_dates_clean)
    -index_latest_file
    -
    -
    [1] 6
    -
    -
    -

    If we condense all these commands, the complete code could look like below. Note that the . in the last line is a placeholder for the piped object at that point in the pipe sequence. At that point the value is simply the number 6. This is placed in double brackets to extract the 6th element of the vector of file names produced by dir().

    -
    -
    # load packages
    -pacman::p_load(
    -  tidyverse,         # data management
    -  stringr,           # work with strings/characters
    -  lubridate,         # work with dates
    -  rio,               # import / export
    -  here,              # relative file paths
    -  fs)                # directory interactions
    -
    -# extract the file name of latest file
    -latest_file <- dir(here("data", "example", "linelists")) %>%  # file names from "linelists" sub-folder          
    -  str_extract("[0-9].*[0-9]") %>%                  # pull out dates (numbers)
    -  ymd() %>%                                        # convert numbers to dates (assuming year-month-day format)
    -  which.max() %>%                                  # get index of max date (latest file)
    -  dir(here("data", "example", "linelists"))[[.]]              # return the filename of latest linelist
    -
    -latest_file  # print name of latest file
    -
    -
    [1] "case_linelist_2020-10-08.xlsx"
    -
    -
    -

    You can now use this name to finish the relative file path, with here():

    -
    -
    here("data", "example", "linelists", latest_file) 
    -
    -

    And you can now import the latest file:

    -
    -
    # import
    -import(here("data", "example", "linelists", latest_file)) # import 
    -
    -
    -
    -

    Use the file info

    -

    If your files do not have dates in their names (or you do not trust those dates), you can try to extract the last modification date from the file metadata. Use functions from the package fs to examine the metadata information for each file, which includes the last modification time and the file path.

    -

    Below, we provide the folder of interest to fs’s dir_info(). In this case, the folder of interest is in the R project in the folder “data”, the sub-folder “example”, and its sub-folder “linelists”. The result is a data frame with one line per file and columns for modification_time, path, etc. You can see a visual example of this in the page on Directory interactions.

    -

    We can sort this data frame of files by the column modification_time, and then keep only the top/latest row (file) with base R’s head(). Then we can extract the file path of this latest file only with the dplyr function pull() on the column path. Finally we can pass this file path to import(). The imported file is saved as latest_file.

    -
    -
    latest_file <- dir_info(here("data", "example", "linelists")) %>%  # collect file info on all files in directory
    -  arrange(desc(modification_time)) %>%      # sort by modification time
    -  head(1) %>%                               # keep only the top (latest) file
    -  pull(path) %>%                            # extract only the file path
    -  import()                                  # import the file
    -
    - -
    -
    -
    -

    7.10 APIs

    -

    An “Automated Programming Interface” (API) can be used to directly request data from a website. APIs are a set of rules that allow one software application to interact with another. The client (you) sends a “request” and receives a “response” containing content. The R packages httr and jsonlite can facilitate this process.

    -

    Each API-enabled website will have its own documentation and specifics to become familiar with. Some sites are publicly available and can be accessed by anyone. Others, such as platforms with user IDs and credentials, require authentication to access their data.

    -

    Needless to say, it is necessary to have an internet connection to import data via API. We will briefly give examples of use of APIs to import data, and link you to further resources.

    -

    Note: recall that data may be posted* on a website without an API, which may be easier to retrieve. For example a posted CSV file may be accessible simply by providing the site URL to import() as described in the section on importing from Github.*

    -
    -

    HTTP request

    -

    The API exchange is most commonly done through an HTTP request. HTTP is Hypertext Transfer Protocol, and is the underlying format of a request/response between a client and a server. The exact input and output may vary depending on the type of API but the process is the same - a “Request” (often HTTP Request) from the user, often containing a query, followed by a “Response”, containing status information about the request and possibly the requested content.

    -

    Here are a few components of an HTTP request:

    -
      -
    • The URL of the API endpoint.
      -
    • -
    • The “Method” (or “Verb”).
      -
    • -
    • Headers.
      -
    • -
    • Body.
    • -
    -

    The HTTP request “method” is the action your want to perform. The two most common HTTP methods are GET and POST but others could include PUT, DELETE, PATCH, etc. When importing data into R it is most likely that you will use GET.

    -

    After your request, your computer will receive a “response” in a format similar to what you sent, including URL, HTTP status (Status 200 is what you want!), file type, size, and the desired content. You will then need to parse this response and turn it into a workable data frame within your R environment.

    -
    -
    -

    Packages

    -

    The httr package works well for handling HTTP requests in R. It requires little prior knowledge of Web APIs and can be used by people less familiar with software development terminology. In addition, if the HTTP response is .json, you can use jsonlite to parse the response.

    -
    -
    # load packages
    -pacman::p_load(httr, jsonlite, tidyverse)
    -
    -
    -
    -

    Publicly-available data

    -

    Below is an example of an HTTP request, borrowed from a tutorial from the Trafford Data Lab. This site has several other resources to learn and API exercises.

    -

    Scenario: We want to import a list of fast food outlets in the city of Trafford, UK. The data can be accessed from the API of the Food Standards Agency, which provides food hygiene rating data for the United Kingdom.

    -

    Here are the parameters for our request:

    -
      -
    • HTTP verb: GET
      -
    • -
    • API endpoint URL: http://api.ratings.food.gov.uk/Establishments
      -
    • -
    • Selected parameters: name, address, longitude, latitude, businessTypeId, ratingKey, localAuthorityId
      -
    • -
    • Headers: “x-api-version”, 2
      -
    • -
    • Data format(s): JSON, XML
      -
    • -
    • Documentation: http://api.ratings.food.gov.uk/help
    • -
    -

    The R code would be as follows:

    -
    -
    # prepare the request
    -path <- "http://api.ratings.food.gov.uk/Establishments"
    -request <- GET(url = path,
    -             query = list(
    -               localAuthorityId = 188,
    -               BusinessTypeId = 7844,
    -               pageNumber = 1,
    -               pageSize = 5000),
    -             add_headers("x-api-version" = "2"))
    -
    -# check for any server error ("200" is good!)
    -request$status_code
    -
    -# submit the request, parse the response, and convert to a data frame
    -response <- content(request, as = "text", encoding = "UTF-8") %>%
    -  fromJSON(flatten = TRUE) %>%
    -  pluck("establishments") %>%
    -  as_tibble()
    -
    -

    You can now clean and use the response data frame, which contains one row per fast food facility.

    -
    -
    -

    Authentication required

    -

    Some APIs require authentication - for you to prove who you are, so you can access restricted data. To import these data, you may need to first use a POST method to provide a username, password, or code. This will return an access token, that can be used for subsequent GET method requests to retrieve the desired data.

    -

    Below is an example of querying data from Go.Data, which is an outbreak investigation tool. Go.Data uses an API for all interactions between the web front-end and smartphone applications used for data collection. Go.Data is used throughout the world. Because outbreak data are sensitive and you should only be able to access data for your outbreak, authentication is required.

    -

    Below is some sample R code using httr and jsonlite for connecting to the Go.Data API to import data on contact follow-up from your outbreak.

    -
    -
    # set credentials for authorization
    -url <- "https://godatasampleURL.int/"           # valid Go.Data instance url
    -username <- "username"                          # valid Go.Data username 
    -password <- "password"                          # valid Go,Data password 
    -outbreak_id <- "xxxxxx-xxxx-xxxx-xxxx-xxxxxxx"  # valid Go.Data outbreak ID
    -
    -# get access token
    -url_request <- paste0(url,"api/oauth/token?access_token=123") # define base URL request
    -
    -# prepare request
    -response <- POST(
    -  url = url_request,  
    -  body = list(
    -    username = username,    # use saved username/password from above to authorize                               
    -    password = password),                                       
    -    encode = "json")
    -
    -# execute request and parse response
    -content <-
    -  content(response, as = "text") %>%
    -  fromJSON(flatten = TRUE) %>%          # flatten nested JSON
    -  glimpse()
    -
    -# Save access token from response
    -access_token <- content$access_token    # save access token to allow subsequent API calls below
    -
    -# import outbreak contacts
    -# Use the access token 
    -response_contacts <- GET(
    -  paste0(url,"api/outbreaks/",outbreak_id,"/contacts"),          # GET request
    -  add_headers(
    -    Authorization = paste("Bearer", access_token, sep = " ")))
    -
    -json_contacts <- content(response_contacts, as = "text")         # convert to text JSON
    -
    -contacts <- as_tibble(fromJSON(json_contacts, flatten = TRUE))   # flatten JSON to tibble
    -
    -

    CAUTION: If you are importing large amounts of data from an API requiring authentication, it may time-out. To avoid this, retrieve access_token again before each API GET request and try using filters or limits in the query.

    -

    TIP: The fromJSON() function in the jsonlite package does not fully un-nest the first time it’s executed, so you will likely still have list items in your resulting tibble. You will need to further un-nest for certain variables; depending on how nested your .json is. To view more info on this, view the documentation for the jsonlite package, such as the flatten() function.

    -

    For more details, View documentation on LoopBack Explorer, the Contact Tracing page or API tips on Go.Data Github repository

    -

    You can read more about the httr package here

    -

    This section was also informed by this tutorial and this tutorial.

    - -
    -
    -
    -

    7.11 Export

    -
    -

    With rio package

    -

    With rio, you can use the export() function in a very similar way to import(). First give the name of the R object you want to save (e.g. linelist) and then in quotes put the file path where you want to save the file, including the desired file name and file extension. For example:

    -

    This saves the data frame linelist as an Excel workbook to the working directory/R project root folder:

    -
    -
    export(linelist, "my_linelist.xlsx") # will save to working directory
    -
    -

    You could save the same data frame as a csv file by changing the extension. For example, we also save it to a file path constructed with here():

    -
    -
    export(linelist, here("data", "clean", "my_linelist.csv"))
    -
    -
    -
    -

    To clipboard

    -

    To export a data frame to your computer’s “clipboard” (to then paste into another software like Excel, Google Spreadsheets, etc.) you can use write_clip() from the clipr package.

    -
    -
    # export the linelist data frame to your system's clipboard
    -clipr::write_clip(linelist)
    -
    -
    -
    -
    -

    7.12 RDS files

    -

    Along with .csv, .xlsx, etc, you can also export (save) R data frames as .rds files. This is a file format specific to R, and is very useful if you know you will work with the exported data again in R.

    -

    The classes of columns are stored, so you don’t have do to cleaning again when it is imported (with an Excel or even a CSV file this can be a headache!). It is also a smaller file, which is useful for export and import if your dataset is large.

    -

    For example, if you work in an Epidemiology team and need to send files to a GIS team for mapping, and they use R as well, just send them the .rds file! Then all the column classes are retained and they have less work to do.

    -
    -
    export(linelist, here("data", "clean", "my_linelist.rds"))
    -
    - -
    -
    -

    7.13 Rdata files and lists

    -

    .Rdata files can store multiple R objects - for example multiple data frames, model results, lists, etc. This can be very useful to consolidate or share a lot of your data for a given project.

    -

    In the below example, multiple R objects are stored within the exported file “my_objects.Rdata”:

    -
    -
    rio::export(my_list, my_dataframe, my_vector, "my_objects.Rdata")
    -
    -

    Note: if you are trying to import a list, use import_list() from rio to import it with the complete original structure and contents.

    -
    -
    rio::import_list("my_list.Rdata")
    -
    - -
    -
    -

    7.14 Saving plots

    -

    Instructions on how to save plots, such as those created by ggplot(), are discussed in depth in the ggplot basics page.

    -

    In brief, run ggsave("my_plot_filepath_and_name.png") after printing your plot. You can either provide a saved plot object to the plot = argument, or only specify the destination file path (with file extension) to save the most recently-displayed plot. You can also control the width =, height =, units =, and dpi =.

    -

    How to save a network graph, such as a transmission tree, is addressed in the page on Transmission chains.

    - -
    -
    -

    7.15 Resources

    -

    R Data Import/Export Manual
    -R 4 Data Science chapter on data import
    -ggsave() documentation

    -

    Below is a table, taken from the rio online vignette. For each type of data it shows: the expected file extension, the package rio uses to import or export the data, and whether this functionality is included in the default installed version of rio.

    - ------- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    FormatTypical ExtensionImport PackageExport PackageInstalled by Default
    Comma-separated data.csvdata.table fread()data.tableYes
    Pipe-separated data.psvdata.table fread()data.tableYes
    Tab-separated data.tsvdata.table fread()data.tableYes
    SAS.sas7bdathavenhavenYes
    SPSS.savhavenhavenYes
    Stata.dtahavenhavenYes
    SASXPORT.xpthavenhaven
    SPSS Portable.porhavenYes
    Excel.xlsreadxlYes
    Excel.xlsxreadxlopenxlsxYes
    R syntax.RbasebaseYes
    Saved R objects.RData, .rdabasebaseYes
    Serialized R objects.rdsbasebaseYes
    Epiinfo.recforeignYes
    Minitab.mtpforeignYes
    Systat.sydforeignYes
    “XBASE”database files.dbfforeignforeign
    Weka Attribute-Relation File Format.arffforeignforeignYes
    Data Interchange Format.difutilsYes
    Fortran datano recognized extensionutilsYes
    Fixed-width format data.fwfutilsutilsYes
    gzip comma-separated data.csv.gzutilsutilsYes
    CSVY (CSV + YAML metadata header).csvycsvycsvyNo
    EViews.wf1hexViewNo
    Feather R/Python interchange format.featherfeatherfeatherNo
    Fast Storage.fstfstfstNo
    JSON.jsonjsonlitejsonliteNo
    Matlab.matrmatiormatioNo
    OpenDocument Spreadsheet.odsreadODSreadODSNo
    HTML Tables.htmlxml2xml2No
    Shallow XML documents.xmlxml2xml2No
    YAML.ymlyamlyamlNo
    Clipboard default is tsvcliprcliprNo
    - - -
    - -
    - - -
    - - - - - - - \ No newline at end of file diff --git a/new_pages/importing.qmd b/new_pages/importing.qmd index cf4a32d9..efd15a8e 100644 --- a/new_pages/importing.qmd +++ b/new_pages/importing.qmd @@ -249,7 +249,7 @@ Unfortunately `skip = ` only accepts one integer value, *not* a range (e.g. "2:1 Sometimes, your data may have a *second* row, for example if it is a "data dictionary" row as shown below. This situation can be problematic because it can result in all columns being imported as class "character". -```{r, echo=F} +```{r, echo=F, waring = F, message = F} # HIDDEN FROM READER #################### # Create second header row of "data dictionary" and insert into row 2. Save as new dataframe. @@ -394,8 +394,11 @@ Gsheets_demo <- read_sheet("1scgtzkVLLHAe5a6_eFQEwkZcc14yFUx1KgOMZ4AKUfY") Another package, **googledrive** offers useful functions for writing, editing, and deleting Google sheets. For example, using the `gs4_create()` and `sheet_write()` functions found in this package. Here are some other helpful online tutorials: + [Google sheets importing tutorial](https://felixanalytix.medium.com/how-to-read-write-append-google-sheet-data-using-r-programming-ecf278108691). + [More detailed tutorial](https://googlesheets4.tidyverse.org/articles/googlesheets4.html). + [Interaction between the googlesheets4 and tidyverse](https://googlesheets4.tidyverse.org/articles/articles/drive-and-sheets.html). Additionally, you can also use `import` from the **rio** package. @@ -548,8 +551,8 @@ df_from_clipboard <- read.table( Often you may receive daily updates to your datasets. In this case you will want to write code that imports the most recent file. Below we present two ways to approach this: -* Selecting the file based on the date in the file name -* Selecting the file based on file metadata (last modification) +* Selecting the file based on the date in the file name. +* Selecting the file based on file metadata (last modification). ### Dates in file name {.unnumbered} @@ -658,7 +661,7 @@ Each API-enabled website will have its own documentation and specifics to become Needless to say, it is necessary to have an internet connection to import data via API. We will briefly give examples of use of APIs to import data, and link you to further resources. -*Note: recall that data may be *posted* on a website without an API, which may be easier to retrieve. For example a posted CSV file may be accessible simply by providing the site URL to `import()` as described in the section on [importing from Github](#import_github).* +Note: recall that data may be *posted* on a website without an API, which may be easier to retrieve. For example a posted CSV file may be accessible simply by providing the site URL to `import()` as described in the section on [importing from Github](#import_github). ### HTTP request {.unnumbered} @@ -695,12 +698,12 @@ Scenario: We want to import a list of fast food outlets in the city of Trafford, Here are the parameters for our request: -* HTTP verb: GET -* API endpoint URL: http://api.ratings.food.gov.uk/Establishments -* Selected parameters: name, address, longitude, latitude, businessTypeId, ratingKey, localAuthorityId -* Headers: “x-api-version”, 2 -* Data format(s): JSON, XML -* Documentation: http://api.ratings.food.gov.uk/help +* HTTP verb: GET. +* API endpoint URL: [http://api.ratings.food.gov.uk/Establishments](http://api.ratings.food.gov.uk/Establishments). +* Selected parameters: name, address, longitude, latitude, businessTypeId, ratingKey, localAuthorityId. +* Headers: “x-api-version”. +* Data format(s): JSON, XML. +* Documentation: [http://api.ratings.food.gov.uk/help](http://api.ratings.food.gov.uk/help). The R code would be as follows: @@ -872,7 +875,9 @@ How to save a network graph, such as a transmission tree, is addressed in the pa ## Resources {} [R Data Import/Export Manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) + [R 4 Data Science chapter on data import](https://r4ds.had.co.nz/data-import.html#data-import) + [ggsave() documentation](https://ggplot2.tidyverse.org/reference/ggsave.html) diff --git a/new_pages/iteration.qmd b/new_pages/iteration.qmd index e706984e..1c970d92 100644 --- a/new_pages/iteration.qmd +++ b/new_pages/iteration.qmd @@ -40,7 +40,7 @@ pacman::p_load( We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the "clean" linelist (as .rds file). Import data with the `import()` function from the **rio** package (it handles many file types like .xlsx, .csv, .rds - see the [Import and export](importing.qmd) page for details). -```{r, echo=F} +```{r, echo=F, message=F, warning=F} # import the linelist into R linelist <- rio::import(here::here("data", "case_linelists", "linelist_cleaned.rds")) ``` @@ -84,7 +84,7 @@ The basic syntax is: `for (item in sequence) {do operations using item}`. Note t A simple *for loop* example is below. ```{r} -for (num in c(1,2,3,4,5)) { # the SEQUENCE is defined (numbers 1 to 5) and loop is opened with "{" +for (num in c(1, 2, 3, 4, 5)) { # the SEQUENCE is defined (numbers 1 to 5) and loop is opened with "{" print(num + 2) # The OPERATIONS (add two to each sequence number and print) } # The loop is closed with "}" # There is no "container" in this example @@ -94,7 +94,7 @@ for (num in c(1,2,3,4,5)) { # the SEQUENCE is defined (numbers 1 to 5) and loop ### Sequence {.unnumbered} -This is the "for" part of a *for loop* - the operations will run "for" each item in the sequence. The sequence can be a series of values (e.g. names of jurisdictions, diseases, column names, list elements, etc), or it can be a series of consecutive numbers (e.g. 1,2,3,4,5). Each approach has their own utilities, described below. +This is the "for" part of a *for loop* - the operations will run "for" each item in the sequence. The sequence can be a series of values (e.g. names of jurisdictions, diseases, column names, list elements, etc), or it can be a series of consecutive numbers (e.g. 1, 2, 3, 4, 5). Each approach has their own utilities, described below. The basic structure of a sequence statement is `item in vector`. @@ -213,8 +213,9 @@ Say you want to store the median delay-to-admission for each hospital. You would ```{r} delays <- vector( - mode = "double", # we expect to store numbers - length = length(unique(linelist$hospital))) # the number of unique hospitals in the dataset + mode = "double", # we expect to store numbers + length = length(unique(linelist$hospital)) # the number of unique hospitals in the dataset + ) ``` **Empty data frame** @@ -269,15 +270,16 @@ We can make a nice epicurve of *all* the cases by gender using the **incidence2* ```{r, warning=F, message=F} # create 'incidence' object outbreak <- incidence2::incidence( - x = linelist, # dataframe - complete linelist - date_index = "date_onset", # date column - interval = "week", # aggregate counts weekly - groups = "gender") # group values by gender - #na_as_group = TRUE) # missing gender is own group - -# tracer la courbe d'épidémie -ggplot(outbreak, # nom de l'objet d'incidence - aes(x = date_index, #aesthetiques et axes + x = linelist, # dataframe - complete linelist + date_index = "date_onset", # date column + interval = "week", # aggregate counts weekly + groups = "gender" # group values by gender + ) + #na_as_group = TRUE) # missing gender is own group + +# plot +ggplot(outbreak, + aes(x = date_index, y = count, fill = gender), # Fill colour of bars by gender color = "black" # Contour colour of bars @@ -289,8 +291,11 @@ ggplot(outbreak, # nom de l'objet d'incidence x = "Counts", y = "Date", fill = "Gender", - color = "Gender") - + color = "Gender") + + theme(axis.title.x = element_blank(), + axis.text.x = element_blank(), + axis.ticks.x = element_blank()) + ``` @@ -306,7 +311,7 @@ Within the loop operations, you can write R code as normal, but use the "item" ( * The plot for the current hospital is temporarily saved and then printed. * The loop then moves onward to repeat with the next hospital in `hospital_names`. -```{r, out.width='50%', message = F} +```{r, out.width='75%', message = F} # make vector of the hospital names hospital_names <- unique(linelist$hospital) @@ -337,14 +342,6 @@ for (hosp in hospital_names) { fill = "Gender", color = "Gender") - # With older versions of R, remove the # before na_as_group and use this plot command instead. - # plot_hosp <- plot( -# outbreak_hosp, -# fill = "gender", -# color = "black", -# title = stringr::str_glue("Epidemic of cases admitted to {hosp}") -# ) - #print the plot for hospitals print(plot_hosp) diff --git a/new_pages/joining_matching.qmd b/new_pages/joining_matching.qmd index 60b344db..927e3abf 100644 --- a/new_pages/joining_matching.qmd +++ b/new_pages/joining_matching.qmd @@ -42,7 +42,7 @@ pacman::p_load( To begin, we import the cleaned linelist of cases from a simulated Ebola epidemic. If you want to follow along, click to download the "clean" linelist (as .rds file). Import data with the `import()` function from the **rio** package (it handles many file types like .xlsx, .csv, .rds - see the [Import and export](importing.qmd) page for details). -```{r, echo=F} +```{r, echo=F, warning = F, message = F} # import the linelist into R linelist <- rio::import(here::here("data", "case_linelists", "linelist_cleaned.rds")) ``` diff --git a/new_pages/missing_data.qmd b/new_pages/missing_data.qmd index d4f95fa6..749f1504 100644 --- a/new_pages/missing_data.qmd +++ b/new_pages/missing_data.qmd @@ -37,7 +37,7 @@ pacman::p_load( We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the "clean" linelist (as .rds file). Import your data with the `import()` function from the **rio** package (it accepts many file types like .xlsx, .rds, .csv - see the [Import and export](importing.qmd) page for details). -```{r, echo=F} +```{r, echo=F, message=F, warning=F} # import the linelist into R linelist <- rio::import(here::here("data", "case_linelists", "linelist_cleaned.rds")) ``` diff --git a/new_pages/moving_average.qmd b/new_pages/moving_average.qmd index a364a9ea..f11cbdc2 100644 --- a/new_pages/moving_average.qmd +++ b/new_pages/moving_average.qmd @@ -35,7 +35,7 @@ pacman::p_load( We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the "clean" linelist (as .rds file). Import data with the `import()` function from the **rio** package (it handles many file types like .xlsx, .csv, .rds - see the [Import and export](importing.qmd) page for details). -```{r, echo=F} +```{r, echo=F, message=F, warning=F} # import the linelist into R linelist <- rio::import(here::here("data", "case_linelists", "linelist_cleaned.rds")) ``` @@ -176,7 +176,7 @@ Now you can plot these data using `ggplot()`: ```{r} ggplot(data = rolling) + - geom_line(mapping = aes(x = date_hospitalisation, y = indexed_7day), size = 1) + geom_line(mapping = aes(x = date_hospitalisation, y = indexed_7day), linewidth = 1) ``` diff --git a/new_pages/packages_suggested.html b/new_pages/packages_suggested.html deleted file mode 100644 index 03ccaf0a..00000000 --- a/new_pages/packages_suggested.html +++ /dev/null @@ -1,1489 +0,0 @@ - - - - - - - - - -The Epidemiologist R Handbook - 5  Suggested packages - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - - - -
    - - - - - - - - - - -
    -
    - -
    - -
    - - -
    - - - -
    - -
    -
    -

    5  Suggested packages

    -
    - - - -
    - - - - -
    - - - -
    - - -

    Below is a long list of suggested packages for common epidemiological work in R. You can copy this code, run it, and all of these packages will install from CRAN and load for use in the current R session. If a package is already installed, it will be loaded for use only.

    -

    You can modify the code with # symbols to exclude any packages you do not want.

    -

    Of note:

    -
      -
    • Install the pacman package first before running the below code. You can do this with install.packages("pacman"). In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use in the current R session. You can also load packages that are already installed with library() from base R.
      -
    • -
    • In the code below, packages that are included when installing/loading another package are indicated by an indent and hash. For example how ggplot2 is listed under tidyverse.
      -
    • -
    • If multiple packages have functions with the same name, masking can occur when the function from the more recently-loaded package takes precedent. Read more in the R basics page. Consider using the package conflicted to manage such conflicts.
      -
    • -
    • See the R basics section on packages for more information on pacman and masking.
    • -
    -

    To see the versions of R, RStudio, and R packages used during the production of this handbook, see the page on Editorial and technical notes.

    -
    -

    5.1 Packages from CRAN

    -
    -
    ##########################################
    -# List of useful epidemiology R packages #
    -##########################################
    -
    -# This script uses the p_load() function from pacman R package, 
    -# which installs if package is absent, and loads for use if already installed
    -
    -
    -# Ensures the package "pacman" is installed
    -if (!require("pacman")) install.packages("pacman")
    -
    -
    -# Packages available from CRAN
    -##############################
    -pacman::p_load(
    -     
    -     # learning R
    -     ############
    -     learnr,   # interactive tutorials in RStudio Tutorial pane
    -     swirl,    # interactive tutorials in R console
    -        
    -     # project and file management
    -     #############################
    -     here,     # file paths relative to R project root folder
    -     rio,      # import/export of many types of data
    -     openxlsx, # import/export of multi-sheet Excel workbooks 
    -     
    -     # package install and management
    -     ################################
    -     pacman,   # package install/load
    -     renv,     # managing versions of packages when working in collaborative groups
    -     remotes,  # install from github
    -     
    -     # General data management
    -     #########################
    -     tidyverse,    # includes many packages for tidy data wrangling and presentation
    -          #dplyr,      # data management
    -          #tidyr,      # data management
    -          #ggplot2,    # data visualization
    -          #stringr,    # work with strings and characters
    -          #forcats,    # work with factors 
    -          #lubridate,  # work with dates
    -          #purrr       # iteration and working with lists
    -     linelist,     # cleaning linelists
    -     naniar,       # assessing missing data
    -     
    -     # statistics  
    -     ############
    -     janitor,      # tables and data cleaning
    -     gtsummary,    # making descriptive and statistical tables
    -     rstatix,      # quickly run statistical tests and summaries
    -     broom,        # tidy up results from regressions
    -     lmtest,       # likelihood-ratio tests
    -     easystats,
    -          # parameters, # alternative to tidy up results from regressions
    -          # see,        # alternative to visualise forest plots 
    -     
    -     # epidemic modeling
    -     ###################
    -     epicontacts,  # Analysing transmission networks
    -     EpiNow2,      # Rt estimation
    -     EpiEstim,     # Rt estimation
    -     projections,  # Incidence projections
    -     incidence2,   # Make epicurves and handle incidence data
    -     i2extras,     # Extra functions for the incidence2 package
    -     epitrix,      # Useful epi functions
    -     distcrete,    # Discrete delay distributions
    -     
    -     
    -     # plots - general
    -     #################
    -     #ggplot2,         # included in tidyverse
    -     patchwork,        # combining plots
    -     RColorBrewer,     # color scales
    -     ggnewscale,       # to add additional layers of color schemes
    -
    -     
    -     # plots - specific types
    -     ########################
    -     DiagrammeR,       # diagrams using DOT language
    -     incidence2,       # epidemic curves
    -     gghighlight,      # highlight a subset
    -     ggrepel,          # smart labels
    -     plotly,           # interactive graphics
    -     gganimate,        # animated graphics 
    -
    -     
    -     # gis
    -     ######
    -     sf,               # to manage spatial data using a Simple Feature format
    -     tmap,             # to produce simple maps, works for both interactive and static maps
    -     OpenStreetMap,    # to add OSM basemap in ggplot map
    -     spdep,            # spatial statistics 
    -     
    -     # routine reports
    -     #################
    -     rmarkdown,        # produce PDFs, Word Documents, Powerpoints, and HTML files
    -     reportfactory,    # auto-organization of R Markdown outputs
    -     officer,          # powerpoints
    -     
    -     # dashboards
    -     ############
    -     flexdashboard,    # convert an R Markdown script into a dashboard
    -     shiny,            # interactive web apps
    -     
    -     # tables for presentation
    -     #########################
    -     knitr,            # R Markdown report generation and html tables
    -     flextable,        # HTML tables
    -     #DT,              # HTML tables (alternative)
    -     #gt,              # HTML tables (alternative)
    -     #huxtable,        # HTML tables (alternative) 
    -     
    -     # phylogenetics
    -     ###############
    -     ggtree,           # visualization and annotation of trees
    -     ape,              # analysis of phylogenetics and evolution
    -     treeio            # to visualize phylogenetic files
    - 
    -)
    -
    -
    -
    -

    5.2 Packages from Github

    -

    Below are commmands to install two packages directly from Github repositories.

    -
      -
    • The development version of epicontacts contains the ability to make transmission trees with an temporal x-axis
      -
    • -
    • The epirhandbook package contains all the example data for this handbook and can be used to download the offline version of the handbook.
    • -
    -
    -
    # Packages to download from Github (not available on CRAN)
    -##########################################################
    -
    -# Development version of epicontacts (for transmission chains with a time x-axis)
    -pacman::p_install_gh("reconhub/epicontacts@timeline")
    -
    -# The package for this handbook, which includes all the example data  
    -pacman::p_install_gh("appliedepi/epirhandbook")
    -
    - - -
    - -
    - - -
    - - - - - - \ No newline at end of file diff --git a/new_pages/packages_suggested.qmd b/new_pages/packages_suggested.qmd index 4d407e15..0f6c2ac2 100644 --- a/new_pages/packages_suggested.qmd +++ b/new_pages/packages_suggested.qmd @@ -102,6 +102,7 @@ pacman::p_load( ggrepel, # smart labels plotly, # interactive graphics gganimate, # animated graphics + ggalluvial, # for alluvial/sankey diagrams # gis @@ -145,7 +146,6 @@ pacman::p_load( Below are commmands to install two packages directly from Github repositories. -* The development version of **epicontacts** contains the ability to make transmission trees with an temporal x-axis * The **epirhandbook** package contains all the example data for this handbook and can be used to download the offline version of the handbook. @@ -153,9 +153,6 @@ Below are commmands to install two packages directly from Github repositories. # Packages to download from Github (not available on CRAN) ########################################################## -# Development version of epicontacts (for transmission chains with a time x-axis) -pacman::p_install_gh("reconhub/epicontacts@timeline") - # The package for this handbook, which includes all the example data pacman::p_install_gh("appliedepi/epirhandbook") diff --git a/new_pages/pivoting.qmd b/new_pages/pivoting.qmd index f867aceb..69de3a6b 100644 --- a/new_pages/pivoting.qmd +++ b/new_pages/pivoting.qmd @@ -50,7 +50,7 @@ pacman::p_load( In this page, we will use a fictional dataset of daily malaria cases, by facility and age group. If you want to follow along, click here to download (as .rds file). Import data with the `import()` function from the **rio** package (it handles many file types like .xlsx, .csv, .rds - see the [Import and export](importing.qmd) page for details). -```{r, echo=F} +```{r, echo=F, warning=F, message=F} count_data <- rio::import(here::here("data", "malaria_facility_count_data.rds")) %>% as_tibble() ``` diff --git a/new_pages/regression.qmd b/new_pages/regression.qmd index cca9538c..160deb80 100644 --- a/new_pages/regression.qmd +++ b/new_pages/regression.qmd @@ -44,7 +44,7 @@ pacman::p_load( We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the "clean" linelist (as .rds file). Import your data with the `import()` function from the **rio** package (it accepts many file types like .xlsx, .rds, .csv - see the [Import and export](importing.qmd) page for details). -```{r, echo=F} +```{r, echo=F, message=F, warning=F} # import the linelist into R linelist <- rio::import(here::here("data", "case_linelists", "linelist_cleaned.rds")) ``` diff --git a/new_pages/standardization.qmd b/new_pages/standardization.qmd index 1ddcd76f..182248e5 100644 --- a/new_pages/standardization.qmd +++ b/new_pages/standardization.qmd @@ -292,7 +292,7 @@ DT::datatable(all_data, rownames = FALSE, options = list(pageLength = 5, scrollX ## **PHEindicatormethods** package {#standard_phe } -Another way of calculating standardized rates is with the **PHEindicatormethods** package. This package allows you to calculate directly as well as indirectly standardized rates. We will show both. +One way of calculating standardized rates is with the **PHEindicatormethods** package. This package allows you to calculate directly as well as indirectly standardized rates. We will show both. This section will use the `all_data` data frame created at the end of the Preparation section. This data frame includes the country populations, death events, and the world standard reference population. You can view it [here](#standard_all). diff --git a/new_pages/stat_tests.qmd b/new_pages/stat_tests.qmd index be99017e..50b6fb43 100644 --- a/new_pages/stat_tests.qmd +++ b/new_pages/stat_tests.qmd @@ -3,7 +3,7 @@ This page demonstrates how to conduct simple statistical tests using **base** R, **rstatix**, and **gtsummary**. -* T-test +* T-test * Shapiro-Wilk test * Wilcoxon rank sum test * Kruskal-Wallis test @@ -48,7 +48,7 @@ pacman::p_load( We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the "clean" linelist (as .rds file). Import your data with the `import()` function from the **rio** package (it accepts many file types like .xlsx, .rds, .csv - see the [Import and export](importing.qmd) page for details). -```{r, echo=F} +```{r, echo=F, warning=F, message=F} # import the linelist into R linelist <- rio::import(here::here("data", "case_linelists", "linelist_cleaned.rds")) ``` diff --git a/new_pages/survey_analysis.qmd b/new_pages/survey_analysis.qmd index 2a100e8f..811b4521 100644 --- a/new_pages/survey_analysis.qmd +++ b/new_pages/survey_analysis.qmd @@ -76,9 +76,9 @@ pacman::p_load_gh( The example dataset used in this section: -- fictional mortality survey data. -- fictional population counts for the survey area. -- data dictionary for the fictional mortality survey data. +- Fictional mortality survey data. +- Fictional population counts for the survey area. +- Data dictionary for the fictional mortality survey data. This is based off the MSF OCA ethical review board pre-approved survey. The fictional dataset was produced as part of the ["R4Epis" project](https://r4epis.netlify.app/). @@ -182,7 +182,7 @@ are in. Finally, we recode all of the yes/no variables to TRUE/FALSE variables - otherwise these cant be used by the **survey** proportion functions. -```{r cleaning} +```{r cleaning, warning=F, message=F} ## select the date variable names from the dictionary DATEVARS <- survey_dict %>% diff --git a/new_pages/survival_analysis.qmd b/new_pages/survival_analysis.qmd index fa07988b..590b98d0 100644 --- a/new_pages/survival_analysis.qmd +++ b/new_pages/survival_analysis.qmd @@ -69,7 +69,7 @@ This page explores survival analyses using the linelist used in most of the prev We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the "clean" linelist (as .rds file). Import data with the `import()` function from the **rio** package (it handles many file types like .xlsx, .csv, .rds - see the [Import and export](importing.qmd) page for details). -```{r echo=F} +```{r echo=F, message=F, warning=F} # import linelist linelist_case_data <- rio::import(here::here("data", "case_linelists", "linelist_cleaned.rds")) ``` @@ -92,7 +92,7 @@ Thus, we will create different variables needed to respect that structure and ru We define: - A new data frame `linelist_surv` for this analysis. -- Iur event of interest as being "death" (hence our survival probability will be the probability of being alive after a certain time after the time of origin), +- The event of interest as being "death" (hence our survival probability will be the probability of being alive after a certain time after the time of origin), - the follow-up time (`futime`) as the time between the time of onset and the time of outcome *in days*, - censored patients as those who recovered or for whom the final outcome is not known ie the event "death" was not observed (`event=0`). diff --git a/new_pages/tables_descriptive.qmd b/new_pages/tables_descriptive.qmd index 0d60a612..640718f4 100644 --- a/new_pages/tables_descriptive.qmd +++ b/new_pages/tables_descriptive.qmd @@ -47,7 +47,7 @@ pacman::p_load( We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the "clean" linelist (as .rds file). Import your data with the `import()` function from the **rio** package (it accepts many file types like .xlsx, .rds, .csv - see the [Import and export](importing.qmd) page for details). -```{r, echo=F} +```{r, echo=F, warning=F, message=F} # import the linelist into R linelist <- rio::import(here::here("data", "case_linelists", "linelist_cleaned.rds")) ``` @@ -675,8 +675,9 @@ by_hospital <- linelist %>% filter(!is.na(outcome) & hospital != "Missing") %>% # Remove cases with missing outcome or hospital group_by(hospital, outcome) %>% # Group data summarise( # Create new summary columns of indicators of interest - N = n(), # Number of rows per hospital-outcome group - ct_value = median(ct_blood, na.rm=T)) # median CT value per group + N = n(), # Number of rows per hospital-outcome group + ct_value = median(ct_blood, na.rm=T) # median CT value per group + ) by_hospital # print table ``` diff --git a/new_pages/time_series.qmd b/new_pages/time_series.qmd index b1f32f45..e68cf6e1 100644 --- a/new_pages/time_series.qmd +++ b/new_pages/time_series.qmd @@ -193,12 +193,12 @@ file_paths <- list.files( file_paths <- file_paths[str_detect(file_paths, "germany")] ## read in all the files as a stars object -data <- stars::read_stars(file_paths) +data <- stars::read_stars(file_paths, quiet = TRUE) ``` Once these files have been imported as the object `data`, we will convert them to a data frame. -```{r} +```{r, message=F} ## change to a data frame temp_data <- as_tibble(data) %>% ## add in variables and correct units @@ -1621,9 +1621,12 @@ ggplot(estimate_res, aes(x = epiweek)) + ## Resources { } -[forecasting: principles and practice textbook](https://otexts.com/fpp3/) -[EPIET timeseries analysis case studies](https://github.com/EPIET/TimeSeriesAnalysis) -[Penn State course](https://online.stat.psu.edu/stat510/lesson/1) +[forecasting: principles and practice textbook](https://otexts.com/fpp3/) + +[EPIET timeseries analysis case studies](https://github.com/EPIET/TimeSeriesAnalysis) + +[Penn State course](https://online.stat.psu.edu/stat510/lesson/1) + [Surveillance package manuscript](https://www.jstatsoft.org/article/view/v070i10) diff --git a/new_pages/transition_to_R.html b/new_pages/transition_to_R.html deleted file mode 100644 index bd86f292..00000000 --- a/new_pages/transition_to_R.html +++ /dev/null @@ -1,1798 +0,0 @@ - - - - - - - - - -The Epidemiologist R Handbook - 4  Transition to R - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - - - -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    -
    - -
    - -
    - - -
    - - - -
    - -
    -
    -

    4  Transition to R

    -
    - - - -
    - - - - -
    - - - -
    - - - - - -

    Below, we provide some advice and resources if you are transitioning to R.

    -

    R was introduced in the late 1990s and has since grown dramatically in scope. Its capabilities are so extensive that commercial alternatives have reacted to R developments in order to stay competitive! (read this article comparing R, SPSS, SAS, STATA, and Python).

    -

    Moreover, R is much easier to learn than it was 10 years ago. Previously, R had a reputation of being difficult for beginners. It is now much easier with friendly user-interfaces like RStudio, intuitive code like the tidyverse, and many tutorial resources.

    -

    Do not be intimidated - come discover the world of R!

    -
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -

    4.1 From Excel

    -

    Transitioning from Excel directly to R is a very achievable goal. It may seem daunting, but you can do it!

    -

    It is true that someone with strong Excel skills can do very advanced activities in Excel alone - even using scripting tools like VBA. Excel is used across the world and is an essential tool for an epidemiologist. However, complementing it with R can dramatically improve and expand your work flows.

    -
    -

    Benefits

    -

    You will find that using R offers immense benefits in time saved, more consistent and accurate analysis, reproducibility, shareability, and faster error-correction. Like any new software there is a learning “curve” of time you must invest to become familiar. The dividends will be significant and immense scope of new possibilities will open to you with R.

    -

    Excel is a well-known software that can be easy for a beginner to use to produce simple analysis and visualizations with “point-and-click”. In comparison, it can take a couple weeks to become comfortable with R functions and interface. However, R has evolved in recent years to become much more friendly to beginners.

    -

    Many Excel workflows rely on memory and on repetition - this means there is much opportunity for error. Furthermore, generally the data cleaning, analysis methodology, and equations used are hidden from view. It can require substantial time for a new colleague to learn what an Excel workbook is doing and how to troubleshoot it. With R, all the steps are explicitly written in the script and can be easily viewed, edited, corrected, and applied to other datasets.

    -

    To begin your transition from Excel to R you must adjust your mindset in a few important ways:

    -
    -
    -

    Tidy data

    -

    Use machine-readable “tidy” data instead of messy “human-readable” data. These are the three main requirements for “tidy” data, as explained in this tutorial on “tidy” data in R:

    -
      -
    • Each variable must have its own column.
      -
    • -
    • Each observation must have its own row.
      -
    • -
    • Each value must have its own cell.
    • -
    -

    To Excel users - think of the role that Excel “tables” play in standardizing data and making the format more predictable.

    -

    An example of “tidy” data would be the case linelist used throughout this handbook - each variable is contained within one column, each observation (one case) has it’s own row, and every value is in just one cell. Below you can view the first 50 rows of the linelist:

    -
    -
    -
    Warning: The `trust` argument of `import()` should be explicit for serialization formats
    -as of rio 1.0.3.
    -ℹ Missing `trust` will be set to FALSE by default for RDS in 2.0.0.
    -ℹ The deprecated feature was likely used in the rio package.
    -  Please report the issue at <https://github.com/gesistsa/rio/issues>.
    -
    -
    -
    -
    -
    - -
    -
    -

    The main reason you might encounter non-tidy data is because many Excel spreadsheets are designed to prioritize easy reading by humans, not easy reading by machines/software.

    -

    To help you see the difference, below are some fictional examples of non-tidy data that prioritize human-readability over machine-readability:

    -
    -
    -
    -
    -

    -
    -
    -
    -
    -

    Problems: In the spreadsheet above, there are merged cells which are not easily digested by R. Which row should be considered the “header” is not clear. A color-based dictionary is to the right side and cell values are represented by colors - which is also not easily interpreted by R (nor by humans with color-blindness!). Furthermore, different pieces of information are combined into one cell (multiple partner organizations working in one area, or the status “TBC” in the same cell as “Partner D”).

    -
    -
    -
    -
    -

    -
    -
    -
    -
    -

    Problems: In the spreadsheet above, there are numerous extra empty rows and columns within the dataset - this will cause cleaning headaches in R. Furthermore, the GPS coordinates are spread across two rows for a given treatment center. As a side note - the GPS coordinates are in two different formats!

    -

    “Tidy” datasets may not be as readable to a human eye, but they make data cleaning and analysis much easier! Tidy data can be stored in various formats, for example “long” or “wide”“(see page on Pivoting data), but the principles above are still observed.

    -
    -
    -

    Functions

    -

    The R word “function” might be new, but the concept exists in Excel too as formulas. Formulas in Excel also require precise syntax (e.g. placement of semicolons and parentheses). All you need to do is learn a few new functions and how they work together in R.

    -
    -
    -

    Scripts

    -

    Instead of clicking buttons and dragging cells you will be writing every step and procedure into a “script”. Excel users may be familiar with “VBA macros” which also employ a scripting approach.

    -

    The R script consists of step-by-step instructions. This allows any colleague to read the script and easily see the steps you took. This also helps de-bug errors or inaccurate calculations. See the R basics section on scripts for examples.

    -

    Here is an example of an R script:

    -
    -
    -
    -
    -

    -
    -
    -
    -
    -
    -
    -

    Excel-to-R resources

    -

    Here are some links to tutorials to help you transition to R from Excel:

    - -
    -
    -

    R-Excel interaction

    -

    R has robust ways to import Excel workbooks, work with the data, export/save Excel files, and work with the nuances of Excel sheets.

    -

    It is true that some of the more aesthetic Excel formatting can get lost in translation (e.g. italics, sideways text, etc.). If your work flow requires passing documents back-and-forth between R and Excel while retaining the original Excel formatting, try packages such as openxlsx.

    -
    -
    -
    -

    4.2 From Stata

    - -

    Coming to R from Stata

    -

    Many epidemiologists are first taught how to use Stata, and it can seem daunting to move into R. However, if you are a comfortable Stata user then the jump into R is certainly more manageable than you might think. While there are some key differences between Stata and R in how data can be created and modified, as well as how analysis functions are implemented – after learning these key differences you will be able to translate your skills.

    -

    Below are some key translations between Stata and R, which may be handy as your review this guide.

    -

    General notes

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    STATAR
    You can only view and manipulate one dataset at a timeYou can view and manipulate multiple datasets at the same time, therefore you will frequently have to specify your dataset within the code
    Online community available through https://www.statalist.org/Online community available through RStudio, StackOverFlow, and R-bloggers
    Point and click functionality as an optionMinimal point and click functionality
    Help for commands available by help [command]Help available by [function]? or search in the Help pane
    Comment code using * or /// or /* TEXT */Comment code using #
    Almost all commands are built-in to Stata. New/user-written functions can be installed as ado files using ssc install [package]R installs with base functions, but typical use involves installing other packages from CRAN (see page on R basics)
    Analysis is usually written in a do fileAnalysis written in an R script in the RStudio source pane. R markdown scripts are an alternative.
    -

    Working directory

    - - - - - - - - - - - - - - - - - - - - - -
    STATAR
    Working directories involve absolute filepaths (e.g. “C:/usename/documents/projects/data/”)Working directories can be either absolute, or relative to a project root folder by using the here package (see Import and export)
    See current working directory with pwdUse getwd() or here() (if using the here package), with empty parentheses
    Set working directory with cd “folder location”Use setwd(“folder location”), or set_here("folder location) (if using here package)
    -

    Importing and viewing data

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    STATAR
    Specific commands per file typeUse import() from rio package for almost all filetypes. Specific functions exist as alternatives (see Import and export)
    Reading in csv files is done by import delimited “filename.csv”Use import("filename.csv")
    Reading in xslx files is done by import excel “filename.xlsx”Use import("filename.xlsx")
    Browse your data in a new window using the command browseView a dataset in the RStudio source pane using View(dataset). You need to specify your dataset name to the function in R because multiple datasets can be held at the same time. Note capital “V” in this function
    Get a high-level overview of your dataset using summarize, which provides the variable names and basic informationGet a high-level overview of your dataset using summary(dataset)
    -

    Basic data manipulation

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    STATAR
    Dataset columns are often referred to as “variables”More often referred to as “columns” or sometimes as “vectors” or “variables”
    No need to specify the datasetIn each of the below commands, you need to specify the dataset - see the page on Cleaning data and core functions for examples
    New variables are created using the command generate varname =Generate new variables using the function mutate(varname = ). See page on Cleaning data and core functions for details on all the below dplyr functions.
    Variables are renamed using rename old_name new_nameColumns can be renamed using the function rename(new_name = old_name)
    Variables are dropped using drop varnameColumns can be removed using the function select() with the column name in the parentheses following a minus sign
    Factor variables can be labeled using a series of commands such as label defineLabeling values can done by converting the column to Factor class and specifying levels. See page on Factors. Column names are not typically labeled as they are in Stata.
    -

    Descriptive analysis

    - - - - - - - - - - - - - - - - - -
    STATAR
    Tabulate counts of a variable using tab varnameProvide the dataset and column name to table() such as table(dataset$colname). Alternatively, use count(varname) from the dplyr package, as explained in Grouping data
    Cross-tabulaton of two variables in a 2x2 table is done with tab varname1 varname2Use table(dataset$varname1, dataset$varname2 or count(varname1, varname2)
    -

    While this list gives an overview of the basics in translating Stata commands into R, it is not exhaustive. There are many other great resources for Stata users transitioning to R that could be of interest:

    - -
    -
    -

    4.3 From SAS

    - -

    Coming from SAS to R

    -

    SAS is commonly used at public health agencies and academic research fields. Although transitioning to a new language is rarely a simple process, understanding key differences between SAS and R may help you start to navigate the new language using your native language. Below outlines the key translations in data management and descriptive analysis between SAS and R.

    -

    General notes

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    SASR
    Online community available through SAS Customer SupportOnline community available through RStudio, StackOverFlow, and R-bloggers
    Help for commands available by help [command]Help available by [function]? or search in the Help pane
    Comment code using * TEXT ; or /* TEXT */Comment code using #
    Almost all commands are built-in. Users can write new functions using SAS macro, SAS/IML, SAS Component Language (SCL), and most recently, procedures Proc Fcmp and Proc ProtoR installs with base functions, but typical use involves installing other packages from CRAN (see page on R basics)
    Analysis is usually conducted by writing a SAS program in the Editor window.Analysis written in an R script in the RStudio source pane. R markdown scripts are an alternative.
    -

    Working directory

    - - - - - - - - - - - - - - - - - - - - - -
    SASR
    Working directories can be either absolute, or relative to a project root folder by defining the root folder using %let rootdir=/root path; %include “&rootdir/subfoldername/filename”Working directories can be either absolute, or relative to a project root folder by using the here package (see Import and export)
    See current working directory with %put %sysfunc(getoption(work));Use getwd() or here() (if using the here package), with empty parentheses
    Set working directory with libname “folder location”Use setwd(“folder location”), or set_here("folder location) if using here package
    -

    Importing and viewing data

    - - - - - - - - - - - - - - - - - - - - - - - - - -
    SASR
    Use Proc Import procedure or using Data Step Infile statement.Use import() from rio package for almost all filetypes. Specific functions exist as alternatives (see Import and export)
    Reading in csv files is done by using Proc Import datafile=”filename.csv” out=work.filename dbms=CSV; run; OR using Data Step Infile statementUse import("filename.csv")
    Reading in xslx files is done by using Proc Import datafile=”filename.xlsx” out=work.filename dbms=xlsx; run; OR using Data Step Infile statementUse import(“filename.xlsx”)
    Browse your data in a new window by opening the Explorer window and select desired library and the datasetView a dataset in the RStudio source pane using View(dataset). You need to specify your dataset name to the function in R because multiple datasets can be held at the same time. Note capital “V” in this function
    -

    Basic data manipulation

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    SASR
    Dataset columns are often referred to as “variables”More often referred to as “columns” or sometimes as “vectors” or “variables”
    No special procedures are needed to create a variable. New variables are created simply by typing the new variable name, followed by an equal sign, and then an expression for the valueGenerate new variables using the function mutate(). See page on Cleaning data and core functions for details on all the below dplyr functions.
    Variables are renamed using rename *old_name=new_name*Columns can be renamed using the function rename(new_name = old_name)
    Variables are kept using **keep**=varnameColumns can be selected using the function select() with the column name in the parentheses
    Variables are dropped using **drop**=varnameColumns can be removed using the function select() with the column name in the parentheses following a minus sign
    Factor variables can be labeled in the Data Step using Label statementLabeling values can done by converting the column to Factor class and specifying levels. See page on Factors. Column names are not typically labeled.
    Records are selected using Where or If statement in the Data Step. Multiple selection conditions are separated using “and” command.Records are selected using the function filter() with multiple selection conditions separated either by an AND operator (&) or a comma
    Datasets are combined using Merge statement in the Data Step. The datasets to be merged need to be sorted first using Proc Sort procedure.dplyr package offers a few functions for merging datasets. See page Joining Data for details.
    -

    Descriptive analysis

    - - - - - - - - - - - - - - - - - - - - - -
    SASR
    Get a high-level overview of your dataset using Proc Summary procedure, which provides the variable names and descriptive statisticsGet a high-level overview of your dataset using summary(dataset) or skim(dataset) from the skimr package
    Tabulate counts of a variable using proc freq data=Dataset; Tables varname; Run;See the page on Descriptive tables. Options include table() from base R, and tabyl() from janitor package, among others. Note you will need to specify the dataset and column name as R holds multiple datasets.
    Cross-tabulation of two variables in a 2x2 table is done with proc freq data=Dataset; Tables rowvar*colvar; Run;Again, you can use table(), tabyl() or other options as described in the Descriptive tables page.
    -

    Some useful resources:

    -

    SAS for R Users: A Book for Data Scientists (2019)

    -

    Analyzing Health Data in R for SAS Users (2018)

    -

    R for SAS and SPSS Users (2011)

    -

    SAS and R, Second Edition (2014)

    -
    -
    -

    4.4 Data interoperability

    - -

    see the Import and export page for details on how the R package rio can import and export files such as STATA .dta files, SAS .xpt and.sas7bdat files, SPSS .por and.sav files, and many others.

    - - -
    - -
    - - -
    - - - - - - - \ No newline at end of file diff --git a/new_pages/transition_to_R.qmd b/new_pages/transition_to_R.qmd index c0adec61..b22eefe1 100644 --- a/new_pages/transition_to_R.qmd +++ b/new_pages/transition_to_R.qmd @@ -51,12 +51,12 @@ To Excel users - think of the role that [Excel "tables"](https://exceljet.net/ex An example of "tidy" data would be the case linelist used throughout this handbook - each variable is contained within one column, each observation (one case) has it's own row, and every value is in just one cell. Below you can view the first 50 rows of the linelist: -```{r, echo=F} +```{r, echo=F, warning = F, message=F} # import the linelist into R linelist <- rio::import(here::here("data", "case_linelists", "linelist_cleaned.rds")) ``` -```{r, message=FALSE, echo=F} +```{r, echo=F, warning = F, message=F} # display the linelist data as a table DT::datatable(head(linelist, 50), rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T), class = 'white-space: nowrap' ) ``` @@ -268,7 +268,7 @@ Cross-tabulation of two variables in a 2x2 table is done with `proc freq data=Da ## Data interoperability -see the [Import and export](importing.qmd) page for details on how the R package **rio** can import and export files such as STATA .dta files, SAS .xpt and.sas7bdat files, SPSS .por and.sav files, and many others. +See the [Import and export](importing.qmd) page for details on how the R package **rio** can import and export files such as STATA .dta files, SAS .xpt and.sas7bdat files, SPSS .por and.sav files, and many others.