Skip to content

Commit

Permalink
Large scale updates to numerous chapters filling in issues raised on …
Browse files Browse the repository at this point in the history
…github
  • Loading branch information
arranhamlet committed Sep 18, 2024
1 parent 224d203 commit ebfa021
Show file tree
Hide file tree
Showing 20 changed files with 1,995 additions and 185 deletions.
13 changes: 7 additions & 6 deletions html_outputs/new_pages/ggplot_tips.html

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions html_outputs/search.json
Original file line number Diff line number Diff line change
Expand Up @@ -2892,7 +2892,7 @@
"href": "new_pages/ggplot_tips.html#highlighting",
"title": "31  ggplot tips",
"section": "31.8 Highlighting",
"text": "31.8 Highlighting\nHighlighting specific elements in a chart is a useful way to draw attention to a specific instance of a variable while also providing information on the dispersion of the full dataset. While this is not easily done in base ggplot2, there is an external package that can help to do this known as gghighlight. This is easy to use within the ggplot syntax.\nThe gghighlight package uses the gghighlight() function to achieve this effect. To use this function, supply a logical statement to the function - this can have quite flexible outcomes, but here we’ll show an example of the age distribution of cases in our linelist, highlighting them by outcome.\n\n# load gghighlight\npacman::p_load(gghighlight)\n\n# replace NA values with unknown in the outcome variable\nlinelist <- linelist %>%\n mutate(outcome = replace_na(outcome, \"Unknown\"))\n\n# produce a histogram of all cases by age\nggplot(\n data = linelist,\n mapping = aes(x = age_years, fill = outcome)) +\n geom_histogram() + \n gghighlight::gghighlight(outcome == \"Death\") # highlight instances where the patient has died.\n\n\n\n\n\n\n\n\nThis also works well with faceting functions - it allows the user to produce facet plots with the background data highlighted that doesn’t apply to the facet! Below we count cases by week and plot the epidemic curves by hospital (color = and facet_wrap() set to hospital column).\n\n# produce a histogram of all cases by age\nlinelist %>% \n count(week = lubridate::floor_date(date_hospitalisation, \"week\"),\n hospital) %>% \n ggplot() +\n geom_line(mapping = aes(x = week, \n y = n, \n color = hospital)) +\n theme_minimal() +\n gghighlight::gghighlight() + # highlight instances where the patient has died\n facet_wrap(~hospital) # make facets by outcome",
"text": "31.8 Highlighting\nHighlighting specific elements in a chart is a useful way to draw attention to a specific instance of a variable while also providing information on the dispersion of the full dataset. While this is not easily done in base ggplot2, there is an external package that can help to do this known as gghighlight. This is easy to use within the ggplot syntax.\nThe gghighlight package uses the gghighlight() function to achieve this effect. To use this function, supply a logical statement to the function - this can have quite flexible outcomes, but here we’ll show an example of the age distribution of cases in our linelist, highlighting them by outcome.\n\n# load gghighlight\npacman::p_load(gghighlight)\n\n# replace NA values with unknown in the outcome variable\nlinelist <- linelist %>%\n mutate(outcome = replace_na(outcome, \"Unknown\"))\n\n# produce a histogram of all cases by age\nggplot(\n data = linelist,\n mapping = aes(x = age_years, fill = outcome)) +\n geom_histogram() + \n gghighlight::gghighlight(outcome == \"Death\") # highlight instances where the patient has died.\n\n\n\n\n\n\n\n\nThis also works well with faceting functions - it allows the user to produce facet plots with the background data highlighted that doesn’t apply to the facet! Below we count cases by week and plot the epidemic curves by hospital (color = and facet_wrap() set to hospital column).\n\n# produce a linegraph of all cases by age\nlinelist %>% \n count(week = lubridate::floor_date(date_hospitalisation, \"week\"),\n hospital) %>% \n ggplot() +\n geom_line(mapping = aes(x = week, \n y = n, \n color = hospital)) +\n theme_minimal() +\n gghighlight::gghighlight() + # highlight instances where the patient has died\n facet_wrap(~hospital) + # make facets by outcome\n scale_x_date(labels = date_format(\"%m/%y\"))",
"crumbs": [
"Data Visualization",
"<span class='chapter-number'>31</span>  <span class='chapter-title'>ggplot tips</span>"
Expand Down Expand Up @@ -2925,7 +2925,7 @@
"href": "new_pages/ggplot_tips.html#dual-axes",
"title": "31  ggplot tips",
"section": "31.11 Dual axes",
"text": "31.11 Dual axes\nA secondary y-axis is often a requested addition to a ggplot2 graph. While there is a robust debate about the validity of such graphs in the data visualization community, and they are often not recommended, your manager may still want them. Below, we present one method to achieve them.\nThis approach involves creating two separate datasets, one for each of the different plots we want to achieve, and then calculating a “scaling factor” required to transform the values onto the same scale.\nThis is because the function we are going to use to add a second y-axis, sec_axis() requires the second axis be directly proportional to the first axis.\nTo demonstrate this technique we will overlay the epidemic curve with a line of the weekly percent of patients who died. We use this example because the alignment of dates on the x-axis is more complex than say, aligning a bar chart with another plot. Some things to note:\n\nThe epicurve and the line are aggregated into weeks prior to plotting and the date_breaks and date_labels are identical - we do this so that the x-axes of the two plots are the same when they are overlaid.\n\nThe y-axis is created to the right-side for plot 2 with the sex_axis = argument of scale_y_continuous().\n\nNote there is another example of this technique in the Epidemic curves page - overlaying cumulative incidence on top of the epicurve.\nMake the datasets for the plot\nHere we will transform linelist into two different datasets linelist_primary_axis and linelist_secondary_axis in order to then create the scaling factor that will allow us to attach a second axis at the correct scale.\n\n#Set up linelist for primary axis - the weekly cases epicurve\nlinelist_primary_axis &lt;- linelist %&gt;% \n count(epiweek = lubridate::floor_date(date_onset, \"week\"))\n\n#Set up linelist for secondary axis - the line graph of the weekly percent of deaths\nlinelist_secondary_axis &lt;- linelist %&gt;% \n group_by(\n epiweek = lubridate::floor_date(date_onset, \"week\")) %&gt;% \n summarise(\n n = n(),\n pct_death = 100*sum(outcome == \"Death\", na.rm = T) / n)\n\nCalculate the scaling factor\nNow that we have created the datasets with our variables of interest, we want to extract the columns and calculate the maximum value in each in order to set our scale. We will then divide the secondary axis value by the first axis value in order to create our scaling factor\n\n#Set up scaling factor to transform secondary axis\nlinelist_primary_axis_max &lt;- linelist_primary_axis %&gt;%\n pull(n) %&gt;%\n max()\n\nlinelist_secondary_axis_max &lt;- linelist_secondary_axis %&gt;%\n pull(pct_death) %&gt;%\n max()\n\n#Create our scaling factor, how much the secondary axis value must be divided by to create values on the same scale as the primary axis\nscaling_factor &lt;- linelist_secondary_axis_max/linelist_primary_axis_max\n\nAnd now we are ready to plot! We will be using the argument to create our epicurve, and to create our line graph. Note that we are not specifying a data = argument in our first ggplot(), this is because we are using two separate datasets to create this plot.\n\nggplot() +\n #First create the epicurve\n geom_area(data = linelist_primary_axis,\n mapping = aes(x = epiweek, \n y = n), \n fill = \"grey\"\n ) +\n #Now create the linegraph\n geom_line(data = linelist_secondary_axis,\n mapping = aes(x = epiweek, \n y = pct_death / scaling_factor)\n ) +\n #Now we specify the second axis, and note that we are going to be multiplying the values of the second axis by the scaling factor in order to get the axis to display the correct values\n scale_y_continuous(\n sec.axis = sec_axis(~.*scaling_factor, \n name = \"Weekly percent of deaths\")\n ) +\n scale_x_date(\n date_breaks = \"month\",\n date_labels = \"%b\"\n ) +\n labs(\n x = \"Epiweek of symptom onset\",\n y = \"Weekly cases\",\n title = \"Weekly case incidence and percent deaths\"\n ) +\n theme_bw()",
"text": "31.11 Dual axes\nA secondary y-axis is often a requested addition to a ggplot2 graph. While there is a robust debate about the validity of such graphs in the data visualization community, and they are often not recommended, your manager may still want them. Below, we present one method to achieve them.\nThis approach involves creating two separate datasets, one for each of the different plots we want to achieve, and then calculating a “scaling factor” required to transform the values onto the same scale.\nThis is because the function we are going to use to add a second y-axis, sec_axis() requires the second axis be directly proportional to the first axis.\nTo demonstrate this technique we will overlay the epidemic curve with a line of the weekly percent of patients who died. We use this example because the alignment of dates on the x-axis is more complex than say, aligning a bar chart with another plot. Some things to note:\n\nThe epicurve and the line are aggregated into weeks prior to plotting and the date_breaks and date_labels are identical - we do this so that the x-axes of the two plots are the same when they are overlaid.\n\nThe y-axis is created to the right-side for plot 2 with the sec_axis = argument of scale_y_continuous().\n\nNote there is another example of this technique in the Epidemic curves page - overlaying cumulative incidence on top of the epicurve.\nMake the datasets for the plot\nHere we will transform linelist into two different datasets linelist_primary_axis and linelist_secondary_axis in order to then create the scaling factor that will allow us to attach a second axis at the correct scale.\n\n#Set up linelist for primary axis - the weekly cases epicurve\nlinelist_primary_axis &lt;- linelist %&gt;% \n count(epiweek = lubridate::floor_date(date_onset, \"week\"))\n\n#Set up linelist for secondary axis - the line graph of the weekly percent of deaths\nlinelist_secondary_axis &lt;- linelist %&gt;% \n group_by(\n epiweek = lubridate::floor_date(date_onset, \"week\")) %&gt;% \n summarise(\n n = n(),\n pct_death = 100*sum(outcome == \"Death\", na.rm = T) / n)\n\nCalculate the scaling factor\nNow that we have created the datasets with our variables of interest, we want to extract the columns and calculate the maximum value in each in order to set our scale. We will then divide the secondary axis value by the first axis value in order to create our scaling factor\n\n#Set up scaling factor to transform secondary axis\nlinelist_primary_axis_max &lt;- linelist_primary_axis %&gt;%\n pull(n) %&gt;%\n max()\n\nlinelist_secondary_axis_max &lt;- linelist_secondary_axis %&gt;%\n pull(pct_death) %&gt;%\n max()\n\n#Create our scaling factor, how much the secondary axis value must be divided by to create values on the same scale as the primary axis\nscaling_factor &lt;- linelist_secondary_axis_max/linelist_primary_axis_max\n\nAnd now we are ready to plot! We will be using the argument to create our epicurve, and to create our line graph. Note that we are not specifying a data = argument in our first ggplot(), this is because we are using two separate datasets to create this plot.\n\nggplot() +\n #First create the epicurve\n geom_area(data = linelist_primary_axis,\n mapping = aes(x = epiweek, \n y = n), \n fill = \"grey\"\n ) +\n #Now create the linegraph\n geom_line(data = linelist_secondary_axis,\n mapping = aes(x = epiweek, \n y = pct_death / scaling_factor)\n ) +\n #Now we specify the second axis, and note that we are going to be multiplying the values of the second axis by the scaling factor in order to get the axis to display the correct values\n scale_y_continuous(\n sec.axis = sec_axis(~.*scaling_factor, \n name = \"Weekly percent of deaths\")\n ) +\n scale_x_date(\n date_breaks = \"month\",\n date_labels = \"%b\"\n ) +\n labs(\n x = \"Epiweek of symptom onset\",\n y = \"Weekly cases\",\n title = \"Weekly case incidence and percent deaths\"\n ) +\n theme_bw()",
"crumbs": [
"Data Visualization",
"<span class='chapter-number'>31</span>  <span class='chapter-title'>ggplot tips</span>"
Expand Down
9 changes: 8 additions & 1 deletion new_pages/basics.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -1503,6 +1503,13 @@ options(scipen = 999)
round(c(2.5, 3.5))
janitor::round_half_up(c(2.5, 3.5))
```

For rounding from proportion to percentages, you can use the function `percent()` from the **scales** package.

```{r}
scales::percent(c(0.25, 0.35), accuracy = 0.1)
```

#### Statistical functions {.unnumbered}
Expand Down Expand Up @@ -1647,4 +1654,4 @@ A few things to remember when writing commands in R, to avoid errors and warning

### Code assists {.unnumbered}

Any script (RMarkdown or otherwise) will give clues when you have made a mistake. For example, if you forgot to write a comma where it is needed, or to close a parentheses, RStudio will raise a flag on that line, on the right side of the script, to warn you.
Any script (RMarkdown or otherwise) will give clues when you have made a mistake. For example, if you forgot to write a comma where it is needed, or to close a parentheses, RStudio will raise a flag on that line, on the left hand side of the script, to warn you.
Loading

0 comments on commit ebfa021

Please sign in to comment.