Skip to content

Commit

Permalink
Updated the iteration chapter
Browse files Browse the repository at this point in the history
  • Loading branch information
arranhamlet committed Sep 9, 2024
1 parent 3752295 commit c5f136a
Show file tree
Hide file tree
Showing 24 changed files with 2,959 additions and 922 deletions.
280 changes: 145 additions & 135 deletions html_outputs/new_pages/deduplication.html

Large diffs are not rendered by default.

7 changes: 3 additions & 4 deletions html_outputs/new_pages/factors.html
Original file line number Diff line number Diff line change
Expand Up @@ -1280,8 +1280,7 @@ <h3 class="unnumbered anchored" data-anchor-id="in-tables">In tables</h3>
</section>
<section id="epiweeks" class="level2" data-number="11.8">
<h2 data-number="11.8" class="anchored" data-anchor-id="epiweeks"><span class="header-section-number">11.8</span> Epiweeks</h2>
<p>Please see the extensive discussion of how to create epidemiological weeks in the <a href="../new_pages/grouping.html">Grouping data</a> page.<br>
Please also see the <a href="../new_pages/dates.html">Working with dates</a> page for tips on how to create and format epidemiological weeks.</p>
<p>Please see the extensive discussion of how to create epidemiological weeks in the <a href="../new_pages/grouping.html">Grouping data</a> page. Also see the <a href="../new_pages/dates.html">Working with dates</a> page for tips on how to create and format epidemiological weeks.</p>
<section id="epiweeks-in-a-plot" class="level3 unnumbered">
<h3 class="unnumbered anchored" data-anchor-id="epiweeks-in-a-plot">Epiweeks in a plot</h3>
<p>If your goal is to create epiweeks to display in a plot, you can do this simply with <strong>lubridate</strong>’s <code>floor_date()</code>, as explained in the <a href="../new_pages/grouping.html">Grouping data</a> page. The values returned will be of class Date with format YYYY-MM-DD. If you use this column in a plot, the dates will naturally order correctly, and you do not need to worry about levels or converting to class Factor. See the <code>ggplot()</code> histogram of onset dates below.</p>
Expand All @@ -1305,7 +1304,7 @@ <h3 class="unnumbered anchored" data-anchor-id="epiweeks-in-a-plot">Epiweeks in
<h3 class="unnumbered anchored" data-anchor-id="epiweeks-in-the-data">Epiweeks in the data</h3>
<p>However, if your purpose in factoring is <em>not</em> to plot, you can approach this one of two ways:</p>
<ol type="1">
<li><em>For fine control over the display</em>, convert the <strong>lubridate</strong> epiweek column (YYYY-MM-DD) to the desired display format (YYYY-WWw) <em>within the data frame itself</em>, and then convert it to class Factor.</li>
<li><em>For fine control over the display</em>, convert the <strong>lubridate</strong> epiweek column (YYYY-MM-DD) to the desired display format (YYYY-Www) <em>within the data frame itself</em>, and then convert it to class Factor.</li>
</ol>
<p>First, use <code>format()</code> from <strong>base</strong> R to convert the date display from YYYY-MM-DD to YYYY-Www display (see the <a href="../new_pages/dates.html">Working with dates</a> page). In this process the class will be converted to character. Then, convert from character to class Factor with <code>factor()</code>.</p>
<div class="cell">
Expand Down Expand Up @@ -1931,7 +1930,7 @@ <h2 data-number="11.9" class="anchored" data-anchor-id="resources"><span class="
</div>
</div>
</footer>
<script>var lightboxQuarto = GLightbox({"descPosition":"bottom","selector":".lightbox","loop":false,"closeEffect":"zoom","openEffect":"zoom"});
<script>var lightboxQuarto = GLightbox({"selector":".lightbox","openEffect":"zoom","closeEffect":"zoom","loop":false,"descPosition":"bottom"});
window.onload = () => {
lightboxQuarto.on('slide_before_load', (data) => {
const { slideIndex, slideNode, slideConfig, player, trigger } = data;
Expand Down
80 changes: 40 additions & 40 deletions html_outputs/new_pages/grouping.html

Large diffs are not rendered by default.

179 changes: 124 additions & 55 deletions html_outputs/new_pages/iteration.html

Large diffs are not rendered by default.

587 changes: 267 additions & 320 deletions html_outputs/new_pages/joining_matching.html

Large diffs are not rendered by default.

408 changes: 193 additions & 215 deletions html_outputs/new_pages/pivoting.html

Large diffs are not rendered by default.

14 changes: 7 additions & 7 deletions html_outputs/search.json

Large diffs are not rendered by default.

57 changes: 32 additions & 25 deletions new_pages/deduplication.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ knitr::include_graphics(here::here("images", "deduplication.png"))

This page covers the following de-duplication techniques:

1. Identifying and removing duplicate rows
2. "Slicing" rows to keep only certain rows (e.g. min or max) from each group of rows
3. "Rolling-up", or combining values from multiple rows into one row
1. Identifying and removing duplicate rows.
2. "Slicing" rows to keep only certain rows (e.g. min or max) from each group of rows.
3. "Rolling-up", or combining values from multiple rows into one row.


<!-- ======================================================= -->
Expand All @@ -24,7 +24,8 @@ This code chunk shows the loading of packages required for the analyses. In this
pacman::p_load(
tidyverse, # deduplication, grouping, and slicing functions
janitor, # function for reviewing duplicates
stringr) # for string searches, can be used in "rolling-up" values
stringr # for string searches, can be used in "rolling-up" values
)
```

### Import data {.unnumbered}
Expand Down Expand Up @@ -66,9 +67,9 @@ DT::datatable(obs, rownames = FALSE, filter = "top", options = list(pageLength =

A few things to note as you review the data:

* The first two records are 100% complete duplicates including duplicate `recordID` (must be a computer glitch!)
* The second two rows are duplicates, in all columns *except for `recordID`*
* Several people had multiple phone encounters, at various dates and times, and as contacts and/or cases
* The first two records are 100% complete duplicates including duplicate `recordID` (must be a computer glitch!).
* The second two rows are duplicates, in all columns *except for `recordID`*.
* Several people had multiple phone encounters, at various dates and times, and as contacts and/or cases.
* At each encounter, the person was asked if they had **ever** had symptoms, and some of this information is missing.


Expand All @@ -91,7 +92,7 @@ This section describes how to review and remove duplicate rows in a data frame.

To quickly review rows that have duplicates, you can use `get_dupes()` from the **janitor** package. *By default*, all columns are considered when duplicates are evaluated - rows returned by the function are 100% duplicates considering the values in *all* columns.

In the `obs` data frame, the first two rows are *100% duplicates* - they have the same value in every column (including the `recordID` column, which is *supposed* to be unique - it must be some computer glitch). The returned data frame automatically includes a new column `dupe_count` on the right side, showing the number of rows with that combination of duplicate values.
In the `obs` data frame, the first two rows are *100% duplicates* - they have the same value in every column (including the `recordID` column, which is *supposed* to be unique). The returned data frame automatically includes a new column `dupe_count` on the right side, showing the number of rows with that combination of duplicate values.

```{r, eval=F}
# 100% duplicates across all columns
Expand Down Expand Up @@ -123,7 +124,7 @@ obs %>%

You can also positively specify the columns to consider. Below, only rows that have the same values in the `name` and `purpose` columns are returned. Notice how "amrish" now has `dupe_count` equal to 3 to reflect his three "contact" encounters.

*Scroll left for more rows**
*Scroll left for more rows*

```{r, eval=F}
# duplicates based on name and purpose columns ONLY
Expand Down Expand Up @@ -265,12 +266,13 @@ To "slice" a data frame to apply a filter on the rows by row number/position. Th
The basic `slice()` function accepts numbers and returns rows in those positions. If the numbers provided are positive, only they are returned. If negative, those rows are *not* returned. Numbers must be either all positive or all negative.

```{r}
obs %>% slice(4) # return the 4th row
obs %>%
slice(4) # return the 4th row
```

```{r}
obs %>% slice(c(2,4)) # return rows 2 and 4
#obs %>% slice(c(2:4)) # return rows 2 through 4
obs %>%
slice(c(2,4)) # return rows 2 and 4
```


Expand All @@ -284,17 +286,18 @@ There are several variations: These should be provided with a column and a numb


```{r}
obs %>% slice_max(encounter, n = 1) # return rows with the largest encounter number
obs %>%
slice_max(encounter, n = 1) # return rows with the largest encounter number
```

Use arguments `n = ` or `prop = ` to specify the number or proportion of rows to keep. If not using the function in a pipe chain, provide the data argument first (e.g. `slice(data, n = 2)`). See `?slice` for more information.

Other arguments:

`.order_by = ` used in `slice_min()` and `slice_max()` this is a column to order by before slicing.
`with_ties = ` TRUE by default, meaning ties are kept.
`.preserve = ` FALSE by default. If TRUE then the grouping structure is re-calculated after slicing.
`weight_by = ` Optional, numeric column to weight by (bigger number more likely to get sampled). Also `replace = ` for whether sampling is done with/without replacement.
* `.order_by = ` used in `slice_min()` and `slice_max()` this is a column to order by before slicing.
* `with_ties = ` TRUE by default, meaning ties are kept.
* `.preserve = ` FALSE by default. If TRUE then the grouping structure is re-calculated after slicing.
* `weight_by = ` Optional, numeric column to weight by (bigger number more likely to get sampled). Also * `replace = ` for whether sampling is done with/without replacement.

<span style="color: darkgreen;">**_TIP:_** When using `slice_max()` and `slice_min()`, be sure to specify/write the `n = ` (e.g. `n = 2`, not just `2`). Otherwise you may get an error `Error: `...` is not empty.` </span>

Expand All @@ -310,7 +313,9 @@ The `slice_*()` functions can be very useful if applied to a grouped data frame

This is helpful for de-duplication if you have multiple rows per person but only want to keep one of them. You first use `group_by()` with key columns that are the same per person, and then use a slice function on a column that will differ among the grouped rows.

In the example below, to keep only the *latest* encounter *per person*, we group the rows by `name` and then use `slice_max()` with `n = 1` on the `date` column. Be aware! To apply a function like `slice_max()` on dates, the date column must be class Date.
In the example below, to keep only the *latest* encounter *per person*, we group the rows by `name` and then use `slice_max()` with `n = 1` on the `date` column.

**Be aware!** To apply a function like `slice_max()` on dates, the date column must be class Date.

By default, "ties" (e.g. same date in this scenario) are kept, and we would still get multiple rows for some people (e.g. adam). To avoid this we set `with_ties = FALSE`. We get back only one row per person.

Expand Down Expand Up @@ -392,7 +397,9 @@ If you want to keep all records but mark only some for analysis, consider a two-
# 1. Define data frame of rows to keep for analysis
obs_keep <- obs %>%
group_by(name) %>%
slice_max(encounter, n = 1, with_ties = FALSE) # keep only latest encounter per person
slice_max(encounter,
n = 1,
with_ties = FALSE) # keep only latest encounter per person
# 2. Mark original data frame
Expand Down Expand Up @@ -429,7 +436,7 @@ Then the new column `key_completeness` is created with `mutate()`. The new value

This involves the function `rowSums()` from **base** R. Also used is `.`, which within piping refers to the data frame at that point in the pipe (in this case, it is being subset with brackets `[]`).

*Scroll to the right to see more rows**
*Scroll to the right to see more rows*

```{r, eval=F}
# create a "key variable completeness" column
Expand Down Expand Up @@ -460,8 +467,8 @@ See the [original data](#dedup_data).

This section describes:

1) How to "roll-up" values from multiple rows into just one row, with some variations
2) Once you have "rolled-up" values, how to overwrite/prioritize the values in each cell
1) How to "roll-up" values from multiple rows into just one row, with some variations.
2) Once you have "rolled-up" values, how to overwrite/prioritize the values in each cell.

This tab uses the example dataset from the Preparation tab.

Expand All @@ -472,9 +479,9 @@ This tab uses the example dataset from the Preparation tab.

The code example below uses `group_by()` and `summarise()` to group rows by person, and then paste together all unique values within the grouped rows. Thus, you get one summary row per person. A few notes:

* A suffix is appended to all new columns ("_roll" in this example)
* If you want to show only unique values per cell, then wrap the `na.omit()` with `unique()`
* `na.omit()` removes `NA` values, but if this is not desired it can be removed `paste0(.x)`...
* A suffix is appended to all new columns ("_roll" in this example).
* If you want to show only unique values per cell, then wrap the `na.omit()` with `unique()`.
* `na.omit()` removes `NA` values, but if this is not desired it can be removed `paste0(.x)`.



Expand Down
Loading

0 comments on commit c5f136a

Please sign in to comment.