practice.Rmd

# Practice

```{r setup, include=FALSE}
source("etc/common.R")
```

We have covered a lot in the last few lessons,
so this one presents some practice exercises to ground what we have learned
and introduce a few more commonly-used functions.

## Working with a single tidy table

### Load the tidyverse collection of libraries and the `here` library for constructing paths:

```{r fake-load-libraries, eval=FALSE}
library(tidyverse)
library(here)
```

### Use `here::here` to construct a path to a file and `readr::read_csv` to read that file:

```{r read-survey-data}
path = here::here("data", "person.csv")
person <- readr::read_csv(path)
person
```

*Read `survey/site.csv`.*

### Count rows and columns using `nrow` and `ncol`:

```{r count-rows-and-columns}
nrow(person)
ncol(person)
```

*How many rows and columns are in the site data?*

### Format strings using `glue::glue`:

```{r use-glue}
print(glue::glue("person has {nrow(person)} rows and {ncol(person)} columns"))
```

*Print a nicely-formatted summary of the number of rows and columns in the site data.*

### Use `colnames` to get the names of columns and `paste` to join strings together:

```{r use-colnames-and-paste}
print(glue::glue("person columns are {paste(colnames(person), collapse = ' ')}"))
```

*Print a nicely-formatted summary of the names of the columns in the site data.*

### Use `dplyr::select` to create a new table with a subset of columns by name:

```{r select-by-name}
dplyr::select(person, family_name, personal_name)
```

*Create a table with just the latitudes and longitudes of sites.*

### Use `dplyr::filter` to create a new table with a subset of rows by values:

```{r filter-with-one-condition}
dplyr::filter(person, family_name < "M")
```

*Create a table with only sites south of -48 degrees.*

### Use the pipe operator `%>%` to combine operations:

```{r select-pipe-filter}
person %>%
  dplyr::select(family_name, personal_name) %>%
  dplyr::filter(family_name < "M")
```

*Create a table with only the latitudes and longitudes of sites south of -48 degrees.*

### Use `dplyr::mutate` to create a new column with calculated values and `stringr::str_length` to calculate string length:

```{r mutate-name-length}
person %>%
  dplyr::mutate(name_length = stringr::str_length(family_name))
```

*Use the built-in function `round` to create a table with latitudes and longitudes rounded to integers.*

### Use `dplyr::arrange` to order rows and (optionally) `dplyr::desc` to impose descending order:

```{r mutate-and-arrange}
person %>%
  dplyr::mutate(name_length = stringr::str_length(family_name)) %>%
  dplyr::arrange(dplyr::desc(name_length))
```

*Create a table sorted by decreasing longitude (i.e., most negative longitude last).*

## Working with grouped data

### Read `survey/measurements.csv` and look at the data with `View`:

```{r read-and-view}
measurements <- readr::read_csv(here::here("data", "measurements.csv"))
View(measurements)
```

### Find rows where `reading` is not NA and saved as `cleaned`:

```{r remove-reading-na}
cleaned <- measurements %>%
  dplyr::filter(!is.na(reading))
cleaned
```

*Rewrite the filter expression to select rows where the visitor and quantity are not NA either.*

```{r hidden-cleaned, echo=FALSE}
cleaned <- measurements %>%
  dplyr::filter(!is.na(visitor) & !is.na(quantity) & !is.na(reading))
```

### Group measurements by quantity measured and count the number of each (the column is named `n` automatically):

```{r group-by-quantity}
cleaned %>%
  dplyr::group_by(quantity) %>%
  dplyr::count()
```

*Group by person and quantity measured.*

### Find the minimum, average, and maximum for each quantity:

```{r min-ave-max}
cleaned %>%
  dplyr::group_by(quantity) %>%
  dplyr::summarize(low = min(reading), mid = mean(reading), high = max(reading))
```

*Look at the range for each combination of person and quantity.*

### Rescale salinity measurements that are greater than 1:

```{r rescale-salinity}
cleaned <- cleaned %>%
  dplyr::mutate(reading = ifelse(quantity == 'sal' & reading > 1.0, reading/100, reading))
cleaned
```

*Do the same calculation use `case_when`.*

### Read `visited.csv`, drop the NAs, and join with the cleaned-up table of readings:

```{r join-two-tables}
cleaned <- readr::read_csv(here::here("data", "visited.csv")) %>%
  dplyr::filter(!is.na(visit_date)) %>%
  dplyr::inner_join(cleaned, by = c("visit_id" = "visit_id"))
cleaned
```

*Join `visited.csv` with `site.csv` to get (date, latitude, longitude) triples for site visits.*

### Find the dates of the highest radiation reading at each site:

```{r dates-high-rad}
cleaned %>%
  dplyr::filter(quantity == "rad") %>%
  dplyr::group_by(site_id) %>%
  dplyr::mutate(max_rad = max(reading)) %>%
  dplyr::filter(reading == max_rad)
```

Another way to do it:

```{r dates-high-rad-2}
cleaned %>%
  dplyr::filter(quantity == "rad") %>%
  dplyr::group_by(site_id) %>%
  dplyr::top_n(1, reading) %>%
  dplyr::select(site_id, visit_date, reading)
```

*Explain why this __doesn't__ work.*

```{r error-high-rad, error=TRUE}
cleaned %>%
  dplyr::filter(quantity == "rad") %>%
  dplyr::group_by(site_id) %>%
  dplyr::summarize(max_rad = max(reading)) %>%
  dplyr::ungroup() %>%
  dplyr::filter(reading == max_rad)
```

### Normalize radiation against the highest radiation seen per site:

```{r normalize-rad}
cleaned %>%
  dplyr::filter(quantity == "rad") %>%
  dplyr::group_by(site_id) %>%
  dplyr::mutate(
    max_rad = max(reading),
    frac_rad = reading / max_rad) %>%
  dplyr::select(visit_id, site_id, visit_date, frac_rad)
```

*Normalize salinity against mean salinity by site.*

### Find stepwise change in radiation per site by date:

```{r rad-change}
cleaned %>%
  dplyr::filter(quantity == "rad") %>%
  dplyr::group_by(site_id) %>%
  dplyr::mutate(delta_rad = reading - dplyr::lag(reading)) %>%
  dplyr::arrange(site_id, visit_date)
```

*Find length of time between visits by site.*

### Find sites that experience any stepwise increase in radiation between visits:

```{r rad-increases}
cleaned %>%
  dplyr::filter(quantity == "rad") %>%
  dplyr::group_by(site_id) %>%
  dplyr::mutate(delta_rad = reading - dplyr::lag(reading)) %>%
  dplyr::filter(!is.na(delta_rad)) %>%
  dplyr::summarize(any_increase = any(delta_rad > 0)) %>%
  dplyr::filter(any_increase)
```

*Find sites with visits more than one year apart.*

## Creating charts

We will use data on the mass and home range area (HRA) of various species from:

> Tamburello N, Côté IM, Dulvy NK (2015) Data from: Energy and the scaling of animal space use. Dryad Digital Repository.
> https://doi.org/10.5061/dryad.q5j65

```{r}
hra <- readr::read_csv(here::here("data", "home-range-database.csv"))
head(hra)
```

### Look at how mass is distributed:

```{r chart-mass}
ggplot2::ggplot(hra) +
  ggplot2::geom_histogram(mapping = aes(x = mean.mass.g))
```

Try again with `log10.mass`:

```{r chart-log-mass}
ggplot2::ggplot(hra) +
  ggplot2::geom_histogram(mapping = aes(x = log10.mass))
```

*Create histograms showing the distribution of home range area using linear and log scales.*

### Change the visual appearance of a chart:

```{r change-visual}
ggplot2::ggplot(hra) +
  ggplot2::geom_histogram(mapping = aes(x = log10.mass), bins = 100) +
  ggplot2::ggtitle("Frequency of Species Masses") + ggplot2::xlab("Log10 of Mass") + ggplot2::ylab("Number of Species") +
  ggplot2::theme_minimal()
```

*Show the distribution of home range areas with a dark background.*

### Create a scatterplot showing the relationship between mass and home range area:

```{r scatterplot}
ggplot2::ggplot(hra) +
  ggplot2::geom_point(mapping = aes(x = log10.mass, y = log10.hra))
```

*Create a similar scatterplot showing the relationship between the raw values rather than the log values.*

### Colorize scatterplot points by class:

```{r colorize-scatterplot}
hra %>%
  dplyr::mutate(class_fct = as.factor(class)) %>%
  ggplot2::ggplot() +
  ggplot2::geom_point(mapping = aes(x = log10.mass, y = log10.hra, color = class_fct), alpha = 0.5)
```

**Group by order and experiment with different alpha values.*

### Create a faceted plot:

```{r facet-plot}
hra %>%
  dplyr::mutate(class_fct = as.factor(class)) %>%
  ggplot2::ggplot() +
  ggplot2::geom_point(mapping = aes(x = log10.mass, y = log10.hra, color = class_fct), alpha = 0.5) +
  ggplot2::facet_wrap(~class_fct)
```

*Create a plot faceted by order for just the reptiles.*

### Fit a linear regression to the logarithmic data for birds:

```{r fit-line}
hra %>%
  dplyr::filter(class == "aves") %>%
  ggplot2::ggplot() +
  ggplot2::geom_point(mapping = aes(x = log10.mass, y = log10.hra), alpha = 0.5) +
  ggplot2::geom_smooth(method = lm, mapping = aes(x = log10.mass, y = log10.hra), color = 'red')
```

*Fit a line to the raw data for birds rather than the logarithmic data.*

### Create a violin plot of mass by order for birds:

```{r violin-plot}
hra %>%
  dplyr::filter(class == "aves") %>%
  dplyr::mutate(order_fct = as.factor(order)) %>%
  ggplot2::ggplot() +
  ggplot2::geom_violin(mapping = aes(x = order_fct, y = log10.mass, color = order_fct))
```

*Rotate the labels on the X axis to make this readable, then explain the gaps.*

### Display the same data as a boxplot:

```{r box-plot}
hra %>%
  dplyr::filter(class == "aves") %>%
  dplyr::mutate(order_fct = as.factor(order)) %>%
  ggplot2::ggplot() +
  ggplot2::geom_boxplot(mapping = aes(x = order_fct, y = log10.mass, color = order_fct))
```

*Fix the labels and remove orders that only contain one species.*

### Save the linear regression plot for birds as a PNG:

```{r save-file}
hra %>%
  dplyr::filter(class == "aves") %>%
  ggplot2::ggplot() +
  ggplot2::geom_point(mapping = aes(x = log10.mass, y = log10.hra), alpha = 0.5) +
  ggplot2::geom_smooth(method = lm, mapping = aes(x = log10.mass, y = log10.hra), color = 'red')
ggsave("/tmp/birds.png")
```

*Save the plot as SVG scaled to be 8cm wide.*

### Create a horizontal histogram with 50 bins:

```{r horizontal}
ggplot2::ggplot(hra) +
  ggplot2::geom_histogram(mapping = aes(x = log10.mass), bins = 50) +
  ggplot2::coord_flip()
```

*Use `stat_summary` to summarize the relationship between mass and home range area by class.*

## Writing functions

### Write and call a function that returns one column from a file:

```{r define-function, message=FALSE}
col_from_file <- function(filename, colname) {
  dat <- readr::read_csv(filename)
  dat[colname]
}

person_filename <- here::here("data", "person.csv")
col_from_file(person_filename, "family_name")
```

*Write a function that reads a file and changes the name of one column.*

### Define a default value for a parameter:

```{r default-value, message=FALSE}
col_from_file <- function(filename, colname, na = c("", "NA")) {
  dat <- readr::read_csv(filename, na = na)
  dat[colname]
}

col_from_file(person_filename, "family_name", c("Dyer"))
```

*Write a function that only keeps the first 100 rows of a table unless another value is passed.*

### Name the value passed for a parameter to make intention clearer:

```{r named-passing}
col_from_file(person_filename, "family_name", na = c("Dyer"))
```

*Call your function with all of the parameters out of order.*

### Make functions slightly more robust:

```{r with-error-check, message=FALSE, error=TRUE}
col_from_file <- function(filename, colname, na = c("", "NA")) {
  dat <- readr::read_csv(filename, na = na)
  stopifnot(colname %in% colnames(dat))
  dat[colname]
}

col_from_file(person_filename, "FAMILY", na = c("Dyer"))
```

*Fail if the number of rows asked for is less than or equal to zero.*

### Use quoting and splicing to avoid needing a string for the column name:

```{r nse-single, message=FALSE}
cols_from_file <- function(filename, colname, na = c("", "NA")) {
  colname = rlang::enquo(colname)
  readr::read_csv(filename, na = na) %>%
    dplyr::select(!!colname)
}

cols_from_file(person_filename, personal_name)
```

*Modify your function so that users can pass `num_rows/2` or a similar expression to specify the number of rows they want to keep.*

### Select any number of columns:

```{r quote-multi-column, message=FALSE}
cols_from_file <- function(filename, ..., na = c("", "NA")) {
  readr::read_csv(filename, na = na) %>%
    dplyr::select(...)
}

cols_from_file(person_filename, personal_name, family_name)
```

*Can you call `cols_from_file` to subtract columns rather than keeping them?*

### Add an optional filter condition:

```{r}
cols_from_file <- function(filename, ..., na = c("", "NA"), filter_by = NULL) {
  temp <- read_csv(filename, na = na) %>%
    dplyr::select(...)
  filter_by <- rlang::enquo(filter_by)
  if (!is.null(filter_by)) {
    temp <- temp %>%
      filter(!!filter_by)
  }
  temp
}

cols_from_file(person_filename, personal_name, family_name, filter_by = family_name > "M")
```

*Why doesn't this work if the call to `rlang::enquo` is moved inside the conditional?*

### Run a function safely:

```{r nonexistent-column, message=FALSE, error=TRUE}
cols_from_file(person_filename, NONEXISTENT)
```

```{r run-safely-fail, message=FALSE}
safe_cols_from_file <- purrr::safely(cols_from_file)
safe_cols_from_file(person_filename, NONEXISTENT)
```

```{r run-safely-succeed, message=FALSE}
safe_cols_from_file(person_filename, person_id)
```

*When is `purrr::safely` safe to use?*

### Unpack multiple values in a single assignment:

```{r unpack, message=FALSE}
library(zeallot)

c(result, error) %<-% safe_cols_from_file(person_filename, person_id)
result
error
```

*What happens if there are more variables on the left of `%<-%` than there are values on the right?*

### Assign out of function scope with `<<-`:

```{r assign-upward, message=FALSE}
call_count = 0

safe_cols_from_file <- purrr::safely(function(filename, ..., na = c("", "NA")) {
  call_count <<- call_count + 1
  readr::read_csv(filename, na = na) %>%
    dplyr::select(...)
})

c(result, error) %<-% safe_cols_from_file(person_filename, family_name)

call_count
```

*What happens if the variable assigned to with `<<-` does not exist before the assignment?*

## Functional programming

### Apply a function to every element of a vector:

```{r purrr-map}
long_name <- function(name){
  stringr::str_length(name) > 4
}

purrr::map(person$family_name, long_name)
```

*Modify this example to return a logical vector.*

### Use an anonymous function:

```{r anonymous-function}
purrr::map_lgl(person$family_name, function(name) stringr::str_length(name) > 4)
```

*Create a character vector with all names in upper case.*

### Use a formula with `.x` as a shorthand for an anonymous function of one argument:

```{r}
purrr::map_chr(person$family_name, ~ stringr::str_to_upper(.x))
```

*Do you actually need `purrr::map_chr` to do this?*

### Map a function of two arguments:

```{r}
purrr::map2_chr(person$personal_name, person$family_name, ~stringr::str_c(.y, .x, sep = '_'))
```

*Calculate a vector of `mean.mass.g` over `mean.hra.m2` for the `hra` table using `purrr::map2_dbl`, then explain how you ought to do it instead.*

### Map a function of three or more arguments:

```{r}
vals = list(first = person$personal_name, last = person$family_name, ident = person$person_id)
purrr::pmap_chr(vals, function(ident, last, first) stringr::str_c(first, last, ident, sep = '_'))
```

*Repeat this without giving names to the elements of `vals`.*

### Flattern one level of a nested structure:

```{r}
purrr::flatten(list(person$personal_name, person$family_name))
```

*Use `purrr::keep` to discard the elements from the flattened list that are greater than "M".*

### Check whether every element passes a test:

```{r}
purrr::every(person$personal_name, ~.x > 'M')
```

*Use `some` to check whether any of the elements pass the same test.*

### Modify specific elements of a list:

```{r}
purrr::modify_at(person$personal_name, c(2, 4), stringr::str_to_upper)
```

*Use `modify_if` to upper-case names that are greater than "M".*

### Create an acronym:

```{r}
purrr::reduce(person$personal_name, ~stringr::str_c(.x, stringr::str_sub(.y, 1, 1)), .init = "")
```

*Explain why using `stringr::str_c(stringr::str_sub(.x, 1, 1), stringr::str_sub(.y, 1, 1))` doesn't work.*

### Create intermediate values:

```{r}
purrr::accumulate(person$personal_name, ~stringr::str_c(.x, stringr::str_sub(.y, 1, 1)), .init = "")
```

*Modify this so that the initial empty string isn't in the final result.*

```{r links, child="etc/links.md"}
```