04-descriptives.Rmd

# Descriptive statistics and data manipulation

Now that we are familiar with some R objects and know how to import data, it is time to write some
code. In this chapter, we are going to compute descriptive statistics for a single dataset, but
also for a list of datasets later in the chapter. However, I will not give a list of functions to
compute descriptive statistics; if you need a specific function you can find easily in the *Help*
pane in Rstudio or using any modern internet search engine. What I will do is show you a workflow
that allows you to compute the descripitive statisics you need fast. R has a lot of built-in
functions for descriptive statistics; however, if you want to compute statistics for different
sub-groups, some more complex manipulations are needed. At least this was true in the past.
Nowadays, thanks to the packages from the `{tidyverse}`, it is very easy and fast to compute
descriptive statistics by any stratifying variable(s). The package we are going to use for this is
called `{dplyr}`. `{dplyr}` contains a lot of functions that make manipulating data and computing
descriptive statistics very easy. To make things easier for now, we are going to use example data
included with `{dplyr}`. So no need to import an external dataset; this does not change anything to
the example that we are going to study here; the source of the data does not matter for this. Using
`{dplyr}` is possible only if the data you are working with is already in a useful shape. When data
is more messy, you will need to first manipulate it to bring it a *tidy* format. For this, we will
use `{tidyr}`, which is very useful package to reshape data and to do advanced cleaning of your
data. All these tidyverse functions are also called *verbs*. However, before getting to know these
verbs, let's do an analysis using standard, or *base* R functions. This will be the benchmark
against which we are going to measure a `{tidyverse}` workflow.

## A data exploration exercice using *base* R

Let's first load the `starwars` data set, included in the `{dplyr}` package:

```{r}
library(dplyr)
data(starwars)
```

Let's first take a look at the data:

```{r}
head(starwars)
```

This data contains information on Star Wars characters. The first question you have to answer is
to find the average height of the characters:

```{r}
mean(starwars$height)
```

As discussed in Chapter 2, `$` allows you to access columns of a `data.frame` objects.

Because there are `NA` values in the data, the result is also `NA`. To get the result, you need to
add an option to `mean()`:

```{r}
mean(starwars$height, na.rm = TRUE)
```

Let's also take a look at the standard deviation:

```{r}
sd(starwars$height, na.rm = TRUE)
```

It might be more informative to compute these two statistics by sex, so for this, we are going
to use `aggregate()`:

```{r}
aggregate(starwars$height,
          by = list(sex = starwars$sex),
          mean)
```

Oh, shoot! Most groups have missing values in them, so we get `NA` back. We need to use `na.rm = TRUE`
just like before. Thankfully, it is possible to pass this option to `mean()` inside `aggregate()` as well:

```{r}
aggregate(starwars$height,
          by = list(sex = starwars$sex),
          mean, na.rm = TRUE)
```

Later in the book, we are also going to see how to define our own functions (with the default options that
are useful to us), and this will also help in this sort of situation.
Even though we can use `na.rm = TRUE`, let's also use `subset()` to filter out the `NA` values beforehand:

```{r}
starwars_no_nas <- subset(starwars,
                          !is.na(height))

aggregate(starwars_no_nas$height,
          by = list(sex = starwars_no_nas$sex),
          mean)
```

(`aggregate()` also has a `subset = ` option, but I prefer to explicitely subset the data set with `subset()`).

Even if you are not familiar with `aggregate()`, I believe the above lines are quite
self-explanatory. You need to provide `aggregate()` with 3 things; the variable you want to
summarize (or only the data frame, if you want to summarize all variables), a list of grouping
variables and then the function that will be applied to each subgroup. And by the way, to test for
`NA`, one uses the function `is.na()` not something like `species == "NA"` or anything like that.
`!is.na()` does the opposite (`!` reverses booleans, so `!TRUE` becomes `FALSE` and vice-versa).

You can easily add another grouping variable:

```{r}
aggregate(starwars_no_nas$height,
          by = list(Sex = starwars_no_nas$sex,
                    `Hair color` = starwars_no_nas$hair_color),
          mean)
```

or use another function:

```{r}
aggregate(starwars_no_nas$height,
          by = list(Sex = starwars_no_nas$sex),
          sd)
```

(let's ignore the `NA`s). It is important to note that `aggregate()` returns a `data.frame` object.

You can only give one function to `aggregate()`, so if you need the mean and the standard deviation of `height`,
you must do it in two steps.

Since R 4.1, a new infix operator `|>` has been introduced, which is really handy for writing the kind of 
code we've been looking at in this chapter. `|>` is also called a pipe, or the *base* pipe to distinguish
it from *another* pipe that we'll discuss in the next section. For now, let's learn about `|>`.
Consider the following:

```{r}
10 |> sqrt()
```

This computes `sqrt(10)`; so what  `|>` does, is pass the left hand side (`10`, in the example above) to the
right hand side (`sqrt()`). Using `|>` might seem more complicated and verbose than not using it, but you
will see in a bit why it can be useful. The next function I would like to introduce at this point is `with()`.
`with()` makes it possible to apply functions on `data.frame` columns without having to write `$` all the time.
For example, consider this:

```{r}
mean(starwars$height, na.rm = TRUE)

with(starwars,
     mean(height, na.rm = TRUE))
```

The advantage of using `with()` is that we can directly reference `height` without using `$`. Here again, this
is more verbose than simply using `$`... so why bother with it? It turns out that by combining `|>` and `with()`,
we can write very clean and concise code. Let's go back to a previous example to illustrate this idea:

```{r}
starwars_no_nas <- subset(starwars,
                          !is.na(height))

aggregate(starwars_no_nas$height,
          by = list(sex = starwars_no_nas$sex),
          mean)
```

First, we created a new dataset where we filtered out rows where `height` is `NA`. This dataset is useless otherwise,
but we need it for the next part, where we actually do what we want (computing the average `height` by `sex`).
Using `|>` and `with()`, we can write this in one go:

```{r}
starwars |>
  subset(!is.na(sex)) |>
  with(aggregate(height,
                 by = list(Species = species,
                           Sex = sex),
                 mean))

```

So let's unpack this. In the first two rows, using `|>`, we pass the `starwars` `data.frame` to `subset()`:

```{r}
starwars |>
  subset(!is.na(sex))
```

as I explained before, this is exactly the same as `subset(starwars, !is.na(sex))`. Then, we pass the result of 
`subset()` to the next function, `with()`. The first argument of `with()` must be a `data.frame`, and this is exactly
what `subset()` returns! So now the output of `subset()` is passed down to `with()`, which makes it now possible
to reference the columns of the `data.frame` in `aggregate()` directly. If you have a hard time understanding what 
is going on, you can use `quote()` to see what's going on. `quote()` returns an expression with evaluating it:

```{r}
quote(log(10))
```

Why am I bring this up? Well, since `a |> f()` is exactly equal to `f(a)`, quoting the code above will return
an expression with `|>`. For instance:

```{r}
quote(10 |> log())
```

So let's quote the big block of code from above:

```{r}
quote(
  starwars |>
  subset(!is.na(sex)) |>
  with(aggregate(height,
                 by = list(Species = species,
                           Sex = sex),
                 mean))
)

```

I think now you see why using `|>` makes code much clearer; the nested expression you would need to write otherwise
is much less readable, unless you define intermediate objects. And without `with()`, this is what you
would need to write:

```{r, eval = F}
b <- subset(starwars, !is.na(height))

aggregate(b$height, by = list(Species = b$species, Sex = b$sex), mean)
```

To finish this section, let's say that you wanted to have the average `height` and `mass` by sex. In this case
you need to specify the columns in `aggregate()` with `cbind()` (let's use `na.rm = TRUE` again instead of 
`subset()`ing the data beforehand):

```{r}
starwars |>
  with(aggregate(cbind(height, mass),
       by = list(Sex = sex),
       FUN = mean, na.rm = TRUE))

```

Let's now continue with some more advanced operations using this fake dataset:

```{r}
survey_data_base <- as.data.frame(
    tibble::tribble(
        ~id, ~var1, ~var2, ~var3,
        1, 1, 0.2, 0.3,
        2, 1.4, 1.9, 4.1,
        3, 0.1, 2.8, 8.9,
        4, 1.7, 1.9, 7.6
        )
)
```

```{r}
survey_data_base
```

Depending on what you want to do with this data, it is not in the right shape. For example, it
would not be possible to simply compute the average of `var1`, `var2` and `var3` for each `id`.
This is because this would require running `mean()` by row, but this is not very easy. This is
because R is not suited to row-based workflows. Well I'm lying a little bit here, it turns here
that R comes with a `rowMeans()` function. So this would work:

```{r}
survey_data_base |>
  transform(mean_id = rowMeans(cbind(var1, var2, var3))) #transform adds a column to a data.frame
```

But there is no `rowSD()` or `rowMax()`, etc... so it is much better to reshape the data and put it in a
format that gives us maximum flexibility. To reshape the data, we'll be using the aptly-called `reshape()` command:

```{r}
survey_data_long <- reshape(survey_data_base,
        varying = list(2:4), v.names = "variable", direction = "long")
```

We can now easily compute the average of `variable` for each `id`:

```{r}
aggregate(survey_data_long$variable,
          by = list(Id = survey_data_long$id),
          mean)
```

or any other variable:

```{r}
aggregate(survey_data_long$variable,
          by = list(Id = survey_data_long$id),
          max)
```

As you can see, R comes with very powerful functions right out of the box, ready to use. When I was
studying, unfortunately, my professors had been brought up on FORTRAN loops, so we had to do to all
this using loops (not reshaping, thankfully), which was not so easy.
Now that we have seen how *base* R works, let's redo the analysis using `{tidyverse}` verbs.
The `{tidyverse}` provides many more functions, each of them doing only one single thing. You will
shortly see why this is quite important; by focusing on just one task, and by focusing on the data frame
as the central object, it becomes possible to build really complex workflows, piece by piece,
very easily.

But before deep diving into the `{tidyverse}`, let's take a moment to discuss about another infix
operator, `%>%`.

## Smoking is bad for you, but pipes are your friend

The title of this section might sound weird at first, but by the end of it, you'll get this
(terrible) pun.

You probably know the following painting by René Magritte, *La trahison des images*:

```{r, echo=FALSE}
knitr::include_graphics("assets/pas_une_pipe.png")
```

It turns out there's an R package from the `tidyverse` that is called `magrittr`. What does this
package do? This package introduced *pipes* to R, way before `|>` in R 4.1. Pipes are a concept
from the Unix operating system; if you're using a GNU+Linux distribution or macOS, you're basically
using a *modern* unix (that's an oversimplification, but I'm an economist by training, and
outrageously oversimplifying things is what we do, deal with it). The *magrittr* pipe is written as
`%>%`. Just like `|>`, `%>%` takes the left hand side to feed it as the first argument of the
function in the right hand side. Try the following:

```{r, include = FALSE}
library(magrittr)
```

```{r, eval = FALSE}
library(magrittr)
```

```{r}
16 %>% sqrt
```

You can chain multiple functions, as you can with `|>`:

```{r}
16 %>%
  sqrt %>%
  log
```

But unlike with `|>`, you can omit `()`. `%>%` also has other features. For example, you can 
pipe things to other infix operators. For example, `+`. You can use `+` as usual:

```{r}
2 + 12
```

Or as a prefix operator:

```{r}
`+`(2, 12)
```

You can use this notation with `%>%`:

```{r}
16 %>% sqrt %>% `+`(18)
```

This also works using `|>` since R version 4.2, but only if you use the `_` pipe placeholder:

```{r}
16 |> sqrt() |> `+`(x = _, 18)
```

The output of `16` (`16`) got fed to `sqrt()`, and the output of `sqrt(16)` (4) got fed to `+(18)` 
(so we got `+(4, 18)` = 22). Without `%>%` you'd write the line just above like this:

```{r}
sqrt(16) + 18
```

Just like before, with `|>`, this might seem overly complicated, but using these pipes will
make our code much more readable. I'm sure you'll be convinced by the end of this chapter.

`%>%` is not the only pipe operator in `magrittr`. There's `%T%`, `%<>%` and `%$%`. All have their
uses, but are basically shortcuts to some common tasks with `%>%` plus another function. Which
means that you can live without them, and because of this, I will not discuss them.

## The `{tidyverse}`'s *enfant prodige*: `{dplyr}`

The best way to get started with the tidyverse packages is to get to know `{dplyr}`. `{dplyr}`
provides a lot of very useful functions that makes it very easy to get discriptive statistics or
add new columns to your data.

### A first taste of data manipulation with `{dplyr}`

This section will walk you through a typical analysis using `{dplyr}` funcitons. Just go with it; I
will give more details in the next sections.

First, let's load `{dplyr}` and the included `starwars` dataset. Let's also take a look at the
first 5 lines of the dataset:

```{r}
library(dplyr)

data(starwars)

head(starwars)
```

`data(starwars)` loads the example dataset called `starwars` that is included in the package
`{dplyr}`. As I said earlier, this is just an example; you could have loaded an external dataset,
from a `.csv` file for instance. This does not matter for what comes next.

Like we saw earlier, R includes a lot of functions for descriptive statistics, such as `mean()`,
`sd()`, `cov()`, and many more. What `{dplyr}` brings to the table is a grammar of data
manipulation that makes it very easy to apply descriptive statistics functions, or any other,
very easily.

Just like before, we are going to compute the average height by `sex`:

```{r}
starwars %>%
  group_by(sex) %>%
  summarise(mean_height = mean(height, na.rm = TRUE))
```

The very nice thing about using `%>%` and `{dplyr}` verbs/functions, is that this is really
readable. The above three lines can be translated like so in English:

*Take the starwars dataset, then group by sex, then compute the mean height (for each subgroup) by
omitting missing values.*

`%>%` can be translated by "then". Without `%>%` you would need to change the code to:

```{r}
summarise(group_by(starwars, sex), mean(height, na.rm = TRUE))
```

Unlike with the *base* approach, each function does only one thing. With the base function
`aggregate()` was used to also define the subgroups. This is not the case with `{dplyr}`; one
function to create the groups (`group_by()`) and then one function to compute the summaries
(`summarise()`). Also, `group_by()` creates a specific subgroup for individuals where `sex` is
missing. This is the last line in the data frame, where `sex` is `NA`. Another nice thing is that
you can specify the column containing the average height. I chose to name it `mean_height`.

Now, let's suppose that we want to filter some data first:

```{r}
starwars %>%
  filter(gender == "masculine") %>%
  group_by(sex) %>%
  summarise(mean_height = mean(height, na.rm = TRUE))
```

Again, the `%>%` makes the above lines of code very easy to read. Without it, one would need to
write:

```{r}
summarise(group_by(filter(starwars, gender == "masculine"), sex), mean(height, na.rm = TRUE))
```

I think you agree with me that this is not very readable. One way to make it more readable would
be to save intermediary variables:

```{r}
filtered_data <- filter(starwars, gender == "masculine")

grouped_data <- group_by(filter(starwars, gender == "masculine"), sex)

summarise(grouped_data, mean(height))
```

But this can get very tedious. Once you're used to `%>%`, you won't go back to not use it.

Before continuing and to make things clearer; `filter()`, `group_by()` and `summarise()` are
functions that are included in `{dplyr}`. `%>%` is actually a function from `{magrittr}`, but this
package gets loaded on the fly when you load `{dplyr}`, so you do not need to worry about it.

The result of all these operations that use `{dplyr}` functions are actually other datasets, or
`tibbles`. This means that you can save them in variable, or write them to disk, and then work with
these as any other datasets.

```{r}
mean_height <- starwars %>%
  group_by(sex) %>%
  summarise(mean(height))

class(mean_height)

head(mean_height)
```

You could then write this data to disk using `rio::export()` for instance. If you need more than
the mean of the height, you can keep adding as many functions as needed (another advantage over
`aggregate()`:

```{r}
summary_table <- starwars %>%
  group_by(sex) %>%
  summarise(mean_height = mean(height, na.rm = TRUE),
            var_height = var(height, na.rm = TRUE),
            n_obs = n())

summary_table
```

I've added more functions, namely `var()`, to get the variance of height, and `n()`, which
is a function from `{dplyr}`, not base R, to get the number of observations. This is quite useful,
because we see that there is a group with only one individual. Let's focus on the
sexes for which we have more than 1 individual. Since we save all the previous operations (which
produce a `tibble`) in a variable, we can keep going from there:

```{r}
summary_table2 <- summary_table %>%
  filter(n_obs > 1)

summary_table2
```

As mentioned before, there's a lot of `NA`s; this is because by default, `mean()` and `var()`
return `NA` if even one single observation is `NA`. This is good, because it forces you to look at
the data to see what is going on. If you would get a number, even if there were `NA`s you could
very easily miss these missing values. It is better for functions to fail early and often than the
opposite. This is way we keep using `na.rm = TRUE` for `mean()` and `var()`.

Now let's actually take a look at the rows where `sex` is `NA`:

```{r}
starwars %>%
  filter(is.na(sex))

```

There's only 4 rows where `sex` is `NA`. Let's ignore them:

```{r}
starwars %>%
  filter(!is.na(sex)) %>%
  group_by(sex) %>%
  summarise(ave_height = mean(height, na.rm = TRUE),
            var_height = var(height, na.rm = TRUE),
            n_obs = n()) %>%
  filter(n_obs > 1)

```

And why not compute the same table, but first add another stratifying variable?

```{r}
starwars %>%
  filter(!is.na(sex)) %>%
  group_by(sex, eye_color) %>%
  summarise(ave_height = mean(height, na.rm = TRUE),
            var_height = var(height, na.rm = TRUE),
            n_obs = n()) %>%
  filter(n_obs > 1)

```

Ok, that's it for a first taste. We have already discovered some very useful `{dplyr}` functions,
`filter()`, `group_by()` and summarise `summarise()`.

Now, we are going to learn more about these functions in more detail.

### Filter the rows of a dataset with `filter()`

We're going to use the `Gasoline` dataset from the `plm` package, so install that first:

```{r, eval = FALSE}
install.packages("plm")
```

Then load the required data:

```{r}
data(Gasoline, package = "plm")
```

and load dplyr:

```{r}
library(dplyr)
```

This dataset gives the consumption of gasoline for 18 countries from 1960 to 1978. When you load
the data like this, it is a standard `data.frame`. `{dplyr}` functions can be used on standard
`data.frame` objects, but also on `tibble`s. `tibble`s are just like data frame, but with a better
print method (and other niceties). I'll discuss the `{tibble}` package later, but for now, let's
convert the data to a `tibble` and change its name, and also transform the `country` column to 
lower case:

```{r}
gasoline <- as_tibble(Gasoline)

gasoline <- gasoline %>%
  mutate(country = tolower(country))
```

`filter()` is pretty straightforward. What if you would like to subset the data to focus on the
year 1969? Simple:

```{r}
filter(gasoline, year == 1969)
```

Let's use `%>%`, since we're familiar with it now:

```{r}
gasoline %>%
  filter(year == 1969)
```

You can also filter more than just one year, by using the `%in%` operator:

```{r}
gasoline %>%
  filter(year %in% seq(1969, 1973))
```

It is also possible use `between()`, a helper function:

```{r}
gasoline %>%
  filter(between(year, 1969, 1973))
```

To select non-consecutive years:

```{r}
gasoline %>%
  filter(year %in% c(1969, 1973, 1977))
```

`%in%` tests if an object is part of a set.

### Select columns with `select()`

While `filter()` allows you to keep or discard rows of data, `select()` allows you to keep or
discard entire columns. To keep columns:

```{r}
gasoline %>%
  select(country, year, lrpmg)
```

To discard them:

```{r}
gasoline %>%
  select(-country, -year, -lrpmg)
```

To rename them:

```{r}
gasoline %>%
  select(country, date = year, lrpmg)
```

There's also `rename()`:

```{r}
gasoline %>%
  rename(date = year)
```

`rename()` does not do any kind of selection, but just renames.

You can also use `select()` to re-order columns:

```{r}
gasoline %>%
  select(year, country, lrpmg, everything())
```

`everything()` is a helper function, and there's also `starts_with()`, and `ends_with()`. For
example, what if we are only interested in columns whose name start with "l"?

```{r}
gasoline %>%
  select(starts_with("l"))
```

`ends_with()` works in a similar fashion. There is also `contains()`:

```{r}
gasoline %>%
  select(country, year, contains("car"))
```

You can read more about these helper functions [here](https://tidyselect.r-lib.org/reference/language.html), but we're going to look more into
them in a coming section.

Another verb, similar to `select()`, is `pull()`. Let's compare the two:

```{r}
gasoline %>%
  select(lrpmg)
```

```{r}
gasoline %>%
  pull(lrpmg) %>%
  head() # using head() because there's 337 elements in total
```

`pull()`, unlike `select()`, does not return a `tibble`, but only the column you want, as a
vector.

### Group the observations of your dataset with `group_by()`

`group_by()` is a very useful verb; as the name implies, it allows you to create groups and then,
for example, compute descriptive statistics by groups. For example, let's group our data by
country:

```{r}
gasoline %>%
  group_by(country)
```

It looks like nothing much happened, but if you look at the second line of the output you can read
the following:

```{r}
## # Groups:   country [18]
```

this means that the data is grouped, and every computation you will do now will take these groups
into account. It is also possible to group by more than one variable:

```{r}
gasoline %>%
  group_by(country, year)
```

and so on. You can then also ungroup:

```{r}
gasoline %>%
  group_by(country, year) %>%
  ungroup()
```

Once your data is grouped, the operations that will follow will be executed inside each group.

### Get summary statistics with `summarise()`

Ok, now that we have learned the basic verbs, we can start to do more interesting stuff. For
example, one might want to compute the average gasoline consumption in each country, for
the whole period:

```{r}
gasoline %>%
  group_by(country) %>%
  summarise(mean(lgaspcar))
```

`mean()` was given as an argument to `summarise()`, which is a `{dplyr}` verb. What we get is
another `tibble`, that contains the variable we used to group, as well as the average per country.
We can also rename this column:

```{r}
gasoline %>%
  group_by(country) %>%
  summarise(mean_gaspcar = mean(lgaspcar))
```

and because the output is a `tibble`, we can continue to use `{dplyr}` verbs on it:

```{r}
gasoline %>%
  group_by(country) %>%
  summarise(mean_gaspcar = mean(lgaspcar)) %>%
  filter(country == "france")
```

`summarise()` is a very useful verb. For example, we can compute several descriptive statistics at once:

```{r}
gasoline %>%
  group_by(country) %>%
  summarise(mean_gaspcar = mean(lgaspcar),
            sd_gaspcar = sd(lgaspcar),
            max_gaspcar = max(lgaspcar),
            min_gaspcar = min(lgaspcar))
```

Because the output is a `tibble`, you can save it in a variable of course:

```{r}
desc_gasoline <- gasoline %>%
  group_by(country) %>%
  summarise(mean_gaspcar = mean(lgaspcar),
            sd_gaspcar = sd(lgaspcar),
            max_gaspcar = max(lgaspcar),
            min_gaspcar = min(lgaspcar))
```

And then you can answer questions such as, *which country has the maximum average gasoline
consumption?*:

```{r}
desc_gasoline %>%
  filter(max(mean_gaspcar) == mean_gaspcar)
```

Turns out it's Turkey. What about the minimum consumption?

```{r}
desc_gasoline %>%
  filter(min(mean_gaspcar) == mean_gaspcar)
```

Because the output of `{dplyr}` verbs is a tibble, it is possible to continue working with it. This
is one shortcoming of using the base `summary()` function. The object returned by that function is
not very easy to manipulate.

### Adding columns with `mutate()` and `transmute()`

`mutate()` adds a column to the `tibble`, which can contain any transformation of any other
variable:

```{r}
gasoline %>%
  group_by(country) %>%
  mutate(n())
```

Using `mutate()` I've added a column that counts how many times the country appears in the `tibble`,
using `n()`, another `{dplyr}` function. There's also `count()` and `tally()`, which we are going to
see further down. It is also possible to rename the column on the fly:

```{r}
gasoline %>%
  group_by(country) %>%
  mutate(count = n())
```

It is possible to do any arbitrary operation:

```{r}
gasoline %>%
  group_by(country) %>%
  mutate(spam = exp(lgaspcar + lincomep))
```

`transmute()` is the same as `mutate()`, but only returns the created variable:

```{r}
gasoline %>%
  group_by(country) %>%
  transmute(spam = exp(lgaspcar + lincomep))
```

### Joining `tibble`s with `full_join()`, `left_join()`, `right_join()` and all the others

I will end this section on `{dplyr}` with the very useful verbs: the `*_join()` verbs. Let's first
start by loading another dataset from the `plm` package. `SumHes` and let's convert it to `tibble`
and rename it:

```{r, include = FALSE}
gasoline <- as_tibble(Gasoline) %>%  
  mutate(country = tolower(country))
```

```{r}
data(SumHes, package = "plm")

pwt <- SumHes %>%
  as_tibble() %>%
  mutate(country = tolower(country))
```

Let's take a quick look at the data:

```{r}
glimpse(pwt)
```

We can merge both `gasoline` and `pwt` by country and year, as these two variables are common to
both datasets. There are more countries and years in the `pwt` dataset, so when merging both, and
depending on which function you use, you will either have `NA`'s for the variables where there is
no match, or rows that will be dropped. Let's start with `full_join`:

```{r}
gas_pwt_full <- gasoline %>%
  full_join(pwt, by = c("country", "year"))
```

Let's see which countries and years are included:

```{r}
gas_pwt_full %>%
  count(country, year)
```

As you see, every country and year was included, but what happened for, say, the U.S.S.R? This country
is in `pwt` but not in `gasoline` at all:

```{r}
gas_pwt_full %>%
  filter(country == "u.s.s.r.")
```

As you probably guessed, the variables from `gasoline` that are not included in `pwt` are filled with
`NA`s. One could remove all these lines and only keep countries for which these variables are not
`NA` everywhere with `filter()`, but there is a simpler solution:

```{r}
gas_pwt_inner <- gasoline %>%
  inner_join(pwt, by = c("country", "year"))

```

Let's use the `tabyl()` from the `janitor` packages which is a very nice alternative to the `table()`
function from base R:

```{r}
library(janitor)

gas_pwt_inner %>%
  tabyl(country)

```

Only countries with values in both datasets were returned. It's almost every country from `gasoline`,
apart from Germany (called "germany west" in `pwt` and "germany" in `gasoline`. I left it as is to
provide an example of a country not in `pwt`). Let's also look at the variables:

```{r}
glimpse(gas_pwt_inner)
```

The variables from both datasets are in the joined data.

Contrast this to `semi_join()`:

```{r}
gas_pwt_semi <- gasoline %>%
  semi_join(pwt, by = c("country", "year"))

glimpse(gas_pwt_semi)

gas_pwt_semi %>%
  tabyl(country)
```

Only columns of `gasoline` are returned, and only rows of `gasoline` that were matched with rows
from `pwt`. `semi_join()` is not a commutative operation:

```{r}
pwt_gas_semi <- pwt %>%
  semi_join(gasoline, by = c("country", "year"))

glimpse(pwt_gas_semi)

gas_pwt_semi %>%
  tabyl(country)
```

The rows are the same, but not the columns.

`left_join()` and `right_join()` return all the rows from either the dataset that is on the
"left" (the first argument of the fonction) or on the "right" (the second argument of the
function) but all columns from both datasets. So depending on which countries you're interested in,
you're going to use either one of these functions:

```{r}
gas_pwt_left <- gasoline %>%
  left_join(pwt, by = c("country", "year"))

gas_pwt_left %>%
  tabyl(country)
```

```{r}
gas_pwt_right <- gasoline %>%
  right_join(pwt, by = c("country", "year"))

gas_pwt_right %>%
  tabyl(country) %>%
  head()
```

The last merge function is `anti_join()`:

```{r}
gas_pwt_anti <- gasoline %>%
  anti_join(pwt, by = c("country", "year"))

glimpse(gas_pwt_anti)

gas_pwt_anti %>%
  tabyl(country)
```

`gas_pwt_anti` has the columns the `gasoline` dataset as well as the only country from `gasoline`
that is not in `pwt`: "germany".

That was it for the basic `{dplyr}` verbs. Next, we're going to learn about `{tidyr}`.

## Reshaping and sprucing up data with `{tidyr}`

Note: this section is going to be a lot harder than anything you've seen until now. Reshaping
data is tricky, and to really grok it, you need time, and you need to run each line, and see what
happens. Take your time, and don't be discouraged. 

Another important package from the `{tidyverse}` that goes hand in hand with `{dplyr}` is `{tidyr}`.
`{tidyr}` is the package you need when it's time to reshape data. 

I will start by presenting `pivot_wider()` and `pivot_longer()`.

### `pivot_wider()` and `pivot_longer()`

Let's first create a fake dataset:

```{r}
library(tidyr)
```

```{r}
survey_data <- tribble(
  ~id, ~variable, ~value,
  1, "var1", 1,
  1, "var2", 0.2,
  NA, "var3", 0.3,
  2, "var1", 1.4,
  2, "var2", 1.9,
  2, "var3", 4.1,
  3, "var1", 0.1,
  3, "var2", 2.8,
  3, "var3", 8.9,
  4, "var1", 1.7,
  NA, "var2", 1.9,
  4, "var3", 7.6
)

head(survey_data)
```

I used the `tribble()` function from the `{tibble}` package to create this fake dataset.
I'll discuss this package later, for now, let's focus on `{tidyr}.`

Let's suppose that we need the data to be in the wide format which means `var1`, `var2` and `var3`
need to be their own columns. To do this, we need to use the `pivot_wider()` function. Why *wide*?
Because the data set will be wide, meaning, having more columns than rows.

```{r}
survey_data %>% 
  pivot_wider(id_cols = id,
              names_from = variable,
              values_from = value)
```

Let's go through `pivot_wider()`'s arguments: the first is `id_cols = ` which requires the variable
that uniquely identifies the rows to be supplied. `names_from = ` is where you input the variable that will 
generate the names of the new columns. In our case, the `variable` colmuns has three values; `var1`,
`var2` and `var3`, and these are now the names of the new columns. Finally, `values_from = ` is where
you can specify the column containing the values that will fill the data frame.
I find the argument names `names_from = ` and `values_from = ` quite explicit. 

As you can see, there are some missing values. Let's suppose that we know that these missing values
are true 0's. `pivot_wider()` has an argument called `values_fill = ` that makes it easy to replace
the missing values:

```{r}
survey_data %>% 
  pivot_wider(id_cols = id,
              names_from = variable,
              values_from = value,
              values_fill = list(value = 0))
```

A list of variables and their respective values to replace NA's with must be supplied to `values_fill`. 

Let's now use another dataset, which you can get from
[here](https://github.com/b-rodrigues/modern_R/tree/master/datasets/unemployment/all) 
(downloaded from: http://www.statistiques.public.lu/stat/TableViewer/tableView.aspx?ReportId=12950&IF_Language=eng&MainTheme=2&FldrName=3&RFPath=91). This data set gives the unemployment rate for each Luxembourguish
canton from 2001 to 2015. We will come back to this data later on to learn how to plot it. For now, 
let's use it to learn more about `{tidyr}`.

```{r}
unemp_lux_data <- rio::import(
      "https://raw.githubusercontent.com/b-rodrigues/modern_R/master/datasets/unemployment/all/unemployment_lux_all.csv"
                       )

head(unemp_lux_data)
```

Now, let's suppose that for our purposes, it would make more sense to have the data in a wide format,
where columns are "divison times year" and the value is the unemployment rate. This can be easily done
with providing more columns to `names_from = `.

```{r}
unemp_lux_data2 <- unemp_lux_data %>% 
  filter(year %in% seq(2013, 2017),
         str_detect(division, ".*ange$"),
         !str_detect(division, ".*Canton.*")) %>% 
  select(division, year, unemployment_rate_in_percent) %>% 
  rowid_to_column()

unemp_lux_data2 %>% 
  pivot_wider(names_from = c(division, year),
              values_from = unemployment_rate_in_percent)
```

In the `filter()` statement, I only kept data from 2013 to 2017, "division"s ending with the string
"ange" ("division" can be a canton or a commune, for example "Canton Redange", a canton, or
"Hesperange" a commune), and removed the cantons as I'm only interested in communes. If you don't
understand this `filter()` statement, don't fret; this is not important for what follows. I then
only kept the columns I'm interested in and pivoted the data to a wide format. Also, I needed to
add a unique identifier to the data frame. For this, I used `rowid_to_column()` function, from the
`{tibble}` package, which adds a new column to the data frame with an id, going from 1 to the
number of rows in the data frame. If I did not add this identifier, the statement would work still:

```{r}
unemp_lux_data3 <- unemp_lux_data %>% 
  filter(year %in% seq(2013, 2017), str_detect(division, ".*ange$"), !str_detect(division, ".*Canton.*")) %>% 
  select(division, year, unemployment_rate_in_percent)

unemp_lux_data3 %>% 
  pivot_wider(names_from = c(division, year), values_from = unemployment_rate_in_percent)
```

and actually look even better, but only because there are no repeated values; there is only one
unemployment rate for each "commune times year". I will come back to this later on, with another
example that might be clearer. These last two code blocks are intense; make sure you go through
each lien step by step and understand what is going on.

You might have noticed that because there is no data for the years 2016 and 2017, these columns do
not appear in the data. But suppose that we need to have these columns, so that a colleague from
another department can fill in the values. This is possible by providing a data frame with the
detailed specifications of the result data frame. This optional data frame must have at least two
columns, `.name`, which are the column names you want, and `.value` which contains the values.
Also, the function that uses this spec is a `pivot_wider_spec()`, and not `pivot_wider()`.

```{r, include=FALSE}
unemp_spec <- unemp_lux_data %>% 
  tidyr::expand(division,
                year = c(year, 2016, 2017),
                .value = "unemployment_rate_in_percent") %>%
  unite(".name", division, year, remove = FALSE)

unemp_spec
```

```{r, eval=FALSE}
unemp_spec <- unemp_lux_data %>% 
  tidyr::expand(division,
         year = c(year, 2016, 2017),
         .value = "unemployment_rate_in_percent") %>%
  unite(".name", division, year, remove = FALSE)

unemp_spec
```

Here, I use another function, `tidyr::expand()`, which returns every combinations (cartesian product)
of every variable from a dataset.

To make it work, we still need to create a column that uniquely identifies each row in the data:

```{r}
unemp_lux_data4 <- unemp_lux_data %>% 
  select(division, year, unemployment_rate_in_percent) %>% 
  rowid_to_column() %>% 
  pivot_wider_spec(spec = unemp_spec) 

unemp_lux_data4
```

You can notice that now we have columns for 2016 and 2017 too. Let's clean the data a little bit more:

```{r}
unemp_lux_data4 %>% 
  select(-rowid) %>% 
  fill(matches(".*"), .direction = "down") %>% 
  slice(n())
```

We will learn about `fill()`, anoher `{tidyr}` function a bit later in this chapter, but its basic
purpose is to fill rows with whatever value comes before or after the missing values. `slice(n())`
then only keeps the last row of the data frame, which is the row that contains all the values (expect
for 2016 and 2017, which has missing values, as we wanted).

Here is another example of the importance of having an identifier column when using a spec:

```{r, include=FALSE}
data(mtcars)
mtcars_spec <- mtcars %>% 
    tidyr::expand(am, cyl, .value = "mpg") %>%
    unite(".name", am, cyl, remove = FALSE)

mtcars_spec
```

```{r, eval=FALSE}
data(mtcars)
mtcars_spec <- mtcars %>% 
    tidyr::expand(am, cyl, .value = "mpg") %>%
    unite(".name", am, cyl, remove = FALSE)

mtcars_spec
```

We can now transform the data:

```{r}
mtcars %>% 
    pivot_wider_spec(spec = mtcars_spec)
```

As you can see, there are several values of "mpg" for some combinations of "am" times "cyl". If 
we remove the other columns, each row will not be uniquely identified anymore. This results in a 
warning message, and a tibble that contains list-columns:

```{r}
mtcars %>% 
  select(am, cyl, mpg) %>% 
  pivot_wider_spec(spec = mtcars_spec)
```

We are going to learn about list-columns in the next section. List-columns are very powerful, and
mastering them will be important. But generally speaking, when reshaping data, if you get list-columns
back it often means that something went wrong.

So you have to be careful with this.

`pivot_longer()` is used when you need to go from a wide to a long dataset, meaning, a dataset
where there are some columns that should not be columns, but rather, the levels of a factor
variable. Let's suppose that the "am" column is split into two columns, `1` for automatic and `0`
for manual transmissions, and that the values filling these colums are miles per gallon, "mpg":

```{r}
mtcars_wide_am <- mtcars %>% 
  pivot_wider(names_from = am, values_from = mpg)

mtcars_wide_am %>% 
  select(`0`, `1`, everything())
```

As you can see, the "0" and "1" columns should not be their own columns, unless there is a very
specific and good reason they should... but rather, they should be the levels of another column (in
our case, "am").

We can go back to a long dataset like so:

```{r}
mtcars_wide_am %>% 
  pivot_longer(cols = c(`1`, `0`), names_to = "am", values_to = "mpg") %>% 
  select(am, mpg, everything())
```

In the cols argument, you need to list all the variables that need to be transformed. Only `1` and 
`0` must be pivoted, so I list them. Just for illustration purposes, imagine that we would need
to pivot 50 columns. It would be faster to list the columns that do not need to be pivoted. This 
can be achieved by listing the columns that must be excluded with `-` in front, and maybe using 
`match()` with a regular expression:

```{r}
mtcars_wide_am %>% 
  pivot_longer(cols = -matches("^[[:alpha:]]"),
               names_to = "am",
               values_to = "mpg") %>% 
  select(am, mpg, everything())
```

Every column that starts with a letter is ok, so there is no need to pivot them. I use the `match()` 
function with a regular expression so that I don't have to type the names of all the columns. `select()`
is used to re-order the columns, only for viewing purposes

`names_to = ` takes a string as argument, which will be the name of the name column containing the 
levels `0` and `1`, and `values_to = ` also takes a string as argument, which will be the name of 
the column containing the values. Finally, you can see that there are a lot of `NA`s in the 
output. These can be removed easily:

```{r}
mtcars_wide_am %>% 
  pivot_longer(cols = c(`1`, `0`), names_to = "am", values_to = "mpg", values_drop_na = TRUE) %>% 
  select(am, mpg, everything())
```

Now for a more advanced example, let's suppose that we are dealing with the following wide dataset:

```{r}
mtcars_wide <- mtcars %>% 
    pivot_wider_spec(spec = mtcars_spec)

mtcars_wide
```

The difficulty here is that we have columns with two levels of information. For instance, the 
column "0_4" contains the miles per gallon values for manual cars (`0`) with `4` cylinders.
The first step is to first pivot the columns:

```{r}
mtcars_wide %>% 
  pivot_longer(cols = matches("0|1"),
               names_to = "am_cyl",
               values_to = "mpg",
               values_drop_na = TRUE) %>% 
  select(am_cyl, mpg, everything())
```

Now we only need to separate the "am_cyl" column into two new columns, "am" and "cyl":

```{r}
mtcars_wide %>% 
  pivot_longer(cols = matches("0|1"),
               names_to = "am_cyl",
               values_to = "mpg",
               values_drop_na = TRUE) %>% 
  separate(am_cyl, into = c("am", "cyl"), sep = "_") %>% 
  select(am, cyl, mpg, everything())
```

It is also possible to construct a specification data frame, just like for `pivot_wider_spec()`. 
This time, I'm using the `build_longer_spec()` function that makes it easy to build specifications:

```{r}
mtcars_spec_long <- mtcars_wide %>%
  build_longer_spec(matches("0|1"),
                    values_to = "mpg") %>%
  separate(name, c("am", "cyl"), sep = "_")

mtcars_spec_long
```

This spec can now be specified to `pivot_longer()`:

```{r}
mtcars_wide %>%
  pivot_longer_spec(spec = mtcars_spec_long,
                    values_drop_na = TRUE) %>%
  select(am, cyl, mpg, everything())
```

Defining specifications give a lot of flexibility and in some complicated cases are the way to go.


### `fill()` and `full_seq()`

`fill()` is pretty useful to... fill in missing values. For instance, in `survey_data`, some "id"s
are missing:

```{r}
survey_data
```

It seems pretty obvious that the first `NA` is supposed to be `1` and the second missing is supposed
to be `4`. With `fill()`, this is pretty easy to achieve:


```{r, include=FALSE}
survey_data %>%
    fill(.direction = "down", id)
```

```{r, eval=FALSE}
survey_data %>%
    fill(.direction = "down", id)
```

`full_seq()` is similar:

```{r}
full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1)
```

We can add this as the date column to our survey data:

```{r}
survey_data %>%
    mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4))
```

I use the base `rep()` function to repeat the date 4 times and then using `mutate()` I have added
it the data frame.

Putting all these operations together:

```{r}
survey_data %>%
    fill(.direction = "down", id) %>%
    mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4))
```

You should be careful when imputing missing values though. The method described above is called 
`Last Observation Carried Forward`, and sometimes it makes sense, like here, but sometimes it doesn't and 
doing this will introduce bias in your analysis. Discussing how to handle missing values in your analysis
is outside of the scope of this book, but there are many resources available. You may want to check 
out the vignettes of the `{mice}` [package](https://amices.org/mice/articles/overview.html), which
lists many resources to get you started.

### Put order in your columns with `separate()`, `unite()`, and in your rows with `separate_rows()`

```{r, include=FALSE}
survey_data_not_tidy <- survey_data %>%
    tidyr::fill(.direction = "down", id) %>%
    mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) %>%
    mutate(variable_date = paste(variable, date, sep = "/")) %>% select(id, variable_date, value)
```

Sometimes, data can be in a format that makes working with it needlessly painful. For example, you
get this:

```{r}
survey_data_not_tidy
```

Dealing with this is simple, thanks to `separate()`:

```{r}
survey_data_not_tidy %>%
    separate(variable_date, into = c("variable", "date"), sep = "/")
```

The `variable_date` column gets separated into two columns, `variable` and `date`. One also needs
to specify the separator, in this case "/".

`unite()` is the reverse operation, which can be useful when you are confronted to this situation:

```{r, include=FALSE}
survey_data2 <- survey_data_not_tidy %>%
    separate(variable_date, into = c("variable", "date"), sep = "/") %>%
    separate(date, into = c("year", "month", "day"), sep = "-")
```

```{r}
survey_data2
```

In some situation, it is better to have the date as a single column:

```{r}
survey_data2 %>%
    unite(date, year, month, day, sep = "-")
```

Another awful situation is the following:

```{r, include=FALSE}
survey_data_from_hell <- data.frame(
  id = c(1, 1, NA, 2, 3, 3, 4, NA, 4),
  variable = c("var1", "var2", "var3", "var1, var2, var3", "var1, var2", "var3", "var1", "var2", "var3"),
  value = c("1", "0.2", "0.3", "1.4, 1.9, 4.1", "0.1, 2.8", "8.9", "1.7", "1.9", "7.6"),
  stringsAsFactors = FALSE
)
```

```{r}
survey_data_from_hell
```

`separate_rows()` saves the day:

```{r}
survey_data_from_hell %>%
    separate_rows(variable, value)
```

So to summarise... you can go from this:

```{r}
survey_data_from_hell
```

```{r, include=FALSE}
survey_data_clean <- survey_data2 %>%
    unite(date, year, month, day, sep = "-")
```

to this:

```{r}
survey_data_clean
```

quite easily:

```{r, include=FALSE}
survey_data_from_hell %>%
    separate_rows(variable, value, convert = TRUE) %>%
    tidyr::fill(.direction = "down", id) %>%
    mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4))
```


```{r, eval=FALSE}
survey_data_from_hell %>%
    separate_rows(variable, value, convert = TRUE) %>%
    fill(.direction = "down", id) %>%
    mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4))
```

## Working on many columns with `if_any()`, `if_all()` and `across()`

### Filtering rows where several columns verify a condition

Let's go back to the `gasoline` data from the `{Ecdat}` package.

When using `filter()`, it is only possible to filter one column at a time. For example, you can
only filter rows where a column equals "France" for instance. But suppose that we have a condition that we want
to use to filter out a lot of columns at once. For example, for every column that is of type
`numeric`, keep only the lines where the condition *value > -8* is satisfied. The next line does
that:

```{r}
gasoline %>%
  filter(if_any(where(is.numeric), \(x)(`>`(x, -8))))
```

The above code is using the `if_any()` function, included in `{dplyr}`. It also uses
`where()`, which must be used for predicate functions like `is.numeric()`, or `is.character()`, etc. 
You can think of `if_any()` as a function that helps you select the columns to which to apply the 
function. You can read the code above like this:

*Start with the gasoline data, then filter rows that are greater than -8 across the columns 
which are numeric*

or similar. `if_any()`, `if_all()` and `across()` makes operations like these very easy to achieve. 

Sometimes, you'd want to filter rows from columns that end their labels with a letter, for instance
`"p"`. This can again be achieved using another helper, `ends_with()`, instead of `where()`:

```{r}
gasoline %>%
  filter(if_any(ends_with("p"), \(x)(`>`(x, -8))))
```

We already know about `ends_with()` and `starts_with()`. So the above line means "for the columns
whose name end with a 'p' only keep the lines where, for all the selected columns, the values are
strictly superior to `-8`". 

`if_all()` works exactly the same way, but think of the `if` in `if_all()` as having the conditions
separated by `and` while the `if` of `if_any()` being separated by `or`. So for example, the
code above, where `if_any()` is replaced by `if_all()`, results in a much smaller data frame:

```{r}
gasoline %>%
  filter(if_all(ends_with("p"), \(x)(`>`(x, -8))))
```

because here, we only keep rows for columns that end with "p" where ALL of them are simultaneously
greater than 8.


### Selecting several columns at once

In a previous section we already played around a little bit with `select()` and some helpers,
`everything()`, `starts_with()` and `ends_with()`. But there are many ways that you can use
helper functions to select several columns easily:

```{r}
gasoline %>%
  select(where(is.numeric))
```

Selecting by column position is also possible:

```{r}
gasoline %>%
  select(c(1, 2, 5))
```

As is selecting columns starting or ending with a certain string of characters, as discussed previously:

```{r}
gasoline %>%
  select(starts_with("l"))
```

Another very neat trick is selecting columns that may or may not exist in your data frame. For this quick examples
let's use the `mtcars` dataset:

```{r}
sort(colnames(mtcars))
```

Let's create a vector with some column names:

```{r}
cols_to_select <- c("mpg", "cyl", "am", "nonsense")
```

The following selects the columns that exist
in the data frame but shows a warning for the column that does not exist:

```{r}
mtcars %>%
  select(any_of(cols_to_select))
```

and finally, if you want it to fail, don't use any helper:

```{r, eval = FALSE}
mtcars %>%
  select(cols_to_select)
```

```
Error: Can't subset columns that don't exist.
The column `nonsense` doesn't exist.
```

or use `all_of()`:

```{r, eval = FALSE}
mtcars %>%
  select(all_of(cols_to_select))
```

```{r, eval = FALSE}
✖ Column `nonsense` doesn't exist.
``` 

Bulk-renaming can be achieved using `rename_with()`

```{r}
gasoline %>%
  rename_with(toupper, is.numeric)
```

you can also pass functions to `rename_with()`:

```{r}
gasoline %>%
  rename_with(\(x)(paste0("new_", x)))
```

The reason I'm talking about renaming in a section about selecting is because you can
also rename with select:

```{r}
gasoline %>%
  select(YEAR = year)
```

but of course here, you only keep that one column, and you can't rename with a function.


### Summarising with `across()`

`across()` is used for summarising data. It allows to aggregations... *across* several columns. It
is especially useful with `group_by()`. To illustrate how `group_by()` works with `across()` I have
to first modify the `gasoline` data a little bit. As you can see below, the `year` column is of
type `double`:

```{r}
gasoline %>%
  lapply(typeof)
```

(we'll discuss `lapply()` in a later chapter, but just to give you a little taste, `lapply()` applies
a function to each element of a list or of a data frame, in this case, `lapply()` applied the `typeof()`
function to each column of the `gasoline` data set, returning the type of each column)

Let's change that to character:

```{r}
gasoline <- gasoline %>%
  mutate(year = as.character(year),
         country = as.character(country))
```

This now allows me to group by type of columns for instance:

```{r}
gasoline %>%
  group_by(across(where(is.character))) %>%
  summarise(mean_lincomep = mean(lincomep))
```

This is faster than having to write:

```{r}
gasoline %>%
    group_by(country, year) %>%
    summarise(mean_lincomep = mean(lincomep))
```

You may think that having two write the name of two variables is not a huge deal, which is true.
But imagine that you have dozens of character columns that you want to group by. 
With `across()` and the helper functions, it doesn't matter if the data frame has 2 columns
you need to group by or 2000. All that matters is that you can find some commonalities between
all these columns that make it easy to select them. It can be their type, as we have seen
before, or their label:

```{r}
gasoline %>%
  group_by(across(contains("y"))) %>%
  summarise(mean_licomep = mean(lincomep))
```

but it's also possible to `group_by()` position:

```{r}
gasoline %>%
    group_by(across(c(1, 2))) %>%
    summarise(mean_licomep = mean(lincomep))
```

Using a sequence is also possible:

```{r}
gasoline %>%
  group_by(across(seq(1:2))) %>%
  summarise(mean_lincomep = mean(lincomep))
```

but be careful, selecting by position is dangerous. If the position of columns changes, your code
will fail. Selecting by type or label is much more robust, especially by label, since types can 
change as well (for example a date column can easily be exported as character column, etc).

### `summarise()` across many columns

Summarising across many columns is really incredibly useful and in my opinion one of the best
arguments in favour of switching to a `{tidyverse}` only workflow:

```{r}
gasoline %>%
  group_by(country) %>%
  summarise(across(starts_with("l"), mean))
```

But where `summarise()` and `across()` really shine is when you want to apply several functions
to many columns at once:

```{r}
gasoline %>%
  group_by(country) %>%
  summarise(across(starts_with("l"), tibble::lst(mean, sd, max, min), .names = "{fn}_{col}"))
```

Here, I first started by grouping by `country`, then I applied the `mean()`, `sd()`, `max()` and
`min()` functions to every column starting with the character `"l"`. `tibble::lst()` allows you to
create a list just like with `list()` but names its arguments automatically. So the `mean()` function
gets name `"mean"`, and so on. Finally, I use the `.names = ` argument to create the template for
the new column names. ``{fn}_{col}`` creates new column names of the form *function name _ column name*.

As mentioned before, `across()` works with other helper functions:

```{r}
gasoline %>%
  group_by(country) %>%
  summarise(across(contains("car"), tibble::lst(mean, sd, max, min), .names = "{fn}_{col}"))
```

This is very likely the quickest, most elegant way to summarise that many columns.

There's also a way to *summarise where*:

```{r}
gasoline %>%
  group_by(country) %>%
  summarise(across(where(is.numeric), tibble::lst(mean, sd, min, max), .names = "{fn}_{col}"))
```

This allows you to summarise every column that contains real numbers. The difference between
`is.double()` and `is.numeric()` is that `is.numeric()` returns `TRUE` for integers too, whereas
`is.double()` returns `TRUE` for real numbers only (integers are real numbers too, but you know
what I mean). It is also possible to summarise every column at once:

```{r}
gasoline %>%
  select(-year) %>%
  group_by(country) %>%
  summarise(across(everything(), tibble::lst(mean, sd, min, max), .names = "{fn}_{col}"))
```

I removed the `year` variable because it's not a variable for which we want to have descriptive
statistics.

## Other useful `{tidyverse}` functions

```{r, include=FALSE}
gasoline %<>%
    mutate(year = as.numeric(year))
```

### `if_else()`, `case_when()` and `recode()`

Some other very useful `{tidyverse}` functions are `if_else()` and `case_when`. These two
functions, combined with `mutate()` make it easy to create a new variable whose values must
respect certain conditions. For instance, we might want to have a dummy that equals `1` if a country
in the European Union (to simplify, say as of 2017) and `0` if not. First let's create a list of
countries that are in the EU:

```{r}
eu_countries <- c("austria", "belgium", "bulgaria", "croatia", "republic of cyprus",
                  "czech republic", "denmark", "estonia", "finland", "france", "germany",
                  "greece", "hungary", "ireland", "italy", "latvia", "lithuania", "luxembourg",
                  "malta", "netherla", "poland", "portugal", "romania", "slovakia", "slovenia",
                  "spain", "sweden", "u.k.")
```

I've had to change "netherlands" to "netherla" because that's how the country is called in the
`gasoline` data. Now let's create a dummy variable that equals `1` for EU countries, and `0` for the others:

```{r}
gasoline %>%
  mutate(country = tolower(country)) %>%
  mutate(in_eu = if_else(country %in% eu_countries, 1, 0))
```

Instead of `1` and `0`, we can of course use strings (I add `filter(year == 1960)` at the end to
have a better view of what happened):

```{r}
gasoline %>%
  mutate(country = tolower(country)) %>%
  mutate(in_eu = if_else(country %in% eu_countries, "yes", "no")) %>%
  filter(year == 1960)
```

I think that `if_else()` is fairly straightforward, especially if you know `ifelse()` already. You
might be wondering what is the difference between these two. `if_else()` is stricter than
`ifelse()` and does not do type conversion. Compare the two next lines:

```{r}
ifelse(1 == 1, "0", 1)
```

```{r, eval = FALSE}
if_else(1 == 1, "0", 1)
```

```{r, eval = FALSE}
Error: `false` must be type string, not double
```

Type conversion, especially without a warning is very dangerous. `if_else()`'s behaviour which
consists in failing as soon as possble avoids a lot of pain and suffering, especially when
programming non-interactively.

`if_else()` also accepts an optional argument, that allows you to specify what should be returned
in case of `NA`:

```{r}
if_else(1 <= NA, 0, 1, 999)

# Or
if_else(1 <= NA, 0, 1, NA_real_)
```

`case_when()` can be seen as a generalization of `if_else()`. Whenever you want to use multiple
`if_else()`s, that's when you know you should use `case_when()` (I'm adding the filter at the end
for the same reason as before, to see the output better):

```{r}
gasoline %>%
  mutate(country = tolower(country)) %>%
  mutate(region = case_when(
           country %in% c("france", "italy", "turkey", "greece", "spain") ~ "mediterranean",
           country %in% c("germany", "austria", "switzerl", "belgium", "netherla") ~ "central europe",
           country %in% c("canada", "u.s.a.", "u.k.", "ireland") ~ "anglosphere",
           country %in% c("denmark", "norway", "sweden") ~ "nordic",
           country %in% c("japan") ~ "asia")) %>%
  filter(year == 1960)
```

If all you want is to recode values, you can use `recode()`. For example, the Netherlands is
written as "NETHERLA" in the `gasoline` data, which is quite ugly. Same for Switzerland:

```{r}
gasoline <- gasoline %>%
  mutate(country = tolower(country)) %>%
  mutate(country = recode(country, "netherla" = "netherlands", "switzerl" = "switzerland"))
```

I saved the data with these changes as they will become useful in the future. Let's take a look at
the data:

```{r}
gasoline %>%
  filter(country %in% c("netherlands", "switzerland"), year == 1960)
```

### `lead()` and `lag()`

`lead()` and `lag()` are especially useful in econometrics. When I was doing my masters, in 4 B.d.
(*Before dplyr*) lagging variables in panel data was quite tricky. Now, with `{dplyr}` it's really
very easy:

```{r}
gasoline %>%
  group_by(country) %>%
  mutate(lag_lgaspcar = lag(lgaspcar)) %>%
  mutate(lead_lgaspcar = lead(lgaspcar)) %>%
  filter(year %in% seq(1960, 1963))
```

To lag every variable, remember that you can use `mutate_if()`:

```{r}
gasoline %>%
  group_by(country) %>%
  mutate_if(is.double, lag) %>%
  filter(year %in% seq(1960, 1963))
```

you can replace `lag()` with `lead()`, but just keep in mind that the columns get transformed in
place.

### `ntile()`

The last helper function I will discuss is `ntile()`. There are some other, so do read `mutate()`'s
documentation with `help(mutate)`!

If you need quantiles, you need `ntile()`. Let's see how it works:

```{r}
gasoline %>%
  mutate(quintile = ntile(lgaspcar, 5)) %>%
  mutate(decile = ntile(lgaspcar, 10)) %>%
  select(country, year, lgaspcar, quintile, decile)
```

`quintile` and `decile` do not hold the values but the quantile the value lies in. If you want to
have a column that contains the median for instance, you can use good ol' `quantile()`:

```{r}
gasoline %>%
  group_by(country) %>%
  mutate(median = quantile(lgaspcar, 0.5)) %>% # quantile(x, 0.5) is equivalent to median(x)
  filter(year == 1960) %>%
  select(country, year, median)
```

### `arrange()`

`arrange()` re-orders the whole `tibble` according to values of the supplied variable:

```{r}
gasoline %>%
  arrange(lgaspcar)
```

If you want to re-order the `tibble` in descending order of the variable:

```{r}
gasoline %>%
  arrange(desc(lgaspcar))
```

`arrange`'s documentation alerts the user that re-ording by group is only possible by explicitely
specifying an option:

```{r}
gasoline %>%
  filter(year %in% seq(1960, 1963)) %>%
  group_by(country) %>%
  arrange(desc(lgaspcar), .by_group = TRUE)
```

This is especially useful for plotting. We'll see this in Chapter 6.

### `tally()` and `count()`

`tally()` and `count()` count the number of observations in your data. I believe `count()` is the
more useful of the two, as it counts the number of observations within a group that you can provide:

```{r}
gasoline %>%
  count(country)
```
There's also `add_count()` which adds the column to the data:

```{r}
gasoline %>%
  add_count(country)
```

`add_count()` is a shortcut for the following code:

```{r}
gasoline %>%
  group_by(country) %>%
  mutate(n = n())
```

where `n()` is a `{dplyr}` function that can only be used within `summarise()`, `mutate()` and
`filter()`.


## Special packages for special kinds of data: `{forcats}`, `{lubridate}`, and `{stringr}`

### `r paste0(rep(emoji::emoji("cat_face"), 4), collapse = "")`

Factor variables are very useful but not very easy to manipulate. `forcats` contains very useful
functions that make working on factor variables painless. In my opinion, the four following functions, `fct_recode()`, `fct_relevel()`, `fct_reorder()` and `fct_relabel()`, are the ones you must
know, so that's what I'll be showing.

Remember in chapter 3 when I very quickly explained what were `factor` variables? In this section,
we are going to work a little bit with these type of variable. `factor`s are very useful, and the
`forcats` package includes some handy functions to work with them. First, let's load the `forcats` package:

```{r}
library(forcats)
```

as an example, we are going to work with the `gss_cat` dataset that is included in `forcats`. Let's
load the data:

```{r}
data(gss_cat)

head(gss_cat)
```

as you can see, `marital`, `race`, `rincome` and `partyid` are all factor variables. Let's take a closer
look at `marital`:

```{r}
str(gss_cat$marital)
```

and let's see `rincome`:

```{r}
str(gss_cat$rincome)
```

`factor` variables have different levels and the `forcats` package includes functions that allow
you to recode, collapse and do all sorts of things on these levels. For example , using
`forcats::fct_recode()` you can recode levels:

```{r}
gss_cat <- gss_cat %>%
  mutate(marital = fct_recode(marital,
                              refuse = "No answer",
                              never_married = "Never married",
                              divorced = "Separated",
                              divorced = "Divorced",
                              widowed = "Widowed",
                              married = "Married"))

gss_cat %>%
  tabyl(marital)
```

Using `fct_recode()`, I was able to recode the levels and collapse `Separated` and `Divorced` to
a single category called `divorced`. As you can see, `refuse` and `widowed` are less than 10%, so
maybe you'd want to lump these categories together:

```{r}
gss_cat <- gss_cat %>%
  mutate(marital = fct_lump(marital, prop = 0.10, other_level = "other"))

gss_cat %>%
  tabyl(marital)
```

`fct_reorder()` is especially useful for plotting. We will explore plotting in the next chapter,
but to show you why `fct_reorder()` is so useful, I will create a barplot, first without
using `fct_reorder()` to re-order the factors, then with reordering. Do not worry if you don't
understand all the code for now:

```{r}
gss_cat %>%
    tabyl(marital) %>%
    ggplot() +
    geom_col(aes(y = n, x = marital)) +
    coord_flip()
```

It would be much better if the categories were ordered by frequency. This is easy to do with
`fct_reorder()`:

```{r}
gss_cat %>%
    tabyl(marital) %>%
    mutate(marital = fct_reorder(marital, n, .desc = FALSE)) %>%
    ggplot() +
    geom_col(aes(y = n, x = marital)) +
    coord_flip()
```

Much better! In Chapter 6, we are going to learn about `{ggplot2}`.

The last family of function I'd like to mention are the `fct_lump*()` functions. These make it possible
to lump several levels of a factor into a new *other* level:

```{r}

gss_cat %>%
  mutate(
    # Description of the different functions taken from help(fct_lump)
    denom_lowfreq = fct_lump_lowfreq(denom), # lumps together the least frequent levels, ensuring that "other" is still the smallest level.
    denom_min = fct_lump_min(denom, min = 10), # lumps levels that appear fewer than min times.
    denom_n = fct_lump_n(denom, n = 3), # lumps all levels except for the n most frequent (or least frequent if n < 0)
    denom_prop = fct_lump_prop(denom, prop = 0.10) # lumps levels that appear in fewer prop * n times.
         )

```

There's many other, so I'd advise you go through the package's function [reference](https://forcats.tidyverse.org/reference/index.html).

### Get your dates right with `{lubridate}`

`{lubridate}` is yet another tidyverse package, that makes dealing with dates or durations (and intervals) as
painless as possible. I do not use every function contained in the package daily, and as such will
only focus on some of the functions. However, if you have to deal with dates often,
you might want to explore the package thouroughly.

#### Defining dates, the tidy way

```{r, eval=FALSE, echo=FALSE}
page <- read_html("https://en.wikipedia.org/wiki/Decolonisation_of_Africa")

independence <- page %>%
    html_node(".wikitable") %>%
    html_table(fill = TRUE)

independence <- independence %>%
    select(-Rank) %>%
    map_df(~str_remove_all(., "\\[.*\\]")) %>%
    rename(country = `Country[a]`,
           colonial_name = `Colonial name`,
           colonial_power = `Colonial power[b]`,
           independence_date = `Independence date[c]`,
           first_head_of_state = `First head of state[d]`,
           independence_won_through = `Independence won through`)

saveRDS(independence, "independence.rds")
```

Let's load new dataset, called *independence* from the Github repo of the book. The file is in the
`.rds` format, so I need to use `readRDS()` to read it. I also download it first using `tempfile()`
to save the file into a temporary file, and then `download.file()` to actually download the file.
I then read it using `readRDS()`:

```{r}
independence_path <- tempfile(fileext = "rds")
download.file(url = "https://github.com/b-rodrigues/modern_R/blob/master/datasets/independence.rds?raw=true",
              destfile = independence_path)

independence <- readRDS(independence_path)
```

This dataset was scraped from the following Wikipedia [page](https://en.wikipedia.org/wiki/Decolonisation_of_Africa#Timeline).
It shows when African countries gained independence and from which colonial powers. In Chapter 10, I
will show you how to scrape Wikipedia pages using R. For now, let's take a look at the contents
of the dataset:

```{r}
independence
```

as you can see, the date of independence is in a format that might make it difficult to answer questions
such as *Which African countries gained independence before 1960 ?* for two reasons. First of all,
the date uses the name of the month instead of the number of the month, and second of all the type of
the independence day column is *character* and not "date". So our first task is to correctly define the column
as being of type date, while making sure that R understands that *January* is supposed to be "01", and so
on. There are several helpful functions included in `{lubridate}` to convert columns to dates. For instance
if the column you want to convert is of the form "2012-11-21", then you would use the function `ymd()`,
for "year-month-day". If, however the column is "2012-21-11", then you would use `ydm()`. There's
a few of these helper functions, and they can handle a lot of different formats for dates. In our case,
having the name of the month instead of the number might seem quite problematic, but it turns out
that this is a case that `{lubridate}` handles painfully:

```{r}
library(lubridate)

independence <- independence %>%
  mutate(independence_date = dmy(independence_date))
```

Some dates failed to parse, for instance for Morocco. This is because these countries have several
independence dates; this means that the string to convert looks like:

```
"2 March 1956
7 April 1956
10 April 1958
4 January 1969"
```

which obviously cannot be converted by `{lubridate}` without further manipulation. I ignore these cases for
simplicity's sake.

#### Data manipulation with dates

Let's take a look at the data now:

```{r}
independence
```

As you can see, we now have a date column in the right format. We can now answer questions such as
*Which countries gained independence before 1960?* quite easily, by using the functions `year()`,
`month()` and `day()`. Let's see which countries gained independence before 1960:

```{r}
independence %>%
  filter(year(independence_date) <= 1960) %>%
  pull(country)
```

You guessed it, `year()` extracts the year of the date column and converts it as a *numeric* so that we can work
on it. This is the same for `month()` or `day()`. Let's try to see if countries gained their independence on
Christmas Eve:

```{r}
independence %>%
  filter(month(independence_date) == 12,
         day(independence_date) == 24) %>%
  pull(country)
```

Seems like Libya was the only one! You can also operate on dates. For instance, let's compute the difference between
two dates, using the `interval()` column:

```{r}
independence %>%
  mutate(today = lubridate::today()) %>%
  mutate(independent_since = interval(independence_date, today)) %>%
  select(country, independent_since)
```

The `independent_since` column now contains an *interval* object that we can convert to years:

```{r}
independence %>%
  mutate(today = lubridate::today()) %>%
  mutate(independent_since = interval(independence_date, today)) %>%
  select(country, independent_since) %>%
  mutate(years_independent = as.numeric(independent_since, "years"))
```

We can now see for how long the last country to gain independence has been independent.
Because the data is not tidy (in some cases, an African country was colonized by two powers,
see Libya), I will only focus on 4 European colonial powers: Belgium, France, Portugal and the United Kingdom:

```{r}
independence %>%
  filter(colonial_power %in% c("Belgium", "France", "Portugal", "United Kingdom")) %>%
  mutate(today = lubridate::today()) %>%
  mutate(independent_since = interval(independence_date, today)) %>%
  mutate(years_independent = as.numeric(independent_since, "years")) %>%
  group_by(colonial_power) %>%
  summarise(last_colony_independent_for = min(years_independent, na.rm = TRUE))
```

#### Arithmetic with dates

Adding or substracting days to dates is quite easy:

```{r}
ymd("2018-12-31") + 16
```

It is also possible to be more explicit and use `days()`:

```{r}
ymd("2018-12-31") + days(16)
```

To add years, you can use `years()`:

```{r}
ymd("2018-12-31") + years(1)
```

But you have to be careful with leap years:

```{r}
ymd("2016-02-29") + years(1)
```

Because 2017 is not a leap year, the above computation returns `NA`. The same goes for months with
a different number of days:

```{r}
ymd("2018-12-31") + months(2)
```

The way to solve these issues is to use the special `%m+%` infix operator:

```{r}
ymd("2016-02-29") %m+% years(1)
```

and for months:

```{r}
ymd("2018-12-31") %m+% months(2)
```

`{lubridate}` contains many more functions. If you often work with dates, duration or interval
data, `{lubridate}` is a package that you have to add to your toolbox.

### Manipulate strings with `{stringr}`

`{stringr}` contains functions to manipulate strings. In Chapter 10, I will teach you about regular
expressions, but the functions contained in `{stringr}` allow you to already do a lot of work on
strings, without needing to be a regular expression expert.

I will discuss the most common string operations: detecting, locating, matching, searching and
replacing, and exctracting/removing strings.

To introduce these operations, let us use an ALTO file of an issue of *The Winchester News* from
October 31, 1910, which you can find on this
[link](https://gist.githubusercontent.com/b-rodrigues/5139560e7d0f2ecebe5da1df3629e015/raw/e3031d894ffb97217ddbad1ade1b307c9937d2c8/gistfile1.txt) (to see
how the newspaper looked like,
[click here](https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/)). I re-hosted
the file on a public gist for archiving purposes. While working on the book, the original site went
down several times...

ALTO is an XML schema for the description of text OCR and layout information of pages for digitzed
material, such as newspapers (source: [ALTO Wikipedia page](https://en.wikipedia.org/wiki/ALTO_(XML))).
For more details, you can read my
[blogpost](https://www.brodrigues.co/blog/2019-01-13-newspapers_mets_alto/)
on the matter, but for our current purposes, it is enough to know that the file contains the text
of newspaper articles. The file looks like this:

```
<TextLine HEIGHT="138.0" WIDTH="2434.0" HPOS="4056.0" VPOS="5814.0">
<String STYLEREFS="ID7" HEIGHT="108.0" WIDTH="393.0" HPOS="4056.0" VPOS="5838.0" CONTENT="timore" WC="0.82539684">
<ALTERNATIVE>timole</ALTERNATIVE>
<ALTERNATIVE>tlnldre</ALTERNATIVE>
<ALTERNATIVE>timor</ALTERNATIVE>
<ALTERNATIVE>insole</ALTERNATIVE>
<ALTERNATIVE>landed</ALTERNATIVE>
</String>
<SP WIDTH="74.0" HPOS="4449.0" VPOS="5838.0"/>
<String STYLEREFS="ID7" HEIGHT="105.0" WIDTH="432.0" HPOS="4524.0" VPOS="5847.0" CONTENT="market" WC="0.95238096"/>
<SP WIDTH="116.0" HPOS="4956.0" VPOS="5847.0"/>
<String STYLEREFS="ID7" HEIGHT="69.0" WIDTH="138.0" HPOS="5073.0" VPOS="5883.0" CONTENT="as" WC="0.96825397"/>
<SP WIDTH="74.0" HPOS="5211.0" VPOS="5883.0"/>
<String STYLEREFS="ID7" HEIGHT="69.0" WIDTH="285.0" HPOS="5286.0" VPOS="5877.0" CONTENT="were" WC="1.0">
<ALTERNATIVE>verc</ALTERNATIVE>
<ALTERNATIVE>veer</ALTERNATIVE>
</String>
<SP WIDTH="68.0" HPOS="5571.0" VPOS="5877.0"/>
<String STYLEREFS="ID7" HEIGHT="111.0" WIDTH="147.0" HPOS="5640.0" VPOS="5838.0" CONTENT="all" WC="1.0"/>
<SP WIDTH="83.0" HPOS="5787.0" VPOS="5838.0"/>
<String STYLEREFS="ID7" HEIGHT="111.0" WIDTH="183.0" HPOS="5871.0" VPOS="5835.0" CONTENT="the" WC="0.95238096">
<ALTERNATIVE>tll</ALTERNATIVE>
<ALTERNATIVE>Cu</ALTERNATIVE>
<ALTERNATIVE>tall</ALTERNATIVE>
</String>
<SP WIDTH="75.0" HPOS="6054.0" VPOS="5835.0"/>
<String STYLEREFS="ID3" HEIGHT="132.0" WIDTH="351.0" HPOS="6129.0" VPOS="5814.0" CONTENT="cattle" WC="0.95238096"/>
</TextLine>
```

We are interested in the strings after `CONTENT=`. We are going to use functions from the `{stringr}`
package to get the strings after `CONTENT=`. In Chapter 10, we are going to explore this file
again, but using complex regular expressions to get all the content in one go.

#### Getting text data into Rstudio

First of all, let us read in the file:

```{r}
winchester <- read_lines("https://gist.githubusercontent.com/b-rodrigues/5139560e7d0f2ecebe5da1df3629e015/raw/e3031d894ffb97217ddbad1ade1b307c9937d2c8/gistfile1.txt")
```

Even though the file is an XML file, I still read it in using `read_lines()` and not `read_xml()`
from the `{xml2}` package. This is for the purposes of the current exercise, and also because I
always have trouble with XML files, and prefer to treat them as simple text files, and use regular
expressions to get what I need.

Now that the ALTO file is read in and saved in the `winchester` variable, you might want to print
the whole thing in the console. Before that, take a look at the structure:

```{r}
str(winchester)
```

So the `winchester` variable is a character atomic vector with 43 elements. So first, we need to
understand what these elements are. Let's start with the first one:

```{r}
winchester[1]
```

Ok, so it seems like the first element is part of the header of the file. What about the second one?

```{r}
winchester[2]
```

Same. So where is the content? The file is very large, so if you print it in the console, it will
take quite some time to print, and you will not really be able to make out anything. The best
way would be to try to detect the string `CONTENT` and work from there.

#### Detecting, getting the position and locating strings

When confronted to an atomic vector of strings, you might want to know inside which elements you
can find certain strings. For example, to know which elements of `winchester` contain the string
`CONTENT`, use `str_detect()`:

```{r}
winchester %>%
  str_detect("CONTENT")
```

This returns a boolean atomic vector of the same length as `winchester`. If the string `CONTENT` is
nowhere to be found, the result will equal `FALSE`, if not it will equal `TRUE`. Here it is easy to
see that the last element contains the string `CONTENT`. But what if instead of having 43 elements,
the vector had 24192 elements? And hundreds would contain the string `CONTENT`? It would be easier
to instead have the indices of the vector where one can find the word `CONTENT`. This is possible
with `str_which()`:

```{r}
winchester %>%
  str_which("CONTENT")
```

Here, the result is 43, meaning that the 43rd element of `winchester` contains the string `CONTENT`
somewhere. If we need more precision, we can use `str_locate()` and `str_locate_all()`. To explain
how both these functions work, let's create a very small example:

```{r}
ancient_philosophers <- c("aristotle", "plato", "epictetus", "seneca the younger", "epicurus", "marcus aurelius")
```

Now suppose I am interested in philosophers whose name ends in `us`. Let us use `str_locate()` first:

```{r}
ancient_philosophers %>%
  str_locate("us")
```

You can interpret the result as follows: in the rows, the index of the vector where the
string `us` is found. So the 3rd, 5th and 6th philosopher have `us` somewhere in their name.
The result also has two columns: `start` and `end`. These give the position of the string. So the
string `us` can be found starting at position 8 of the 3rd element of the vector, and ends at position
9. Same goes for the other philisophers. However, consider Marcus Aurelius. He has two names, both
ending with `us`. However, `str_locate()` only shows the position of the `us` in `Marcus`.

To get both `us` strings, you need to use `str_locate_all()`:

```{r}
ancient_philosophers %>%
  str_locate_all("us")
```

Now we get the position of the two `us` in Marcus Aurelius. Doing this on the `winchester` vector
will give use the position of the `CONTENT` string, but this is not really important right now. What
matters is that you know how `str_locate()` and `str_locate_all()` work.

So now that we know what interests us in the 43nd element of `winchester`, let's take a closer
look at it:

```{r, eval=FALSE}
winchester[43]
```

As you can see, it's a mess:

```
<TextLine HEIGHT=\"126.0\" WIDTH=\"1731.0\" HPOS=\"17160.0\" VPOS=\"21252.0\"><String HEIGHT=\"114.0\" WIDTH=\"354.0\" HPOS=\"17160.0\" VPOS=\"21264.0\" CONTENT=\"0tV\" WC=\"0.8095238\"/><SP WIDTH=\"131.0\" HPOS=\"17514.0\" VPOS=\"21264.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"111.0\" WIDTH=\"474.0\" HPOS=\"17646.0\" VPOS=\"21258.0\" CONTENT=\"BATES\" WC=\"1.0\"/><SP WIDTH=\"140.0\" HPOS=\"18120.0\" VPOS=\"21258.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"114.0\" WIDTH=\"630.0\" HPOS=\"18261.0\" VPOS=\"21252.0\" CONTENT=\"President\" WC=\"1.0\"><ALTERNATIVE>Prcideht</ALTERNATIVE><ALTERNATIVE>Pride</ALTERNATIVE></String></TextLine><TextLine HEIGHT=\"153.0\" WIDTH=\"1689.0\" HPOS=\"17145.0\" VPOS=\"21417.0\"><String STYLEREFS=\"ID7\" HEIGHT=\"105.0\" WIDTH=\"258.0\" HPOS=\"17145.0\" VPOS=\"21439.0\" CONTENT=\"WM\" WC=\"0.82539684\"><TextLine HEIGHT=\"120.0\" WIDTH=\"2211.0\" HPOS=\"16788.0\" VPOS=\"21870.0\"><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"102.0\" HPOS=\"16788.0\" VPOS=\"21894.0\" CONTENT=\"It\" WC=\"1.0\"/><SP WIDTH=\"72.0\" HPOS=\"16890.0\" VPOS=\"21894.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"93.0\" HPOS=\"16962.0\" VPOS=\"21885.0\" CONTENT=\"is\" WC=\"1.0\"/><SP WIDTH=\"80.0\" HPOS=\"17055.0\" VPOS=\"21885.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"102.0\" WIDTH=\"417.0\" HPOS=\"17136.0\" VPOS=\"21879.0\" CONTENT=\"seldom\" WC=\"1.0\"/><SP WIDTH=\"80.0\" HPOS=\"17553.0\" VPOS=\"21879.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"267.0\" HPOS=\"17634.0\" VPOS=\"21873.0\" CONTENT=\"hard\" WC=\"1.0\"/><SP WIDTH=\"81.0\" HPOS=\"17901.0\" VPOS=\"21873.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"87.0\" WIDTH=\"111.0\" HPOS=\"17982.0\" VPOS=\"21879.0\" CONTENT=\"to\" WC=\"1.0\"/><SP WIDTH=\"81.0\" HPOS=\"18093.0\" VPOS=\"21879.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"219.0\" HPOS=\"18174.0\" VPOS=\"21870.0\" CONTENT=\"find\" WC=\"1.0\"/><SP WIDTH=\"77.0\" HPOS=\"18393.0\" VPOS=\"21870.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"69.0\" WIDTH=\"66.0\" HPOS=\"18471.0\" VPOS=\"21894.0\" CONTENT=\"a\" WC=\"1.0\"/><SP WIDTH=\"77.0\" HPOS=\"18537.0\" VPOS=\"21894.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"78.0\" WIDTH=\"384.0\" HPOS=\"18615.0\" VPOS=\"21888.0\" CONTENT=\"succes\" WC=\"0.82539684\"><ALTERNATIVE>success</ALTERNATIVE></String></TextLine><TextLine HEIGHT=\"126.0\" WIDTH=\"2316.0\" HPOS=\"16662.0\" VPOS=\"22008.0\"><String STYLEREFS=\"ID7\" HEIGHT=\"75.0\" WIDTH=\"183.0\" HPOS=\"16662.0\" VPOS=\"22059.0\" CONTENT=\"sor\" WC=\"1.0\"><ALTERNATIVE>soar</ALTERNATIVE></String><SP WIDTH=\"72.0\" HPOS=\"16845.0\" VPOS=\"22059.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"90.0\" WIDTH=\"168.0\" HPOS=\"16917.0\" VPOS=\"22035.0\" CONTENT=\"for\" WC=\"1.0\"/><SP WIDTH=\"72.0\" HPOS=\"17085.0\" VPOS=\"22035.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"69.0\" WIDTH=\"267.0\" HPOS=\"17157.0\" VPOS=\"22050.0\" CONTENT=\"even\" WC=\"1.0\"><ALTERNATIVE>cen</ALTERNATIVE><ALTERNATIVE>cent</ALTERNATIVE></String><SP WIDTH=\"77.0\" HPOS=\"17434.0\" VPOS=\"22050.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"66.0\" WIDTH=\"63.0\" HPOS=\"17502.0\" VPOS=\"22044.0\"
```

The file was imported without any newlines. So we need to insert them ourselves, by splitting the
string in a clever way.

#### Splitting strings

There are two functions included in `{stringr}` to split strings, `str_split()` and `str_split_fixed()`.
Let's go back to our ancient philosophers. Two of them, Seneca the Younger and Marcus Aurelius have
something else in common than both being Roman Stoic philosophers. Their names are composed of several
words. If we want to split their names at the space character, we can use `str_split()` like this:

```{r}
ancient_philosophers %>%
  str_split(" ")
```

`str_split()` also has a `simplify = TRUE` option:

```{r}
ancient_philosophers %>%
  str_split(" ", simplify = TRUE)
```

This time, the returned object is a matrix.

What about `str_split_fixed()`? The difference is that here you can specify the number of pieces
to return. For example, you could consider the name "Aurelius" to be the middle name of Marcus Aurelius,
and the "the younger" to be the middle name of Seneca the younger. This means that you would want
to split the name only at the first space character, and not at all of them. This is easily achieved
with `str_split_fixed()`:

```{r}
ancient_philosophers %>%
  str_split_fixed(" ", 2)
```

This gives the expected result.

So how does this help in our case? Well, if you look at how the ALTO file looks like, at the beginning
of this section, you will notice that every line ends with the ">" character. So let's split at
that character!

```{r}
winchester_text <- winchester[43] %>%
  str_split(">")
```

Let's take a closer look at `winchester_text`:

```{r}
str(winchester_text)
```

So this is a list of length one, and the first, and only, element of that list is an atomic vector
with 19706 elements. Since this is a list of only one element, we can simplify it by saving the
atomic vector in a variable:

```{r}
winchester_text <- winchester_text[[1]]
```

Let's now look at some lines:

```{r}
winchester_text[1232:1245]
```

This now looks easier to handle. We can narrow it down to the lines that only contain the string
we are interested in, "CONTENT". First, let's get the indices:

```{r}
content_winchester_index <- winchester_text %>%
  str_which("CONTENT")
```

How many lines contain the string "CONTENT"?

```{r}
length(content_winchester_index)
```

As you can see, this reduces the amount of data we have to work with. Let us save this is a new
variable:

```{r}
content_winchester <- winchester_text[content_winchester_index]
```

#### Matching strings

Matching strings is useful, but only in combination with regular expressions. As stated at the
beginning of this section, we are going to learn about regular expressions in Chapter 10, but in
order to make this section useful, we are going to learn the easiest, but perhaps the most useful
regular expression: `.*`.

Let's go back to our ancient philosophers, and use `str_match()` and see what happens. Let's match
the "us" string:

```{r}
ancient_philosophers %>%
  str_match("us")
```

Not very useful, but what about the regular expression `.*`? How could it help?

```{r}
ancient_philosophers %>%
  str_match(".*us")
```

That's already very interesting! So how does `.*` work? To understand, let's first start by using
`.` alone:

```{r}
ancient_philosophers %>%
  str_match(".us")
```

This also matched whatever symbol comes just before the "u" from "us". What if we use two `.` instead?

```{r}
ancient_philosophers %>%
  str_match("..us")
```

This time, we get the two symbols that immediately precede "us". Instead of continuing like this
we now use the `*`, which matches zero or more of `.`. So by combining `*` and `.`, we can match
any symbol repeatedly, until there is nothing more to match. Note that there is also `+`, which works
similarly to `*`, but it matches one or more symbols.

There is also a `str_match_all()`:

```{r}
ancient_philosophers %>%
  str_match_all(".*us")
```

In this particular case it does not change the end result, but keep it in mind for cases like this one:

```{r}
c("haha", "huhu") %>%
  str_match("ha")
```

and:

```{r}
c("haha", "huhu") %>%
  str_match_all("ha")
```

What if we want to match names containing the letter "t"? Easy:

```{r}
ancient_philosophers %>%
  str_match(".*t.*")
```

So how does this help us with our historical newspaper? Let's try to get the strings that come
after "CONTENT":

```{r}
winchester_content <- winchester_text %>%
  str_match("CONTENT.*")
```

Let's use our faithful `str()` function to take a look:

```{r}
winchester_content %>%
  str
```

Hum, there's a lot of `NA` values! This is because a lot of the lines from the file did not have the
string "CONTENT", so there is no match possible. Let's us remove all these `NA`s. Because the
result is a matrix, we cannot use the `filter()` function from `{dplyr}`. So we need to convert it
to a tibble first:

```{r}
winchester_content <- winchester_content %>%
  as.tibble() %>%
  filter(!is.na(V1))
```

Because matrix columns do not have names, when a matrix gets converted into a tibble, the firt column
gets automatically called `V1`. This is why I filter on this column. Let's take a look at the data:

```{r}
head(winchester_content)
```

#### Searching and replacing strings

We are getting close to the final result. We still need to do some cleaning however. Since our data
is inside a nice tibble, we might as well stick with it. So let's first rename the column and 
change all the strings to lowercase:

```{r}
winchester_content <- winchester_content %>% 
  mutate(content = tolower(V1)) %>% 
  select(-V1)
```

Let's take a look at the result:

```{r}
head(winchester_content)
```

The second part of the string, "wc=...." is not really interesting. Let's search and replace this
with an empty string, using `str_replace()`:

```{r}
winchester_content <- winchester_content %>% 
  mutate(content = str_replace(content, "wc.*", ""))

head(winchester_content)
```

We need to use the regular expression from before to replace "wc" and every character that follows.
The same can be use to remove "content=":

```{r}
winchester_content <- winchester_content %>% 
  mutate(content = str_replace(content, "content=", ""))

head(winchester_content)
```

We are almost done, but some cleaning is still necessary:

#### Exctracting or removing strings

Now, because I now the ALTO spec, I know how to find words that are split between two sentences: 

```{r}
winchester_content %>% 
  filter(str_detect(content, "hyppart"))
```

For instance, the word "average" was split over two lines, the first part of the word, "aver" on the
first line, and the second part of the word, "age", on the second line. We want to keep what comes
after "subs_content". Let's extract the word "average" using `str_extract()`. However, because only
some words were split between two lines, we first need to detect where the string "hyppart1" is 
located, and only then can we extract what comes after "subs_content". Thus, we need to combine
`str_detect()` to first detect the string, and then `str_extract()` to extract what comes after 
"subs_content":

```{r}
winchester_content <- winchester_content %>% 
  mutate(content = if_else(str_detect(content, "hyppart1"), 
                           str_extract_all(content, "content=.*", simplify = TRUE), 
                           content))
```

Let's take a look at the result:

```{r}
winchester_content %>% 
  filter(str_detect(content, "content"))
```

We still need to get rid of the string "content=" and then of all the strings that contain "hyppart2",
which are not needed now:

```{r}
winchester_content <- winchester_content %>% 
  mutate(content = str_replace(content, "content=", "")) %>% 
  mutate(content = if_else(str_detect(content, "hyppart2"), NA_character_, content))

head(winchester_content)
```

Almost done! We only need to remove the `"` characters:

```{r}
winchester_content <- winchester_content %>% 
  mutate(content = str_replace_all(content, "\"", "")) 

head(winchester_content)
```

Let's remove space characters with `str_trim()`:

```{r}
winchester_content <- winchester_content %>% 
  mutate(content = str_trim(content)) 

head(winchester_content)
```

To finish off this section, let's remove stop words (words that do not add any meaning to a sentence,
such as "as", "and"...) and words that are composed of less than 3 characters. You can find a dataset
with stopwords inside the `{stopwords}` package:

```{r}
library(stopwords)

data(data_stopwords_stopwordsiso)

eng_stopwords <- tibble("content" = data_stopwords_stopwordsiso$en)

winchester_content <- winchester_content %>% 
  anti_join(eng_stopwords) %>% 
  filter(nchar(content) > 3)

head(winchester_content)
```

That's it for this section! You now know how to work with strings, but in Chapter 10 we are going
one step further by learning about regular expressions, which offer much more power.

### Tidy data frames with `{tibble}`

We have already seen and used several functions from the `{tibble}` package. Let's now go through
some more useful functions. 

#### Creating tibbles

`tribble()` makes it easy to create tibble row by row, manually:

```{r, include=FALSE}
cars <- tribble(
  ~combustion, ~doors,
  "oil", 3,
  "diesel", 5,
  "oil", 5,
  "electric", 5
)
```

It is also possible to create a tibble from a named list:

```{r}
as_tibble(list("combustion" = c("oil", "diesel", "oil", "electric"),
               "doors" = c(3, 5, 5, 5)))
```

```{r}
enframe(list("combustion" = c(1,2), "doors" = c(1,2,4), "cylinders" = c(1,8,9,10)))
```


## List-columns

To learn about list-columns, let's first focus on a single character of the `starwars` dataset:

```{r}
data(starwars)
```

```{r}
starwars %>%
  filter(name == "Luke Skywalker") %>%
  glimpse()
```

We see that the columns `films`, `vehicles` and `starships` (at the bottom) are all lists, and in
the case of `films`, it lists all the films where Luke Skywalker has appeared. What if you want to
take a closer look at films where Luke Skywalker appeared?

```{r}
starwars %>%
  filter(name == "Luke Skywalker") %>%
  pull(films)
```

`pull()` is a `{dplyr}` function that extract (pulls) the column you're interested in. It is quite
useful when you want to inspect a column. Instead of just looking at Luke Skywalker's films,
let's pull the complete `films` column instead:

```{r}
starwars %>%
  head() %>%  # let's just look at the first six rows
  pull(films) 
```

Let's stop here a moment. As you see, the `films` column contains several items in it. How is it
possible that a single *cell* contains more than one film? This is because what is actually
contained in the cell is not the seven films, as seven separate characters, but an atomic vector
that happens to have seven elements. But it is still only one vector. *Zooming* in into the data
frame helps understand:

```{r, echo = FALSE}
knitr::include_graphics("assets/zoom_list_columns.png", dpi = 80)
```

In the picture above we see three columns. The first two, `name` and `sex` are what you're used
to see, just one element defining the characters `name` and `sex` respectively. The last one
also contains only one element for each character, it just so happens to be a complete 
vector of characters. Because what is inside the *cells* of a list-column can be very different
things (as list can contain anything), you have to think a bit about it in order to extract
insights from such columns. List-columns may seem arcane, but they are extremely powerful
once you master them.

As an example, suppose we want to create a numerical variable which counts the number of movies
in which the characters have appeared. For this we need to compute the length of the list, or count
the number of elements this list has. Let's try with `length()` a base R function:

```{r}
starwars %>%
  filter(name == "Luke Skywalker") %>%
  pull(films) %>%
  length()
```

This might be surprising, but remember that a list with only one element, has a length of 1:

```{r}
length(
  list(words) # this creates a list which one element. This element is a list of 980 words.
)
```

Even though `words` contain a vector of 980 words, if we put this very long vector inside the 
first element of list, `length(list(words))` will this compute the length of the list. Let's 
see what happens if we create a more complex list:

```{r}
numbers <- seq(1, 5)

length(
  list(words, # this creates a list which one element. This element is a list of 980 words.
       numbers) # numbers contains numbers 1 through 5
)
```

`list(words, numbers)` is now a list of two elements, `words` and `numbers`. If we want to compute
the length of `words` and `numbers`, we need to learn about another powerful concept called
*higher-order functions*. We are going to learn about this in greater detail in Chapter 8. For now,
let's use the fact that our list `films` is contained inside of a data frame, and use a convenience
function included in `{dplyr}` to handle situations like this:

```{r}
starwars <- starwars %>%
  rowwise() %>% # <- Apply the next steps for each row individually
  mutate(n_films = length(films))

```

`dplyr::rowwise()` is useful when working with list-columns because whatever instructions follow
get run on the single element contained in the list. The picture below illustrates this:

```{r, echo = FALSE}
knitr::include_graphics("assets/rowwise.png", dpi = 80)
```

Let's take a look at the characters and the number of films they have appeared in:

```{r}
starwars %>%
  select(name, films, n_films)
```

Now we can, for example, create a factor variable that groups characters by asking whether they appeared only in
1 movie, or more:

```{r}
starwars <- starwars %>%
  mutate(more_1 = case_when(n_films == 1 ~ "Exactly one movie",
                            n_films >= 1 ~ "More than 1 movie"))
```

You can also create list-columns with your own datasets, by using `tidyr::nest()`. Remember the
fake `survey_data` I created to illustrate `pivot_longer()` and `pivot_wider()`? Let's go back to that dataset
again:

```{r}
survey_data <- tribble(
  ~id, ~variable, ~value,
  1, "var1", 1,
  1, "var2", 0.2,
  NA, "var3", 0.3,
  2, "var1", 1.4,
  2, "var2", 1.9,
  2, "var3", 4.1,
  3, "var1", 0.1,
  3, "var2", 2.8,
  3, "var3", 8.9,
  4, "var1", 1.7,
  NA, "var2", 1.9,
  4, "var3", 7.6
)

print(survey_data)
```


```{r}
nested_data <- survey_data %>%
  group_by(id) %>%  
  nest()

glimpse(nested_data)
```

This creates a new tibble, with columns `id` and `data`. `data` is a list-column that contains
tibbles; each tibble is the `variable` and `value` for each individual:

```{r}
nested_data %>%
  filter(id == "1") %>%
  pull(data)
```

As you can see, for individual 1, the column data contains a 2x2 tibble with columns `variable` and
`value`. Because `group_by()` followed by `nest()` is so useful, there is a wrapper around these two functions
called `group_nest()`:

```{r}
survey_data %>%
  group_nest(id)
```

You might be wondering why this is useful, because this seems to introduce an unnecessary
layer of complexity. The usefulness of list-columns will become apparent in the next chapters,
where we are going to learn how to repeat actions over, say, individuals. So if you've reached
the end of this section and still didn't really grok list-columns, go take some fresh air and
come back to this section again later on.

## Going beyond descriptive statistics and data manipulation

The `{tidyverse}` collection of packages can do much more than simply data manipulation and
descriptive statisics. You can use the principles we have covered and the functions you now know
to do much more. For instance, you can use a few `{tidyverse}` functions to do Monte Carlo simulations,
for example to estimate $\pi$.

Draw the unit circle inside the unit square, the ratio of the area of the circle to the area of the
square will be $\pi/4$. Then shot K arrows at the square; roughly $K*\pi/4$ should have fallen
inside the circle. So if now you shoot N arrows at the square, and M fall inside the circle, you have
the following relationship $M = N*\pi/4$. You can thus compute $\pi$ like so: $\pi = 4*M/N$.

The more arrows N you throw at the square, the better approximation of $\pi$ you'll have. Let's
try to do this with a tidy Monte Carlo simulation. First, let's randomly pick some points inside
the unit square:

```{r}
library(tidyverse)

n <- 5000

set.seed(2019)
points <- tibble("x" = runif(n), "y" = runif(n))
```

Now, to know if a point is inside the unit circle, we need to check wether $x^2 + y^2 < 1$. Let's
add a new column to the `points` tibble, called `inside` equal to 1 if the point is inside the
unit circle and 0 if not:

```{r}
points <- points %>%
    mutate(inside = map2_dbl(.x = x, .y = y, ~ifelse(.x**2 + .y**2 < 1, 1, 0))) %>%
    rowid_to_column("N")
```

Let's take a look at `points`:

```{r}
points
```

Now, I can compute the estimation
of $\pi$ at each row, by computing the cumulative sum of the 1's in the `inside` column and dividing
that by the current value of `N` column:

```{r}
points <- points %>%
    mutate(estimate = 4*cumsum(inside)/N)
```

`cumsum(inside)` is the `M` from the formula. Now, we can finish by plotting the result:

```{r}
ggplot(points) +
    geom_line(aes(y = estimate, x = N)) +
    geom_hline(yintercept = pi)
```

In the next chapter, we are going to learn all about `{ggplot2}`, the package I used in the lines
above to create this plot.

As the number of tries grows, the estimation of $\pi$ gets better.

Using a data frame as a structure to hold our simulated points and the results makes it very easy
to avoid loops, and thus write code that is more concise and easier to follow.
If you studied a quantitative field in university, you might have done a similar exercise at the
time, very likely by defining a matrix to hold your points, and an empty vector to hold whether a
particular point was inside the unit circle. Then you wrote a loop to compute whether
a point was inside the unit circle, save this result in the before-defined empty vector and then
compute the estimation of $\pi$. Again, I take this opportunity here to stress that there is nothing
wrong with this approach per se, but R is better suited for a workflow where lists or data frames
are the central objects and where the analyst operates over them with functional programming techniques.

## Exercises

### Exercise 1 {-}

* Combine `mutate()` with `across()` to exponentiate every column of type `double` of the `gasoline` dataset.

To obtain the `gasoline` dataset, run the following lines:

```{r, eval = FALSE}
data(Gasoline, package = "plm")

gasoline <- as_tibble(Gasoline)

gasoline <- gasoline %>%
  mutate(country = tolower(country))
```

```{r, eval = FALSE, include = FALSE}
gasoline %>%
  mutate(across(is.double, exp))
```

* Exponeniate columns starting with the character `"l"` of the `gasoline` dataset.

```{r, eval = FALSE, include = FALSE}
gasoline %>%
  mutate(across(starts_with("l"), exp))
```

* Convert all columns' classes into the character class.

```{r, eval = FALSE, include = FALSE}
gasoline %>%
  mutate(across(everything(), as.character))
```

### Exercise 2 {-}

Load the `LaborSupply` dataset from the `{Ecdat}` package and answer the following questions:

* Compute the average annual hours worked by year (plus standard deviation)
* What age group worked the most hours in the year 1982?
* Create a variable, `n_years` that equals the number of years an  individual stays in the panel. Is the panel balanced?
* Which are the individuals that do not have any kids during the whole period? Create a variable, `no_kids`, that flags these individuals (1 = no kids, 0 = kids)
* Using the `no_kids` variable from before compute the average wage, standard deviation and number of observations in each group for the year 1980 (no kids group vs kids group).
* Create the lagged logarithm of hours worked and wages. Remember that this is a panel.

```{r, eval=FALSE, include=FALSE}
library(Ecdat)
library(dplyr)

data("LaborSupply")

# Compute the average annual hours worked by year (plus standard deviation)

LaborSupply %>%
  group_by(year) %>%
  summarise(mean(lnhr), sd(lnhr))

# What age group worked the most hours in the year 1982?

LaborSupply %>%
  filter(year == 1982) %>%
  group_by(age) %>%
  mutate(total_lnhr_age = sum(lnhr)) %>%
  ungroup() %>%
  filter(total_lnhr_age == max(total_lnhr_age))

# Create a variable, `n_years` that equals the number of years an
# individual stays in the panel. Is the panel balanced?

LaborSupply %>%
  group_by(id) %>%
  mutate(n_years = n()) %>%
  ungroup() %>%
  summarise(mean(10))

# Which are the individuals that do not have any kids during the whole period?
# Create a variable, `no_kids`, that flags these individuals (1 = no kids, 0 = kids)

LaborSupply = LaborSupply %>%
  group_by(id) %>%
  mutate(n_kids = max(kids)) %>%
  mutate(no_kids = ifelse(n_kids == 0, 1, 0))

# Using the `no_kids` variable from before compute the average wage in 1980 for these two groups (no kids group vs kids group).
LaborSupply %>%
  filter(year == 1980) %>%
  group_by(no_kids) %>%
  summarise(mean(lnwg), sd(lnwg), n())
```

### Exercise 3 {-}

* What does the following code do? Copy and paste it in an R interpreter to find out!

```{r, eval=FALSE}
LaborSupply %>%
  group_by(id) %>%
  mutate(across(starts_with("l"), tibble::lst(lag, lead)))
```


* Using `summarise()` and `across()`, compute the mean, standard deviation and number of individuals of `lnhr` and `lnwg` for each individual.

```{r, eval=FALSE, include = FALSE}
LaborSupply %>%
  group_by(id) %>%
  summarise(across(starts_with("l"), tibble::lst(mean, sd, n)))
```

<!--
### Exercise 4 {-}

* In the dataset folder you downloaded at the beginning of the chapter, there is a folder called
"unemployment". I used the data in the section about working with lists of datasets. Using
`rio::import_list()`, read the 4 datasets into R.

```{r, eval=FALSE, echo=FALSE}
paths = Sys.glob("datasets/unemployment/*.csv")

all_datasets = import_list(paths)
```

* Using `map()`, map the `janitor::clean_names()` function to each dataset (just like in the example
in the section on working with lists of datasets). Then, still with `map()` and `mutate()` convert
all commune names in the `commune` column with the function `tolower()`, in a new column called `lcommune`.
This is not an easy exercise; so here are some hints:

    * Remember that `all_datasets` is a list of datasets. Which function do you use when you want to map a function to each element of a list?
    * Each element of `all_datasets` are `data.frame` objects. Which function do you use to add a column to a `data.frame`?
    * What symbol can you use to access a column of a `data.frame`?

```{r, eval=FALSE, echo=FALSE}
all_datasets %>%
  map(~mutate(., lcommune = tolower(.$commune)))
```
-->