diff --git a/07-defining_your_own_functions.Rmd b/07-defining_your_own_functions.Rmd index 68faefb..584c9db 100644 --- a/07-defining_your_own_functions.Rmd +++ b/07-defining_your_own_functions.Rmd @@ -370,9 +370,14 @@ my_sum <- function(a_vector){ Loading the library inside the function has the advantage that you will be sure that the package upon which your function depends will be loaded. If the package is already loaded, it will not be loaded again, thus not impact performance, but if you forgot to load it at the beginning of your -script, then, no worries, your function will load it the first time you use it! However, the very -best way would be to write your own package and declare the packages upon which your functions -depend as dependencies. This is something we are going to explore in Chapter 11. +script, then, no worries, your function will load it the first time you use it! However, you should +avoid doing this, because the resulting function is now not pure. It has a side effect, which is +loading a library. This could result in problems, especially if several functions load several +different packages that have functions with the same name. Depending on which function runs first, +a function with the same name but coming from the same package will be available in the global +environment. The very best way would be to write your own package and declare the packages upon +which your functions depend as dependencies. This is something we are going to explore in Chapter +9. You can put a lot of instructions inside a function, such as loops. Let's create the function that returns Fionacci numbers. diff --git a/docs/404.html b/docs/404.html index f91512a..3fb63b3 100644 --- a/docs/404.html +++ b/docs/404.html @@ -23,7 +23,7 @@ - + @@ -282,7 +282,6 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • 5 Graphs @@ -365,7 +364,7 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • +
  • Exercise 4
  • 8 Functional programming diff --git a/docs/defining-your-own-functions.html b/docs/defining-your-own-functions.html index 2fcb6e4..3e6dcc3 100644 --- a/docs/defining-your-own-functions.html +++ b/docs/defining-your-own-functions.html @@ -23,7 +23,7 @@ - + @@ -282,7 +282,6 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • 5 Graphs @@ -365,7 +364,7 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • +
  • Exercise 4
  • 8 Functional programming @@ -456,45 +455,45 @@

    7.1 Control flow7.1.1 If-else

    Imagine you want a variable to be equal to a certain value if a condition is met. This is a typical problem that requires the if ... else ... construct. For instance:

    -
    a <- 4
    -b <- 5
    +
    a <- 4
    +b <- 5

    Suppose that if a > b then f should be equal to 20, else f should be equal to 10. Using if ... else ... you can achieve this like so:

    -
    if (a > b) {
    -  f <- 20
    -    } else {
    -  f <- 10
    -}
    +
    if (a > b) {
    +  f <- 20
    +    } else {
    +  f <- 10
    +}

    Obviously, here f = 10. Another way to achieve this is by using the ifelse() function:

    -
    f <- ifelse(a > b, 20, 10)
    +
    f <- ifelse(a > b, 20, 10)

    if...else... and ifelse() might seem interchangeable, but they’re not. ifelse() is vectorized, while if...else.. is not. Let’s try the following:

    -
    ifelse(c(1,2,4) > c(3, 1, 0), "yes", "no")
    +
    ifelse(c(1,2,4) > c(3, 1, 0), "yes", "no")
    ## [1] "no"  "yes" "yes"

    The result is a vector. Now, let’s see what happens if we use if...else... instead of ifelse():

    -
    if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no")
    -
    > Error in if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no") : 
    -  the condition has length > 1
    +
    if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no")
    +
    > Error in if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no") : 
    +  the condition has length > 1

    This results in an error (in previous R version, only the first element of the vector would get used). We have already discussed this in Chapter 2, remember? If you want to make sure that such an expression evaluates to TRUE, then you need to use all():

    -
    ifelse(all(c(1,2,4) > c(3, 1, 0)), "all elements are greater", "not all elements are greater")
    +
    ifelse(all(c(1,2,4) > c(3, 1, 0)), "all elements are greater", "not all elements are greater")
    ## [1] "not all elements are greater"

    You may also remember the any() function:

    -
    ifelse(any(c(1,2,4) > c(3, 1, 0)), "at least one element is greater", "no element greater")
    +
    ifelse(any(c(1,2,4) > c(3, 1, 0)), "at least one element is greater", "no element greater")
    ## [1] "at least one element is greater"

    These are the basics. But sometimes, you might need to test for more complex conditions, which can lead to using nested if...else... constructs. These, however, can get messy:

    -
    if (10 %% 3 == 0) {
    -  print("10 is divisible by 3")
    -  } else if (10 %% 2 == 0) {
    -    print("10 is divisible by 2")
    -}
    +
    if (10 %% 3 == 0) {
    +  print("10 is divisible by 3")
    +  } else if (10 %% 2 == 0) {
    +    print("10 is divisible by 2")
    +}
    ## [1] "10 is divisible by 2"

    10 being obviously divisible by 2 and not 3, it is the second sentence that will be printed. The %% operator is the modulus operator, which gives the rest of the division of 10 by 2. In such cases, it is easier to use dplyr::case_when():

    -
    case_when(10 %% 3 == 0 ~ "10 is divisible by 3",
    -          10 %% 2 == 0 ~ "10 is divisible by 2")
    +
    case_when(10 %% 3 == 0 ~ "10 is divisible by 3",
    +          10 %% 2 == 0 ~ "10 is divisible by 2")
    ## [1] "10 is divisible by 2"

    We have already encountered this function in Chapter 4, inside a dplyr::mutate() call to create a new column.

    Let’s now discuss loops.

    @@ -502,9 +501,9 @@

    7.1.1 If-else

    7.1.2 For loops

    For loops make it possible to repeat a set of instructions i times. For example, try the following:

    -
    for (i in 1:10){
    -  print("hello")
    -}
    +
    for (i in 1:10){
    +  print("hello")
    +}
    ## [1] "hello"
     ## [1] "hello"
     ## [1] "hello"
    @@ -517,12 +516,12 @@ 

    7.1.2 For loops
    result <- 0
    -for (i in 1:100){
    -  result <- result + i
    -}
    -
    -print(result)
    +
    result <- 0
    +for (i in 1:100){
    +  result <- result + i
    +}
    +
    +print(result)
    ## [1] 5050

    result is equal to 5050, the expected result. What happened in that loop? First, we defined a variable called result and set it to 0. Then, when the loops starts, i equals 1, so we add @@ -541,7 +540,7 @@

    7.1.2 For loops
    Reduce(`+`, seq(1, 100))
    +
    Reduce(`+`, seq(1, 100))
    ## [1] 5050

    We will see how Reduce() works in greater detail in the next chapter, but what happened was something like this:

    Reduce(`+`, seq(1, 100)) = 
    @@ -559,14 +558,14 @@ 

    7.1.2 For loops7.1.3 While loops

    While loops are very similar to for loops. The instructions inside a while loop are repeated while a certain condition holds true. Let’s consider the sum of the first 100 integers again:

    -
    result <- 0
    -i <- 1
    -while (i<=100){
    -  result = result + i
    -  i = i + 1
    -}
    -
    -print(result)
    +
    result <- 0
    +i <- 1
    +while (i<=100){
    +  result = result + i
    +  i = i + 1
    +}
    +
    +print(result)
    ## [1] 5050

    Here, we first set result and i to 0. Then, while i is less than, or equal to 100, we add i to result. Notice that there is one more line than in the for loop version of this code: we need @@ -586,71 +585,71 @@

    7.2 Writing your own functions7.2.1 Declaring functions in R

    Suppose you want to create the following function: \(f(x) = \dfrac{1}{\sqrt{x}}\). Writing this in R is quite simple:

    -
    my_function <- function(x){
    -  1/sqrt(x)
    -}
    +
    my_function <- function(x){
    +  1/sqrt(x)
    +}

    The argument of the function, x, gets passed to the function() function and the body of the function (more on that in the next Chapter) contains the function definition. Of course, you could define functions that use more than one input:

    -
    my_function <- function(x, y){
    -  1/sqrt(x + y)
    -}
    +
    my_function <- function(x, y){
    +  1/sqrt(x + y)
    +}

    or inputs with names longer than one character:

    -
    my_function <- function(argument1, argument2){
    -  1/sqrt(argument1 + argument2)
    -}
    +
    my_function <- function(argument1, argument2){
    +  1/sqrt(argument1 + argument2)
    +}

    Functions written by the user get called just the same way as functions included in R:

    -
    my_function(1, 10)
    +
    my_function(1, 10)
    ## [1] 0.3015113

    It is also possible to provide default values to the function’s arguments, which are values that are used if the user omits them:

    -
    my_function <- function(argument1, argument2 = 10){
    -1/sqrt(argument1 + argument2)
    -}
    -
    my_function(1)
    +
    my_function <- function(argument1, argument2 = 10){
    +1/sqrt(argument1 + argument2)
    +}
    +
    my_function(1)
    ## [1] 0.3015113

    This is especially useful for functions with many arguments. Consider also the following example, where the function has a default method:

    -
    my_function <- function(argument1, argument2, method = "foo"){
    -  
    -  x <- argument1 + argument2
    -  
    -  if(method == "foo"){
    -    1/sqrt(x)
    -  } else if (method == "bar"){
    -    "this is a string"
    -    }
    -}
    -
    -my_function(10, 11)
    +
    my_function <- function(argument1, argument2, method = "foo"){
    +  
    +  x <- argument1 + argument2
    +  
    +  if(method == "foo"){
    +    1/sqrt(x)
    +  } else if (method == "bar"){
    +    "this is a string"
    +    }
    +}
    +
    +my_function(10, 11)
    ## [1] 0.2182179
    -
    my_function(10, 11, "bar")
    +
    my_function(10, 11, "bar")
    ## [1] "this is a string"

    As you see, depending on the “method” chosen, the returned result is either a numeric, or a string. What happens if the user provides a “method” that is neither “foo” nor “bar”?

    -
    my_function(10, 11, "spam")
    +
    my_function(10, 11, "spam")

    As you can see nothing happens. It is possible to add safeguards to your function to avoid such situations:

    -
    my_function <- function(argument1, argument2, method = "foo"){
    -  
    -  if(!(method %in% c("foo", "bar"))){
    -    return("Method must be either 'foo' or 'bar'")
    -  }
    -  
    -  x <- argument1 + argument2
    -  
    -  if(method == "foo"){
    -    1/sqrt(x)
    -  } else if (method == "bar"){
    -    "this is a string"
    -    }
    -}
    -
    -my_function(10, 11)
    +
    my_function <- function(argument1, argument2, method = "foo"){
    +  
    +  if(!(method %in% c("foo", "bar"))){
    +    return("Method must be either 'foo' or 'bar'")
    +  }
    +  
    +  x <- argument1 + argument2
    +  
    +  if(method == "foo"){
    +    1/sqrt(x)
    +  } else if (method == "bar"){
    +    "this is a string"
    +    }
    +}
    +
    +my_function(10, 11)
    ## [1] 0.2182179
    -
    my_function(10, 11, "bar")
    +
    my_function(10, 11, "bar")
    ## [1] "this is a string"
    -
    my_function(10, 11, "foobar")
    +
    my_function(10, 11, "foobar")
    ## [1] "Method must be either 'foo' or 'bar'"

    Notice that I have used return() inside my first if statement. This is to immediately stop evaluation of the function and return a value. If I had omitted it, evaluation would have @@ -667,60 +666,65 @@

    7.2.1 Declaring functions in RThey have one limitation though (which is shared with R’s native function): just like in math, they can only return one value. However, sometimes, you may need to return more than one value. To be able to do this, you must put your values in a list, and return the list of values. For example:

    -
    average_and_sd <- function(x){
    -c(mean(x), sd(x))
    -}
    -
    -average_and_sd(c(1, 3, 8, 9, 10, 12))
    +
    average_and_sd <- function(x){
    +c(mean(x), sd(x))
    +}
    +
    +average_and_sd(c(1, 3, 8, 9, 10, 12))
    ## [1] 7.166667 4.262237

    You’re still returning a single object, but it’s a vector. You can also return a named list:

    -
    average_and_sd <- function(x){
    -list("mean_x" =  mean(x), "sd_x" = sd(x))
    -}
    -
    -average_and_sd(c(1, 3, 8, 9, 10, 12))
    +
    average_and_sd <- function(x){
    +list("mean_x" =  mean(x), "sd_x" = sd(x))
    +}
    +
    +average_and_sd(c(1, 3, 8, 9, 10, 12))
    ## $mean_x
     ## [1] 7.166667
     ## 
     ## $sd_x
     ## [1] 4.262237

    As described before, you can use return() at the end of your functions:

    -
    average_and_sd <- function(x){
    -  result <- c(mean(x), sd(x))
    -return(result)
    -}
    -
    -average_and_sd(c(1, 3, 8, 9, 10, 12))
    +
    average_and_sd <- function(x){
    +  result <- c(mean(x), sd(x))
    +return(result)
    +}
    +
    +average_and_sd(c(1, 3, 8, 9, 10, 12))
    ## [1] 7.166667 4.262237

    But this is only needed if you need to return a value early:

    -
    average_and_sd <- function(x){
    -if(any(is.na(x))){
    -    return(NA)
    -  } else {
    -    c(mean(x), sd(x))
    -    }
    -}
    -
    -average_and_sd(c(1, 3, 8, 9, 10, 12))
    +
    average_and_sd <- function(x){
    +if(any(is.na(x))){
    +    return(NA)
    +  } else {
    +    c(mean(x), sd(x))
    +    }
    +}
    +
    +average_and_sd(c(1, 3, 8, 9, 10, 12))
    ## [1] 7.166667 4.262237
    -
    average_and_sd(c(1, 3, NA, 9, 10, 12))
    +
    average_and_sd(c(1, 3, NA, 9, 10, 12))
    ## [1] NA

    If you need to use a function from a package inside your function use :::

    -
    my_sum <- function(a_vector){
    -  purrr::reduce(a_vector, `+`)
    -}
    +
    my_sum <- function(a_vector){
    +  purrr::reduce(a_vector, `+`)
    +}

    However, if you need to use more than one function, this can become tedious. A quick and dirty way of doing that, is to use library(package_name), inside the function:

    -
    my_sum <- function(a_vector){
    -  library(purrr)
    -  reduce(a_vector, `+`)
    -}
    +
    my_sum <- function(a_vector){
    +  library(purrr)
    +  reduce(a_vector, `+`)
    +}

    Loading the library inside the function has the advantage that you will be sure that the package upon which your function depends will be loaded. If the package is already loaded, it will not be loaded again, thus not impact performance, but if you forgot to load it at the beginning of your -script, then, no worries, your function will load it the first time you use it! However, the very -best way would be to write your own package and declare the packages upon which your functions -depend as dependencies. This is something we are going to explore in Chapter 11.

    +script, then, no worries, your function will load it the first time you use it! However, you should +avoid doing this, because the resulting function is now not pure. It has a side effect, which is +loading a library. This could result in problems, especially if several functions load several +different packages that have functions with the same name. Depending on which function runs first, +a function with the same name but coming from the same package will be available in the global +environment. The very best way would be to write your own package and declare the packages upon +which your functions depend as dependencies. This is something we are going to explore in Chapter +9.

    You can put a lot of instructions inside a function, such as loops. Let’s create the function that returns Fionacci numbers.

    @@ -729,16 +733,16 @@

    7.2.2 Fibonacci numbers\[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...\]

    Each subsequent number is composed of the sum of the two preceding ones. In R, it is possible to define a function that returns the \(n^{th}\) fibonacci number:

    -
    my_fibo <- function(n){
    - a <- 0
    - b <- 1
    - for (i in 1:n){
    -  temp <- b
    -  b <- a
    -  a <- a + temp
    - }
    - a
    -}
    +
    my_fibo <- function(n){
    + a <- 0
    + b <- 1
    + for (i in 1:n){
    +  temp <- b
    +  b <- a
    +  a <- a + temp
    + }
    + a
    +}

    Inside the loop, we defined a variable called temp. Defining temporary variables is usually very useful. Let’s try to understand what happens inside this loop:

      @@ -756,26 +760,26 @@

      7.2.2 Fibonacci numbers
      fibo_recur <- function(n){
      - if (n == 0 || n == 1){
      -   return(n)
      -   } else {
      -   fibo_recur(n-1) + fibo_recur(n-2)
      -   }
      -}
      +
      fibo_recur <- function(n){
      + if (n == 0 || n == 1){
      +   return(n)
      +   } else {
      +   fibo_recur(n-1) + fibo_recur(n-2)
      +   }
      +}

      This algorithm should be easier to understand: if n = 0 or n = 1 the function should return n (0 or 1). If n is strictly bigger than 1, fibo_recur() should return the sum of fibo_recur(n-1) and fibo_recur(n-2). This version of the function is very much the same as the mathematical definition of the fibonacci sequence. So why not use only recursive algorithms then? Try to run the following:

      -
      system.time(my_fibo(30))
      +
      system.time(my_fibo(30))
      ##    user  system elapsed 
       ##   0.007   0.000   0.007

      The result should be printed very fast (the system.time() function returns the time that it took to execute my_fibo(30)). Let’s try with the recursive version:

      -
      system.time(fibo_recur(30))
      +
      system.time(fibo_recur(30))
      ##    user  system elapsed 
      -##   1.720   0.044   1.772
      +## 1.460 0.037 1.498

    It takes much longer to execute! Recursive algorithms are very CPU demanding, so if speed is critical, it’s best to avoid recursive algorithms. Also, in fibo_recur() try to remove this line: if (n == 0 || n == 1) and try to run fibo_recur(5) and see what happens. You should @@ -808,47 +812,48 @@

    Exercise 3

    7.4 Functions that take functions as arguments: writing your own higher-order functions

    -

    Functions that take functions as arguments are very powerful and useful tools. You already know a -couple, purrr::map() and purrr::reduce(), discussed briefly in Chapter 4. -But you can also write your own! A very simple example would be the following:

    -
    my_func <- function(x, func){
    -  func(x)
    -}
    +

    Functions that take functions as arguments are very powerful and useful tools. +Two very important functions, that we will discuss in chapter 8 are purrr::map() +and purrr::reduce(). But you can also write your own! A very simple example +would be the following:

    +
    my_func <- function(x, func){
    +  func(x)
    +}

    my_func() is a very simple function that takes x and func() as arguments and that simply executes func(x). This might not seem very useful (after all, you could simply use func(x)!) but this is just for illustration purposes, in practice, your functions would be more useful than that! Let’s try to use my_func():

    -
    my_func(c(1, 8, 1, 0, 8), mean)
    +
    my_func(c(1, 8, 1, 0, 8), mean)
    ## [1] 3.6

    As expected, this returns the mean of the given vector. But now suppose the following:

    -
    my_func(c(1, 8, 1, NA, 8), mean)
    +
    my_func(c(1, 8, 1, NA, 8), mean)
    ## [1] NA

    Because one element of the list is NA, the whole mean is NA. mean() has a na.rm argument that you can set to TRUE to ignore the NAs in the vector. However, here, there is no way to provide this argument to the function mean()! Let’s see what happens when we try to:

    -
    my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE)
    +
    my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE)
    Error in my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE) :
       unused argument (na.rm = TRUE)

    So what you could do is pass the value TRUE to the na.rm argument of mean() from your own function:

    -
    my_func <- function(x, func, remove_na){
    -  func(x, na.rm = remove_na)
    -}
    -
    -my_func(c(1, 8, 1, NA, 8), mean, remove_na = TRUE)
    +
    my_func <- function(x, func, remove_na){
    +  func(x, na.rm = remove_na)
    +}
    +
    +my_func(c(1, 8, 1, NA, 8), mean, remove_na = TRUE)
    ## [1] 4.5

    This is one solution, but mean() also has another argument called trim. What if some other user needs this argument? Should you also add it to your function? Surely there’s a way to avoid this problem? Yes, there is, and it by using the dots. The ... simply mean “any other argument as needed”, and it’s very easy to use:

    -
    my_func <- function(x, func, ...){
    -  func(x, ...)
    -}
    -
    -my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE)
    +
    my_func <- function(x, func, ...){
    +  func(x, ...)
    +}
    +
    +my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE)
    ## [1] 4.5

    or, now, if you need the trim argument:

    -
    my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE, trim = 0.1)
    +
    my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE, trim = 0.1)
    ## [1] 4.5

    The ... are very useful when writing higher-order functions such as my_func(), because it allows you to pass arguments down to the underlying functions.

    @@ -860,47 +865,47 @@

    7.5 Functions that return functio need or want. It is sometimes useful to write a function that returns a modified function. This type of function is called a function factory, as it builds functions. For instance, suppose that we want to time how long functions take to run. An idea would be to proceed like this:

    -
    tic <- Sys.time()
    -very_slow_function(x)
    -toc <- Sys.time()
    -
    -running_time <- toc - tic
    +
    tic <- Sys.time()
    +very_slow_function(x)
    +toc <- Sys.time()
    +
    +running_time <- toc - tic

    but if you want to time several functions, this gets very tedious. It would be much easier if functions would time themselves. We could achieve this by writing a wrapper, like this:

    -
    timed_very_slow_function <- function(...){
    -
    -  tic <- Sys.time()
    -  result <- very_slow_function(x)
    -  toc <- Sys.time()
    -
    -  running_time <- toc - tic
    -
    -  list("result" = result,
    -       "running_time" = running_time)
    -
    -}
    +
    timed_very_slow_function <- function(...){
    +
    +  tic <- Sys.time()
    +  result <- very_slow_function(x)
    +  toc <- Sys.time()
    +
    +  running_time <- toc - tic
    +
    +  list("result" = result,
    +       "running_time" = running_time)
    +
    +}

    The problem here is that we have to change each function we need to time. But thanks to the concept of function factories, we can write a function that does this for us:

    -
    time_f <- function(.f, ...){
    -
    -  function(...){
    -
    -    tic <- Sys.time()
    -    result <- .f(...)
    -    toc <- Sys.time()
    -
    -    running_time <- toc - tic
    -
    -    list("result" = result,
    -         "running_time" = running_time)
    -
    -  }
    -}
    +
    time_f <- function(.f, ...){
    +
    +  function(...){
    +
    +    tic <- Sys.time()
    +    result <- .f(...)
    +    toc <- Sys.time()
    +
    +    running_time <- toc - tic
    +
    +    list("result" = result,
    +         "running_time" = running_time)
    +
    +  }
    +}

    time_f() is a function that returns a function, a function factory. Calling it on a function returns, as expected, a function:

    -
    t_mean <- time_f(mean)
    -
    -t_mean
    +
    t_mean <- time_f(mean)
    +
    +t_mean
    ## function(...){
     ## 
     ##     tic <- Sys.time()
    @@ -913,61 +918,61 @@ 

    7.5 Functions that return functio ## "running_time" = running_time) ## ## } -## <environment: 0x55a89cb7fb78>

    +## <environment: 0x5572990788f8>

    This function can now be used like any other function:

    -
    output <- t_mean(seq(-500000, 500000))
    +
    output <- t_mean(seq(-500000, 500000))

    output is a list of two elements, the first being simply the result of mean(seq(-500000, 500000)), and the other being the running time.

    This approach is super flexible. For instance, imagine that there is an NA in the vector. This would result in the mean of this vector being NA:

    -
    t_mean(c(NA, seq(-500000, 500000)))
    +
    t_mean(c(NA, seq(-500000, 500000)))
    ## $result
     ## [1] NA
     ## 
     ## $running_time
    -## Time difference of 0.006837606 secs
    +## Time difference of 0.006829977 secs

    But because we use the ... in the definition of time_f(), we can now simply pass mean()’s option down to it:

    -
    t_mean(c(NA, seq(-500000, 500000)), na.rm = TRUE)
    +
    t_mean(c(NA, seq(-500000, 500000)), na.rm = TRUE)
    ## $result
     ## [1] 0
     ## 
     ## $running_time
    -## Time difference of 0.01413321 secs
    +## Time difference of 0.01427937 secs

    7.6 Functions that take columns of data as arguments

    7.6.1 The enquo() - !!() approach

    In many situations, you will want to write functions that look similar to this:

    -
    my_function(my_data, one_column_inside_data)
    +
    my_function(my_data, one_column_inside_data)

    Such a function would be useful in situation where you have to apply a certain number of operations to columns for different data frames. For example if you need to create tables of descriptive statistics or graphs periodically, it might be very interesting to put these operations inside a function and then call the function whenever you need it, on the fresh batch of data.

    However, if you try to write something like that, something that might seem unexpected, at first, will happen:

    -
    data(mtcars)
    -
    -simple_function <- function(dataset, col_name){
    -  dataset %>%
    -    group_by(col_name) %>%
    -    summarise(mean_speed = mean(speed))
    -}
    -
    -
    -simple_function(cars, "dist")
    +
    data(mtcars)
    +
    +simple_function <- function(dataset, col_name){
    +  dataset %>%
    +    group_by(col_name) %>%
    +    summarise(mean_speed = mean(speed))
    +}
    +
    +
    +simple_function(cars, "dist")
    Error: unknown variable to group by : col_name

    The variable col_name is passed to simple_function() as a string, but group_by() requires a variable name. So why not try to convert col_name to a name?

    -
    simple_function <- function(dataset, col_name){
    -  col_name <- as.name(col_name)
    -  dataset %>%
    -    group_by(col_name) %>%
    -    summarise(mean_speed = mean(speed))
    -}
    -
    -
    -simple_function(cars, "dist")
    +
    simple_function <- function(dataset, col_name){
    +  col_name <- as.name(col_name)
    +  dataset %>%
    +    group_by(col_name) %>%
    +    summarise(mean_speed = mean(speed))
    +}
    +
    +
    +simple_function(cars, "dist")
    Error: unknown variable to group by : col_name

    This is because R is literally looking for the variable "dist" somewhere in the global environment, and not as a column of the data. R does not understand that you are refering to the @@ -981,15 +986,15 @@

    7.6.1 The enquo() - !!(){rlang}. As you will see, knowing some of the capabilities {rlang} provides can be incredibly useful. Take a look at the code below:

    -
    simple_function <- function(dataset, col_name){
    -  col_name <- enquo(col_name)
    -  dataset %>%
    -    group_by(!!col_name) %>%
    -    summarise(mean_mpg = mean(mpg))
    -}
    -
    -
    -simple_function(mtcars, cyl)
    +
    simple_function <- function(dataset, col_name){
    +  col_name <- enquo(col_name)
    +  dataset %>%
    +    group_by(!!col_name) %>%
    +    summarise(mean_mpg = mean(mpg))
    +}
    +
    +
    +simple_function(mtcars, cyl)
    ## # A tibble: 3 × 2
     ##     cyl mean_mpg
     ##   <dbl>    <dbl>
    @@ -1005,45 +1010,45 @@ 

    7.6.1 The enquo() - !!()enquo() on your column names and then !!() inside the {dplyr} function you want to use.

    Let’s see some other examples:

    -
    simple_function <- function(dataset, col_name, value){
    -  col_name <- enquo(col_name)
    -  dataset %>%
    -    filter((!!col_name) == value) %>%
    -    summarise(mean_cyl = mean(cyl))
    -}
    -
    -
    -simple_function(mtcars, am, 1)
    +
    simple_function <- function(dataset, col_name, value){
    +  col_name <- enquo(col_name)
    +  dataset %>%
    +    filter((!!col_name) == value) %>%
    +    summarise(mean_cyl = mean(cyl))
    +}
    +
    +
    +simple_function(mtcars, am, 1)
    ##   mean_cyl
     ## 1 5.076923

    Notice that I’ve written:

    -
    filter((!!col_name) == value)
    +
    filter((!!col_name) == value)

    and not:

    -
    filter(!!col_name == value)
    +
    filter(!!col_name == value)

    I have enclosed !!col_name inside parentheses. This is because operators such as == have precedence over !!, so you have to be explicit. Also, notice that I didn’t have to quote 1. This is because it’s standard variable, not a column inside the dataset. Let’s make this function a bit more general. I hard-coded the variable cyl inside the body of the function, but maybe you’d like the mean of another variable?

    -
    simple_function <- function(dataset, filter_col, mean_col, value){
    -  filter_col <- enquo(filter_col)
    -  mean_col <- enquo(mean_col)
    -  dataset %>%
    -    filter((!!filter_col) == value) %>%
    -    summarise(mean((!!mean_col)))
    -}
    -
    -
    -simple_function(mtcars, am, cyl, 1)
    +
    simple_function <- function(dataset, filter_col, mean_col, value){
    +  filter_col <- enquo(filter_col)
    +  mean_col <- enquo(mean_col)
    +  dataset %>%
    +    filter((!!filter_col) == value) %>%
    +    summarise(mean((!!mean_col)))
    +}
    +
    +
    +simple_function(mtcars, am, cyl, 1)
    ##   mean(cyl)
     ## 1  5.076923

    Notice that I had to quote mean_col too.

    Using the ... that we discovered in the previous section, we can pass more than one column:

    -
    simple_function <- function(dataset, ...){
    -  col_vars <- quos(...)
    -  dataset %>%
    -    summarise_at(vars(!!!col_vars), funs(mean, sd))
    -}
    +
    simple_function <- function(dataset, ...){
    +  col_vars <- quos(...)
    +  dataset %>%
    +    summarise_at(vars(!!!col_vars), funs(mean, sd))
    +}

    Because these dots contain more than one variable, you have to use quos() instead of enquo(). This will put the arguments provided via the dots in a list. Then, because we have a list of columns, we have to use summarise_at(), which you should know if you did the exercices of @@ -1051,20 +1056,31 @@

    7.6.1 The enquo() - !!()vars() and funs() are. The last thing you have to pay attention to is to use !!!() if you used quos(). So 3 ! instead of only 2. This allows you to then do things like this:

    -
    simple_function(mtcars, am, cyl, mpg)
    +
    simple_function(mtcars, am, cyl, mpg)
    +
    ## Warning: `funs()` was deprecated in dplyr 0.8.0.
    +## Please use a list of either functions or lambdas: 
    +## 
    +##   # Simple named list: 
    +##   list(mean = mean, median = median)
    +## 
    +##   # Auto named with `tibble::lst()`: 
    +##   tibble::lst(mean, median)
    +## 
    +##   # Using lambdas
    +##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
    ##   am_mean cyl_mean mpg_mean     am_sd   cyl_sd   mpg_sd
     ## 1 0.40625   6.1875 20.09062 0.4989909 1.785922 6.026948

    Using ... with !!!() allows you to write very flexible functions.

    If you need to be even more general, you can also provide the summary functions as arguments of your function, but you have to rewrite your function a little bit:

    -
    simple_function <- function(dataset, cols, funcs){
    -  dataset %>%
    -    summarise_at(vars(!!!cols), funs(!!!funcs))
    -}
    +
    simple_function <- function(dataset, cols, funcs){
    +  dataset %>%
    +    summarise_at(vars(!!!cols), funs(!!!funcs))
    +}

    You might be wondering where the quos() went? Well because now we are passing two lists, a list of columns that we have to quote, and a list of functions, that we also have to quote, we need to use quos() when calling the function:

    -
    simple_function(mtcars, quos(am, cyl, mpg), quos(mean, sd, sum))
    +
    simple_function(mtcars, quos(am, cyl, mpg), quos(mean, sd, sum))
    ##   am_mean cyl_mean mpg_mean     am_sd   cyl_sd   mpg_sd am_sum cyl_sum mpg_sum
     ## 1 0.40625   6.1875 20.09062 0.4989909 1.785922 6.026948     13     198   642.9

    This works, but I don’t think you’ll need to have that much flexibility; either the columns @@ -1072,16 +1088,16 @@

    7.6.1 The enquo() - !!()To conclude this function, I should also talk about as_label() which allows you to change the name of a variable, for instance if you want to call the resulting column mean_mpg when you compute the mean of the mpg column:

    -
    simple_function <- function(dataset, filter_col, mean_col, value){
    -
    -  filter_col <- enquo(filter_col)
    -  mean_col <- enquo(mean_col)
    -  mean_name <- paste0("mean_", as_label(mean_col))
    -  
    -  dataset %>%
    -    filter((!!filter_col) == value) %>%
    -    summarise(!!(mean_name) := mean((!!mean_col)))
    -}
    +
    simple_function <- function(dataset, filter_col, mean_col, value){
    +
    +  filter_col <- enquo(filter_col)
    +  mean_col <- enquo(mean_col)
    +  mean_name <- paste0("mean_", as_label(mean_col))
    +  
    +  dataset %>%
    +    filter((!!filter_col) == value) %>%
    +    summarise(!!(mean_name) := mean((!!mean_col)))
    +}

    Pay attention to the := operator in the last line. This is needed when using as_label().

    @@ -1092,15 +1108,15 @@

    7.6.2 Curly Curly, a simplified a consensus yet.

    Let’s suppose that I need to write a function that takes a data frame, as well as a column from this data frame as arguments, just like before:

    -
    how_many_na <- function(dataframe, column_name){
    -  dataframe %>%
    -    filter(is.na(column_name)) %>%
    -    count()
    -}
    +
    how_many_na <- function(dataframe, column_name){
    +  dataframe %>%
    +    filter(is.na(column_name)) %>%
    +    count()
    +}

    Let’s try this function out on the starwars data:

    -
    data(starwars)
    -
    -head(starwars)
    +
    data(starwars)
    +
    +head(starwars)
    ## # A tibble: 6 × 14
     ##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
     ##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
    @@ -1115,55 +1131,55 @@ 

    7.6.2 Curly Curly, a simplified a ## # ³​eye_color, ⁴​birth_year, ⁵​homeworld

    As you can see, there are missing values in the hair_color column. Let’s try to count how many missing values are in this column:

    -
    how_many_na(starwars, hair_color)
    +
    how_many_na(starwars, hair_color)
    Error: object 'hair_color' not found

    Just as expected, this does not work. The issue is that the column is inside the dataframe, but when calling the function with hair_color as the second argument, R is looking for a variable called hair_color that does not exist. What about trying with "hair_color"?

    -
    how_many_na(starwars, "hair_color")
    +
    how_many_na(starwars, "hair_color")
    ## # A tibble: 1 × 1
     ##       n
     ##   <int>
     ## 1     0

    Now we get something, but something wrong!

    One way to solve this issue, is to not use the filter() function, and instead rely on base R:

    -
    how_many_na_base <- function(dataframe, column_name){
    -  na_index <- is.na(dataframe[, column_name])
    -  nrow(dataframe[na_index, column_name])
    -}
    -
    -how_many_na_base(starwars, "hair_color")
    +
    how_many_na_base <- function(dataframe, column_name){
    +  na_index <- is.na(dataframe[, column_name])
    +  nrow(dataframe[na_index, column_name])
    +}
    +
    +how_many_na_base(starwars, "hair_color")
    ## [1] 5

    This works, but not using the {tidyverse} at all is not always an option. For instance, the next function, which uses a grouping variable, would be difficult to implement without the {tidyverse}:

    -
    summarise_groups <- function(dataframe, grouping_var, column_name){
    -  dataframe %>%
    -    group_by(grouping_var) %>%  
    -    summarise(mean(column_name, na.rm = TRUE))
    -}
    +
    summarise_groups <- function(dataframe, grouping_var, column_name){
    +  dataframe %>%
    +    group_by(grouping_var) %>%  
    +    summarise(mean(column_name, na.rm = TRUE))
    +}

    Calling this function results in the following error message, as expected:

    Error: Column `grouping_var` is unknown

    In the previous section, we solved the issue like so:

    -
    summarise_groups <- function(dataframe, grouping_var, column_name){
    -
    -  grouping_var <- enquo(grouping_var)
    -  column_name <- enquo(column_name)
    -  mean_name <- paste0("mean_", as_label(column_name))
    -
    -  dataframe %>%
    -    group_by(!!grouping_var) %>%  
    -    summarise(!!(mean_name) := mean(!!column_name, na.rm = TRUE))
    -}
    +
    summarise_groups <- function(dataframe, grouping_var, column_name){
    +
    +  grouping_var <- enquo(grouping_var)
    +  column_name <- enquo(column_name)
    +  mean_name <- paste0("mean_", as_label(column_name))
    +
    +  dataframe %>%
    +    group_by(!!grouping_var) %>%  
    +    summarise(!!(mean_name) := mean(!!column_name, na.rm = TRUE))
    +}

    The core of the function remained very similar to the version from before, but now one has to use the enquo()-!! syntax.

    Now this can be simplified using the new {{}} syntax:

    -
    summarise_groups <- function(dataframe, grouping_var, column_name){
    -
    -  dataframe %>%
    -    group_by({{grouping_var}}) %>%  
    -    summarise({{column_name}} := mean({{column_name}}, na.rm = TRUE))
    -}
    +
    summarise_groups <- function(dataframe, grouping_var, column_name){
    +
    +  dataframe %>%
    +    group_by({{grouping_var}}) %>%  
    +    summarise({{column_name}} := mean({{column_name}}, na.rm = TRUE))
    +}

    Much easier and cleaner! You still have to use the := operator instead of = for the column name however, and if you want to modify the column names, for instance in this case return "mean_height" instead of height you have to keep using the enquo()-!! syntax.

    @@ -1173,20 +1189,20 @@

    7.6.2 Curly Curly, a simplified a

    7.7 Functions that use loops

    It is entirely possible to put a loop inside a function. For example, consider the following function that return the square root of a number using Newton’s algorithm:

    -
    sqrt_newton <- function(a, init = 1, eps = 0.01){
    -    stopifnot(a >= 0)
    -    while(abs(init**2 - a) > eps){
    -        init <- 1/2 *(init + a/init)
    -    }
    -    init
    -}
    +
    sqrt_newton <- function(a, init = 1, eps = 0.01){
    +    stopifnot(a >= 0)
    +    while(abs(init**2 - a) > eps){
    +        init <- 1/2 *(init + a/init)
    +    }
    +    init
    +}

    This functions contains a while loop inside its body. Let’s see if it works:

    -
    sqrt_newton(16)
    +
    sqrt_newton(16)
    ## [1] 4.000001

    In the definition of the function, I wrote init = 1 and eps = 0.01 which means that this argument can be omitted and will have the provided value (0.01) as the default. You can then use this function as any other, for example with map():

    -
    map(c(16, 7, 8, 9, 12), sqrt_newton)
    +
    map(c(16, 7, 8, 9, 12), sqrt_newton)
    ## [[1]]
     ## [1] 4.000001
     ## 
    @@ -1211,7 +1227,7 @@ 

    7.7 Functions that use loops7.8 Anonymous functions

    As the name implies, anonymous functions are functions that do not have a name. These are useful inside functions that have functions as arguments, such as purrr::map() or purrr::reduce():

    -
    map(c(1,2,3,4), function(x){1/sqrt(x)})
    +
    map(c(1,2,3,4), function(x){1/sqrt(x)})
    ## [[1]]
     ## [1] 1
     ## 
    @@ -1225,7 +1241,7 @@ 

    7.8 Anonymous functions
    map(c(1,2,3,4), ~{1/sqrt(.)})

    +
    map(c(1,2,3,4), ~{1/sqrt(.)})
    ## [[1]]
     ## [1] 1
     ## 
    @@ -1240,7 +1256,7 @@ 

    7.8 Anonymous functions
    map2(c(1, 2, 3, 4, 5), c(9, 8, 7, 6, 5), function(x, y){(x**2)/y})

    +
    map2(c(1, 2, 3, 4, 5), c(9, 8, 7, 6, 5), function(x, y){(x**2)/y})
    ## [[1]]
     ## [1] 0.1111111
     ## 
    @@ -1256,7 +1272,7 @@ 

    7.8 Anonymous functions
    map2(c(1, 2, 3, 4, 5), c(9, 8, 7, 6, 5), ~{(.x**2)/.y})
    +
    map2(c(1, 2, 3, 4, 5), c(9, 8, 7, 6, 5), ~{(.x**2)/.y})
    ## [[1]]
     ## [1] 0.1111111
     ## 
    @@ -1274,7 +1290,7 @@ 

    7.8 Anonymous functions
    map(c(1,2,3,4), \(x)(1/sqrt(x)))
    +
    map(c(1,2,3,4), \(x)(1/sqrt(x)))
    ## [[1]]
     ## [1] 1
     ## 
    @@ -1356,8 +1372,8 @@ 

    Exercise 3 or "Buzz" up to that integer.

  • -
    -

    Exercise 4

    +
    +

    Exercise 4

    • Fizz Buzz 2: Same as above, but now add this third condition: if a number is both divisible by 3 and 5, print "FizzBuzz".

    • Write a function that takes an integer as argument, and prints Fizz, Buzz or FizzBuzz up to that integer.

    • diff --git a/docs/descriptive-statistics-and-data-manipulation.html b/docs/descriptive-statistics-and-data-manipulation.html index a04dc04..ecd4ebe 100644 --- a/docs/descriptive-statistics-and-data-manipulation.html +++ b/docs/descriptive-statistics-and-data-manipulation.html @@ -23,7 +23,7 @@ - + @@ -282,7 +282,6 @@
    • Exercise 1
    • Exercise 2
    • Exercise 3
    • -
    • Exercise 4
  • 5 Graphs @@ -365,7 +364,7 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • +
  • Exercise 4
  • 8 Functional programming @@ -3550,10 +3549,14 @@

    4.7.2.1 Defining dates, the tidy a few of these helper functions, and they can handle a lot of different formats for dates. In our case, having the name of the month instead of the number might seem quite problematic, but it turns out that this is a case that {lubridate} handles painfully:

    -
    library(lubridate)
    -
    -independence <- independence %>%
    -  mutate(independence_date = dmy(independence_date))
    +
    library(lubridate)
    +
    ## 
    +## Attaching package: 'lubridate'
    +
    ## The following objects are masked from 'package:base':
    +## 
    +##     date, intersect, setdiff, union
    +
    independence <- independence %>%
    +  mutate(independence_date = dmy(independence_date))
    ## Warning: 5 failed to parse.

    Some dates failed to parse, for instance for Morocco. This is because these countries have several independence dates; this means that the string to convert looks like:

    @@ -3567,7 +3570,7 @@

    4.7.2.1 Defining dates, the tidy

    4.7.2.2 Data manipulation with dates

    Let’s take a look at the data now:

    -
    independence
    +
    independence
    ## # A tibble: 54 × 6
     ##    country      colonial_name                 colon…¹ independ…² first…³ indep…⁴
     ##    <chr>        <chr>                         <chr>   <date>     <chr>   <chr>  
    @@ -3586,9 +3589,9 @@ 

    4.7.2.2 Data manipulation with da

    As you can see, we now have a date column in the right format. We can now answer questions such as Which countries gained independence before 1960? quite easily, by using the functions year(), month() and day(). Let’s see which countries gained independence before 1960:

    -
    independence %>%
    -  filter(year(independence_date) <= 1960) %>%
    -  pull(country)
    +
    independence %>%
    +  filter(year(independence_date) <= 1960) %>%
    +  pull(country)
    ##  [1] "Liberia"                          "South Africa"                    
     ##  [3] "Egypt"                            "Eritrea"                         
     ##  [5] "Libya"                            "Sudan"                           
    @@ -3604,61 +3607,61 @@ 

    4.7.2.2 Data manipulation with da

    You guessed it, year() extracts the year of the date column and converts it as a numeric so that we can work on it. This is the same for month() or day(). Let’s try to see if countries gained their independence on Christmas Eve:

    -
    independence %>%
    -  filter(month(independence_date) == 12,
    -         day(independence_date) == 24) %>%
    -  pull(country)
    +
    independence %>%
    +  filter(month(independence_date) == 12,
    +         day(independence_date) == 24) %>%
    +  pull(country)
    ## [1] "Libya"

    Seems like Libya was the only one! You can also operate on dates. For instance, let’s compute the difference between two dates, using the interval() column:

    -
    independence %>%
    -  mutate(today = lubridate::today()) %>%
    -  mutate(independent_since = interval(independence_date, today)) %>%
    -  select(country, independent_since)
    +
    independence %>%
    +  mutate(today = lubridate::today()) %>%
    +  mutate(independent_since = interval(independence_date, today)) %>%
    +  select(country, independent_since)
    ## # A tibble: 54 × 2
     ##    country      independent_since             
     ##    <chr>        <Interval>                    
    -##  1 Liberia      1847-07-26 UTC--2022-10-13 UTC
    -##  2 South Africa 1910-05-31 UTC--2022-10-13 UTC
    -##  3 Egypt        1922-02-28 UTC--2022-10-13 UTC
    -##  4 Eritrea      1947-02-10 UTC--2022-10-13 UTC
    -##  5 Libya        1951-12-24 UTC--2022-10-13 UTC
    -##  6 Sudan        1956-01-01 UTC--2022-10-13 UTC
    -##  7 Tunisia      1956-03-20 UTC--2022-10-13 UTC
    +##  1 Liberia      1847-07-26 UTC--2022-10-16 UTC
    +##  2 South Africa 1910-05-31 UTC--2022-10-16 UTC
    +##  3 Egypt        1922-02-28 UTC--2022-10-16 UTC
    +##  4 Eritrea      1947-02-10 UTC--2022-10-16 UTC
    +##  5 Libya        1951-12-24 UTC--2022-10-16 UTC
    +##  6 Sudan        1956-01-01 UTC--2022-10-16 UTC
    +##  7 Tunisia      1956-03-20 UTC--2022-10-16 UTC
     ##  8 Morocco      NA--NA                        
    -##  9 Ghana        1957-03-06 UTC--2022-10-13 UTC
    -## 10 Guinea       1958-10-02 UTC--2022-10-13 UTC
    +##  9 Ghana        1957-03-06 UTC--2022-10-16 UTC
    +## 10 Guinea       1958-10-02 UTC--2022-10-16 UTC
     ## # … with 44 more rows

    The independent_since column now contains an interval object that we can convert to years:

    -
    independence %>%
    -  mutate(today = lubridate::today()) %>%
    -  mutate(independent_since = interval(independence_date, today)) %>%
    -  select(country, independent_since) %>%
    -  mutate(years_independent = as.numeric(independent_since, "years"))
    +
    independence %>%
    +  mutate(today = lubridate::today()) %>%
    +  mutate(independent_since = interval(independence_date, today)) %>%
    +  select(country, independent_since) %>%
    +  mutate(years_independent = as.numeric(independent_since, "years"))
    ## # A tibble: 54 × 3
     ##    country      independent_since              years_independent
     ##    <chr>        <Interval>                                 <dbl>
    -##  1 Liberia      1847-07-26 UTC--2022-10-13 UTC             175. 
    -##  2 South Africa 1910-05-31 UTC--2022-10-13 UTC             112. 
    -##  3 Egypt        1922-02-28 UTC--2022-10-13 UTC             101. 
    -##  4 Eritrea      1947-02-10 UTC--2022-10-13 UTC              75.7
    -##  5 Libya        1951-12-24 UTC--2022-10-13 UTC              70.8
    -##  6 Sudan        1956-01-01 UTC--2022-10-13 UTC              66.8
    -##  7 Tunisia      1956-03-20 UTC--2022-10-13 UTC              66.6
    +##  1 Liberia      1847-07-26 UTC--2022-10-16 UTC             175. 
    +##  2 South Africa 1910-05-31 UTC--2022-10-16 UTC             112. 
    +##  3 Egypt        1922-02-28 UTC--2022-10-16 UTC             101. 
    +##  4 Eritrea      1947-02-10 UTC--2022-10-16 UTC              75.7
    +##  5 Libya        1951-12-24 UTC--2022-10-16 UTC              70.8
    +##  6 Sudan        1956-01-01 UTC--2022-10-16 UTC              66.8
    +##  7 Tunisia      1956-03-20 UTC--2022-10-16 UTC              66.6
     ##  8 Morocco      NA--NA                                      NA  
    -##  9 Ghana        1957-03-06 UTC--2022-10-13 UTC              65.6
    -## 10 Guinea       1958-10-02 UTC--2022-10-13 UTC              64.0
    +##  9 Ghana        1957-03-06 UTC--2022-10-16 UTC              65.6
    +## 10 Guinea       1958-10-02 UTC--2022-10-16 UTC              64.0
     ## # … with 44 more rows

    We can now see for how long the last country to gain independence has been independent. Because the data is not tidy (in some cases, an African country was colonized by two powers, see Libya), I will only focus on 4 European colonial powers: Belgium, France, Portugal and the United Kingdom:

    -
    independence %>%
    -  filter(colonial_power %in% c("Belgium", "France", "Portugal", "United Kingdom")) %>%
    -  mutate(today = lubridate::today()) %>%
    -  mutate(independent_since = interval(independence_date, today)) %>%
    -  mutate(years_independent = as.numeric(independent_since, "years")) %>%
    -  group_by(colonial_power) %>%
    -  summarise(last_colony_independent_for = min(years_independent, na.rm = TRUE))
    +
    independence %>%
    +  filter(colonial_power %in% c("Belgium", "France", "Portugal", "United Kingdom")) %>%
    +  mutate(today = lubridate::today()) %>%
    +  mutate(independent_since = interval(independence_date, today)) %>%
    +  mutate(years_independent = as.numeric(independent_since, "years")) %>%
    +  group_by(colonial_power) %>%
    +  summarise(last_colony_independent_for = min(years_independent, na.rm = TRUE))
    ## # A tibble: 4 × 2
     ##   colonial_power last_colony_independent_for
     ##   <chr>                                <dbl>
    @@ -3670,26 +3673,26 @@ 

    4.7.2.2 Data manipulation with da

    4.7.2.3 Arithmetic with dates

    Adding or substracting days to dates is quite easy:

    -
    ymd("2018-12-31") + 16
    +
    ymd("2018-12-31") + 16
    ## [1] "2019-01-16"

    It is also possible to be more explicit and use days():

    -
    ymd("2018-12-31") + days(16)
    +
    ymd("2018-12-31") + days(16)
    ## [1] "2019-01-16"

    To add years, you can use years():

    -
    ymd("2018-12-31") + years(1)
    +
    ymd("2018-12-31") + years(1)
    ## [1] "2019-12-31"

    But you have to be careful with leap years:

    -
    ymd("2016-02-29") + years(1)
    +
    ymd("2016-02-29") + years(1)
    ## [1] NA

    Because 2017 is not a leap year, the above computation returns NA. The same goes for months with a different number of days:

    -
    ymd("2018-12-31") + months(2)
    +
    ymd("2018-12-31") + months(2)
    ## [1] NA

    The way to solve these issues is to use the special %m+% infix operator:

    -
    ymd("2016-02-29") %m+% years(1)
    +
    ymd("2016-02-29") %m+% years(1)
    ## [1] "2017-02-28"

    and for months:

    -
    ymd("2018-12-31") %m+% months(2)
    +
    ymd("2018-12-31") %m+% months(2)
    ## [1] "2019-02-28"

    {lubridate} contains many more functions. If you often work with dates, duration or interval data, {lubridate} is a package that you have to add to your toolbox.

    @@ -3749,21 +3752,21 @@

    4.7.3 Manipulate strings with

    4.7.3.1 Getting text data into Rstudio

    First of all, let us read in the file:

    -
    winchester <- read_lines("https://gist.githubusercontent.com/b-rodrigues/5139560e7d0f2ecebe5da1df3629e015/raw/e3031d894ffb97217ddbad1ade1b307c9937d2c8/gistfile1.txt")
    +
    winchester <- read_lines("https://gist.githubusercontent.com/b-rodrigues/5139560e7d0f2ecebe5da1df3629e015/raw/e3031d894ffb97217ddbad1ade1b307c9937d2c8/gistfile1.txt")

    Even though the file is an XML file, I still read it in using read_lines() and not read_xml() from the {xml2} package. This is for the purposes of the current exercise, and also because I always have trouble with XML files, and prefer to treat them as simple text files, and use regular expressions to get what I need.

    Now that the ALTO file is read in and saved in the winchester variable, you might want to print the whole thing in the console. Before that, take a look at the structure:

    -
    str(winchester)
    +
    str(winchester)
    ##  chr [1:43] "" ...

    So the winchester variable is a character atomic vector with 43 elements. So first, we need to understand what these elements are. Let’s start with the first one:

    -
    winchester[1]
    +
    winchester[1]
    ## [1] ""

    Ok, so it seems like the first element is part of the header of the file. What about the second one?

    -
    winchester[2]
    +
    winchester[2]
    ## [1] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"><base href=\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\"><style>body{margin-left:0;margin-right:0;margin-top:0}#bN015htcoyT__google-cache-hdr{background:#f5f5f5;font:13px arial,sans-serif;text-align:left;color:#202020;border:0;margin:0;border-bottom:1px solid #cecece;line-height:16px;padding:16px 28px 24px 28px}#bN015htcoyT__google-cache-hdr *{display:inline;font:inherit;text-align:inherit;color:inherit;line-height:inherit;background:none;border:0;margin:0;padding:0;letter-spacing:0}#bN015htcoyT__google-cache-hdr a{text-decoration:none;color:#1a0dab}#bN015htcoyT__google-cache-hdr a:hover{text-decoration:underline}#bN015htcoyT__google-cache-hdr a:visited{color:#609}#bN015htcoyT__google-cache-hdr div{display:block;margin-top:4px}#bN015htcoyT__google-cache-hdr b{font-weight:bold;display:inline-block;direction:ltr}</style><div id=\"bN015htcoyT__google-cache-hdr\"><div><span>This is Google's cache of <a href=\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\">https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml</a>.</span>&nbsp;<span>It is a snapshot of the page as it appeared on 21 Jan 2019 05:18:18 GMT.</span>&nbsp;<span>The <a href=\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\">current page</a> could have changed in the meantime.</span>&nbsp;<a href=\"http://support.google.com/websearch/bin/answer.py?hl=en&amp;p=cached&amp;answer=1687222\"><span>Learn more</span>.</a></div><div><span style=\"display:inline-block;margin-top:8px;margin-right:104px;white-space:nowrap\"><span style=\"margin-right:28px\"><span style=\"font-weight:bold\">Full version</span></span><span style=\"margin-right:28px\"><a href=\"http://webcache.googleusercontent.com/search?q=cache:2BVPV8QGj3oJ:https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml&amp;hl=en&amp;gl=lu&amp;strip=1&amp;vwsrc=0\"><span>Text-only version</span></a></span><span style=\"margin-right:28px\"><a href=\"http://webcache.googleusercontent.com/search?q=cache:2BVPV8QGj3oJ:https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml&amp;hl=en&amp;gl=lu&amp;strip=0&amp;vwsrc=1\"><span>View source</span></a></span></span></div><span style=\"display:inline-block;margin-top:8px;color:#717171\"><span>Tip: To quickly find your search term on this page, press <b>Ctrl+F</b> or <b>⌘-F</b> (Mac) and use the find bar.</span></span></div><div style=\"position:relative;\"><?xml version=\"1.0\" encoding=\"UTF-8\"?>"

    Same. So where is the content? The file is very large, so if you print it in the console, it will take quite some time to print, and you will not really be able to make out anything. The best @@ -3774,8 +3777,8 @@

    4.7.3.2 Detecting, getting the po

    When confronted to an atomic vector of strings, you might want to know inside which elements you can find certain strings. For example, to know which elements of winchester contain the string CONTENT, use str_detect():

    -
    winchester %>%
    -  str_detect("CONTENT")
    +
    winchester %>%
    +  str_detect("CONTENT")
    ##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
     ## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
     ## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    @@ -3786,16 +3789,16 @@ 

    4.7.3.2 Detecting, getting the po the vector had 24192 elements? And hundreds would contain the string CONTENT? It would be easier to instead have the indices of the vector where one can find the word CONTENT. This is possible with str_which():

    -
    winchester %>%
    -  str_which("CONTENT")
    +
    winchester %>%
    +  str_which("CONTENT")
    ## [1] 43

    Here, the result is 43, meaning that the 43rd element of winchester contains the string CONTENT somewhere. If we need more precision, we can use str_locate() and str_locate_all(). To explain how both these functions work, let’s create a very small example:

    -
    ancient_philosophers <- c("aristotle", "plato", "epictetus", "seneca the younger", "epicurus", "marcus aurelius")
    +
    ancient_philosophers <- c("aristotle", "plato", "epictetus", "seneca the younger", "epicurus", "marcus aurelius")

    Now suppose I am interested in philosophers whose name ends in us. Let us use str_locate() first:

    -
    ancient_philosophers %>%
    -  str_locate("us")
    +
    ancient_philosophers %>%
    +  str_locate("us")
    ##      start end
     ## [1,]    NA  NA
     ## [2,]    NA  NA
    @@ -3810,8 +3813,8 @@ 

    4.7.3.2 Detecting, getting the po 9. Same goes for the other philisophers. However, consider Marcus Aurelius. He has two names, both ending with us. However, str_locate() only shows the position of the us in Marcus.

    To get both us strings, you need to use str_locate_all():

    -
    ancient_philosophers %>%
    -  str_locate_all("us")
    +
    ancient_philosophers %>%
    +  str_locate_all("us")
    ## [[1]]
     ##      start end
     ## 
    @@ -3838,7 +3841,7 @@ 

    4.7.3.2 Detecting, getting the po matters is that you know how str_locate() and str_locate_all() work.

    So now that we know what interests us in the 43nd element of winchester, let’s take a closer look at it:

    -
    winchester[43]
    +
    winchester[43]

    As you can see, it’s a mess:

    <TextLine HEIGHT=\"126.0\" WIDTH=\"1731.0\" HPOS=\"17160.0\" VPOS=\"21252.0\"><String HEIGHT=\"114.0\" WIDTH=\"354.0\" HPOS=\"17160.0\" VPOS=\"21264.0\" CONTENT=\"0tV\" WC=\"0.8095238\"/><SP WIDTH=\"131.0\" HPOS=\"17514.0\" VPOS=\"21264.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"111.0\" WIDTH=\"474.0\" HPOS=\"17646.0\" VPOS=\"21258.0\" CONTENT=\"BATES\" WC=\"1.0\"/><SP WIDTH=\"140.0\" HPOS=\"18120.0\" VPOS=\"21258.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"114.0\" WIDTH=\"630.0\" HPOS=\"18261.0\" VPOS=\"21252.0\" CONTENT=\"President\" WC=\"1.0\"><ALTERNATIVE>Prcideht</ALTERNATIVE><ALTERNATIVE>Pride</ALTERNATIVE></String></TextLine><TextLine HEIGHT=\"153.0\" WIDTH=\"1689.0\" HPOS=\"17145.0\" VPOS=\"21417.0\"><String STYLEREFS=\"ID7\" HEIGHT=\"105.0\" WIDTH=\"258.0\" HPOS=\"17145.0\" VPOS=\"21439.0\" CONTENT=\"WM\" WC=\"0.82539684\"><TextLine HEIGHT=\"120.0\" WIDTH=\"2211.0\" HPOS=\"16788.0\" VPOS=\"21870.0\"><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"102.0\" HPOS=\"16788.0\" VPOS=\"21894.0\" CONTENT=\"It\" WC=\"1.0\"/><SP WIDTH=\"72.0\" HPOS=\"16890.0\" VPOS=\"21894.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"93.0\" HPOS=\"16962.0\" VPOS=\"21885.0\" CONTENT=\"is\" WC=\"1.0\"/><SP WIDTH=\"80.0\" HPOS=\"17055.0\" VPOS=\"21885.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"102.0\" WIDTH=\"417.0\" HPOS=\"17136.0\" VPOS=\"21879.0\" CONTENT=\"seldom\" WC=\"1.0\"/><SP WIDTH=\"80.0\" HPOS=\"17553.0\" VPOS=\"21879.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"267.0\" HPOS=\"17634.0\" VPOS=\"21873.0\" CONTENT=\"hard\" WC=\"1.0\"/><SP WIDTH=\"81.0\" HPOS=\"17901.0\" VPOS=\"21873.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"87.0\" WIDTH=\"111.0\" HPOS=\"17982.0\" VPOS=\"21879.0\" CONTENT=\"to\" WC=\"1.0\"/><SP WIDTH=\"81.0\" HPOS=\"18093.0\" VPOS=\"21879.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"219.0\" HPOS=\"18174.0\" VPOS=\"21870.0\" CONTENT=\"find\" WC=\"1.0\"/><SP WIDTH=\"77.0\" HPOS=\"18393.0\" VPOS=\"21870.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"69.0\" WIDTH=\"66.0\" HPOS=\"18471.0\" VPOS=\"21894.0\" CONTENT=\"a\" WC=\"1.0\"/><SP WIDTH=\"77.0\" HPOS=\"18537.0\" VPOS=\"21894.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"78.0\" WIDTH=\"384.0\" HPOS=\"18615.0\" VPOS=\"21888.0\" CONTENT=\"succes\" WC=\"0.82539684\"><ALTERNATIVE>success</ALTERNATIVE></String></TextLine><TextLine HEIGHT=\"126.0\" WIDTH=\"2316.0\" HPOS=\"16662.0\" VPOS=\"22008.0\"><String STYLEREFS=\"ID7\" HEIGHT=\"75.0\" WIDTH=\"183.0\" HPOS=\"16662.0\" VPOS=\"22059.0\" CONTENT=\"sor\" WC=\"1.0\"><ALTERNATIVE>soar</ALTERNATIVE></String><SP WIDTH=\"72.0\" HPOS=\"16845.0\" VPOS=\"22059.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"90.0\" WIDTH=\"168.0\" HPOS=\"16917.0\" VPOS=\"22035.0\" CONTENT=\"for\" WC=\"1.0\"/><SP WIDTH=\"72.0\" HPOS=\"17085.0\" VPOS=\"22035.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"69.0\" WIDTH=\"267.0\" HPOS=\"17157.0\" VPOS=\"22050.0\" CONTENT=\"even\" WC=\"1.0\"><ALTERNATIVE>cen</ALTERNATIVE><ALTERNATIVE>cent</ALTERNATIVE></String><SP WIDTH=\"77.0\" HPOS=\"17434.0\" VPOS=\"22050.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"66.0\" WIDTH=\"63.0\" HPOS=\"17502.0\" VPOS=\"22044.0\"

    The file was imported without any newlines. So we need to insert them ourselves, by splitting the @@ -3850,8 +3853,8 @@

    4.7.3.3 Splitting stringsstr_split() like this:

    -
    ancient_philosophers %>%
    -  str_split(" ")
    +
    ancient_philosophers %>%
    +  str_split(" ")
    ## [[1]]
     ## [1] "aristotle"
     ## 
    @@ -3870,8 +3873,8 @@ 

    4.7.3.3 Splitting strings

    str_split() also has a simplify = TRUE option:

    -
    ancient_philosophers %>%
    -  str_split(" ", simplify = TRUE)
    +
    ancient_philosophers %>%
    +  str_split(" ", simplify = TRUE)
    ##      [,1]        [,2]       [,3]     
     ## [1,] "aristotle" ""         ""       
     ## [2,] "plato"     ""         ""       
    @@ -3885,8 +3888,8 @@ 

    4.7.3.3 Splitting stringsstr_split_fixed():

    -
    ancient_philosophers %>%
    -  str_split_fixed(" ", 2)
    +
    ancient_philosophers %>%
    +  str_split_fixed(" ", 2)
    ##      [,1]        [,2]         
     ## [1,] "aristotle" ""           
     ## [2,] "plato"     ""           
    @@ -3898,18 +3901,18 @@ 

    4.7.3.3 Splitting stringsSo how does this help in our case? Well, if you look at how the ALTO file looks like, at the beginning of this section, you will notice that every line ends with the “>” character. So let’s split at that character!

    -
    winchester_text <- winchester[43] %>%
    -  str_split(">")
    +
    winchester_text <- winchester[43] %>%
    +  str_split(">")

    Let’s take a closer look at winchester_text:

    -
    str(winchester_text)
    +
    str(winchester_text)
    ## List of 1
     ##  $ : chr [1:19706] "</processingStepSettings" "<processingSoftware" "<softwareCreator" "iArchives</softwareCreator" ...

    So this is a list of length one, and the first, and only, element of that list is an atomic vector with 19706 elements. Since this is a list of only one element, we can simplify it by saving the atomic vector in a variable:

    -
    winchester_text <- winchester_text[[1]]
    +
    winchester_text <- winchester_text[[1]]

    Let’s now look at some lines:

    -
    winchester_text[1232:1245]
    +
    winchester_text[1232:1245]
    ##  [1] "<SP WIDTH=\"66.0\" HPOS=\"5763.0\" VPOS=\"9696.0\"/"                                                                         
     ##  [2] "<String STYLEREFS=\"ID7\" HEIGHT=\"108.0\" WIDTH=\"612.0\" HPOS=\"5829.0\" VPOS=\"9693.0\" CONTENT=\"Louisville\" WC=\"1.0\""
     ##  [3] "<ALTERNATIVE"                                                                                                                
    @@ -3926,14 +3929,14 @@ 

    4.7.3.3 Splitting strings

    This now looks easier to handle. We can narrow it down to the lines that only contain the string we are interested in, “CONTENT”. First, let’s get the indices:

    -
    content_winchester_index <- winchester_text %>%
    -  str_which("CONTENT")
    +
    content_winchester_index <- winchester_text %>%
    +  str_which("CONTENT")

    How many lines contain the string “CONTENT”?

    -
    length(content_winchester_index)
    +
    length(content_winchester_index)
    ## [1] 4462

    As you can see, this reduces the amount of data we have to work with. Let us save this is a new variable:

    -
    content_winchester <- winchester_text[content_winchester_index]
    +
    content_winchester <- winchester_text[content_winchester_index]

    4.7.3.4 Matching strings

    @@ -3943,8 +3946,8 @@

    4.7.3.4 Matching strings
    ancient_philosophers %>%
    -  str_match("us")

    +
    ancient_philosophers %>%
    +  str_match("us")
    ##      [,1]
     ## [1,] NA  
     ## [2,] NA  
    @@ -3953,8 +3956,8 @@ 

    4.7.3.4 Matching strings
    ancient_philosophers %>%
    -  str_match(".*us")

    +
    ancient_philosophers %>%
    +  str_match(".*us")
    ##      [,1]             
     ## [1,] NA               
     ## [2,] NA               
    @@ -3964,8 +3967,8 @@ 

    4.7.3.4 Matching strings
    ancient_philosophers %>%
    -  str_match(".us")

  • +
    ancient_philosophers %>%
    +  str_match(".us")
    ##      [,1] 
     ## [1,] NA   
     ## [2,] NA   
    @@ -3974,8 +3977,8 @@ 

    4.7.3.4 Matching strings
    ancient_philosophers %>%
    -  str_match("..us")

    +
    ancient_philosophers %>%
    +  str_match("..us")
    ##      [,1]  
     ## [1,] NA    
     ## [2,] NA    
    @@ -3988,8 +3991,8 @@ 

    4.7.3.4 Matching strings
    ancient_philosophers %>%
    -  str_match_all(".*us")
    +
    ancient_philosophers %>%
    +  str_match_all(".*us")
    ## [[1]]
     ##      [,1]
     ## 
    @@ -4011,14 +4014,14 @@ 

    4.7.3.4 Matching strings
    c("haha", "huhu") %>%
    -  str_match("ha")
    +
    c("haha", "huhu") %>%
    +  str_match("ha")
    ##      [,1]
     ## [1,] "ha"
     ## [2,] NA

    and:

    -
    c("haha", "huhu") %>%
    -  str_match_all("ha")
    +
    c("haha", "huhu") %>%
    +  str_match_all("ha")
    ## [[1]]
     ##      [,1]
     ## [1,] "ha"
    @@ -4027,8 +4030,8 @@ 

    4.7.3.4 Matching strings
    ancient_philosophers %>%
    -  str_match(".*t.*")
    +
    ancient_philosophers %>%
    +  str_match(".*t.*")
    ##      [,1]                
     ## [1,] "aristotle"         
     ## [2,] "plato"             
    @@ -4038,22 +4041,27 @@ 

    4.7.3.4 Matching strings
    winchester_content <- winchester_text %>%
    -  str_match("CONTENT.*")
    +
    winchester_content <- winchester_text %>%
    +  str_match("CONTENT.*")

    Let’s use our faithful str() function to take a look:

    -
    winchester_content %>%
    -  str
    +
    winchester_content %>%
    +  str
    ##  chr [1:19706, 1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...

    Hum, there’s a lot of NA values! This is because a lot of the lines from the file did not have the string “CONTENT”, so there is no match possible. Let’s us remove all these NAs. Because the result is a matrix, we cannot use the filter() function from {dplyr}. So we need to convert it to a tibble first:

    -
    winchester_content <- winchester_content %>%
    -  as.tibble() %>%
    -  filter(!is.na(V1))
    +
    winchester_content <- winchester_content %>%
    +  as.tibble() %>%
    +  filter(!is.na(V1))
    +
    ## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
    +## Please use `as_tibble()` instead.
    +## The signature and semantics have changed, see `?as_tibble`.
    +
    ## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
    +## Using compatibility `.name_repair`.

    Because matrix columns do not have names, when a matrix gets converted into a tibble, the firt column gets automatically called V1. This is why I filter on this column. Let’s take a look at the data:

    -
    head(winchester_content)
    +
    head(winchester_content)
    ## # A tibble: 6 × 1
     ##   V1                                  
     ##   <chr>                               
    @@ -4069,11 +4077,11 @@ 

    4.7.3.5 Searching and replacing s

    We are getting close to the final result. We still need to do some cleaning however. Since our data is inside a nice tibble, we might as well stick with it. So let’s first rename the column and change all the strings to lowercase:

    -
    winchester_content <- winchester_content %>% 
    -  mutate(content = tolower(V1)) %>% 
    -  select(-V1)
    +
    winchester_content <- winchester_content %>% 
    +  mutate(content = tolower(V1)) %>% 
    +  select(-V1)

    Let’s take a look at the result:

    -
    head(winchester_content)
    +
    head(winchester_content)
    ## # A tibble: 6 × 1
     ##   content                             
     ##   <chr>                               
    @@ -4085,10 +4093,10 @@ 

    4.7.3.5 Searching and replacing s ## 6 "content=\"te1r\" wc=\"0.8095238\"/"

    The second part of the string, “wc=….” is not really interesting. Let’s search and replace this with an empty string, using str_replace():

    -
    winchester_content <- winchester_content %>% 
    -  mutate(content = str_replace(content, "wc.*", ""))
    -
    -head(winchester_content)
    +
    winchester_content <- winchester_content %>% 
    +  mutate(content = str_replace(content, "wc.*", ""))
    +
    +head(winchester_content)
    ## # A tibble: 6 × 1
     ##   content            
     ##   <chr>              
    @@ -4100,10 +4108,10 @@ 

    4.7.3.5 Searching and replacing s ## 6 "content=\"te1r\" "

    We need to use the regular expression from before to replace “wc” and every character that follows. The same can be use to remove “content=”:

    -
    winchester_content <- winchester_content %>% 
    -  mutate(content = str_replace(content, "content=", ""))
    -
    -head(winchester_content)
    +
    winchester_content <- winchester_content %>% 
    +  mutate(content = str_replace(content, "content=", ""))
    +
    +head(winchester_content)
    ## # A tibble: 6 × 1
     ##   content    
     ##   <chr>      
    @@ -4118,8 +4126,8 @@ 

    4.7.3.5 Searching and replacing s

    4.7.3.6 Exctracting or removing strings

    Now, because I now the ALTO spec, I know how to find words that are split between two sentences:

    -
    winchester_content %>% 
    -  filter(str_detect(content, "hyppart"))
    +
    winchester_content %>% 
    +  filter(str_detect(content, "hyppart"))
    ## # A tibble: 64 × 1
     ##    content                                                               
     ##    <chr>                                                                 
    @@ -4141,13 +4149,13 @@ 

    4.7.3.6 Exctracting or removing s located, and only then can we extract what comes after “subs_content”. Thus, we need to combine str_detect() to first detect the string, and then str_extract() to extract what comes after “subs_content”:

    -
    winchester_content <- winchester_content %>% 
    -  mutate(content = if_else(str_detect(content, "hyppart1"), 
    -                           str_extract_all(content, "content=.*", simplify = TRUE), 
    -                           content))
    +
    winchester_content <- winchester_content %>% 
    +  mutate(content = if_else(str_detect(content, "hyppart1"), 
    +                           str_extract_all(content, "content=.*", simplify = TRUE), 
    +                           content))

    Let’s take a look at the result:

    -
    winchester_content %>% 
    -  filter(str_detect(content, "content"))
    +
    winchester_content %>% 
    +  filter(str_detect(content, "content"))
    ## # A tibble: 64 × 1
     ##    content                                                          
     ##    <chr>                                                            
    @@ -4164,11 +4172,11 @@ 

    4.7.3.6 Exctracting or removing s ## # … with 54 more rows

    We still need to get rid of the string “content=” and then of all the strings that contain “hyppart2”, which are not needed now:

    -
    winchester_content <- winchester_content %>% 
    -  mutate(content = str_replace(content, "content=", "")) %>% 
    -  mutate(content = if_else(str_detect(content, "hyppart2"), NA_character_, content))
    -
    -head(winchester_content)
    +
    winchester_content <- winchester_content %>% 
    +  mutate(content = str_replace(content, "content=", "")) %>% 
    +  mutate(content = if_else(str_detect(content, "hyppart2"), NA_character_, content))
    +
    +head(winchester_content)
    ## # A tibble: 6 × 1
     ##   content    
     ##   <chr>      
    @@ -4179,10 +4187,10 @@ 

    4.7.3.6 Exctracting or removing s ## 5 "\"ii\" " ## 6 "\"te1r\" "

    Almost done! We only need to remove the " characters:

    -
    winchester_content <- winchester_content %>% 
    -  mutate(content = str_replace_all(content, "\"", "")) 
    -
    -head(winchester_content)
    +
    winchester_content <- winchester_content %>% 
    +  mutate(content = str_replace_all(content, "\"", "")) 
    +
    +head(winchester_content)
    ## # A tibble: 6 × 1
     ##   content
     ##   <chr>  
    @@ -4193,10 +4201,10 @@ 

    4.7.3.6 Exctracting or removing s ## 5 "ii " ## 6 "te1r "

    Let’s remove space characters with str_trim():

    -
    winchester_content <- winchester_content %>% 
    -  mutate(content = str_trim(content)) 
    -
    -head(winchester_content)
    +
    winchester_content <- winchester_content %>% 
    +  mutate(content = str_trim(content)) 
    +
    +head(winchester_content)
    ## # A tibble: 6 × 1
     ##   content
     ##   <chr>  
    @@ -4209,17 +4217,17 @@ 

    4.7.3.6 Exctracting or removing s

    To finish off this section, let’s remove stop words (words that do not add any meaning to a sentence, such as “as”, “and”…) and words that are composed of less than 3 characters. You can find a dataset with stopwords inside the {stopwords} package:

    -
    library(stopwords)
    -
    -data(data_stopwords_stopwordsiso)
    -
    -eng_stopwords <- tibble("content" = data_stopwords_stopwordsiso$en)
    -
    -winchester_content <- winchester_content %>% 
    -  anti_join(eng_stopwords) %>% 
    -  filter(nchar(content) > 3)
    +
    library(stopwords)
    +
    +data(data_stopwords_stopwordsiso)
    +
    +eng_stopwords <- tibble("content" = data_stopwords_stopwordsiso$en)
    +
    +winchester_content <- winchester_content %>% 
    +  anti_join(eng_stopwords) %>% 
    +  filter(nchar(content) > 3)
    ## Joining, by = "content"
    -
    head(winchester_content)
    +
    head(winchester_content)
    ## # A tibble: 6 × 1
     ##   content   
     ##   <chr>     
    @@ -4241,8 +4249,8 @@ 

    4.7.4 Tidy data frames with 4.7.4.1 Creating tibbles

    tribble() makes it easy to create tibble row by row, manually:

    It is also possible to create a tibble from a named list:

    -
    as_tibble(list("combustion" = c("oil", "diesel", "oil", "electric"),
    -               "doors" = c(3, 5, 5, 5)))
    +
    as_tibble(list("combustion" = c("oil", "diesel", "oil", "electric"),
    +               "doors" = c(3, 5, 5, 5)))
    ## # A tibble: 4 × 2
     ##   combustion doors
     ##   <chr>      <dbl>
    @@ -4250,7 +4258,7 @@ 

    4.7.4.1 Creating tibbles
    enframe(list("combustion" = c(1,2), "doors" = c(1,2,4), "cylinders" = c(1,8,9,10)))

    +
    enframe(list("combustion" = c(1,2), "doors" = c(1,2,4), "cylinders" = c(1,8,9,10)))
    ## # A tibble: 3 × 2
     ##   name       value    
     ##   <chr>      <list>   
    @@ -4263,10 +4271,10 @@ 

    4.7.4.1 Creating tibbles

    4.8 List-columns

    To learn about list-columns, let’s first focus on a single character of the starwars dataset:

    -
    data(starwars)
    -
    starwars %>%
    -  filter(name == "Luke Skywalker") %>%
    -  glimpse()
    +
    data(starwars)
    +
    starwars %>%
    +  filter(name == "Luke Skywalker") %>%
    +  glimpse()
    ## Rows: 1
     ## Columns: 14
     ## $ name       <chr> "Luke Skywalker"
    @@ -4286,9 +4294,9 @@ 

    4.8 List-columns
    starwars %>%
    -  filter(name == "Luke Skywalker") %>%
    -  pull(films)
    +
    starwars %>%
    +  filter(name == "Luke Skywalker") %>%
    +  pull(films)
    ## [[1]]
     ## [1] "The Empire Strikes Back" "Revenge of the Sith"    
     ## [3] "Return of the Jedi"      "A New Hope"             
    @@ -4296,9 +4304,9 @@ 

    4.8 List-columns
    starwars %>%
    -  head() %>%  # let's just look at the first six rows
    -  pull(films) 
    +
    starwars %>%
    +  head() %>%  # let's just look at the first six rows
    +  pull(films) 
    ## [[1]]
     ## [1] "The Empire Strikes Back" "Revenge of the Sith"    
     ## [3] "Return of the Jedi"      "A New Hope"             
    @@ -4342,40 +4350,40 @@ 

    4.8 List-columns
    starwars %>%
    -  filter(name == "Luke Skywalker") %>%
    -  pull(films) %>%
    -  length()
    +
    starwars %>%
    +  filter(name == "Luke Skywalker") %>%
    +  pull(films) %>%
    +  length()
    ## [1] 1

    This might be surprising, but remember that a list with only one element, has a length of 1:

    -
    length(
    -  list(words) # this creates a list which one element. This element is a list of 980 words.
    -)
    +
    length(
    +  list(words) # this creates a list which one element. This element is a list of 980 words.
    +)
    ## [1] 1

    Even though words contain a vector of 980 words, if we put this very long vector inside the first element of list, length(list(words)) will this compute the length of the list. Let’s see what happens if we create a more complex list:

    -
    numbers <- seq(1, 5)
    -
    -length(
    -  list(words, # this creates a list which one element. This element is a list of 980 words.
    -       numbers) # numbers contains numbers 1 through 5
    -)
    +
    numbers <- seq(1, 5)
    +
    +length(
    +  list(words, # this creates a list which one element. This element is a list of 980 words.
    +       numbers) # numbers contains numbers 1 through 5
    +)
    ## [1] 2

    list(words, numbers) is now a list of two elements, words and numbers. If we want to compute the length of words and numbers, we need to learn about another powerful concept called higher-order functions. We are going to learn about this in greater detail in Chapter 8. For now, let’s use the fact that our list films is contained inside of a data frame, and use a convenience function included in {dplyr} to handle situations like this:

    -
    starwars <- starwars %>%
    -  rowwise() %>% # <- Apply the next steps for each row individually
    -  mutate(n_films = length(films))
    +
    starwars <- starwars %>%
    +  rowwise() %>% # <- Apply the next steps for each row individually
    +  mutate(n_films = length(films))

    dplyr::rowwise() is useful when working with list-columns because whatever instructions follow get run on the single element contained in the list. The picture below illustrates this:

    Let’s take a look at the characters and the number of films they have appeared in:

    -
    starwars %>%
    -  select(name, films, n_films)
    +
    starwars %>%
    +  select(name, films, n_films)
    ## # A tibble: 87 × 3
     ## # Rowwise: 
     ##    name               films     n_films
    @@ -4393,29 +4401,29 @@ 

    4.8 List-columns
    starwars <- starwars %>%
    -  mutate(more_1 = case_when(n_films == 1 ~ "Exactly one movie",
    -                            n_films >= 1 ~ "More than 1 movie"))
    +
    starwars <- starwars %>%
    +  mutate(more_1 = case_when(n_films == 1 ~ "Exactly one movie",
    +                            n_films >= 1 ~ "More than 1 movie"))

    You can also create list-columns with your own datasets, by using tidyr::nest(). Remember the fake survey_data I created to illustrate pivot_longer() and pivot_wider()? Let’s go back to that dataset again:

    -
    survey_data <- tribble(
    -  ~id, ~variable, ~value,
    -  1, "var1", 1,
    -  1, "var2", 0.2,
    -  NA, "var3", 0.3,
    -  2, "var1", 1.4,
    -  2, "var2", 1.9,
    -  2, "var3", 4.1,
    -  3, "var1", 0.1,
    -  3, "var2", 2.8,
    -  3, "var3", 8.9,
    -  4, "var1", 1.7,
    -  NA, "var2", 1.9,
    -  4, "var3", 7.6
    -)
    -
    -print(survey_data)
    +
    survey_data <- tribble(
    +  ~id, ~variable, ~value,
    +  1, "var1", 1,
    +  1, "var2", 0.2,
    +  NA, "var3", 0.3,
    +  2, "var1", 1.4,
    +  2, "var2", 1.9,
    +  2, "var3", 4.1,
    +  3, "var1", 0.1,
    +  3, "var2", 2.8,
    +  3, "var3", 8.9,
    +  4, "var1", 1.7,
    +  NA, "var2", 1.9,
    +  4, "var3", 7.6
    +)
    +
    +print(survey_data)
    ## # A tibble: 12 × 3
     ##       id variable value
     ##    <dbl> <chr>    <dbl>
    @@ -4431,11 +4439,11 @@ 

    4.8 List-columns
    nested_data <- survey_data %>%
    -  group_by(id) %>%  
    -  nest()
    -
    -glimpse(nested_data)
    +
    nested_data <- survey_data %>%
    +  group_by(id) %>%  
    +  nest()
    +
    +glimpse(nested_data)
    ## Rows: 5
     ## Columns: 2
     ## Groups: id [5]
    @@ -4443,9 +4451,9 @@ 

    4.8 List-columns
    nested_data %>%
    -  filter(id == "1") %>%
    -  pull(data)
    +
    nested_data %>%
    +  filter(id == "1") %>%
    +  pull(data)
    ## [[1]]
     ## # A tibble: 2 × 2
     ##   variable value
    @@ -4455,8 +4463,8 @@ 

    4.8 List-columns
    survey_data %>%
    -  group_nest(id)
    +
    survey_data %>%
    +  group_nest(id)
    ## # A tibble: 5 × 2
     ##      id               data
     ##   <dbl> <list<tibble[,2]>>
    @@ -4484,20 +4492,20 @@ 

    4.9 Going beyond descriptive stat

    The more arrows N you throw at the square, the better approximation of \(\pi\) you’ll have. Let’s try to do this with a tidy Monte Carlo simulation. First, let’s randomly pick some points inside the unit square:

    -
    library(tidyverse)
    -
    -n <- 5000
    -
    -set.seed(2019)
    -points <- tibble("x" = runif(n), "y" = runif(n))
    +
    library(tidyverse)
    +
    +n <- 5000
    +
    +set.seed(2019)
    +points <- tibble("x" = runif(n), "y" = runif(n))

    Now, to know if a point is inside the unit circle, we need to check wether \(x^2 + y^2 < 1\). Let’s add a new column to the points tibble, called inside equal to 1 if the point is inside the unit circle and 0 if not:

    -
    points <- points %>%
    -    mutate(inside = map2_dbl(.x = x, .y = y, ~ifelse(.x**2 + .y**2 < 1, 1, 0))) %>%
    -    rowid_to_column("N")
    +
    points <- points %>%
    +    mutate(inside = map2_dbl(.x = x, .y = y, ~ifelse(.x**2 + .y**2 < 1, 1, 0))) %>%
    +    rowid_to_column("N")

    Let’s take a look at points:

    -
    points
    +
    points
    ## # A tibble: 5,000 × 4
     ##        N       x      y inside
     ##    <int>   <dbl>  <dbl>  <dbl>
    @@ -4515,12 +4523,12 @@ 

    4.9 Going beyond descriptive stat

    Now, I can compute the estimation of \(\pi\) at each row, by computing the cumulative sum of the 1’s in the inside column and dividing that by the current value of N column:

    -
    points <- points %>%
    -    mutate(estimate = 4*cumsum(inside)/N)
    +
    points <- points %>%
    +    mutate(estimate = 4*cumsum(inside)/N)

    cumsum(inside) is the M from the formula. Now, we can finish by plotting the result:

    -
    ggplot(points) +
    -    geom_line(aes(y = estimate, x = N)) +
    -    geom_hline(yintercept = pi)
    +
    ggplot(points) +
    +    geom_line(aes(y = estimate, x = N)) +
    +    geom_hline(yintercept = pi)

    In the next chapter, we are going to learn all about {ggplot2}, the package I used in the lines above to create this plot.

    @@ -4543,12 +4551,12 @@

    Exercise 1
    data(Gasoline, package = "plm")
    -
    -gasoline <- as_tibble(Gasoline)
    -
    -gasoline <- gasoline %>%
    -  mutate(country = tolower(country))
    +
    data(Gasoline, package = "plm")
    +
    +gasoline <- as_tibble(Gasoline)
    +
    +gasoline <- gasoline %>%
    +  mutate(country = tolower(country))
    • Exponeniate columns starting with the character "l" of the gasoline dataset.

    • Convert all columns’ classes into the character class.

    • @@ -4571,29 +4579,32 @@

      Exercise 3
      LaborSupply %>%
      -  group_by(id) %>%
      -  mutate(across(starts_with("l"), tibble::lst(lag, lead)))
      +
      LaborSupply %>%
      +  group_by(id) %>%
      +  mutate(across(starts_with("l"), tibble::lst(lag, lead)))
      • Using summarise() and across(), compute the mean, standard deviation and number of individuals of lnhr and lnwg for each individual.
      - -
      -

      Exercise 4

      -
        -
      • In the dataset folder you downloaded at the beginning of the chapter, there is a folder called -“unemployment”. I used the data in the section about working with lists of datasets. Using -rio::import_list(), read the 4 datasets into R.

      • -
      • Using map(), map the janitor::clean_names() function to each dataset (just like in the example -in the section on working with lists of datasets). Then, still with map() and mutate() convert -all commune names in the commune column with the function tolower(), in a new column called lcommune. -This is not an easy exercise; so here are some hints:

        -
          -
        • Remember that all_datasets is a list of datasets. Which function do you use when you want to map a function to each element of a list?
        • -
        • Each element of all_datasets are data.frame objects. Which function do you use to add a column to a data.frame?
        • -
        • What symbol can you use to access a column of a data.frame?
        • -
      • -
      +
      diff --git a/docs/functional-programming.html b/docs/functional-programming.html index bf7a924..1b0b6b5 100644 --- a/docs/functional-programming.html +++ b/docs/functional-programming.html @@ -23,7 +23,7 @@ - + @@ -282,7 +282,6 @@
    • Exercise 1
    • Exercise 2
    • Exercise 3
    • -
    • Exercise 4
  • 5 Graphs @@ -365,7 +364,7 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • +
  • Exercise 4
  • 8 Functional programming @@ -453,14 +452,14 @@

    Chapter 8 Functional programming<

    8.1 Function definitions

    You should now be familiar with function definitions in R. Let’s suppose you want to write a function to compute the square root of a number and want to do so using Newton’s algorithm:

    -
    sqrt_newton <- function(a, init, eps = 0.01){
    -    while(abs(init**2 - a) > eps){
    -        init <- 1/2 *(init + a/init)
    -    }
    -    init
    -}
    +
    sqrt_newton <- function(a, init, eps = 0.01){
    +    while(abs(init**2 - a) > eps){
    +        init <- 1/2 *(init + a/init)
    +    }
    +    init
    +}

    You can then use this function to get the square root of a number:

    -
    sqrt_newton(16, 2)
    +
    sqrt_newton(16, 2)
    ## [1] 4.00122

    We are using a while loop inside the body of the function. The body of a function are the instructions that define the function. You can get the body of a function with body(some_func). @@ -468,16 +467,16 @@

    8.1 Function definitions
    sqrt_newton_recur <- function(a, init, eps = 0.01){
    -    if(abs(init**2 - a) < eps){
    -        result <- init
    -    } else {
    -        init <- 1/2 * (init + a/init)
    -        result <- sqrt_newton_recur(a, init, eps)
    -    }
    -    result
    -}
    -
    sqrt_newton_recur(16, 2)
    +
    sqrt_newton_recur <- function(a, init, eps = 0.01){
    +    if(abs(init**2 - a) < eps){
    +        result <- init
    +    } else {
    +        init <- 1/2 * (init + a/init)
    +        result <- sqrt_newton_recur(a, init, eps)
    +    }
    +    result
    +}
    +
    sqrt_newton_recur(16, 2)
    ## [1] 4.00122

    R is not a pure functional programming language though, so we can still use loops (be it while or for loops) in the bodies of our functions. As discussed in the previous chapter, it is actually @@ -496,73 +495,73 @@

    8.2 Properties of functionsMathematical functions have a nice property: we always get the same output for a given input. This is called referential transparency and we should aim to write our R functions in such a way. For example, the following function:

    -
    increment <- function(x){
    -    x + 1
    -}
    +
    increment <- function(x){
    +    x + 1
    +}

    Is a referential transparent function. We always get the same result for any x that we give to this function.

    This:

    -
    increment(10)
    +
    increment(10)
    ## [1] 11

    will always produce 11.

    However, this one:

    -
    increment_opaque <- function(x){
    -    x + spam
    -}
    +
    increment_opaque <- function(x){
    +    x + spam
    +}

    is not a referential transparent function, because its value depends on the global variable spam.

    -
    spam <- 1
    -
    -increment_opaque(10)
    +
    spam <- 1
    +
    +increment_opaque(10)
    ## [1] 11

    will produce 11 if spam = 1. But what if spam = 19?

    -
    spam <- 19
    -
    -increment_opaque(10)
    +
    spam <- 19
    +
    +increment_opaque(10)
    ## [1] 29

    To make increment_opaque() a referential transparent function, it is enough to make spam an argument:

    -
    increment_not_opaque <- function(x, spam){
    -    x + spam
    -}
    +
    increment_not_opaque <- function(x, spam){
    +    x + spam
    +}

    Now even if there is a global variable called spam, this will not influence our function:

    -
    spam <- 19
    -
    -increment_not_opaque(10, 34)
    +
    spam <- 19
    +
    +increment_not_opaque(10, 34)
    ## [1] 44

    This is because the variable spam defined in the body of the function is a local variable. It could have been called anything else, really. Avoiding opaque functions makes our life easier.

    Another property that adepts of functional programming value is that functions should have no, or very limited, side-effects. This means that functions should not change the state of your program.

    For example this function (which is not a referential transparent function):

    -
    count_iter <- 0
    -
    -sqrt_newton_side_effect <- function(a, init, eps = 0.01){
    -    while(abs(init**2 - a) > eps){
    -        init <- 1/2 *(init + a/init)
    -        count_iter <<- count_iter + 1 # The "<<-" symbol means that we assign the
    -    }                                 # RHS value in a variable inside the global environment
    -    init
    -}
    +
    count_iter <- 0
    +
    +sqrt_newton_side_effect <- function(a, init, eps = 0.01){
    +    while(abs(init**2 - a) > eps){
    +        init <- 1/2 *(init + a/init)
    +        count_iter <<- count_iter + 1 # The "<<-" symbol means that we assign the
    +    }                                 # RHS value in a variable inside the global environment
    +    init
    +}

    If you look in the environment pane, you will see that count_iter equals 0. Now call this function with the following arguments:

    -
    sqrt_newton_side_effect(16000, 2)
    +
    sqrt_newton_side_effect(16000, 2)
    ## [1] 126.4911
    -
    print(count_iter)
    +
    print(count_iter)
    ## [1] 9

    If you check the value of count_iter now, you will see that it increased! This is a side effect, because the function changed something outside of its scope. It changed a value in the global environment. In general, it is good practice to avoid side-effects. For example, we could make the above function not have any side effects like this:

    -
    sqrt_newton_count <- function(a, init, count_iter = 0, eps = 0.01){
    -    while(abs(init**2 - a) > eps){
    -        init <- 1/2 *(init + a/init)
    -        count_iter <- count_iter + 1
    -    }
    -    c(init, count_iter)
    -}
    +
    sqrt_newton_count <- function(a, init, count_iter = 0, eps = 0.01){
    +    while(abs(init**2 - a) > eps){
    +        init <- 1/2 *(init + a/init)
    +        count_iter <- count_iter + 1
    +    }
    +    c(init, count_iter)
    +}

    Now, this function returns a list with two elements, the result, and the number of iterations it took to get the result:

    -
    sqrt_newton_count(16000, 2)
    +
    sqrt_newton_count(16000, 2)
    ## [1] 126.4911   9.0000

    Writing to disk is also considered a side effect, because the function changes something (a file) outside its scope. But this cannot be avoided since you want to write to disk. @@ -620,14 +619,14 @@

    8.3.1 Doing away with loops: the

    \(X\) is a vector composed of the following scalars: \((0, 5, 8, 3, 2, 1)\). The function we want to map to each element of \(X\) is \(f(x) = x + 1\). \(X'\) is the result of this operation. Using R, we would do the following:

    -
    library("purrr")
    -numbers <- c(0, 5, 8, 3, 2, 1)
    -
    -plus_one <- function(x) (x + 1)
    -
    -my_results <- map(numbers, plus_one)
    -
    -my_results
    +
    library("purrr")
    +numbers <- c(0, 5, 8, 3, 2, 1)
    +
    +plus_one <- function(x) (x + 1)
    +
    +my_results <- map(numbers, plus_one)
    +
    +my_results
    ## [[1]]
     ## [1] 1
     ## 
    @@ -646,17 +645,17 @@ 

    8.3.1 Doing away with loops: the ## [[6]] ## [1] 2

    Using a loop, you would write:

    -
    numbers <- c(0, 5, 8, 3, 2, 1)
    -
    -plus_one <- function(x) (x + 1)
    -
    -my_results <- vector("list", 6)
    -
    -for(number in seq_along(numbers)){
    -  my_results[[number]] <- plus_one(number)
    -}
    -
    -my_results
    +
    numbers <- c(0, 5, 8, 3, 2, 1)
    +
    +plus_one <- function(x) (x + 1)
    +
    +my_results <- vector("list", 6)
    +
    +for(number in seq_along(numbers)){
    +  my_results[[number]] <- plus_one(number)
    +}
    +
    +my_results
    ## [[1]]
     ## [1] 2
     ## 
    @@ -680,13 +679,13 @@ 

    8.3.1 Doing away with loops: the (and I’ve using R for almost 10 years now). Why? Well, first of all I used %in% instead of in. Then, I forgot about seq_along(). After that, I made a typo, plos_one() instead of plus_one() (ok, that one is unrelated to the loop). Let’s also see how this works using base R:

    -
    numbers <- c(0, 5, 8, 3, 2, 1)
    -
    -plus_one <- function(x) (x + 1)
    -
    -my_results <- lapply(numbers, plus_one)
    -
    -my_results
    +
    numbers <- c(0, 5, 8, 3, 2, 1)
    +
    +plus_one <- function(x) (x + 1)
    +
    +my_results <- lapply(numbers, plus_one)
    +
    +my_results
    ## [[1]]
     ## [1] 1
     ## 
    @@ -706,24 +705,24 @@ 

    8.3.1 Doing away with loops: the ## [1] 2

    So what is the added value of using {purrr}, you might ask. Well, imagine that instead of a list, I need to an atomic vector of numerics. This is fairly easy with {purrr}:

    -
    library("purrr")
    -numbers <- c(0, 5, 8, 3, 2, 1)
    -
    -plus_one <- function(x) (x + 1)
    -
    -my_results <- map_dbl(numbers, plus_one)
    -
    -my_results
    +
    library("purrr")
    +numbers <- c(0, 5, 8, 3, 2, 1)
    +
    +plus_one <- function(x) (x + 1)
    +
    +my_results <- map_dbl(numbers, plus_one)
    +
    +my_results
    ## [1] 1 6 9 4 3 2

    We’re going to discuss these functions below, but know that in base R, outputting something else involves more effort.

    Let’s go back to our sqrt_newton() function. This function has more than one parameter. Often, we would like to map functions with more than one parameter to a list, while holding constant some of the functions parameters. This is easily achieved like so:

    -
    library("purrr")
    -numbers <- c(7, 8, 19, 64)
    -
    -map(numbers, sqrt_newton, init = 1)
    +
    library("purrr")
    +numbers <- c(7, 8, 19, 64)
    +
    +map(numbers, sqrt_newton, init = 1)
    ## [[1]]
     ## [1] 2.645767
     ## 
    @@ -736,10 +735,10 @@ 

    8.3.1 Doing away with loops: the ## [[4]] ## [1] 8.000002

    It is also possible to use a formula:

    -
    library("purrr")
    -numbers <- c(7, 8, 19, 64)
    -
    -map(numbers, ~sqrt_newton(., init = 1))
    +
    library("purrr")
    +numbers <- c(7, 8, 19, 64)
    +
    +map(numbers, ~sqrt_newton(., init = 1))
    ## [[1]]
     ## [1] 2.645767
     ## 
    @@ -753,7 +752,7 @@ 

    8.3.1 Doing away with loops: the ## [1] 8.000002

    Another function that is similar to map() is rerun(). You guessed it, this one simply reruns an expression:

    -
    rerun(10, "hello")
    +
    rerun(10, "hello")
    ## [[1]]
     ## [1] "hello"
     ## 
    @@ -786,7 +785,7 @@ 

    8.3.1 Doing away with loops: the

    rerun() simply runs an expression (which can be arbitrarily complex) n times, whereas map() maps a function to a list of inputs, so to achieve the same with map(), you need to map the print() function to a vector of characters:

    -
    map(rep("hello", 10), print)
    +
    map(rep("hello", 10), print)
    ## [1] "hello"
     ## [1] "hello"
     ## [1] "hello"
    @@ -833,17 +832,17 @@ 

    8.3.1 Doing away with loops: the We see this side effect 10 times, plus then the list created with map().

    rerun() is useful if you want to run simulation. For instance, let’s suppose that I perform a simulation where I throw a die 5 times, and compute the mean of the points obtained, as well as the variance:

    -
    mean_var_throws <- function(n){
    -  throws <- sample(1:6, n, replace = TRUE)
    -
    -  mean_throws <- mean(throws)
    -  var_throws <- var(throws)
    -
    -  tibble::tribble(~mean_throws, ~var_throws,
    -                   mean_throws, var_throws)
    -}
    -
    -mean_var_throws(5)
    +
    mean_var_throws <- function(n){
    +  throws <- sample(1:6, n, replace = TRUE)
    +
    +  mean_throws <- mean(throws)
    +  var_throws <- var(throws)
    +
    +  tibble::tribble(~mean_throws, ~var_throws,
    +                   mean_throws, var_throws)
    +}
    +
    +mean_var_throws(5)
    ## # A tibble: 1 × 2
     ##   mean_throws var_throws
     ##         <dbl>      <dbl>
    @@ -852,9 +851,9 @@ 

    8.3.1 Doing away with loops: the I want to compute the expected value of the distribution of throwing dice. We know from theory that it should be equal to \(3.5 (= 1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6)\).

    Let’s rerun the simulation 50 times:

    -
    simulations <- rerun(50, mean_var_throws(5))
    +
    simulations <- rerun(50, mean_var_throws(5))

    Let’s see what the simulations object is made of:

    -
    str(simulations)
    +
    str(simulations)
    ## List of 50
     ##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
     ##   ..$ mean_throws: num 2
    @@ -871,31 +870,31 @@ 

    8.3.1 Doing away with loops: the .....

    simulations is a list of 50 data frames. We can easily combine them into a single data frame, and compute the mean of the means, which should return something close to the expected value of 3.5:

    -
    bind_rows(simulations) %>%
    -  summarise(expected_value = mean(mean_throws))
    +
    bind_rows(simulations) %>%
    +  summarise(expected_value = mean(mean_throws))
    ## # A tibble: 1 × 1
     ##   expected_value
     ##            <dbl>
     ## 1           3.44

    Pretty close! Now of course, one could have simply done something like this:

    -
    mean(sample(1:6, 1000, replace = TRUE))
    +
    mean(sample(1:6, 1000, replace = TRUE))
    ## [1] 3.481

    but the point was to illustrate that rerun() can run any arbitrarily complex expression, and that it is good practice to put the result in a data frame or list, for easier further manipulation.

    You now know the standard map() function, and also rerun(), which return lists, but there are a number of variants of this function. map_dbl() returns an atomic vector of doubles, as seen we’ve seen before. A little reminder below:

    -
    map_dbl(numbers, sqrt_newton, init = 1)
    +
    map_dbl(numbers, sqrt_newton, init = 1)
    ## [1] 2.645767 2.828469 4.358902 8.000002

    In a similar fashion, map_chr() returns an atomic vector of strings:

    -
    map_chr(numbers, sqrt_newton, init = 1)
    +
    map_chr(numbers, sqrt_newton, init = 1)
    ## [1] "2.645767" "2.828469" "4.358902" "8.000002"

    map_lgl() returns an atomic vector of TRUE or FALSE:

    -
    divisible <- function(x, y){
    -  if_else(x %% y == 0, TRUE, FALSE)
    -}
    -
    -map_lgl(seq(1:100), divisible, 3)
    +
    divisible <- function(x, y){
    +  if_else(x %% y == 0, TRUE, FALSE)
    +}
    +
    +map_lgl(seq(1:100), divisible, 3)
    ##   [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
     ##  [13] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
     ##  [25] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
    @@ -906,9 +905,9 @@ 

    8.3.1 Doing away with loops: the ## [85] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [97] FALSE FALSE TRUE FALSE

    There are also other interesting variants, such as map_if():

    -
    a <- seq(1,10)
    -
    -map_if(a, (function(x) divisible(x, 2)), sqrt)
    +
    a <- seq(1,10)
    +
    +map_if(a, (function(x) divisible(x, 2)), sqrt)
    ## [[1]]
     ## [1] 1
     ## 
    @@ -941,7 +940,7 @@ 

    8.3.1 Doing away with loops: the

    I used map_if() to take the square root of only those numbers in vector a that are divisble by 2, by using an anonymous function that checks if a number is divisible by 2 (by wrapping divisible()).

    map_at() is similar to map_if() but maps the function at a position specified by the user:

    -
    map_at(numbers, c(1, 3), sqrt)
    +
    map_at(numbers, c(1, 3), sqrt)
    ## [[1]]
     ## [1] 2.645751
     ## 
    @@ -954,9 +953,9 @@ 

    8.3.1 Doing away with loops: the ## [[4]] ## [1] 64

    or if you have a named list:

    -
    recipe <- list("spam" = 1, "eggs" = 3, "bacon" = 10)
    -
    -map_at(recipe, "bacon", `*`, 2)
    +
    recipe <- list("spam" = 1, "eggs" = 3, "bacon" = 10)
    +
    +map_at(recipe, "bacon", `*`, 2)
    ## $spam
     ## [1] 1
     ## 
    @@ -969,14 +968,14 @@ 

    8.3.1 Doing away with loops: the its second argument, 2. Try the following in the command prompt: `*`(3, 4)).

    map2() is the equivalent of mapply() and pmap() is the generalisation of map2() for more than 2 arguments:

    -
    print(a)
    +
    print(a)
    ##  [1]  1  2  3  4  5  6  7  8  9 10
    -
    b <- seq(1, 2, length.out = 10)
    -
    -print(b)
    +
    b <- seq(1, 2, length.out = 10)
    +
    +print(b)
    ##  [1] 1.000000 1.111111 1.222222 1.333333 1.444444 1.555556 1.666667 1.777778
     ##  [9] 1.888889 2.000000
    -
    map2(a, b, `*`)
    +
    map2(a, b, `*`)
    ## [[1]]
     ## [1] 1
     ## 
    @@ -1009,14 +1008,14 @@ 

    8.3.1 Doing away with loops: the

    Each element of a gets multiplied by the element of b that is in the same position. Let’s see what pmap() does. Can you guess from the code below what is going on? I will print a and b again for clarity:

    -
    a
    +
    a
    ##  [1]  1  2  3  4  5  6  7  8  9 10
    -
    b
    +
    b
    ##  [1] 1.000000 1.111111 1.222222 1.333333 1.444444 1.555556 1.666667 1.777778
     ##  [9] 1.888889 2.000000
    -
    n <- seq(1:10)
    -
    -pmap(list(a, b, n), rnorm)
    +
    n <- seq(1:10)
    +
    +pmap(list(a, b, n), rnorm)
    ## [[1]]
     ## [1] -0.1758315
     ## 
    @@ -1050,7 +1049,7 @@ 

    8.3.1 Doing away with loops: the ## [1] -9.101480 4.404571 -16.071437 1.110689 7.168097 15.848579 ## [7] 16.710863 1.998482 -17.856521 -2.021087

    Let’s take a closer look at what a, b and n look like, when they are place next to each other:

    -
    cbind(a, b, n)
    +
    cbind(a, b, n)
    ##        a        b  n
     ##  [1,]  1 1.000000  1
     ##  [2,]  2 1.111111  2
    @@ -1076,28 +1075,28 @@ 

    8.3.2 Reducing with purrrReducing is another important concept in functional programming. It allows going from a list of elements, to a single element, by somehow combining the elements into one. For instance, using the base R Reduce() function, you can sum the elements of a list like so:

    -
    Reduce(`+`, seq(1:100))
    +
    Reduce(`+`, seq(1:100))
    ## [1] 5050

    using purrr::reduce(), this becomes:

    -
    reduce(seq(1:100), `+`)
    +
    reduce(seq(1:100), `+`)
    ## [1] 5050

    If you don’t really get what happening, don’t worry. Things should get clearer once I’ll introduce another version of reduce(), called accumulate(), which we will see below.

    Sometimes, the direction from which we start to reduce is quite important. You can “start from the end” of the list by using the .dir argument:

    -
    reduce(seq(1:100), `+`, .dir = "backward")
    +
    reduce(seq(1:100), `+`, .dir = "backward")
    ## [1] 5050

    Of course, for commutative operations, direction does not matter. But it does matter for non-commutative operations:

    -
    reduce(seq(1:100), `-`)
    +
    reduce(seq(1:100), `-`)
    ## [1] -5048
    -
    reduce(seq(1:100), `-`, .dir = "backward")
    +
    reduce(seq(1:100), `-`, .dir = "backward")
    ## [1] -50

    Let’s now take a look at accumulate(). accumulate() is very similar to map(), but keeps the intermediary results. Which intermediary results? Let’s try and see what happens:

    -
    a <- seq(1, 10)
    -
    -accumulate(a, `-`)
    +
    a <- seq(1, 10)
    +
    +accumulate(a, `-`)
    ##  [1]   1  -1  -4  -8 -13 -19 -26 -34 -43 -53

    accumulate() illustrates pretty well what is happening; the first element, 1, is simply the first element of seq(1, 10). The second element of the result however, is the difference between @@ -1119,24 +1118,24 @@

    8.3.2 Reducing with purrrreduce() only shows the final result of all these operations. accumulate() and reduce() also have an .init argument, that makes it possible to start the reducing procedure from an initial value that is different from the first element of the vector:

    -
    reduce(a, `+`, .init = 1000)
    -
    -accumulate(a, `-`, .init = 1000, .dir = "backward")
    +
    reduce(a, `+`, .init = 1000)
    +
    +accumulate(a, `-`, .init = 1000, .dir = "backward")
    ## [1] 1055
    ##  [1]  995 -994  996 -993  997 -992  998 -991  999 -990 1000

    reduce() generalizes functions that only take two arguments. If you were to write a function that returns the minimum between two numbers:

    -
    my_min <- function(a, b){
    -    if(a < b){
    -        return(a)
    -    } else {
    -        return(b)
    -    }
    -}
    +
    my_min <- function(a, b){
    +    if(a < b){
    +        return(a)
    +    } else {
    +        return(b)
    +    }
    +}

    You could use reduce() to get the minimum of a list of numbers:

    -
    numbers2 <- c(3, 1, -8, 9)
    -
    -reduce(numbers2, my_min)
    +
    numbers2 <- c(3, 1, -8, 9)
    +
    +reduce(numbers2, my_min)
    ## [1] -8

    map() and reduce() are arguably the most useful higher-order functions, and perhaps also the most famous one, true ambassadors of functional programming. You might have read about @@ -1157,18 +1156,18 @@

    8.3.2 Reducing with purrr

    8.3.3 Error handling with safely() and possibly()

    safely() and possibly() are very useful functions. Consider the following situation:

    -
    a <- list("a", 4, 5)
    -
    -sqrt(a)
    -
    Error in sqrt(a) : non-numeric argument to mathematical function
    +
    a <- list("a", 4, 5)
    +
    +sqrt(a)
    +
    Error in sqrt(a) : non-numeric argument to mathematical function

    Using map() or Map() will result in a similar error. safely() is an higher-order function that takes one function as an argument and executes it… safely, meaning the execution of the function will not stop if there is an error. The error message gets captured alongside valid results.

    -
    a <- list("a", 4, 5)
    -
    -safe_sqrt <- safely(sqrt)
    -
    -map(a, safe_sqrt)
    +
    a <- list("a", 4, 5)
    +
    +safe_sqrt <- safely(sqrt)
    +
    +map(a, safe_sqrt)
    ## [[1]]
     ## [[1]]$result
     ## NULL
    @@ -1192,9 +1191,9 @@ 

    8.3.3 Error handling with s ## [[3]]$error ## NULL

    possibly() works similarly, but also allows you to specify a return value in case of an error:

    -
    possible_sqrt <- possibly(sqrt, otherwise = NA_real_)
    -
    -map(a, possible_sqrt)
    +
    possible_sqrt <- possibly(sqrt, otherwise = NA_real_)
    +
    +map(a, possible_sqrt)
    ## [[1]]
     ## [1] NA
     ## 
    @@ -1204,15 +1203,15 @@ 

    8.3.3 Error handling with s ## [[3]] ## [1] 2.236068

    Of course, in this particular example, the same effect could be obtained way more easily:

    -
    sqrt(as.numeric(a))
    +
    sqrt(as.numeric(a))
    ## Warning: NAs introduced by coercion
    ## [1]       NA 2.000000 2.236068

    However, in some situations, this trick does not work as intended (or at all). possibly() and safely() allow the programmer to model errors explicitly, and to then provide a consistent way of dealing with them. For instance, consider the following example:

    -
    data(mtcars)
    -
    -write.csv(mtcars, "my_data/mtcars.csv")
    +
    data(mtcars)
    +
    +write.csv(mtcars, "my_data/mtcars.csv")
    Error in file(file, ifelse(append, "a", "w")) : 
       cannot open the connection
     In addition: Warning message:
    @@ -1220,14 +1219,14 @@ 

    8.3.3 Error handling with s cannot open file 'my_data/mtcars.csv': No such file or directory

    The folder path/to/save/ does not exist, and as such this code produces an error. You might want to catch this error, and create the directory for instance:

    -
    possibly_write.csv <- possibly(write.csv, otherwise = NULL)
    -
    -if(is.null(possibly_write.csv(mtcars, "my_data/mtcars.csv"))) {
    -  print("Creating folder...")
    -  dir.create("my_data/")
    -  print("Saving file...")
    -  write.csv(mtcars, "my_data/mtcars.csv")
    -}
    +
    possibly_write.csv <- possibly(write.csv, otherwise = NULL)
    +
    +if(is.null(possibly_write.csv(mtcars, "my_data/mtcars.csv"))) {
    +  print("Creating folder...")
    +  dir.create("my_data/")
    +  print("Saving file...")
    +  write.csv(mtcars, "my_data/mtcars.csv")
    +}
    [1] "Creating folder..."
     [1] "Saving file..."
     Warning message:
    @@ -1242,20 +1241,20 @@ 

    8.3.3 Error handling with s

    8.3.4 Partial applications with partial()

    Consider the following simple function:

    -
    add <- function(a, b) a+b
    +
    add <- function(a, b) a+b

    It is possible to create a new function, where one of the parameters is fixed, for instance, where a = 10:

    -
    add_to_10 <- partial(add, a = 10)
    -
    add_to_10(12)
    +
    add_to_10 <- partial(add, a = 10)
    +
    add_to_10(12)
    ## [1] 22

    This is equivalent to the following:

    -
    add_to_10_2 <- function(b){
    -  add(a = 10, b)
    -}
    +
    add_to_10_2 <- function(b){
    +  add(a = 10, b)
    +}

    Using partial() is much less verbose however, and allowing you to define new functions very quickly:

    -
    head10 <- partial(head, n = 10)
    -
    -head10(mtcars)
    +
    head10 <- partial(head, n = 10)
    +
    +head10(mtcars)
    ##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
     ## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
     ## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
    @@ -1271,19 +1270,19 @@ 

    8.3.4 Partial applications with <

    8.3.5 Function composition using compose

    Function composition is another handy tool, which makes chaining equation much more elegant:

    -
    compose(sqrt, log10, exp)(10)
    +
    compose(sqrt, log10, exp)(10)
    ## [1] 2.083973

    You can read this expression as exp() after log10() after sqrt() and is equivalent to:

    -
    sqrt(log10(exp(10)))
    +
    sqrt(log10(exp(10)))
    ## [1] 2.083973

    It is also possible to reverse the order the functions get called using the .dir = option:

    -
    compose(sqrt, log10, exp, .dir = "forward")(10)
    +
    compose(sqrt, log10, exp, .dir = "forward")(10)
    ## [1] 1.648721

    One could also use the %>% operator to achieve the same result:

    -
    10 %>%
    -  sqrt %>%
    -  log10 %>%
    -  exp
    +
    10 %>%
    +  sqrt %>%
    +  log10 %>%
    +  exp
    ## [1] 1.648721

    but strictly speaking, this is not function composition.

    @@ -1292,9 +1291,9 @@

    8.3.6 «Transposing lists»Another interesting function is transpose(). It is not an alternative to the function t() from base but, has a similar effect. transpose() works on lists. Let’s take a look at the example from before:

    -
    safe_sqrt <- safely(sqrt, otherwise = NA_real_)
    -
    -map(a, safe_sqrt)
    +
    safe_sqrt <- safely(sqrt, otherwise = NA_real_)
    +
    +map(a, safe_sqrt)
    ## [[1]]
     ## [[1]]$result
     ## [1] NA
    @@ -1320,7 +1319,7 @@ 

    8.3.6 «Transposing lists»The output is a list with the first element being a list with a result and an error message. One might want to have all the results in a single list, and all the error messages in another list. This is possible with transpose():

    -
    purrr::transpose(map(a, safe_sqrt))
    +
    purrr::transpose(map(a, safe_sqrt))
    ## $result
     ## $result[[1]]
     ## [1] NA
    @@ -1351,12 +1350,12 @@ 

    8.3.6 «Transposing lists»

    8.4 List-based workflows for efficiency

    You can use your own functions in pipe workflows:

    -
    double_number <- function(x){
    -  x+x
    -}
    -
    mtcars %>%
    -  head() %>%
    -  mutate(double_mpg = double_number(mpg))
    +
    double_number <- function(x){
    +  x+x
    +}
    +
    mtcars %>%
    +  head() %>%
    +  mutate(double_mpg = double_number(mpg))
    ##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb double_mpg
     ## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4       42.0
     ## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4       42.0
    @@ -1369,33 +1368,33 @@ 

    8.4 List-based workflows for effi matter where they come from. The consequence of functions being first-class objects is that functions can take functions as arguments, functions can return functions (the function factories from the previous chapter) and can be assigned to any variable:

    -
    plop <- sqrt
    -
    -plop(4)
    +
    plop <- sqrt
    +
    +plop(4)
    ## [1] 2
    -
    bacon <- function(.f){
    -
    -  message("Bacon is tasty")
    -
    -  .f
    -
    -}
    -
    -bacon(sqrt) # `bacon` is a function factory, as it returns a function (alongside an informative message)
    +
    bacon <- function(.f){
    +
    +  message("Bacon is tasty")
    +
    +  .f
    +
    +}
    +
    +bacon(sqrt) # `bacon` is a function factory, as it returns a function (alongside an informative message)
    ## Bacon is tasty
    ## function (x)  .Primitive("sqrt")
    -
    # To actually call it:
    -bacon(sqrt)(4)
    +
    # To actually call it:
    +bacon(sqrt)(4)
    ## Bacon is tasty
    ## [1] 2

    Now, let’s step back for a bit and think about what we learned up until now, and especially the map() family of functions.

    Let’s read the list of datasets from the previous chapter:

    -
    paths <- Sys.glob("datasets/unemployment/*.csv")
    -
    -all_datasets <- import_list(paths)
    -
    -str(all_datasets)
    +
    paths <- Sys.glob("datasets/unemployment/*.csv")
    +
    +all_datasets <- import_list(paths)
    +
    +str(all_datasets)
    ## List of 4
     ##  $ unemp_2013:'data.frame':  118 obs. of  8 variables:
     ##   ..$ Commune                   : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
    @@ -1442,20 +1441,20 @@ 

    8.4 List-based workflows for effi names are not very easy to work with; there are spaces, and it would be better if the names of the columns would be all lowercase. For this we are going to use the function clean_names() from the janitor package. For a single dataset, I would write this:

    -
    library(janitor)
    -
    -one_dataset <- one_dataset %>%
    -  clean_names()
    +
    library(janitor)
    +
    +one_dataset <- one_dataset %>%
    +  clean_names()

    and I would get a dataset with column names in lowercase and spaces replaced by _ (and other corrections). How can I apply, or map, this function to each dataset in the list? To do this I need to use purrr::map(), which we’ve seen in the previous section:

    -
    library(purrr)
    -
    -all_datasets <- all_datasets %>%
    -  map(clean_names)
    -
    -all_datasets %>%
    -  glimpse()
    +
    library(purrr)
    +
    +all_datasets <- all_datasets %>%
    +  map(clean_names)
    +
    +all_datasets %>%
    +  glimpse()
    ## List of 4
     ##  $ unemp_2013:'data.frame':  118 obs. of  8 variables:
     ##   ..$ commune                     : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
    @@ -1500,12 +1499,12 @@ 

    8.4 List-based workflows for effi

    Remember that map(list, function) simply evaluates function to each element of list.

    So now, what if I want to know, for each dataset, which communes have an unemployment rate that is less than, say, 3%? For a single dataset I would do something like this:

    -
    one_dataset %>%
    -  filter(unemployment_rate_in_percent < 3)
    +
    one_dataset %>%
    +  filter(unemployment_rate_in_percent < 3)

    but since we’re dealing with a list of data sets, we cannot simply use filter() on it. This is because filter() expects a data frame, not a list of data frames. The way around this is to use map().

    -
    all_datasets %>%
    -  map(~filter(., unemployment_rate_in_percent < 3))
    +
    all_datasets %>%
    +  map(~filter(., unemployment_rate_in_percent < 3))
    ## $unemp_2013
     ##      commune total_employed_population of_which_wage_earners
     ## 1    Garnich                       844                   750
    @@ -1566,8 +1565,8 @@ 

    8.4 List-based workflows for effi ## 4 2.569593 2016

    map() needs a function to map to each element of the list. all_datasets is the list to which I want to map the function. But what function? filter() is the function I need, so why doesn’t:

    -
    all_datasets %>%
    -  map(filter(unemployment_rate_in_percent < 3))
    +
    all_datasets %>%
    +  map(filter(unemployment_rate_in_percent < 3))

    work? This is what happens if we try it:

    Error in filter(unemployment_rate_in_percent < 3) :
       object 'unemployment_rate_in_percent' not found
    @@ -1581,9 +1580,9 @@

    8.4 List-based workflows for effi
    • Using a formula (only works within {tidyverse} functions):
    -
    all_datasets %>%
    -  map(~filter(., unemployment_rate_in_percent < 3)) %>% 
    -  glimpse()
    +
    all_datasets %>%
    +  map(~filter(., unemployment_rate_in_percent < 3)) %>% 
    +  glimpse()
    ## List of 4
     ##  $ unemp_2013:'data.frame':  3 obs. of  8 variables:
     ##   ..$ commune                     : chr [1:3] "Garnich" "Leudelange" "Bech"
    @@ -1630,9 +1629,9 @@ 

    8.4 List-based workflows for effi
    • using an anonymous function (using the function(x) keyword):
    -
    all_datasets %>%
    -  map(function(x)filter(x, unemployment_rate_in_percent < 3)) %>%
    -  glimpse()
    +
    all_datasets %>%
    +  map(function(x)filter(x, unemployment_rate_in_percent < 3)) %>%
    +  glimpse()
    ## List of 4
     ##  $ unemp_2013:'data.frame':  3 obs. of  8 variables:
     ##   ..$ commune                     : chr [1:3] "Garnich" "Leudelange" "Bech"
    @@ -1677,9 +1676,9 @@ 

    8.4 List-based workflows for effi
    • or, since R 4.1, using the shorthand \(x):
    -
    all_datasets %>%
    -  map(\(x)filter(x, unemployment_rate_in_percent < 3)) %>%
    -  glimpse()
    +
    all_datasets %>%
    +  map(\(x)filter(x, unemployment_rate_in_percent < 3)) %>%
    +  glimpse()
    ## List of 4
     ##  $ unemp_2013:'data.frame':  3 obs. of  8 variables:
     ##   ..$ commune                     : chr [1:3] "Garnich" "Leudelange" "Bech"
    @@ -1726,9 +1725,9 @@ 

    8.4 List-based workflows for effi

    Before merging these datasets together, we would need them to have a year column indicating the year the data was measured in each data frame. It would also be helpful if gave names to these datasets, meaning converting the list to a named list. For this task, we can use purrr::set_names():

    -
    all_datasets <- set_names(all_datasets, as.character(seq(2013, 2016)))
    +
    all_datasets <- set_names(all_datasets, as.character(seq(2013, 2016)))

    Let’s take a look at the list now:

    -
    str(all_datasets)
    +
    str(all_datasets)

    As you can see, each data.frame object contained in the list has been renamed. You can thus access them with the $ operator:

    @@ -1739,14 +1738,11 @@

    8.4 List-based workflows for effi output. So how could reduce() help us with merging all the datasets that are in the list? dplyr comes with a lot of function to merge two datasets. Remember that I said before that reduce() allows you to generalize a function of two arguments? Let’s try it with our list of datasets:

    -
    unemp_lux <- reduce(all_datasets, full_join)
    -
    ## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed",
    -## "active_population", "unemployment_rate_in_percent", "year")
    -## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed",
    -## "active_population", "unemployment_rate_in_percent", "year")
    -## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed",
    -## "active_population", "unemployment_rate_in_percent", "year")
    -
    glimpse(unemp_lux)
    +
    unemp_lux <- reduce(all_datasets, full_join)
    +
    ## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year")
    +## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year")
    +## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year")
    +
    glimpse(unemp_lux)
    ## Rows: 472
     ## Columns: 8
     ## $ commune                      <chr> "Grand-Duche de Luxembourg", "Canton Cape…
    @@ -1760,7 +1756,7 @@ 

    8.4 List-based workflows for effi

    full_join() is one of the dplyr function that merges data. There are others that might be useful depending on the kind of join operation you need. Let’s write this data to disk as we’re going to keep using it for the next chapters:

    -
    export(unemp_lux, "datasets/unemp_lux.csv")
    +
    export(unemp_lux, "datasets/unemp_lux.csv")

    8.4.1 Functional programming and plotting

    In this section, we are going to learn how to use the possibilities offered by the purrr package @@ -1768,28 +1764,28 @@

    8.4.1 Functional programming and but what comes next is also what makes R, and the functional programming paradigm so powerful.

    For example, suppose that instead of wanting a single plot with the unemployment rate of each commune, you need one unemployment plot, per commune:

    -
    unemp_lux_data %>%
    -  filter(division == "Luxembourg") %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
    -  theme_minimal() +
    -  labs(title = "Unemployment in Luxembourg", x = "Year", y = "Rate") +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division == "Luxembourg") %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
    +  theme_minimal() +
    +  labs(title = "Unemployment in Luxembourg", x = "Year", y = "Rate") +
    +  geom_line()

    and then you would write the same for “Esch-sur-Alzette” and also for “Wiltz”. If you only have to make to make these 3 plots, copy and pasting the above lines is no big deal:

    -
    unemp_lux_data %>%
    -  filter(division == "Esch-sur-Alzette") %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
    -  theme_minimal() +
    -  labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division == "Esch-sur-Alzette") %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
    +  theme_minimal() +
    +  labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") +
    +  geom_line()

    -
    unemp_lux_data %>%
    -  filter(division == "Wiltz") %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
    -  theme_minimal() +
    -  labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division == "Wiltz") %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
    +  theme_minimal() +
    +  labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") +
    +  geom_line()

    But copy and pasting is error prone. Can you spot the copy-paste mistake I made? And what if you have to create the above plots for all 108 Luxembourguish communes? That’s a lot of copy pasting. @@ -1797,23 +1793,23 @@

    8.4.1 Functional programming and could use the search and replace function of RStudio, true, but sometimes search and replace can also introduce bugs and typos. You can avoid all these issues by using purrr::map(). What do you need to map over? The commune names. So let’s create a vector of commune names:

    -
    communes <- list("Luxembourg", "Esch-sur-Alzette", "Wiltz")
    +
    communes <- list("Luxembourg", "Esch-sur-Alzette", "Wiltz")

    Now we can create the graphs using map(), or map2() to be exact:

    -
    plots_tibble <- unemp_lux_data %>%
    -  filter(division %in% communes) %>%
    -  group_by(division) %>%
    -  nest() %>%
    -  mutate(plot = map2(.x = data, .y = division, ~ggplot(data = .x) +
    -       theme_minimal() +
    -       geom_line(aes(year, unemployment_rate_in_percent, group = 1)) +
    -       labs(title = paste("Unemployment in", .y))))
    +
    plots_tibble <- unemp_lux_data %>%
    +  filter(division %in% communes) %>%
    +  group_by(division) %>%
    +  nest() %>%
    +  mutate(plot = map2(.x = data, .y = division, ~ggplot(data = .x) +
    +       theme_minimal() +
    +       geom_line(aes(year, unemployment_rate_in_percent, group = 1)) +
    +       labs(title = paste("Unemployment in", .y))))

    Let’s study this line by line: the first line is easy, we simply use filter() to keep only the communes we are interested in. Then we group by division and use tidyr::nest(). As a refresher, let’s take a look at what this does:

    -
    unemp_lux_data %>%
    -  filter(division %in% communes) %>%
    -  group_by(division) %>%
    -  nest()
    +
    unemp_lux_data %>%
    +  filter(division %in% communes) %>%
    +  group_by(division) %>%
    +  nest()
    ## # A tibble: 3 × 2
     ## # Groups:   division [3]
     ##   division         data             
    @@ -1826,11 +1822,11 @@ 

    8.4.1 Functional programming and because now we can pass these tibbles to map2(), to generate the plots. But why map2() and what’s the difference with map()? map2() works the same way as map(), but maps over two inputs:

    -
    numbers1 <- list(1, 2, 3, 4, 5)
    -
    -numbers2 <- list(9, 8, 7, 6, 5)
    -
    -map2(numbers1, numbers2, `*`)
    +
    numbers1 <- list(1, 2, 3, 4, 5)
    +
    +numbers2 <- list(9, 8, 7, 6, 5)
    +
    +map2(numbers1, numbers2, `*`)
    ## [[1]]
     ## [1] 9
     ## 
    @@ -1849,7 +1845,7 @@ 

    8.4.1 Functional programming and useful to create the title with labs(title = paste("Unemployment in", .y)))) where .y is the second input of map2(), the commune names contained in variable division.

    So what happened? We now have a tibble called plots_tibble that looks like this:

    -
    print(plots_tibble)
    +
    print(plots_tibble)
    ## # A tibble: 3 × 3
     ## # Groups:   division [3]
     ##   division         data              plot  
    @@ -1862,8 +1858,8 @@ 

    8.4.1 Functional programming and tibbles). plot is a list-column, with elements… being plots! Yes you read that right, the elements of the column plot are literally plots. This is what I meant with list columns. Let’s see what is inside the data and the plot columns exactly:

    -
    plots_tibble %>%
    -  pull(data)
    +
    plots_tibble %>%
    +  pull(data)
    ## [[1]]
     ## # A tibble: 15 × 7
     ##     year active_population of_which_non_wage_e…¹ of_wh…² total…³ unemp…⁴ unemp…⁵
    @@ -1938,8 +1934,8 @@ 

    8.4.1 Functional programming and plot multiple lines in the same graph, you need to write group = division).

    But more interestingly, how can you actually see the plots? If you want to simply look at them, it is enough to use pull():

    -
    plots_tibble %>%
    -  pull(plot)
    +
    plots_tibble %>%
    +  pull(plot)
    ## [[1]]

    ## 
    @@ -1949,7 +1945,7 @@ 

    8.4.1 Functional programming and ## [[3]]

    And if we want to save these plots, we can do so using map2():

    -
    map2(paste0(plots_tibble$division, ".pdf"), plots_tibble$plot, ggsave)
    +
    map2(paste0(plots_tibble$division, ".pdf"), plots_tibble$plot, ggsave)
    Saving 7 x 5 in image
     Saving 6.01 x 3.94 in image
     Saving 6.01 x 3.94 in image
    @@ -1988,22 +1984,22 @@

    8.4.2 Modeling with functional pr

    so LPMs are still used for estimating marginal effects.

    Let us check this assessment with one example. First, we simulate some data, then run a logistic regression and compute the marginal effects, and then compare with a LPM:

    -
    set.seed(1234)
    -x1 <- rnorm(100)
    -x2 <- rnorm(100)
    -  
    -z <- .5 + 2*x1 + 4*x2
    -
    -p <- 1/(1 + exp(-z))
    -
    -y <- rbinom(100, 1, p)
    -
    -df <- tibble(y = y, x1 = x1, x2 = x2)
    +
    set.seed(1234)
    +x1 <- rnorm(100)
    +x2 <- rnorm(100)
    +  
    +z <- .5 + 2*x1 + 4*x2
    +
    +p <- 1/(1 + exp(-z))
    +
    +y <- rbinom(100, 1, p)
    +
    +df <- tibble(y = y, x1 = x1, x2 = x2)

    This data generating process generates data from a binary choice model. Fitting the model using a logistic regression allows us to recover the structural parameters:

    -
    logistic_regression <- glm(y ~ ., data = df, family = binomial(link = "logit"))
    +
    logistic_regression <- glm(y ~ ., data = df, family = binomial(link = "logit"))

    Let’s see a summary of the model fit:

    -
    summary(logistic_regression)
    +
    summary(logistic_regression)
    ## 
     ## Call:
     ## glm(formula = y ~ ., family = binomial(link = "logit"), data = df)
    @@ -2029,38 +2025,38 @@ 

    8.4.2 Modeling with functional pr ## Number of Fisher Scoring iterations: 7

    We do recover the parameters that generated the data, but what about the marginal effects? We can get the marginal effects easily using the {margins} package:

    -
    library(margins)
    -
    -margins(logistic_regression)
    +
    library(margins)
    +
    +margins(logistic_regression)
    ## Average marginal effects
    ## glm(formula = y ~ ., family = binomial(link = "logit"), data = df)
    ##      x1     x2
     ##  0.1598 0.3516

    Or, even better, we can compute the true marginal effects, since we know the data generating process:

    -
    meffects <- function(dataset, coefs){
    -  X <- dataset %>% 
    -  select(-y) %>% 
    -  as.matrix()
    -  
    -  dydx_x1 <- mean(dlogis(X%*%c(coefs[2], coefs[3]))*coefs[2])
    -  dydx_x2 <- mean(dlogis(X%*%c(coefs[2], coefs[3]))*coefs[3])
    -  
    -  tribble(~term, ~true_effect,
    -          "x1", dydx_x1,
    -          "x2", dydx_x2)
    -}
    -
    -(true_meffects <- meffects(df, c(0.5, 2, 4)))
    +
    meffects <- function(dataset, coefs){
    +  X <- dataset %>% 
    +  select(-y) %>% 
    +  as.matrix()
    +  
    +  dydx_x1 <- mean(dlogis(X%*%c(coefs[2], coefs[3]))*coefs[2])
    +  dydx_x2 <- mean(dlogis(X%*%c(coefs[2], coefs[3]))*coefs[3])
    +  
    +  tribble(~term, ~true_effect,
    +          "x1", dydx_x1,
    +          "x2", dydx_x2)
    +}
    +
    +(true_meffects <- meffects(df, c(0.5, 2, 4)))
    ## # A tibble: 2 × 2
     ##   term  true_effect
     ##   <chr>       <dbl>
     ## 1 x1          0.175
     ## 2 x2          0.350

    Ok, so now what about using this infamous Linear Probability Model to estimate the marginal effects?

    -
    lpm <- lm(y ~ ., data = df)
    -
    -summary(lpm)
    +
    lpm <- lm(y ~ ., data = df)
    +
    +summary(lpm)
    ## 
     ## Call:
     ## lm(formula = y ~ ., data = df)
    @@ -2093,36 +2089,36 @@ 

    8.4.2 Modeling with functional pr of different sizes, and generated using different structural parameters.

    First, let’s write a function that generates data. The function below generates 10 datasets of size 100 (the code is inspired by this StackExchange answer):

    -
    generate_datasets <- function(coefs = c(.5, 2, 4), sample_size = 100, repeats = 10){
    -
    -  generate_one_dataset <- function(coefs, sample_size){
    -  x1 <- rnorm(sample_size)
    -  x2 <- rnorm(sample_size)
    -  
    -  z <- coefs[1] + coefs[2]*x1 + coefs[3]*x2
    -
    -  p <- 1/(1 + exp(-z))
    -
    -  y <- rbinom(sample_size, 1, p)
    -
    -  df <- tibble(y = y, x1 = x1, x2 = x2)
    -  }
    -
    -  simulations <- rerun(.n = repeats, generate_one_dataset(coefs, sample_size))
    - 
    -  tibble("coefs" = list(coefs), "sample_size" = sample_size, "repeats" = repeats, "simulations" = list(simulations))
    -}
    +
    generate_datasets <- function(coefs = c(.5, 2, 4), sample_size = 100, repeats = 10){
    +
    +  generate_one_dataset <- function(coefs, sample_size){
    +  x1 <- rnorm(sample_size)
    +  x2 <- rnorm(sample_size)
    +  
    +  z <- coefs[1] + coefs[2]*x1 + coefs[3]*x2
    +
    +  p <- 1/(1 + exp(-z))
    +
    +  y <- rbinom(sample_size, 1, p)
    +
    +  df <- tibble(y = y, x1 = x1, x2 = x2)
    +  }
    +
    +  simulations <- rerun(.n = repeats, generate_one_dataset(coefs, sample_size))
    + 
    +  tibble("coefs" = list(coefs), "sample_size" = sample_size, "repeats" = repeats, "simulations" = list(simulations))
    +}

    Let’s first generate one dataset:

    -
    one_dataset <- generate_datasets(repeats = 1)
    +
    one_dataset <- generate_datasets(repeats = 1)

    Let’s take a look at one_dataset:

    -
    one_dataset
    +
    one_dataset
    ## # A tibble: 1 × 4
     ##   coefs     sample_size repeats simulations
     ##   <list>          <dbl>   <dbl> <list>     
     ## 1 <dbl [3]>         100       1 <list [1]>

    As you can see, the tibble with the simulated data is inside a list-column called simulations. Let’s take a closer look:

    -
    str(one_dataset$simulations)
    +
    str(one_dataset$simulations)
    ## List of 1
     ##  $ :List of 1
     ##   ..$ : tibble [100 × 3] (S3: tbl_df/tbl/data.frame)
    @@ -2138,9 +2134,9 @@ 

    8.4.2 Modeling with functional pr then thousands and ten of thousands of data sets, get the marginal effects and compare them to the true ones (but here I won’t simulate more than 500 datasets).

    Let’s first generate 10 datasets:

    -
    many_datasets <- generate_datasets()
    +
    many_datasets <- generate_datasets()

    Now comes the tricky part. I have this object, many_datasets looking like this:

    -
    many_datasets
    +
    many_datasets
    ## # A tibble: 1 × 4
     ##   coefs     sample_size repeats simulations
     ##   <list>          <dbl>   <dbl> <list>     
    @@ -2155,22 +2151,22 @@ 

    8.4.2 Modeling with functional pr

    I highly suggest that you run the following lines, one after another. It is complicated to understand what’s going on if you are not used to such workflows. However, I hope to convince you that once it will click, it’ll be much more intuitive than doing all this inside a loop. Here’s the code:

    -
    results <- many_datasets %>% 
    -  mutate(lpm = modify_depth(simulations, 2, ~lm(y ~ ., data = .x))) %>% 
    -  mutate(lpm = modify_depth(lpm, 2, broom::tidy)) %>% 
    -  mutate(lpm = modify_depth(lpm, 2, ~select(., term, estimate))) %>% 
    -  mutate(lpm = modify_depth(lpm, 2, ~filter(., term != "(Intercept)"))) %>% 
    -  mutate(lpm = map(lpm, bind_rows)) %>% 
    -  mutate(true_effect = modify_depth(simulations, 2, ~meffects(., coefs = coefs[[1]]))) %>% 
    -  mutate(true_effect = map(true_effect, bind_rows))
    +
    results <- many_datasets %>% 
    +  mutate(lpm = modify_depth(simulations, 2, ~lm(y ~ ., data = .x))) %>% 
    +  mutate(lpm = modify_depth(lpm, 2, broom::tidy)) %>% 
    +  mutate(lpm = modify_depth(lpm, 2, ~select(., term, estimate))) %>% 
    +  mutate(lpm = modify_depth(lpm, 2, ~filter(., term != "(Intercept)"))) %>% 
    +  mutate(lpm = map(lpm, bind_rows)) %>% 
    +  mutate(true_effect = modify_depth(simulations, 2, ~meffects(., coefs = coefs[[1]]))) %>% 
    +  mutate(true_effect = map(true_effect, bind_rows))

    This is how results looks like:

    -
    results
    +
    results
    ## # A tibble: 1 × 6
     ##   coefs     sample_size repeats simulations lpm               true_effect      
     ##   <list>          <dbl>   <dbl> <list>      <list>            <list>           
     ## 1 <dbl [3]>         100      10 <list [10]> <tibble [20 × 2]> <tibble [20 × 2]>

    Let’s take a closer look to the lpm and true_effect columns:

    -
    results$lpm
    +
    results$lpm
    ## [[1]]
     ## # A tibble: 20 × 2
     ##    term  estimate
    @@ -2195,7 +2191,7 @@ 

    8.4.2 Modeling with functional pr ## 18 x2 0.374 ## 19 x1 0.176 ## 20 x2 0.410

    -
    results$true_effect
    +
    results$true_effect
    ## [[1]]
     ## # A tibble: 20 × 2
     ##    term  true_effect
    @@ -2222,18 +2218,18 @@ 

    8.4.2 Modeling with functional pr ## 20 x2 0.321

    Let’s bind the columns, and compute the difference between the true and estimated marginal effects:

    -
    simulation_results <- results %>% 
    -  mutate(difference = map2(.x = lpm, .y = true_effect, full_join)) %>%  
    -  mutate(difference = map(difference, ~mutate(., difference = true_effect - estimate))) %>% 
    -  mutate(difference = map(difference, ~select(., term, difference))) %>% 
    -  pull(difference) %>% 
    -  .[[1]]
    +
    simulation_results <- results %>% 
    +  mutate(difference = map2(.x = lpm, .y = true_effect, full_join)) %>%  
    +  mutate(difference = map(difference, ~mutate(., difference = true_effect - estimate))) %>% 
    +  mutate(difference = map(difference, ~select(., term, difference))) %>% 
    +  pull(difference) %>% 
    +  .[[1]]
    ## Joining, by = "term"

    Let’s take a look at the simulation results:

    -
    simulation_results %>% 
    -  group_by(term) %>% 
    -  summarise(mean = mean(difference), 
    -            sd = sd(difference))
    +
    simulation_results %>% 
    +  group_by(term) %>% 
    +  summarise(mean = mean(difference), 
    +            sd = sd(difference))
    ## # A tibble: 2 × 3
     ##   term     mean     sd
     ##   <chr>   <dbl>  <dbl>
    @@ -2242,67 +2238,67 @@ 

    8.4.2 Modeling with functional pr

    Already with only 10 simulated datasets, the difference in means is not significant. Let’s rerun the analysis, but for difference sizes. In order to make things easier, we can put all the code into a nifty function:

    -
    monte_carlo <- function(coefs, sample_size, repeats){
    -  many_datasets <- generate_datasets(coefs, sample_size, repeats)
    -  
    -  results <- many_datasets %>% 
    -    mutate(lpm = modify_depth(simulations, 2, ~lm(y ~ ., data = .x))) %>% 
    -    mutate(lpm = modify_depth(lpm, 2, broom::tidy)) %>% 
    -    mutate(lpm = modify_depth(lpm, 2, ~select(., term, estimate))) %>% 
    -    mutate(lpm = modify_depth(lpm, 2, ~filter(., term != "(Intercept)"))) %>% 
    -    mutate(lpm = map(lpm, bind_rows)) %>% 
    -    mutate(true_effect = modify_depth(simulations, 2, ~meffects(., coefs = coefs[[1]]))) %>% 
    -    mutate(true_effect = map(true_effect, bind_rows))
    -
    -  simulation_results <- results %>% 
    -    mutate(difference = map2(.x = lpm, .y = true_effect, full_join)) %>% 
    -    mutate(difference = map(difference, ~mutate(., difference = true_effect - estimate))) %>% 
    -    mutate(difference = map(difference, ~select(., term, difference))) %>% 
    -    pull(difference) %>% 
    -    .[[1]]
    -
    -  simulation_results %>% 
    -    group_by(term) %>% 
    -    summarise(mean = mean(difference), 
    -              sd = sd(difference))
    -}
    +
    monte_carlo <- function(coefs, sample_size, repeats){
    +  many_datasets <- generate_datasets(coefs, sample_size, repeats)
    +  
    +  results <- many_datasets %>% 
    +    mutate(lpm = modify_depth(simulations, 2, ~lm(y ~ ., data = .x))) %>% 
    +    mutate(lpm = modify_depth(lpm, 2, broom::tidy)) %>% 
    +    mutate(lpm = modify_depth(lpm, 2, ~select(., term, estimate))) %>% 
    +    mutate(lpm = modify_depth(lpm, 2, ~filter(., term != "(Intercept)"))) %>% 
    +    mutate(lpm = map(lpm, bind_rows)) %>% 
    +    mutate(true_effect = modify_depth(simulations, 2, ~meffects(., coefs = coefs[[1]]))) %>% 
    +    mutate(true_effect = map(true_effect, bind_rows))
    +
    +  simulation_results <- results %>% 
    +    mutate(difference = map2(.x = lpm, .y = true_effect, full_join)) %>% 
    +    mutate(difference = map(difference, ~mutate(., difference = true_effect - estimate))) %>% 
    +    mutate(difference = map(difference, ~select(., term, difference))) %>% 
    +    pull(difference) %>% 
    +    .[[1]]
    +
    +  simulation_results %>% 
    +    group_by(term) %>% 
    +    summarise(mean = mean(difference), 
    +              sd = sd(difference))
    +}

    And now, let’s run the simulation for different parameters and sizes:

    -
    monte_carlo(c(.5, 2, 4), 100, 10)
    +
    monte_carlo(c(.5, 2, 4), 100, 10)
    ## Joining, by = "term"
    ## # A tibble: 2 × 3
     ##   term      mean     sd
     ##   <chr>    <dbl>  <dbl>
     ## 1 x1    -0.00826 0.0318
     ## 2 x2    -0.00732 0.0421
    -
    monte_carlo(c(.5, 2, 4), 100, 100)
    +
    monte_carlo(c(.5, 2, 4), 100, 100)
    ## Joining, by = "term"
    ## # A tibble: 2 × 3
     ##   term     mean     sd
     ##   <chr>   <dbl>  <dbl>
     ## 1 x1    0.00360 0.0408
     ## 2 x2    0.00517 0.0459
    -
    monte_carlo(c(.5, 2, 4), 100, 500)
    +
    monte_carlo(c(.5, 2, 4), 100, 500)
    ## Joining, by = "term"
    ## # A tibble: 2 × 3
     ##   term       mean     sd
     ##   <chr>     <dbl>  <dbl>
     ## 1 x1    -0.00152  0.0388
     ## 2 x2    -0.000701 0.0462
    -
    monte_carlo(c(pi, 6, 9), 100, 10)
    +
    monte_carlo(c(pi, 6, 9), 100, 10)
    ## Joining, by = "term"
    ## # A tibble: 2 × 3
     ##   term      mean     sd
     ##   <chr>    <dbl>  <dbl>
     ## 1 x1    -0.00829 0.0421
     ## 2 x2     0.00178 0.0397
    -
    monte_carlo(c(pi, 6, 9), 100, 100)
    +
    monte_carlo(c(pi, 6, 9), 100, 100)
    ## Joining, by = "term"
    ## # A tibble: 2 × 3
     ##   term     mean     sd
     ##   <chr>   <dbl>  <dbl>
     ## 1 x1    0.0107  0.0576
     ## 2 x2    0.00831 0.0772
    -
    monte_carlo(c(pi, 6, 9), 100, 500)
    +
    monte_carlo(c(pi, 6, 9), 100, 500)
    ## Joining, by = "term"
    ## # A tibble: 2 × 3
     ##   term     mean     sd
    @@ -2327,36 +2323,36 @@ 

    Exercise 1

    Exercise 2

    Use one of the map() functions to combine two lists into one. Consider the following two lists:

    -
    mediterranean <- list("starters" = list("humous", "lasagna"), "dishes" = list("sardines", "olives"))
    -
    -continental <- list("starters" = list("pea soup", "terrine"), "dishes" = list("frikadelle", "sauerkraut"))
    +
    mediterranean <- list("starters" = list("humous", "lasagna"), "dishes" = list("sardines", "olives"))
    +
    +continental <- list("starters" = list("pea soup", "terrine"), "dishes" = list("frikadelle", "sauerkraut"))

    The result we’d like to have would look like this:

    -
    $starters
    -$starters[[1]]
    -[1] "humous"
    -
    -$starters[[2]]
    -[1] "olives"
    -
    -$starters[[3]]
    -[1] "pea soup"
    -
    -$starters[[4]]
    -[1] "terrine"
    -
    -
    -$dishes
    -$dishes[[1]]
    -[1] "sardines"
    -
    -$dishes[[2]]
    -[1] "lasagna"
    -
    -$dishes[[3]]
    -[1] "frikadelle"
    -
    -$dishes[[4]]
    -[1] "sauerkraut"
    +
    $starters
    +$starters[[1]]
    +[1] "humous"
    +
    +$starters[[2]]
    +[1] "olives"
    +
    +$starters[[3]]
    +[1] "pea soup"
    +
    +$starters[[4]]
    +[1] "terrine"
    +
    +
    +$dishes
    +$dishes[[1]]
    +[1] "sardines"
    +
    +$dishes[[2]]
    +[1] "lasagna"
    +
    +$dishes[[3]]
    +[1] "frikadelle"
    +
    +$dishes[[4]]
    +[1] "sauerkraut"

    diff --git a/docs/further-topics.html b/docs/further-topics.html index 6528fce..4dac0f3 100644 --- a/docs/further-topics.html +++ b/docs/further-topics.html @@ -23,7 +23,7 @@ - + @@ -282,7 +282,6 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • 5 Graphs @@ -365,7 +364,7 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • +
  • Exercise 4
  • 8 Functional programming diff --git a/docs/getting-to-know-rstudio.html b/docs/getting-to-know-rstudio.html index d7dae3e..6bd5150 100644 --- a/docs/getting-to-know-rstudio.html +++ b/docs/getting-to-know-rstudio.html @@ -23,7 +23,7 @@ - + @@ -282,7 +282,6 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • 5 Graphs @@ -365,7 +364,7 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • +
  • Exercise 4
  • 8 Functional programming diff --git a/docs/graphs.html b/docs/graphs.html index cd3d0fb..9a941c4 100644 --- a/docs/graphs.html +++ b/docs/graphs.html @@ -23,7 +23,7 @@ - + @@ -282,7 +282,6 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • 5 Graphs @@ -365,7 +364,7 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • +
  • Exercise 4
  • 8 Functional programming @@ -481,13 +480,13 @@

    5.2 Examples

    5.2.1 Barplots

    To follow the examples below, load the following libraries:

    -
    library(ggplot2)
    -library(ggthemes)
    +
    library(ggplot2)
    +library(ggthemes)

    {ggplot2} is an implementation of the Grammar of Graphics by Wilkinson (2006), but you don’t need to read the books to start using it. If we go back to the Star Wars data (contained in dplyr), and wish to draw a barplot of the gender, the following lines are enough:

    -
    ggplot(starwars, aes(gender)) +
    -  geom_bar()
    +
    ggplot(starwars, aes(gender)) +
    +  geom_bar()

    The first argument of the function is the data (called starwars in this example), and then the function aes(). This function is where you list the variables that you want to map to the aesthetics @@ -495,17 +494,17 @@

    5.2.1 Barplots
    ggplot(starwars) +
    -  geom_bar(aes(gender))
    +
    ggplot(starwars) +
    +  geom_bar(aes(gender))

    The difference between these two approaches is that when you specify the aesthetics in the ggplot() function, all the geom_*() functions that follow will inherited these aesthetics. This is useful if you want to avoid writing the same code over and over again, but can be problematic if you need to specify different aesthetics to different geom_*() functions. This will become clear in a later example.

    You can add options to your plots, for instance, you can change the coordinate system in your barplot:

    -
    ggplot(starwars, aes(gender)) +
    -  geom_bar() +
    -  coord_flip()
    +
    ggplot(starwars, aes(gender)) +
    +  geom_bar() +
    +  coord_flip()

    This is the basic recipe to create plots using {ggplot2}: start with a call to ggplot() where you specify the data you want to plot, and optionally the aesthetics. Then, use the geom_*() function you need; if you @@ -518,55 +517,55 @@

    5.2.1 Barplots5.2.2 Scatter plots

    Scatter plots are very useful, especially if you are trying to figure out the relationship between two variables. For instance, let’s make a scatter plot of height vs weight of Star Wars characters:

    -
    ggplot(starwars) +
    -  geom_point(aes(height, mass))
    +
    ggplot(starwars) +
    +  geom_point(aes(height, mass))

    As you can see there is an outlier; a very heavy character! Star Wars fans already guessed it, it’s Jabba the Hut. To make the plot easier to read, let’s remove this outlier:

    -
    starwars %>%
    -  filter(!str_detect(name, "Jabba")) %>%
    -  ggplot() +
    -    geom_point(aes(height, mass))
    +
    starwars %>%
    +  filter(!str_detect(name, "Jabba")) %>%
    +  ggplot() +
    +    geom_point(aes(height, mass))

    There is a positive correlation between height and mass, by adding geom_smooth() with the option method = "lm":

    -
    starwars %>%
    -  filter(!str_detect(name, "Jabba")) %>%
    -  ggplot(aes(height, mass)) +
    -  geom_point(aes(height, mass)) +
    -  geom_smooth(method = "lm")
    +
    starwars %>%
    +  filter(!str_detect(name, "Jabba")) %>%
    +  ggplot(aes(height, mass)) +
    +  geom_point(aes(height, mass)) +
    +  geom_smooth(method = "lm")
    ## `geom_smooth()` using formula 'y ~ x'

    I’ve moved the aes(height, mass) up to the ggplot() function because both geom_point() and geom_smooth() need them, and as explained in the begging of this section, the aesthetics listed in ggplot() get passed down to the other geoms.

    If you omit method = "lm, you get a non-parametric curve:

    -
    starwars %>%
    -  filter(!str_detect(name, "Jabba")) %>%
    -  ggplot(aes(height, mass)) +
    -  geom_point(aes(height, mass)) +
    -  geom_smooth()
    +
    starwars %>%
    +  filter(!str_detect(name, "Jabba")) %>%
    +  ggplot(aes(height, mass)) +
    +  geom_point(aes(height, mass)) +
    +  geom_smooth()
    ## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

    5.2.3 Density

    Use geom_density() to get density plots:

    -
    ggplot(starwars, aes(height)) +
    -  geom_density()
    +
    ggplot(starwars, aes(height)) +
    +  geom_density()
    ## Warning: Removed 6 rows containing non-finite values (stat_density).

    Let’s go into more detail now; what if you would like to plot the densities for feminines and masculines only (removing the droids from the data first)? This can be done by first filtering the data using dplyr and then separating the dataset by gender:

    -
    starwars %>%
    -  filter(gender %in% c("feminine", "masculine"))
    +
    starwars %>%
    +  filter(gender %in% c("feminine", "masculine"))

    The above lines do the filtering; only keep gender if gender is in the vector "feminine", "masculine". This is much easier than having to write gender == "feminine" | gender == "masculine". Then, we pipe this dataset to ggplot:

    -
    starwars %>%
    -  filter(gender %in% c("feminine", "masculine")) %>%
    -  ggplot(aes(height, fill = gender)) +
    -  geom_density()
    +
    starwars %>%
    +  filter(gender %in% c("feminine", "masculine")) %>%
    +  ggplot(aes(height, fill = gender)) +
    +  geom_density()
    ## Warning: Removed 5 rows containing non-finite values (stat_density).

    Let’s take a closer look to the aes() function: I’ve added fill = gender. This means that @@ -578,11 +577,11 @@

    5.2.3 Density
    filtered_data <- starwars %>%
    -  filter(gender %in% c("feminine", "masculine"))
    -
    -ggplot(filtered_data) +
    -  geom_density(aes(height, fill = gender))

    +
    filtered_data <- starwars %>%
    +  filter(gender %in% c("feminine", "masculine"))
    +
    +ggplot(filtered_data) +
    +  geom_density(aes(height, fill = gender))
    ## Warning: Removed 5 rows containing non-finite values (stat_density).

    @@ -593,30 +592,30 @@

    5.2.4 Line plotshere (downloaded from the website of the Luxembourguish national statistical institute.

    Let’s plot the unemployment for the canton of Luxembourg only:

    -
    unemp_lux_data <- import("datasets/unemployment/all/unemployment_lux_all.csv")
    -
    -unemp_lux_data %>%
    -  filter(division == "Luxembourg") %>%
    -  ggplot(aes(x = year, y = unemployment_rate_in_percent, group = 1)) +
    -  geom_line()
    +
    unemp_lux_data <- import("datasets/unemployment/all/unemployment_lux_all.csv")
    +
    +unemp_lux_data %>%
    +  filter(division == "Luxembourg") %>%
    +  ggplot(aes(x = year, y = unemployment_rate_in_percent, group = 1)) +
    +  geom_line()

    Because line plots are 2D, you need to specify the y and x axes. There is also another option you need to add, group = 1. This is to tell aes() that the dots have to be connected with a single line. What if you want to plot more than one commune?

    -
    unemp_lux_data %>%
    -  filter(division %in% c("Luxembourg", "Esch-sur-Alzette")) %>%
    -  ggplot(aes(x = year, y = unemployment_rate_in_percent, group = division, colour = division)) +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division %in% c("Luxembourg", "Esch-sur-Alzette")) %>%
    +  ggplot(aes(x = year, y = unemployment_rate_in_percent, group = division, colour = division)) +
    +  geom_line()

    This time, I’ve specified group = division which means that there has to be one line per as many communes as in the variable division. I do the same for colours. I think the next example illustrates how {ggplot2} is actually brilliant; if you need to add a third commune, there is no need to specify anything else; no need to add anything to the legend, no need to specify a third colour etc:

    -
    unemp_lux_data %>%
    -  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    -  ggplot(aes(x = year, y = unemployment_rate_in_percent, group = division, colour = division)) +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    +  ggplot(aes(x = year, y = unemployment_rate_in_percent, group = division, colour = division)) +
    +  geom_line()

    The three communes get mapped to the colour aesthetic so whatever the number of communes, as long as there are enough colours, the communes will each get mapped to one of these colours.

    @@ -627,39 +626,39 @@

    5.2.5 Facets
    starwars %>%
    -  mutate(human = case_when(species == "Human" ~ "Human",
    -                           species != "Human" ~ "Not Human")) %>%
    -  filter(gender %in% c("feminine", "masculine"), !is.na(human)) %>%
    -  ggplot(aes(height, fill = gender)) +
    -  facet_grid(. ~ human) + #<--- this is a formula
    -  geom_density()
    +
    starwars %>%
    +  mutate(human = case_when(species == "Human" ~ "Human",
    +                           species != "Human" ~ "Not Human")) %>%
    +  filter(gender %in% c("feminine", "masculine"), !is.na(human)) %>%
    +  ggplot(aes(height, fill = gender)) +
    +  facet_grid(. ~ human) + #<--- this is a formula
    +  geom_density()
    ## Warning: Removed 5 rows containing non-finite values (stat_density).

    I first created a factor variable that specifies if a Star Wars character is human or not, and then use it for facetting. By changing the formula, you change how the facetting is done:

    -
    starwars %>%
    -  mutate(human = case_when(species == "Human" ~ "Human",
    -                           species != "Human" ~ "Not Human")) %>%
    -  filter(gender %in% c("feminine", "masculine"), !is.na(human)) %>%
    -  ggplot(aes(height, fill = gender)) +
    -  facet_grid(human ~ .) +
    -  geom_density()
    +
    starwars %>%
    +  mutate(human = case_when(species == "Human" ~ "Human",
    +                           species != "Human" ~ "Not Human")) %>%
    +  filter(gender %in% c("feminine", "masculine"), !is.na(human)) %>%
    +  ggplot(aes(height, fill = gender)) +
    +  facet_grid(human ~ .) +
    +  geom_density()
    ## Warning: Removed 5 rows containing non-finite values (stat_density).

    Recall the categorical variable more_1 that we computed in the previous chapter? Let’s use it as a faceting variable:

    -
    starwars %>%
    -  rowwise() %>%
    -  mutate(n_films = length(films)) %>%
    -  mutate(more_1 = case_when(n_films == 1 ~ "Exactly one movie",
    -                            n_films != 1 ~ "More than 1 movie")) %>%
    -  mutate(human = case_when(species == "Human" ~ "Human",
    -                           species != "Human" ~ "Not Human")) %>%
    -  filter(gender %in% c("feminine", "masculine"), !is.na(human)) %>%
    -  ggplot(aes(height, fill = gender)) +
    -  facet_grid(human ~ more_1) +
    -  geom_density()
    +
    starwars %>%
    +  rowwise() %>%
    +  mutate(n_films = length(films)) %>%
    +  mutate(more_1 = case_when(n_films == 1 ~ "Exactly one movie",
    +                            n_films != 1 ~ "More than 1 movie")) %>%
    +  mutate(human = case_when(species == "Human" ~ "Human",
    +                           species != "Human" ~ "Not Human")) %>%
    +  filter(gender %in% c("feminine", "masculine"), !is.na(human)) %>%
    +  ggplot(aes(height, fill = gender)) +
    +  facet_grid(human ~ more_1) +
    +  geom_density()
    ## Warning: Removed 5 rows containing non-finite values (stat_density).

    @@ -669,20 +668,20 @@

    5.2.6 Pie Charts
    test_data <- tribble(
    -  ~id, ~var1, ~var2,  ~var3, ~var4, ~var5,
    -  "a", 26.5, 38, 30, 32, 34,
    -  "b", 30, 30, 28, 32, 30,
    -  "c", 34, 32, 30, 28, 26.5
    -)
    +
    test_data <- tribble(
    +  ~id, ~var1, ~var2,  ~var3, ~var4, ~var5,
    +  "a", 26.5, 38, 30, 32, 34,
    +  "b", 30, 30, 28, 32, 30,
    +  "c", 34, 32, 30, 28, 26.5
    +)

    This data is not in the right format though, which is wide. We need to have it in the long format for it to work with {ggplot2}. For this, let’s use tidyr::gather() as seen in the previous chapter:

    -
    test_data_long = test_data %>%
    -  gather(variable, value, starts_with("var"))
    +
    test_data_long = test_data %>%
    +  gather(variable, value, starts_with("var"))

    Now, let’s plot this data, first by creating 3 bar plots:

    -
    ggplot(test_data_long) +
    -  facet_wrap(~id) +
    -  geom_bar(aes(variable, value, fill = variable), stat = "identity")
    +
    ggplot(test_data_long) +
    +  facet_wrap(~id) +
    +  geom_bar(aes(variable, value, fill = variable), stat = "identity")

    In the code above, I introduce a new option, called stat = "identity". By default, geom_bar() counts the number of observations of each category that is plotted, which is a statistical transformation. @@ -692,13 +691,13 @@

    5.2.6 Pie Charts
    test_data_long <- test_data_long %>%
    -  group_by(id) %>%
    -  mutate(total = sum(value)) %>%
    -  ungroup() %>%
    -  mutate(share = value/total)
    +
    test_data_long <- test_data_long %>%
    +  group_by(id) %>%
    +  mutate(total = sum(value)) %>%
    +  ungroup() %>%
    +  mutate(share = value/total)

    Let’s take a look to see if this is what we wanted:

    -
    print(test_data_long)
    +
    print(test_data_long)
    ## # A tibble: 15 × 5
     ##    id    variable value total share
     ##    <chr> <chr>    <dbl> <dbl> <dbl>
    @@ -720,20 +719,20 @@ 

    5.2.6 Pie Charts
    ggplot(test_data_long) +
    -  facet_wrap(~id) +
    -  geom_bar(aes(y = share, x = "", fill = variable), stat = "identity") +
    -  theme() +
    -  coord_polar("y", start = 0)
    +
    ggplot(test_data_long) +
    +  facet_wrap(~id) +
    +  geom_bar(aes(y = share, x = "", fill = variable), stat = "identity") +
    +  theme() +
    +  coord_polar("y", start = 0)

    As you can see, this typical pie chart is not very easy to read; compared to the barplots above it is not easy to distinguish if a has a higher share than b or c. You can change the look of the pie chart, for example by specifying variable as the x:

    -
    ggplot(test_data_long) +
    -  facet_wrap(~id) +
    -  geom_bar(aes(y = share, x = variable, fill = variable), stat = "identity") +
    -  theme() +
    -  coord_polar("x", start = 0)
    +
    ggplot(test_data_long) +
    +  facet_wrap(~id) +
    +  geom_bar(aes(y = share, x = variable, fill = variable), stat = "identity") +
    +  theme() +
    +  coord_polar("x", start = 0)

    But as a general rule, avoid pie charts if possible. I find that pie charts are only interesting if you need to show proportions that are hugely unequal, to really emphasize the difference between @@ -742,23 +741,23 @@

    5.2.6 Pie Charts

    5.2.7 Adding text to plots

    Sometimes you might want to add some text to your plots. This is possible with geom_text():

    -
    ggplot(test_data_long) +
    -  facet_wrap(~id) +
    -  geom_bar(aes(variable, value, fill = variable), stat = "identity") +
    -  geom_text(aes(variable, value + 1.5, label = value))
    +
    ggplot(test_data_long) +
    +  facet_wrap(~id) +
    +  geom_bar(aes(variable, value, fill = variable), stat = "identity") +
    +  geom_text(aes(variable, value + 1.5, label = value))

    You can put anything after label = but in general what you want are the values, so that’s what I put there. But you can also refine it, imagine the values are actually in euros:

    -
    ggplot(test_data_long) +
    -  facet_wrap(~id) +
    -  geom_bar(aes(variable, value, fill = variable), stat = "identity") +
    -  geom_text(aes(variable, value + 1.5, label = paste(value, "€")))
    +
    ggplot(test_data_long) +
    +  facet_wrap(~id) +
    +  geom_bar(aes(variable, value, fill = variable), stat = "identity") +
    +  geom_text(aes(variable, value + 1.5, label = paste(value, "€")))

    You can also achieve something similar with geom_label():

    -
    ggplot(test_data_long) +
    -  facet_wrap(~id) +
    -  geom_bar(aes(variable, value, fill = variable), stat = "identity") +
    -  geom_label(aes(variable, value + 1.5, label = paste(value, "€")))
    +
    ggplot(test_data_long) +
    +  facet_wrap(~id) +
    +  geom_bar(aes(variable, value, fill = variable), stat = "identity") +
    +  geom_label(aes(variable, value + 1.5, label = paste(value, "€")))

    @@ -773,18 +772,18 @@

    5.3.1 Changing titles, axes label

    The name of this subsection is quite long, but this is because everything is kind of linked. Let’s start by learning what the labs() function does. To change the title of the plot, and of the axes, you need to pass the names to the labs() function:

    -
    unemp_lux_data %>%
    -  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    -  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    +  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    +  geom_line()

    What if you want to make the lines thicker?

    -
    unemp_lux_data %>%
    -  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    -  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    -  geom_line(size = 2)
    +
    unemp_lux_data %>%
    +  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    +  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    +  geom_line(size = 2)

    Each geom_*() function has its own options. Notice that the size=2 argument is not inside an aes() function. This is because I do not want to map a variable of the data to the size @@ -792,43 +791,43 @@

    5.3.1 Changing titles, axes label variable in the data. Recall the scatter plot we did earlier, where we showed that height and mass of star wars characters increased together? Let’s take this plot again, but make the size of the dots proportional to the birth year of the character:

    -
    starwars %>%
    -  filter(!str_detect(name, "Jabba")) %>%
    -  ggplot() +
    -    geom_point(aes(height, mass, size = birth_year))
    +
    starwars %>%
    +  filter(!str_detect(name, "Jabba")) %>%
    +  ggplot() +
    +    geom_point(aes(height, mass, size = birth_year))

    Making the size proportional to the birth year (the age would have been more informative) allows us to see a third dimension. It is also possible to “see” a fourth dimension, the gender for instance, by changing the colour of the dots:

    -
    starwars %>%
    -  filter(!str_detect(name, "Jabba")) %>%
    -  ggplot() +
    -    geom_point(aes(height, mass, size = birth_year, colour = gender))
    +
    starwars %>%
    +  filter(!str_detect(name, "Jabba")) %>%
    +  ggplot() +
    +    geom_point(aes(height, mass, size = birth_year, colour = gender))

    As I promised above, we are now going to learn how to add a regression line to this scatter plot:

    -
    starwars %>%
    -  filter(!str_detect(name, "Jabba")) %>%
    -  ggplot() +
    -    geom_point(aes(height, mass, size = birth_year, colour = gender)) +
    -    geom_smooth(aes(height, mass), method  = "lm")
    +
    starwars %>%
    +  filter(!str_detect(name, "Jabba")) %>%
    +  ggplot() +
    +    geom_point(aes(height, mass, size = birth_year, colour = gender)) +
    +    geom_smooth(aes(height, mass), method  = "lm")
    ## `geom_smooth()` using formula 'y ~ x'

    geom_smooth() adds a regression line, but only if you specify method = "lm" (“lm” stands for “linear model”). What happens if you remove this option?

    -
    starwars %>%
    -  filter(!str_detect(name, "Jabba")) %>%
    -  ggplot() +
    -    geom_point(aes(height, mass, size = birth_year, colour = gender)) +
    -    geom_smooth(aes(height, mass))
    +
    starwars %>%
    +  filter(!str_detect(name, "Jabba")) %>%
    +  ggplot() +
    +    geom_point(aes(height, mass, size = birth_year, colour = gender)) +
    +    geom_smooth(aes(height, mass))
    ## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

    By default, geom_smooth() does a non-parametric regression called LOESS (locally estimated scatterplot smoothing), which is more flexible. It is also possible to have one regression line by gender:

    -
    starwars %>%
    -  filter(!str_detect(name, "Jabba")) %>%
    -  ggplot() +
    -    geom_point(aes(height, mass, size = birth_year, colour = gender)) +
    -    geom_smooth(aes(height, mass, colour = gender))
    +
    starwars %>%
    +  filter(!str_detect(name, "Jabba")) %>%
    +  ggplot() +
    +    geom_point(aes(height, mass, size = birth_year, colour = gender)) +
    +    geom_smooth(aes(height, mass, colour = gender))
    ## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

    Because there are only a few observations for feminines and NAs the regression lines are not very informative, @@ -837,76 +836,76 @@

    5.3.1 Changing titles, axes label modify it a bit. For example, the legend placement is actually a feature of the theme. This means that if you want to change where the legend is placed you need to modify this feature from the theme. This is done with the function theme():

    -
    unemp_lux_data %>%
    -  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    -  theme(legend.position = "bottom") +
    -  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    +  theme(legend.position = "bottom") +
    +  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    +  geom_line()

    What I also like to do is remove the title of the legend, because it is often superfluous:

    -
    unemp_lux_data %>%
    -  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    -  theme(legend.position = "bottom", legend.title = element_blank()) +
    -  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    +  theme(legend.position = "bottom", legend.title = element_blank()) +
    +  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    +  geom_line()

    The legend title has to be an element_text object.element_text objects are used with theme to specify how text should be displayed. element_blank() draws nothing and assigns no space (not even blank space). If you want to keep the legend title but change it, you need to use element_text():

    -
    unemp_lux_data %>%
    -  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    -  theme(legend.position = "bottom", legend.title = element_text(colour = "red")) +
    -  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    +  theme(legend.position = "bottom", legend.title = element_text(colour = "red")) +
    +  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    +  geom_line()

    If you want to change the word “division” to something else, you can do so by providing the colour argument to the labs() function:

    -
    unemp_lux_data %>%
    -  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    -  theme(legend.position = "bottom") +
    -  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate", colour = "Administrative division") +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    +  theme(legend.position = "bottom") +
    +  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate", colour = "Administrative division") +
    +  geom_line()

    You could modify every feature of the theme like that, but there are built-in themes that you can use:

    -
    unemp_lux_data %>%
    -  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    -  theme_minimal() +
    -  theme(legend.position = "bottom", legend.title = element_blank()) +
    -  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    +  theme_minimal() +
    +  theme(legend.position = "bottom", legend.title = element_blank()) +
    +  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    +  geom_line()

    For example in the code above, I have used theme_minimal() which I like quite a lot. You can also use themes from the ggthemes package, which even contains a STATA theme, if you like it:

    -
    unemp_lux_data %>%
    -  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    -  theme_stata() +
    -  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    +  theme_stata() +
    +  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    +  geom_line()

    As you can see, theme_stata() has the legend on the bottom by default, because this is how the legend position is defined within the theme. However the legend title is still there. Let’s remove it:

    -
    unemp_lux_data %>%
    -  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    -  theme_stata() +
    -  theme(legend.title = element_blank()) +
    -  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    +  theme_stata() +
    +  theme(legend.title = element_blank()) +
    +  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    +  geom_line()

    ggthemes even features an Excel 2003 theme (don’t use it though):

    -
    unemp_lux_data %>%
    -  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    -  theme_excel() +
    -  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    +  theme_excel() +
    +  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    +  geom_line()

    You can create your own theme by using a simple theme, such as theme_minimal() as a base and then add your options. We are going to create one theme after we learn how to create our @@ -919,34 +918,34 @@

    5.3.2 Colour schemesHighcharts colour scheme.

    -
    unemp_lux_data %>%
    -  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    -  theme_minimal() +
    -  scale_colour_hc() +
    -  theme(legend.position = "bottom", legend.title = element_blank()) +
    -  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    +  theme_minimal() +
    +  scale_colour_hc() +
    +  theme(legend.position = "bottom", legend.title = element_blank()) +
    +  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    +  geom_line()

    An example with a barplot:

    -
    ggplot(test_data_long) +
    -  facet_wrap(~id) +
    -  geom_bar(aes(variable, value, fill = variable), stat = "identity") +
    -  geom_text(aes(variable, value + 1.5, label = value)) +
    -  theme_minimal() +
    -  scale_fill_hc()
    +
    ggplot(test_data_long) +
    +  facet_wrap(~id) +
    +  geom_bar(aes(variable, value, fill = variable), stat = "identity") +
    +  geom_text(aes(variable, value + 1.5, label = value)) +
    +  theme_minimal() +
    +  scale_fill_hc()

    It is also possible to define and use your own palette.

    To use your own colours you can use scale_colour_manual() and scale_fill_manual() and specify the html codes of the colours you want to use.

    -
    unemp_lux_data %>%
    -  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    -  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    -  theme_minimal() +
    -  scale_colour_manual(values = c("#FF336C", "#334BFF", "#2CAE00")) +
    -  theme(legend.position = "bottom", legend.title = element_blank()) +
    -  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    -  geom_line()
    +
    unemp_lux_data %>%
    +  filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>%
    +  ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) +
    +  theme_minimal() +
    +  scale_colour_manual(values = c("#FF336C", "#334BFF", "#2CAE00")) +
    +  theme(legend.position = "bottom", legend.title = element_blank()) +
    +  labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") +
    +  geom_line()

    To get html codes of colours you can use this online tool. @@ -955,24 +954,24 @@

    5.3.2 Colour schemes

    For a barplot you would do the same:

    -
    ggplot(test_data_long) +
    -  facet_wrap(~id) +
    -  geom_bar(aes(variable, value, fill = variable), stat = "identity") +
    -  geom_text(aes(variable, value + 1.5, label = value)) +
    -  theme_minimal() +
    -  theme(legend.position = "bottom", legend.title = element_blank()) +
    -  scale_fill_manual(values = c("#FF336C", "#334BFF", "#2CAE00", "#B3C9C6", "#765234"))
    +
    ggplot(test_data_long) +
    +  facet_wrap(~id) +
    +  geom_bar(aes(variable, value, fill = variable), stat = "identity") +
    +  geom_text(aes(variable, value + 1.5, label = value)) +
    +  theme_minimal() +
    +  theme(legend.position = "bottom", legend.title = element_blank()) +
    +  scale_fill_manual(values = c("#FF336C", "#334BFF", "#2CAE00", "#B3C9C6", "#765234"))

    For countinuous variables, things are a bit different. Let’s first create a plot where we map a continuous variable to the colour argument of aes():

    -
    ggplot(diamonds) +
    -  geom_point(aes(carat, price, colour = depth))
    +
    ggplot(diamonds) +
    +  geom_point(aes(carat, price, colour = depth))

    To change the colour, we need to use scale_color_gradient() and specify a value for low values of the variable, and a value for high values of the variable. For example, using the colours of the theme I made for my blog:

    -
    ggplot(diamonds) +
    -  geom_point(aes(carat, price, colour = depth)) +
    -  scale_color_gradient(low = "#bec3b8", high = "#ad2c6c")
    +
    ggplot(diamonds) +
    +  geom_point(aes(carat, price, colour = depth)) +
    +  scale_color_gradient(low = "#bec3b8", high = "#ad2c6c")

    @@ -984,10 +983,10 @@

    5.4 Saving plots to disk

    This is fine if you only generate one or two plots but if you generate a large number of them, it is less tedious to use the ggsave() function:

    -
    my_plot1 <- ggplot(my_data) +
    -  geom_bar(aes(variable))
    -
    -ggsave("path/you/want/to/save/the/plot/to/my_plot1.pdf", my_plot1)
    +
    my_plot1 <- ggplot(my_data) +
    +  geom_bar(aes(variable))
    +
    +ggsave("path/you/want/to/save/the/plot/to/my_plot1.pdf", my_plot1)

    There are other options that you can specify such as the width and height, resolution, units, etc…

    diff --git a/docs/index.html b/docs/index.html index f9de285..7afeaf3 100644 --- a/docs/index.html +++ b/docs/index.html @@ -23,7 +23,7 @@ - + @@ -282,7 +282,6 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • 5 Graphs @@ -365,7 +364,7 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • +
  • Exercise 4
  • 8 Functional programming @@ -443,7 +442,7 @@

    Preface

    @@ -590,7 +589,10 @@

    Prerequisitesthis link. For macOS, follow this one. If you run a GNU+Linux -distribution, you can install R using the system’s package manager. On Ubuntu, install r-base.

    +distribution, you can install R using the system’s package manager. If you’re running Ubuntu, you +might want to take a look at r2u, which provides very +fast installation of packages, full integration with apt (so dependencies get solved automatically) +and covers the entirety of CRAN.

    For RStudio, look for your operating system here.

    diff --git a/docs/objects-their-classes-and-types-and-useful-r-functions-to-get-you-started.html b/docs/objects-their-classes-and-types-and-useful-r-functions-to-get-you-started.html index 9fecec3..4907ffc 100644 --- a/docs/objects-their-classes-and-types-and-useful-r-functions-to-get-you-started.html +++ b/docs/objects-their-classes-and-types-and-useful-r-functions-to-get-you-started.html @@ -23,7 +23,7 @@ - + @@ -282,7 +282,6 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • 5 Graphs @@ -365,7 +364,7 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • +
  • Exercise 4
  • 8 Functional programming diff --git a/docs/package-development.html b/docs/package-development.html index 587ee6f..88f9310 100644 --- a/docs/package-development.html +++ b/docs/package-development.html @@ -23,7 +23,7 @@ - + @@ -282,7 +282,6 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • 5 Graphs @@ -365,7 +364,7 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • +
  • Exercise 4
  • 8 Functional programming @@ -544,17 +543,17 @@

    9.2.2 Starting your package

    Let’s start by adding a readme file. This is easily achieved by using the following function from {usethis}:

    -
    usethis::use_readme_md()
    +
    usethis::use_readme_md()

    This creates a template README.md file in the root directory of your package. You can now edit this file accordingly, and that’s it.

    The next step could be setting up your package to work with {roxygen2}, which will help write the documentation of your package:

    -
    usethis::use_roxygen_md()
    +
    usethis::use_roxygen_md()

    The output tells you to run devtools::document(), we will do this later.

    Since you have learned about the tidyverse by reading this book, I am willing to bet that you will want to use the %>% operator inside the functions contained in your package. To do this without issues, which wil become apparent later, use the following command:

    -
    usethis::use_pipe()
    +
    usethis::use_pipe()

    This will make the %>% operator available internally to your package’s functions, but also to the user that will load the package.

    We are almost done setting up the package. If you plan on distributing data with your package, @@ -563,20 +562,20 @@

    9.2.2 Starting your packagedata-raw:

    -
    usethis::use_data_raw()
    +
    usethis::use_data_raw()

    One final folder is inst. You can add files to this folder, and they will be available to the users that install the package. Users can find the files in the folder where packages get installed. On GNU+Linux systems, that would be somewhere like: /home/user/R/amd64-linux-gnu-library/3.6. There, you will find the installation folders of all the packages. If the package you make is called {spam}, you will find the files you put inside the inst folder on the root of the installation folder of spam. You can simply create the inst folder yourself, or use the following command:

    -
    usethis::use_directory("inst")
    +
    usethis::use_directory("inst")

    Finally, the last step is to give your package a license; this again is only useful if you plan on distributing it to the world. If you are writing your own package for yourself, or for purposes internal to your company, this is probably superfluous. I won’t discuss the particularities of licenses, so let’s just say that for the sake of this example package we are writing, we are going to use the MIT license:

    -
    usethis::use_mit_license()
    +
    usethis::use_mit_license()

    This again creates the right file at the right spot. There are other interesting functions inside the {usethis} package, and we will come back to it later.

    @@ -586,27 +585,27 @@

    9.3 Including data inside the pac

    Many packages include data and we are going to learn how to do it. I’ll assume that we already have a dataset on hand that we have to share. This is quite simple to do, first let’s simply load the data:

    -
    arcade <- readr::read_csv("~/path/to/data/arcade.csv")
    +
    arcade <- readr::read_csv("~/path/to/data/arcade.csv")

    and then use, once again, {usethis} comes to our rescue:

    -
    usethis::use_data(arcade, compress = "xz")
    +
    usethis::use_data(arcade, compress = "xz")

    that’s it! Well almost. We still need to write a little script that will allow users of your package to load the data. This script is simply called data.R and contains the following lines:

    -
    #' List of highest-grossing games
    -#'
    -#' Source: https://en.wikipedia.org/wiki/Arcade_game#List_of_highest-grossing_games
    -#'
    -#' @format A data frame with 6 variables: \code{game}, \code{release_year},
    -#'   \code{hardware_units_sold}, \code{comment_hardware}, \code{estimated_gross_revenue}, 
    -#'   \code{comment_revenue}
    -#' \describe{
    -#' \item{game}{The name of the game}
    -#' \item{release_year}{The year the game was released}
    -#' \item{hardware_units_sold}{The amount of hardware units sold}
    -#' \item{comment_hardware}{Comment accompanying the amount of hardware units sold}
    -#' \item{estimated_gross_revenue}{Estimated gross revenue in US$ with 2019 inflation}
    -#' \item{comment_revenue}{Comment accompanying the amount of hardware units sold}
    -#' }
    -"arcade"
    +
    #' List of highest-grossing games
    +#'
    +#' Source: https://en.wikipedia.org/wiki/Arcade_game#List_of_highest-grossing_games
    +#'
    +#' @format A data frame with 6 variables: \code{game}, \code{release_year},
    +#'   \code{hardware_units_sold}, \code{comment_hardware}, \code{estimated_gross_revenue}, 
    +#'   \code{comment_revenue}
    +#' \describe{
    +#' \item{game}{The name of the game}
    +#' \item{release_year}{The year the game was released}
    +#' \item{hardware_units_sold}{The amount of hardware units sold}
    +#' \item{comment_hardware}{Comment accompanying the amount of hardware units sold}
    +#' \item{estimated_gross_revenue}{Estimated gross revenue in US$ with 2019 inflation}
    +#' \item{comment_revenue}{Comment accompanying the amount of hardware units sold}
    +#' }
    +"arcade"

    Basically this is a description of the data, and the name with which the user will invoke the data. To conclude this part, remember the data-raw folder? If you used a script to scrape/get the data from somewhere, or if you had to write code to prepare the data to make it fit for sharing, this @@ -623,41 +622,41 @@

    9.4 Adding functions to your pack

    9.4.1 One function inside one script

    Create a new R script, or edit the hello.R file, and add in the following code:

    -
    #' Compute descriptive statistics for the numeric columns of a data frame.
    -#' @param df The data frame to summarise.
    -#' @param ... Optional. Columns in the data frame
    -#' @return A data frame with descriptive statistics. If you are only interested in certain columns
    -#' you can add these columns.
    -#' @import dplyr
    -#' @importFrom tidyr gather
    -#' @export
    -#' @examples
    -#' \dontrun{
    -#' describe(dataset)
    -#' describe(dataset, col1, col2)
    -#' }
    -describe_numeric <- function(df, ...){
    -
    -    if (nargs() > 1) df <- select(df, ...)
    -
    -    df %>%
    -        select_if(is.numeric) %>%
    -        gather(variable, value) %>%
    -        group_by(variable) %>%
    -        summarise_all(list(mean = ~mean(., na.rm = TRUE),
    -                           sd = ~sd(., na.rm = TRUE),
    -                           nobs = ~length(.),
    -                           min = ~min(., na.rm = TRUE),
    -                           max = ~max(., na.rm = TRUE),
    -                           q05 = ~quantile(., 0.05, na.rm = TRUE),
    -                           q25 = ~quantile(., 0.25, na.rm = TRUE),
    -                           mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE),
    -                           median = ~quantile(., 0.5, na.rm = TRUE),
    -                           q75 = ~quantile(., 0.75, na.rm = TRUE),
    -                           q95 = ~quantile(., 0.95, na.rm = TRUE),
    -                           n_missing = ~sum(is.na(.)))) %>%
    -        mutate(type = "Numeric")
    -}
    +
    #' Compute descriptive statistics for the numeric columns of a data frame.
    +#' @param df The data frame to summarise.
    +#' @param ... Optional. Columns in the data frame
    +#' @return A data frame with descriptive statistics. If you are only interested in certain columns
    +#' you can add these columns.
    +#' @import dplyr
    +#' @importFrom tidyr gather
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' describe(dataset)
    +#' describe(dataset, col1, col2)
    +#' }
    +describe_numeric <- function(df, ...){
    +
    +    if (nargs() > 1) df <- select(df, ...)
    +
    +    df %>%
    +        select_if(is.numeric) %>%
    +        gather(variable, value) %>%
    +        group_by(variable) %>%
    +        summarise_all(list(mean = ~mean(., na.rm = TRUE),
    +                           sd = ~sd(., na.rm = TRUE),
    +                           nobs = ~length(.),
    +                           min = ~min(., na.rm = TRUE),
    +                           max = ~max(., na.rm = TRUE),
    +                           q05 = ~quantile(., 0.05, na.rm = TRUE),
    +                           q25 = ~quantile(., 0.25, na.rm = TRUE),
    +                           mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE),
    +                           median = ~quantile(., 0.5, na.rm = TRUE),
    +                           q75 = ~quantile(., 0.75, na.rm = TRUE),
    +                           q95 = ~quantile(., 0.95, na.rm = TRUE),
    +                           n_missing = ~sum(is.na(.)))) %>%
    +        mutate(type = "Numeric")
    +}

    Save the script under the name describe.R.

    This function shows you pretty much you need to know when writing functions for packages. First, there’s the comment lines, that start with #' and not with #. These lines will be converted @@ -681,7 +680,7 @@

    9.4.1 One function inside one scr

    As explained before, if the function depends on function from other packages, then @import or @importFrom must be used. But it is also possible to use the package::function() syntax like I did on the following line:

    -
    mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE),
    +
    mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE),

    This function uses the sample_mode() function from my {brotools} package. Since it is the only function that I am using, I don’t import the whole package with @import. I could have done the same for gather() from {tidyr} instead of using @importFrom, but I wanted to showcase @@ -695,7 +694,7 @@

    9.4.1 One function inside one scr What I will cover is how to declare dependencies to other CRAN packages. These dependencies also get declared inside the ‘Description’ file, which we will cover in the next section.

    Because I’m doing that in this hacky way, my {brotools} package should be installed:

    -
    devtools::install_github("b-rodrigues/brotools")
    +
    devtools::install_github("b-rodrigues/brotools")

    Again, I want to emphasize that this is not the best way of doing it. However, using the “REMOTES” field as described in the document I linked above is not complicated.

    Now comes the function itself. The function is written in pretty much the same way as usual, but @@ -705,13 +704,13 @@

    9.4.1 One function inside one scr I cannot know how many columns the user wants to summarize beforehand, and also because I do not want to limit the user to 2 or 3 columns, I use the .... But what if the user wants to summarize all the columns? This is taken care of in this line:

    -
      if (nargs() > 1) df <- select(df, ...)
    +
      if (nargs() > 1) df <- select(df, ...)

    nargs() counts the number of arguments of the function. If the user calls the function like so:

    -
    describe_numeric(mtcars)
    +
    describe_numeric(mtcars)

    nargs() will return 1. If, instead, the user calls the function with one or more columns:

    -
    describe_numeric(mtcars, hp, mpg)
    +
    describe_numeric(mtcars, hp, mpg)

    then nargs() will return 2 (in this case). And does, this piece of code will be executed:

    -
    df <- select(df, ...)
    +
    df <- select(df, ...)

    which selects the columns hp and mpg from the mtcars dataset. This reduced data set is then the one that is being summarized.

    @@ -722,76 +721,76 @@

    9.4.2 Many functions inside a scr script is that you can keep functions that are conceptually similar in the same place. For instance, if you want to add a function called describe_character() to your package, adding it to the same script where describe_numeric() is might be a good idea, so let’s do just that:

    -
    #' Compute descriptive statistics for the numeric columns of a data frame.
    -#' @param df The data frame to summarise.
    -#' @param ... Optional. Columns in the data frame
    -#' @return A data frame with descriptive statistics. If you are only interested in certain columns
    -#' you can add these columns.
    -#' @import dplyr
    -#' @importFrom tidyr gather
    -#' @export
    -#' @examples
    -#' \dontrun{
    -#' describe(dataset)
    -#' describe(dataset, col1, col2)
    -#' }
    -describe_numeric <- function(df, ...){
    -
    -  if (nargs() > 1) df <- select(df, ...)
    -
    -  df %>%
    -    select(is.numeric) %>%
    -    pivot_longer(cols = everything(),
    -                 names_to = "variable", values_to = "value") %>%
    -    group_by(variable) %>%
    -    summarise(across(everything(),
    -                     tibble::lst(mean = ~mean(., na.rm = TRUE),
    -                       sd = ~sd(., na.rm = TRUE),
    -                       nobs = ~length(.),
    -                       min = ~min(., na.rm = TRUE),
    -                       max = ~max(., na.rm = TRUE),
    -                       q05 = ~quantile(., 0.05, na.rm = TRUE),
    -                       q25 = ~quantile(., 0.25, na.rm = TRUE),
    -                       mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE),
    -                       median = ~quantile(., 0.5, na.rm = TRUE),
    -                       q75 = ~quantile(., 0.75, na.rm = TRUE),
    -                       q95 = ~quantile(., 0.95, na.rm = TRUE),
    -                       n_missing = ~sum(is.na(.))))) %>%
    -    mutate(type = "Numeric")
    -}
    -
    -#' Compute descriptive statistics for the character or factor columns of a data frame.
    -#' @param df The data frame to summarise.
    -#' @return A data frame with a description of the character or factor columns.
    -#' @import dplyr
    -#' @importFrom tidyr gather
    -describe_character_or_factors <- function(df, type){
    -  df %>%
    -    pivot_longer(cols = everything(),
    -                 names_to = "variable", values_to = "value") %>%
    -    group_by(variable) %>%
    -    summarise(across(everything(),
    -                     funs(mode = brotools::sample_mode(value, na.rm = TRUE),
    -                     nobs = length(value),
    -                     n_missing = sum(is.na(value)),
    -                     n_unique = length(unique(value))))) %>%
    -    mutate(type = type)
    -}
    -
    -#' Compute descriptive statistics for the character columns of a data frame.
    -#' @param df The data frame to summarise.
    -#' @return A data frame with a description of the character columns.
    -#' @import dplyr
    -#' @export
    -#' @examples
    -#' \dontrun{
    -#' describe(dataset)
    -#' }
    -describe_character <- function(df){
    -  df %>%
    -    select(where(is.character)) %>%
    -    describe_character_or_factors(type = "Character")
    -}
    +
    #' Compute descriptive statistics for the numeric columns of a data frame.
    +#' @param df The data frame to summarise.
    +#' @param ... Optional. Columns in the data frame
    +#' @return A data frame with descriptive statistics. If you are only interested in certain columns
    +#' you can add these columns.
    +#' @import dplyr
    +#' @importFrom tidyr gather
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' describe(dataset)
    +#' describe(dataset, col1, col2)
    +#' }
    +describe_numeric <- function(df, ...){
    +
    +  if (nargs() > 1) df <- select(df, ...)
    +
    +  df %>%
    +    select(is.numeric) %>%
    +    pivot_longer(cols = everything(),
    +                 names_to = "variable", values_to = "value") %>%
    +    group_by(variable) %>%
    +    summarise(across(everything(),
    +                     tibble::lst(mean = ~mean(., na.rm = TRUE),
    +                       sd = ~sd(., na.rm = TRUE),
    +                       nobs = ~length(.),
    +                       min = ~min(., na.rm = TRUE),
    +                       max = ~max(., na.rm = TRUE),
    +                       q05 = ~quantile(., 0.05, na.rm = TRUE),
    +                       q25 = ~quantile(., 0.25, na.rm = TRUE),
    +                       mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE),
    +                       median = ~quantile(., 0.5, na.rm = TRUE),
    +                       q75 = ~quantile(., 0.75, na.rm = TRUE),
    +                       q95 = ~quantile(., 0.95, na.rm = TRUE),
    +                       n_missing = ~sum(is.na(.))))) %>%
    +    mutate(type = "Numeric")
    +}
    +
    +#' Compute descriptive statistics for the character or factor columns of a data frame.
    +#' @param df The data frame to summarise.
    +#' @return A data frame with a description of the character or factor columns.
    +#' @import dplyr
    +#' @importFrom tidyr gather
    +describe_character_or_factors <- function(df, type){
    +  df %>%
    +    pivot_longer(cols = everything(),
    +                 names_to = "variable", values_to = "value") %>%
    +    group_by(variable) %>%
    +    summarise(across(everything(),
    +                     funs(mode = brotools::sample_mode(value, na.rm = TRUE),
    +                     nobs = length(value),
    +                     n_missing = sum(is.na(value)),
    +                     n_unique = length(unique(value))))) %>%
    +    mutate(type = type)
    +}
    +
    +#' Compute descriptive statistics for the character columns of a data frame.
    +#' @param df The data frame to summarise.
    +#' @return A data frame with a description of the character columns.
    +#' @import dplyr
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' describe(dataset)
    +#' }
    +describe_character <- function(df){
    +  df %>%
    +    select(where(is.character)) %>%
    +    describe_character_or_factors(type = "Character")
    +}

    Let’s now continue on to the next section, where we will learn to document the package.

    diff --git a/docs/reading-and-writing-data.html b/docs/reading-and-writing-data.html index b142de0..f9a7d3a 100644 --- a/docs/reading-and-writing-data.html +++ b/docs/reading-and-writing-data.html @@ -23,7 +23,7 @@ - + @@ -282,7 +282,6 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • 5 Graphs @@ -365,7 +364,7 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • +
  • Exercise 4
  • 8 Functional programming diff --git a/docs/references.html b/docs/references.html index 712d7aa..3f1675b 100644 --- a/docs/references.html +++ b/docs/references.html @@ -23,7 +23,7 @@ - + @@ -282,7 +282,6 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • 5 Graphs @@ -365,7 +364,7 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • +
  • Exercise 4
  • 8 Functional programming diff --git a/docs/search_index.json b/docs/search_index.json index d444aca..c812990 100644 --- a/docs/search_index.json +++ b/docs/search_index.json @@ -1 +1 @@ -[["index.html", "Modern R with the tidyverse Preface Note to the reader What is R? Who is this book for? Why this book? Why modern R? What is RStudio? What to expect from this book? Prerequisites What are packages? The author", " Modern R with the tidyverse Bruno Rodrigues 2022-10-13 Preface Note to the reader I have been working on this on and off for the past 4 years or so. In 2022, I have updated the contents of the book to reflect updates introduced with R 4.1 and in several packages (especially those from the {tidyverse}). I have also cut some content that I think is not that useful, especially in later chapters. This book is still being written. Chapters 1 to 8 are almost ready, but more content is being added (especially to chapter 8). 9 and 10 are empty for now. Some exercises might be at the wrong place too and more are coming. You can purchase an ebook version of this book on leanpub. The version on leanpub is quite out of date, so if you buy it, it’s really just to send some money my money, so many thanks for that! You can also support me by buying me a coffee or paypal.me. What is R? Read R’s official answer to this question here. To make it short: R is a multi-paradigm (procedural, imperative, object-oriented and functional)1 programming language that focuses on applications in statistics. By statistics I mean any field that uses statistics such as official statistics, economics, finance, data science, machine learning, etc. For the sake of simplicity, I will use the word “statistics” as a general term that encompasses all these fields and disciplines for the remainder of this book. Who is this book for? This book can be useful to different audiences. If you have never used R in your life, and want to start, start with Chapter 1 of this book. Chapter 1 to 3 are the very basics, and should be easy to follow up to Chapter 7. Starting with Chapter 7, it gets more technical, and will be harder to follow. But I suggest you keep on going, and do not hesitate to contact me for help if you struggle! Chapter 7 is also where you can start if you are already familiar with R and the {tidyverse}, but not functional programming. If you are familiar with R but not the {tidyverse} (or have no clue what the {tidyverse} is), then you can start with Chapter 4. If you are familiar with R, the {tidyverse} and functional programming, you might still be interested in this book, especially Chapter 9 and 10, which deal with package development and further advanced topics respectively. Why this book? This book is first and foremost for myself. This book is the result of years of using and teaching R at university and then at my jobs. During my university time, I wrote some notes to help me teach R and which I shared with my students. These are still the basis of Chapter 2. Then, once I had left university, and continued using R at my first “real” job, I wrote another book that dealt mostly with package development and functional programming. This book is now merged to this one and is the basis of Chapters 9 and 10. During these years at my first job, I was also tasked with teaching R. By that time, I was already quite familiar with the {tidyverse} so I wrote a lot of notes that were internal and adapted for the audience of my first job. These are now the basis of Chapters 3 to 8. Then, during all these years, I kept blogging about R, and reading blogs and further books. All this knowledge is condensed here, so if you are familiar with my blog, you’ll definitely recognize a lot of my blog posts in here. So this book is first and foremost for me, because I need to write all of this down in a central place. So because my target audience is myself, this book is free. If you find it useful, and are in the mood of buying me a coffee, you can, but if this book is not useful to you, no harm done (unless you paid for it before reading it, in which case, I am sorry to have wasted your time). But I am quite sure you’ll find some of the things written here useful, regardless of your current experience level with R. Why modern R? Modern R instead of “just” R because we are going to learn how to use modern packages (mostly those from the tidyverse) and concepts, such as functional programming (which is quite an old concept actually, but one that came into fashion recently). R is derived from S, which is a programming language that has roots in FORTRAN and other languages too. If you learned R at university, you’ve probably learned to use it as you would have used FORTRAN; very long scripts where data are represented as matrices and where row-wise (or column-wise) operations are implemented with for loops. There’s nothing wrong with that, mind you, but R was also influenced by Scheme and Common Lisp, which are functional programming languages. In my opinion, functional programming is a programming paradigm that works really well when dealing with statistical problems. This is because programming in a functional style is just like writing math. For instance, suppose you want to sum all the elements of a vector. In mathematical notation, you would write something like: \\[ \\sum_{i = 1}^{100} x_{i} \\] where \\(x\\) is a vector of length 100. Solving this using a loop would look something like this: res <- 0 for(i in 1:length(x)){ res <- x[i] + res } This does not look like the math notation at all! You have to define a variable that will hold the result outside of the loop, and then you have to define res as something plus res inside the body of the loop. This is really unnatural. The functional programming approach is much easier: Reduce(`+`, x) We will learn about Reduce() later (to be more precise, we will learn about purrr::reduce(), the “tidy” version of Reduce()), but already you see that the notation looks a lot more like the mathematical notation. At its core, functional programming uses functions, and functions are so-called first class objects in R, which means that there is nothing special about them… you can pass them to other functions, create functions that return functions and do any kind of operation on them just as with any other object. This means that functions in R are extremely powerful and flexible tools. In the first part of the book, we are going to use functions that are already available in R, and then use those available in packages, mostly those from the tidyverse. The tidyverse is a collection of packages developed by Hadley Wickham, and several of his colleagues at RStudio, Inc. By using the packages from the tidyverse and R’s built-in functional programming capabilities, we can write code that is faster and easier to explain to colleagues, and also easier to maintain. This also means that you might have to change your expectations and what you know already from R, if you learned it at University but haven’t touched it in a long time. For example for and while loops, are relegated to chapter 8. This does not mean that you will have to wait for 8 chapter to know how to repeat instructions N times, but that for and while loops are tools that are very useful for very specific situations that will be discussed at that point. In the second part of the book, we are going to move from using R to solve statistical problems to developing with R. We are going to learn about creating your own package. If you do not know what packages are, don’t worry, this will be discussed just below. What is RStudio? RStudio is a modern IDE that makes writing R code easier. The first thing we are going to learn is how to use it. R and RStudio are both open source: this means that the source code is freely available on the internet and contributions by anyone are welcome and integrated; provided they are meaningful and useful. What to expect from this book? The idea of Chapters 1 to 7 is to make you efficient with R as quickly as possible, especially if you already have prior programming knowledge. Starting with Chapter 8 you will learn more advanced topics, especially programming with R. R is a programming language, and you can’t write “programming language” without “language”. And just as you wouldn’t expect to learn French, Portuguese or Icelandic by reading a single book, you shouldn’t expect to become fluent in R by reading a single book, not even by reading 10 books. Programming is an art which requires a lot of practice. Teach yourself programming in 10 years is a blog post written by Peter Norvig which explains that just as with any craft, mastering programming takes time. And even if you don’t need or want to become an expert in R, if you wish to use R effectively and in a way that ultimately saves you time, you need to have some fluency in it, and this only comes by continuing to learn about the language, and most importantly practicing. If you keep using R every day, you’ll definitely become very fluent. To stay informed about developments of the language, and the latest news, I advise you read blogs, especially R-bloggers which aggregates blog posts by more than 750 blogs discussing R. So what you can expect from this book is that this book is not the only one you should read. Prerequisites R and RStudio are the two main pieces of software that we are going to use. R is the programming language and RStudio is a modern IDE for it. You can use R without RStudio; but you cannot use RStudio without R. If you wish to install R and RStudio at home to follow the examples in this book you can do it as both pieces of software are available free of charge (paid options for RStudio exist, for companies that need technical support). Installation is simple, but operating system dependent. To download and install R for Windows, follow this link. For macOS, follow this one. If you run a GNU+Linux distribution, you can install R using the system’s package manager. On Ubuntu, install r-base. For RStudio, look for your operating system here. What are packages? There is one more step; we are going to install some packages. Packages are additional pieces of code that can be installed from within R with the following function: install.packages(). These packages extend R’s capabilities significantly, and are probably one of the main reasons R is so popular. As of November 2018, R has over 13000 packages. To install the packages we need, first open RStudio and then copy and paste this line in the console: install.packages(c("tidyverse", "rsample", "recipes", "blogdown" ,"yardstick", "parsnip", "plm", "pwt9", "checkpoint", "Ecdat", "ggthemes", "ggfortify", "margins", "janitor", "rio", "stopwords", "colourpicker", "glmnet", "lhs", "mlrMBO", "mlbench", "ranger")) or go to the Packages pane and then click on Install: The author My name is Bruno Rodrigues and I program almost exclusively in R and have been teaching some R courses for a few years now. I first started teaching for students at the University of Strasbourg while working on my PhD. I hold a PhD in economics, with a focus on quantitative methods. I’m currently head of the statistics department of the Ministry of Higher education and Research in Luxembourg, and before that worked as a manager in the data science team of PWC Luxembourg. This book is an adaptation of notes I’ve used in the past during my time as a teacher, but also a lot of things I’ve learned about R since I left academia. In my free time I like cooking, working out and blogging, while listening to Fip or Chillsky Radio. I also like to get my butt handed to me by playing roguelikes such as NetHack, for which I wrote a package that contains functions to analyze the data that is saved on your computer after you win or lose (it will be lose 99% of the time) the game. You can follow me on twitter, I tweet mostly about R or what’s happening in Luxembourg. In this book we are going to focus on R’s functional programming capabilities↩︎ "],["getting-to-know-rstudio.html", "Chapter 1 Getting to know RStudio 1.1 Panes 1.2 Console 1.3 Scripts 1.4 Options 1.5 Keyboard shortcuts 1.6 Projects 1.7 History 1.8 Plots 1.9 Addins 1.10 Packages 1.11 Exercises", " Chapter 1 Getting to know RStudio RStudio is a company that develops and maintains several products. Their best-known product is an IDE (Integrated development environment) for the R programming language, also called RStudio. You can install RStudio by visiting this link. There is also a server version that can be used to have a centralized version of R within, say, a company. RStudio, the company, also develops Shiny, a package to create full-fledged web-apps. I am not going to cover Shiny in this book, since there’s already a lot of material that you can learn from. Once you have installed RStudio, launch it and let’s go through the interface together. 1.1 Panes RStudio is divided into different panes. Each pane has a specific function. The gif below shows some of these panes: Take some time to look around what each pane shows you. Some panes are empty; for example the Plots pane or the Viewer pane. Plots shows you the plots you make. You can browse the plots and save them. We will see this in more detail in a later chapter. Viewer shows you previews of documents that you generate with R. More on this later. 1.2 Console The Console pane is where you can execute R code. Write the following in the console: 2 + 3 and you’ll get the answer, 5. However, do not write a lot of lines in the console. It is better write your code inside a script. Output is also shown inside the console. 1.3 Scripts Look at the gif below: In this gif, we see the user creating a new R script. R scripts are simple text files that hold R code. Think of .do files in STATA or .c files for C. R scripts have the extension .r or .R. It is possible to create a lot of other files. We’ll take a look at R Markdown files in Chapter 11. 1.3.1 The help pane The Help pane allows you to consult documentation for functions or packages. The gif below shows how it works: you can also access help using the following syntax: ?lm. This will bring up the documentation for the function lm(). You can also type ??lm which will look for the string lm in every package. 1.3.2 The Environment pane The Environment pane shows every object created in the current section. It is especially useful if you have defined lists or have loaded data into R as it makes it easy to explore these more complex objects. 1.4 Options It is also possible to customize RStudio’s look and feel: Take some time to go through the options. 1.5 Keyboard shortcuts It is a good idea to familiarize yourself with at least some keyboard shortcuts. This is more convenient than having to move the mouse around: If there is only one keyboard shortcut you need to know, it’s Ctrl-Enter that executes a line of code from your script. However, these other shortcuts are also worth knowing: CTRL-ALT-R: run entire script CTRL-ALT-UP or DOWN: make cursor taller or shorter, allowing you to edit multiple lines at the same time CTRL-F: Search and replace ALT-UP or DOWN: Move line up or down CTRL-SHIFT-C: Comment/uncomment line ALT-SHIFT-K: Bring up the list of keyboard shortcuts CTRL-SHIFT-M: Insert the pipe operator (%>%, more on this later) CTRL-S: Save script This is just a few keyboard shortcuts that I personally find useful. However, I strongly advise you to learn and use whatever shortcuts are useful and feel natural to you! 1.6 Projects One of the best features of RStudio are projects. Creating a project is simple; the gif below shows how you can create a project and how you can switch between projects. Projects make a lot of things easier, such as managing paths. More on this in the chapter about reading data. Another useful feature of projects is that the scripts you open in project A will stay open even if you switch to another project B, and then switch back to the project A again. You can also use version control (with git) inside a project. Version control is very useful, but I won’t discuss it here. You can find a lot of resources online to get you started with git. 1.7 History The History pane saves all the previous lines you executed. You can then select these lines and send them back to the console or the script. 1.8 Plots All the plots you make during a session are visible in the Plots pane. From there, you can export them in different formats. The plots shown in the gif are made using basic R functions. Later, we will learn how to make nicer looking plots using the package ggplot2. 1.9 Addins Some packages install addins, which are accessible through the addins button: These addins make it easier to use some functions and you can read more about them here. My favorite addins are the ones you get when installing the {datapasta} package. Read more about it here. There are other panes that I will not discuss here, but you will naturally discover their use as you go. For example, we will discuss the Build pane in Chapter 11. 1.10 Packages You can think of packages as addons that extend R’s core functionality. You can browse all available packages on CRAN. To make it easier to find what you might be interested in, you can also browse the CRAN Task Views. Each package has a landing page that summarises its dependencies, version number etc. For example, for the dplyr package: https://cran.r-project.org/web/packages/dplyr/index.html. Take a look at the Downloads section, and especially at the Reference Manual and Vignettes: Vignettes are valuable documents; inside vignettes, the purpose of the package is explained in plain English, usually with accompanying examples. The reference manuals list the available functions inside the packages. You can also find vignettes from within Rstudio: Go to the Packages pane and click on the package you’re interested in. Then you can consult the help for the functions that come with the package as well as the package’s vignettes. Once you installed a package, you have to load it before you can use it. To load packages you use the library() function: library(dplyr) library(janitor) # and so on... If you only need to use one single function once, you don’t need to load an entire package. You can write the following: dplyr::full_join(A, B) using the :: operator, you can access functions from packages without having to load the whole package beforehand. It is possible and easy to create your own packages. This is useful if you have to write a lot of functions that you use daily. We will lean about that, in Chapter 10. 1.11 Exercises Exercise 1 Change the look and feel of RStudio to suit your tastes! I personally like to move the console to the right and use a dark theme. Take some 5 minutes to customize it and browse through all the options. "],["objects-their-classes-and-types-and-useful-r-functions-to-get-you-started.html", "Chapter 2 Objects, their classes and types, and useful R functions to get you started 2.1 The numeric class 2.2 The character class 2.3 The factor class 2.4 The Date class 2.5 The logical class 2.6 Vectors and matrices 2.7 The list class 2.8 The data.frame and tibble classes 2.9 Formulas 2.10 Models 2.11 NULL, NA and NaN 2.12 Useful functions to get you started 2.13 Exercises", " Chapter 2 Objects, their classes and types, and useful R functions to get you started All objects in R have a given type. You already know most of them, as these types are also used in mathematics. Integers, floating point numbers (floats), matrices, etc, are all objects you are already familiar with. But R has other, maybe lesser known data types (that you can find in a lot of other programming languages) that you need to become familiar with. But first, we need to learn how to assign a value to a variable. This can be done in two ways: a <- 3 or a = 3 in very practical terms, there is no difference between the two. I prefer using <- for assigning values to variables and reserve = for passing arguments to functions, for example: spam <- mean(x = c(1,2,3)) I think this is less confusing than: spam = mean(x = c(1,2,3)) but as I explained above you can use whatever you feel most comfortable with. 2.1 The numeric class To define single numbers, you can do the following: a <- 3 The class() function allows you to check the class of an object: class(a) ## [1] "numeric" Decimals are defined with the character .: a <- 3.14 R also supports integers. If you find yourself in a situation where you explicitly need an integer and not a floating point number, you can use the following: a <- as.integer(3) class(a) ## [1] "integer" The as.integer() function is very useful, because it converts its argument into an integer. There is a whole family of as.*() functions. To convert a into a floating point number again: class(as.numeric(a)) ## [1] "numeric" There is also is.numeric() which tests whether a number is of the numeric class: is.numeric(a) ## [1] TRUE It is also possible to create an integer using L: a <- 5L class(a) ## [1] "integer" Another way to convert this integer back to a floating point number is to use as.double() instead of as numeric: class(as.double(a)) ## [1] "numeric" The functions prefixed with is.* and as.* are quite useful, there is one for any of the supported types in R, such as as/is.character(), as/is.factor(), etc… 2.2 The character class Use \" \" to define characters (called strings in other programming languages): a <- "this is a string" class(a) ## [1] "character" To convert something to a character you can use the as.character() function: a <- 4.392 class(a) ## [1] "numeric" Now let’s convert it: class(as.character(a)) ## [1] "character" It is also possible to convert a character to a numeric: a <- "4.392" class(a) ## [1] "character" class(as.numeric(a)) ## [1] "numeric" But this only works if it makes sense: a <- "this won't work, chief" class(a) ## [1] "character" as.numeric(a) ## Warning: NAs introduced by coercion ## [1] NA A very nice package to work with characters is {stringr}, which is also part of the {tidyverse}. 2.3 The factor class Factors look like characters, but are very different. They are the representation of categorical variables. A {tidyverse} package to work with factors is {forcats}. You would rarely use factor variables outside of datasets, so for now, it is enough to know that this class exists. We are going to learn more about factor variables in Chapter 4, by using the {forcats} package. 2.4 The Date class Dates also look like characters, but are very different too: as.Date("2019/03/19") ## [1] "2019-03-19" class(as.Date("2019/03/19")) ## [1] "Date" Manipulating dates and time can be tricky, but thankfully there’s a {tidyverse} package for that, called {lubridate}. We are going to go over this package in Chapter 4. 2.5 The logical class This is the class of predicates, expressions that evaluate to true or false. For example, if you type: 4 > 3 ## [1] TRUE R returns TRUE, which is an object of class logical: k <- 4 > 3 class(k) ## [1] "logical" In other programming languages, logicals are often called bools. A logical variable can only have two values, either TRUE or FALSE. You can test the truthiness of a variable with isTRUE(): k <- 4 > 3 isTRUE(k) ## [1] TRUE How can you test if a variable is false? There is not a isFALSE() function (at least not without having to load a package containing this function), but there is way to do it: k <- 4 > 3 !isTRUE(k) ## [1] FALSE The ! operator indicates negation, so the above expression could be translated as is k not TRUE?. There are other operators for boolean algebra, namely &, &&, |, ||. & means and and | stands for or. You might be wondering what the difference between & and && is? Or between | and ||? & and | work on vectors, doing pairwise comparisons: one <- c(TRUE, FALSE, TRUE, FALSE) two <- c(FALSE, TRUE, TRUE, TRUE) one & two ## [1] FALSE FALSE TRUE FALSE Compare this to the && operator: one <- c(TRUE, FALSE, TRUE, FALSE) two <- c(FALSE, TRUE, TRUE, TRUE) one && two ## Warning in one && two: 'length(x) = 4 > 1' in coercion to 'logical(1)' ## Warning in one && two: 'length(x) = 4 > 1' in coercion to 'logical(1)' ## [1] FALSE The && and || operators only compare the first element of the vectors and stop as soon as a the return value can be safely determined. This is called short-circuiting. Consider the following: one <- c(TRUE, FALSE, TRUE, FALSE) two <- c(FALSE, TRUE, TRUE, TRUE) three <- c(TRUE, TRUE, FALSE, FALSE) one && two && three ## Warning in one && two: 'length(x) = 4 > 1' in coercion to 'logical(1)' ## Warning in one && two: 'length(x) = 4 > 1' in coercion to 'logical(1)' ## [1] FALSE one || two || three ## Warning in one || two: 'length(x) = 4 > 1' in coercion to 'logical(1)' ## [1] TRUE The || operator stops as soon it evaluates to TRUE whereas the && stops as soon as it evaluates to FALSE. Personally, I rarely use || or && because I get confused. I find using | or & in combination with the all() or any() functions much more useful: one <- c(TRUE, FALSE, TRUE, FALSE) two <- c(FALSE, TRUE, TRUE, TRUE) any(one & two) ## [1] TRUE all(one & two) ## [1] FALSE any() checks whether any of the vector’s elements are TRUE and all() checks if all elements of the vector are TRUE. As a final note, you should know that is possible to use T for TRUE and F for FALSE but I would advise against doing this, because it is not very explicit. 2.6 Vectors and matrices You can create a vector in different ways. But first of all, it is important to understand that a vector in most programming languages is nothing more than a list of things. These things can be numbers (either integers or floats), strings, or even other vectors. A vector in R can only contain elements of one single type. This is not the case for a list, which is much more flexible. We will talk about lists shortly, but let’s first focus on vectors and matrices. 2.6.1 The c() function A very important function that allows you to build a vector is c(): a <- c(1,2,3,4,5) This creates a vector with elements 1, 2, 3, 4, 5. If you check its class: class(a) ## [1] "numeric" This can be confusing: you where probably expecting a to be of class vector or something similar. This is not the case if you use c() to create the vector, because c() doesn’t build a vector in the mathematical sense, but a so-called atomic vector. Checking its dimension: dim(a) ## NULL returns NULL because an atomic vector doesn’t have a dimension. If you want to create a true vector, you need to use cbind() or rbind(). But before continuing, be aware that atomic vectors can only contain elements of the same type: c(1, 2, "3") ## [1] "1" "2" "3" because “3” is a character, all the other values get implicitly converted to characters. You have to be very careful about this, and if you use atomic vectors in your programming, you have to make absolutely sure that no characters or logicals or whatever else are going to convert your atomic vector to something you were not expecting. 2.6.2 cbind() and rbind() You can create a true vector with cbind(): a <- cbind(1, 2, 3, 4, 5) Check its class now: class(a) ## [1] "matrix" "array" This is exactly what we expected. Let’s check its dimension: dim(a) ## [1] 1 5 This returns the dimension of a using the LICO notation (number of LInes first, the number of COlumns). It is also possible to bind vectors together to create a matrix. b <- cbind(6,7,8,9,10) Now let’s put vector a and b into a matrix called matrix_c using rbind(). rbind() functions the same way as cbind() but glues the vectors together by rows and not by columns. matrix_c <- rbind(a,b) print(matrix_c) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 2 3 4 5 ## [2,] 6 7 8 9 10 2.6.3 The matrix class R also has support for matrices. For example, you can create a matrix of dimension (5,5) filled with 0’s with the matrix() function: matrix_a <- matrix(0, nrow = 5, ncol = 5) If you want to create the following matrix: \\[ B = \\left( \\begin{array}{ccc} 2 & 4 & 3 \\\\ 1 & 5 & 7 \\end{array} \\right) \\] you would do it like this: B <- matrix(c(2, 4, 3, 1, 5, 7), nrow = 2, byrow = TRUE) The option byrow = TRUE means that the rows of the matrix will be filled first. You can access individual elements of matrix_a like so: matrix_a[2, 3] ## [1] 0 and R returns its value, 0. We can assign a new value to this element if we want. Try: matrix_a[2, 3] <- 7 and now take a look at matrix_a again. print(matrix_a) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0 0 0 0 0 ## [2,] 0 0 7 0 0 ## [3,] 0 0 0 0 0 ## [4,] 0 0 0 0 0 ## [5,] 0 0 0 0 0 Recall our vector b: b <- cbind(6,7,8,9,10) To access its third element, you can simply write: b[3] ## [1] 8 I have heard many people praising R for being a matrix based language. Matrices are indeed useful, and statisticians are used to working with them. However, I very rarely use matrices in my day to day work, and prefer an approach based on data frames (which will be discussed below). This is because working with data frames makes it easier to use R’s advanced functional programming language capabilities, and this is where R really shines in my opinion. Working with matrices almost automatically implies using loops and all the iterative programming techniques, à la Fortran, which I personally believe are ill-suited for interactive statistical programming (as discussed in the introduction). 2.7 The list class The list class is a very flexible class, and thus, very useful. You can put anything inside a list, such as numbers: list1 <- list(3, 2) or other lists constructed with c(): list2 <- list(c(1, 2), c(3, 4)) you can also put objects of different classes in the same list: list3 <- list(3, c(1, 2), "lists are amazing!") and of course create list of lists: my_lists <- list(list1, list2, list3) To check the contents of a list, you can use the structure function str(): str(my_lists) ## List of 3 ## $ :List of 2 ## ..$ : num 3 ## ..$ : num 2 ## $ :List of 2 ## ..$ : num [1:2] 1 2 ## ..$ : num [1:2] 3 4 ## $ :List of 3 ## ..$ : num 3 ## ..$ : num [1:2] 1 2 ## ..$ : chr "lists are amazing!" or you can use RStudio’s Environment pane: You can also create named lists: list4 <- list("name_1" = 2, "name_2" = 8, "name_3" = "this is a named list") and you can access the elements in two ways: list4[[1]] ## [1] 2 or, for named lists: list4$name_3 ## [1] "this is a named list" Take note of the $ operator, because it is going to be quite useful for data.frames as well, which we are going to get to know in the next section. Lists are used extensively because they are so flexible. You can build lists of datasets and apply functions to all the datasets at once, build lists of models, lists of plots, etc… In the later chapters we are going to learn all about them. Lists are central objects in a functional programming workflow for interactive statistical analysis. 2.8 The data.frame and tibble classes In the next chapter we are going to learn how to import datasets into R. Once you import data, the resulting object is either a data.frame or a tibble depending on which package you used to import the data. tibbles extend data.frames so if you know about data.frame objects already, working with tibbles will be very easy. tibbles have a better print() method, and some other niceties. However, I want to stress that these objects are central to R and are thus very important; they are actually special cases of lists, discussed above. There are different ways to print a data.frame or a tibble if you wish to inspect it. You can use View(my_data) to show the my_data data.frame in the View pane of RStudio: You can also use the str() function: str(my_data) And if you need to access an individual column, you can use the $ sign, same as for a list: my_data$col1 2.9 Formulas We will learn more about formulas later, but because it is an important object, it is useful if you already know about them early on. A formula is defined in the following way: my_formula <- ~x class(my_formula) ## [1] "formula" Formula objects are defined using the ~ symbol. Formulas are useful to define statistical models, for example for a linear regression: lm(y ~ x) or also to define anonymous functions, but more on this later. 2.10 Models A statistical model is an object like any other in R: Here, I have already a model that I ran on some test data: class(my_model) ## [1] "lm" my_model is an object of class lm, for linear model. You can apply different functions to a model object: summary(my_model) ## ## Call: ## lm(formula = mpg ~ hp, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.7121 -2.1122 -0.8854 1.5819 8.2360 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 30.09886 1.63392 18.421 < 2e-16 *** ## hp -0.06823 0.01012 -6.742 1.79e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.863 on 30 degrees of freedom ## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892 ## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07 This class will be explored in later chapters. 2.11 NULL, NA and NaN The NULL, NA and NaN classes are pretty special. NULL is returned when the result of function is undetermined. For example, consider list4: list4 ## $name_1 ## [1] 2 ## ## $name_2 ## [1] 8 ## ## $name_3 ## [1] "this is a named list" if you try to access an element that does not exist, such as d, you will get NULL back: list4$d ## NULL NaN means “Not a Number” and is returned when a function return something that is not a number: sqrt(-1) ## Warning in sqrt(-1): NaNs produced ## [1] NaN or: 0/0 ## [1] NaN Basically, numbers that cannot be represented as floating point numbers are NaN. Finally, there’s NA which is closely related to NaN but is used for missing values. NA stands for Not Available. There are several types of NAs: NA_integer_ NA_real_ NA_complex_ NA_character_ but these are in principle only used when you need to program your own functions and need to explicitly test for the missingness of, say, a character value. To test whether a value is NA, use the is.na() function. 2.12 Useful functions to get you started This section will list several basic R functions that are very useful and should be part of your toolbox. 2.12.1 Sequences There are several functions that create sequences, seq(), seq_along() and rep(). rep() is easy enough: rep(1, 10) ## [1] 1 1 1 1 1 1 1 1 1 1 This simply repeats 1 10 times. You can repeat other objects too: rep("HAHA", 10) ## [1] "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" To create a sequence, things are not as straightforward. There is seq(): seq(1, 10) ## [1] 1 2 3 4 5 6 7 8 9 10 seq(70, 80) ## [1] 70 71 72 73 74 75 76 77 78 79 80 It is also possible to provide a by argument: seq(1, 10, by = 2) ## [1] 1 3 5 7 9 seq_along() behaves similarly, but returns the length of the object passed to it. So if you pass list4 to seq_along(), it will return a sequence from 1 to 3: seq_along(list4) ## [1] 1 2 3 which is also true for seq() actually: seq(list4) ## [1] 1 2 3 but these two functions behave differently for arguments of length equal to 1: seq(10) ## [1] 1 2 3 4 5 6 7 8 9 10 seq_along(10) ## [1] 1 So be quite careful about that. I would advise you do not use seq(), but only seq_along() and seq_len(). seq_len() only takes arguments of length 1: seq_len(10) ## [1] 1 2 3 4 5 6 7 8 9 10 seq_along(10) ## [1] 1 The problem with seq() is that it is unpredictable; depending on its input, the output will either be an integer or a sequence. When programming, it is better to have function that are stricter and fail when confronted to special cases, instead of returning some result. This is a bit of a recurrent issue with R, and the functions from the {tidyverse} mitigate this issue by being stricter than their base R counterparts. For example, consider the ifelse() function from base R: ifelse(3 > 5, 1, "this is false") ## [1] "this is false" and compare it to {dplyr}’s implementation, if_else(): if_else(3 > 5, 1, "this is false") Error: `false` must be type double, not character Call `rlang::last_error()` to see a backtrace if_else() fails because the return value when FALSE is not a double (a real number) but a character. This might seem unnecessarily strict, but at least it is predictable. This makes debugging easier when used inside functions. In Chapter 8 we are going to learn how to write our own functions, and being strict makes programming easier. 2.12.2 Basic string manipulation For now, we have not closely studied character objects, we only learned how to define them. Later, in Chapter 5 we will learn about the {stringr} package which provides useful function to work with strings. However, there are several base R functions that are very useful that you might want to know nonetheless, such as paste() and paste0(): paste("Hello", "amigo") ## [1] "Hello amigo" but you can also change the separator if needed: paste("Hello", "amigo", sep = "--") ## [1] "Hello--amigo" paste0() is the same as paste() but does not have any sep argument: paste0("Hello", "amigo") ## [1] "Helloamigo" If you provide a vector of characters, you can also use the collapse argument, which places whatever you provide for collapse between the characters of the vector: paste0(c("Joseph", "Mary", "Jesus"), collapse = ", and ") ## [1] "Joseph, and Mary, and Jesus" To change the case of characters, you can use toupper() and tolower(): tolower("HAHAHAHAH") ## [1] "hahahahah" toupper("hueuehuehuheuhe") ## [1] "HUEUEHUEHUHEUHE" Finally, there are the classical mathematical functions that you know and love: sqrt() exp() log() abs() sin(), cos(), tan(), and others sum(), cumsum(), prod(), cumprod() max(), min() and many others… 2.13 Exercises Exercise 1 Try to create the following vector: \\[a = (6,3,8,9)\\] and add it this other vector: \\[b = (9,1,3,5)\\] and save the result to a new variable called result. Exercise 2 Using a and b from before, try to get their dot product. Try with a * b in the R console. What happened? Try to find the right function to get the dot product. Don’t hesitate to google the answer! Exercise 3 How can you create a matrix of dimension (30,30) filled with 2’s by only using the function matrix()? Exercise 4 Save your first name in a variable a and your surname in a variable b. What does the function: paste(a, b) do? Look at the help for paste() with ?paste or using the Help pane in RStudio. What does the optional argument sep do? Exercise 5 Define the following variables: a <- 8, b <- 3, c <- 19. What do the following lines check? What do they return? a > b a == b a != b a < b (a > b) && (a < c) (a > b) && (a > c) (a > b) || (a < b) Exercise 6 Define the following matrix: \\[ \\text{matrix_a} = \\left( \\begin{array}{ccc} 9 & 4 & 12 \\\\ 5 & 0 & 7 \\\\ 2 & 6 & 8 \\\\ 9 & 2 & 9 \\end{array} \\right) \\] What does matrix_a >= 5 do? What does matrix_a[ , 2] do? Can you find which function gives you the transpose of this matrix? Exercise 7 Solve the following system of equations using the solve() function: \\[ \\left( \\begin{array}{cccc} 9 & 4 & 12 & 2 \\\\ 5 & 0 & 7 & 9\\\\ 2 & 6 & 8 & 0\\\\ 9 & 2 & 9 & 11 \\end{array} \\right) \\times \\left( \\begin{array}{ccc} x \\\\ y \\\\ z \\\\ t \\\\ \\end{array}\\right) = \\left( \\begin{array}{ccc} 7\\\\ 18\\\\ 1\\\\ 0 \\end{array} \\right) \\] Exercise 8 Load the mtcars data (mtcars is include in R, so you only need to use the data() function to load the data): data(mtcars) if you run class(mtcars), you get “data.frame”. Try now with typeof(mtcars). The answer is now “list”! This is because the class of an object is an attribute of that object, which can even be assigned by the user: class(mtcars) <- "don't do this" class(mtcars) ## [1] "don't do this" The type of an object is R’s internal type of that object, which cannot be manipulated by the user. It is always useful to know the type of an object (not just its class). For example, in the particular case of data frames, because the type of a data frame is a list, you can use all that you learned about lists to manipulate data frames! Recall that $ allowed you to select the element of a list for instance: my_list <- list("one" = 1, "two" = 2, "three" = 3) my_list$one ## [1] 1 Because data frames are nothing but fancy lists, this is why you can access columns the same way: mtcars$mpg ## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 ## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 ## [31] 15.0 21.4 "],["reading-and-writing-data.html", "Chapter 3 Reading and writing data 3.1 The swiss army knife of data import and export: {rio} 3.2 Writing any object to disk 3.3 Using RStudio projects to manage paths", " Chapter 3 Reading and writing data In this chapter, we are going to import example datasets that are available in R, mtcars and iris. I have converted these datasets into several formats. Download those datasets here if you want to follow the examples below. R can import some formats without the need of external packages, such as the .csv format. However, for other formats, you will need to use different packages. Because there are a lot of different formats available I suggest you use the {rio} package. {rio} is a wrapper around different packages that import/export data in different formats. This package is nice because you don’t need to remember which package to use to import, say, STATA datasets and then you need to remember which one for SAS datasets, and so on. Read {rio}’s vignette for more details. Below I show some of {rio}’s functions presented in the vignette. It is also possible to import data from other, less “traditional” sources, such as your clipboard. Also note that it is possible to import more than one dataset at once. There are two ways of doing that, either by importing all the datasets, binding their rows together and add a new variable with the name of the data, or import all the datasets into a list, where each element of that list is a data frame. We are going to explore this second option later. 3.1 The swiss army knife of data import and export: {rio} To import data with {rio}, import() is all you need: library(rio) mtcars <- import("datasets/mtcars.csv") head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 import() needs the path to the data, and you can specify additional options if needed. On a Windows computer, you have to pay attention to the path; you cannot simply copy and paste it, because paths in Windows use the \\ symbol whereas R uses / (just like on Linux or macOS). Importing a STATA or a SAS file is done just the same: mtcars_stata <- import("datasets/mtcars.dta") head(mtcars_stata) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 mtcars_sas <- import("datasets/mtcars.sas7bdat") head(mtcars_sas) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 It is also possible to import Excel files where each sheet is a single table, but you will need import_list() for that. The file multi.xlsx has two sheets, each with a table in it: multi <- import_list("datasets/multi.xlsx") str(multi) ## List of 2 ## $ mtcars:'data.frame': 32 obs. of 11 variables: ## ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ... ## ..$ disp: num [1:32] 160 160 108 258 360 ... ## ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ... ## ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ... ## ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ... ## ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ... ## ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ... ## ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ... ## ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ... ## $ iris :'data.frame': 150 obs. of 5 variables: ## ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## ..$ Species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ... As you can see multi is a list of datasets. Told you lists were very flexible! It is also possible to import all the datasets in a single directory at once. For this, you first need a vector of paths: paths <- Sys.glob("datasets/unemployment/*.csv") Sys.glob() allows you to find files using a regular expression. “datasets/unemployment/*.csv” matches all the .csv files inside the “datasets/unemployment/” folder. all_data <- import_list(paths) str(all_data) ## List of 4 ## $ unemp_2013:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ... ## ..$ of which: Wage-earners : int [1:118] 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ... ## ..$ of which: Non-wage-earners: int [1:118] 19872 1809 168 94 116 294 272 113 189 338 ... ## ..$ Unemployed : int [1:118] 19287 1071 114 25 74 261 98 45 66 207 ... ## ..$ Active population : int [1:118] 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.95 5.67 6.27 2.88 4.92 ... ## ..$ Year : int [1:118] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2013.csv" ## $ unemp_2014:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 228423 18166 1767 845 1505 4129 2172 1007 1268 3124 ... ## ..$ of which: Wage-earners : int [1:118] 208238 16366 1606 757 1390 3840 1897 887 1082 2782 ... ## ..$ of which: Non-wage-earners: int [1:118] 20185 1800 161 88 115 289 275 120 186 342 ... ## ..$ Unemployed : int [1:118] 19362 1066 122 19 66 287 91 38 61 202 ... ## ..$ Active population : int [1:118] 247785 19232 1889 864 1571 4416 2263 1045 1329 3326 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.81 5.54 6.46 2.2 4.2 ... ## ..$ Year : int [1:118] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2014.csv" ## $ unemp_2015:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 233130 18310 1780 870 1470 4130 2170 1050 1300 3140 ... ## ..$ of which: Wage-earners : int [1:118] 212530 16430 1620 780 1350 3820 1910 920 1100 2770 ... ## ..$ of which: Non-wage-earners: int [1:118] 20600 1880 160 90 120 310 260 130 200 370 ... ## ..$ Unemployed : int [1:118] 18806 988 106 29 73 260 80 41 72 169 ... ## ..$ Active population : int [1:118] 251936 19298 1886 899 1543 4390 2250 1091 1372 3309 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.46 5.12 5.62 3.23 4.73 ... ## ..$ Year : int [1:118] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2015.csv" ## $ unemp_2016:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 236100 18380 1790 870 1470 4160 2160 1030 1330 3150 ... ## ..$ of which: Wage-earners : int [1:118] 215430 16500 1640 780 1350 3840 1900 900 1130 2780 ... ## ..$ of which: Non-wage-earners: int [1:118] 20670 1880 150 90 120 320 260 130 200 370 ... ## ..$ Unemployed : int [1:118] 18185 975 91 27 66 246 76 35 70 206 ... ## ..$ Active population : int [1:118] 254285 19355 1881 897 1536 4406 2236 1065 1400 3356 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.15 5.04 4.84 3.01 4.3 ... ## ..$ Year : int [1:118] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2016.csv" in a subsequent chapter we will learn how to actually use these lists of datasets. If you know that each dataset in each file has the same columns, you can also import them directly into a single dataset by binding each dataset together using rbind = TRUE: bind_data <- import_list(paths, rbind = TRUE) str(bind_data) ## 'data.frame': 472 obs. of 9 variables: ## $ Commune : chr "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## $ Total employed population : int 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ... ## $ of which: Wage-earners : int 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ... ## $ of which: Non-wage-earners: int 19872 1809 168 94 116 294 272 113 189 338 ... ## $ Unemployed : int 19287 1071 114 25 74 261 98 45 66 207 ... ## $ Active population : int 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ... ## $ Unemployment rate (in %) : num 7.95 5.67 6.27 2.88 4.92 ... ## $ Year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## $ _file : chr "datasets/unemployment/unemp_2013.csv" "datasets/unemployment/unemp_2013.csv" "datasets/unemployment/unemp_2013.csv" "datasets/unemployment/unemp_2013.csv" ... ## - attr(*, ".internal.selfref")=<externalptr> This also adds a further column called _file indicating the name of the file that contained the original data. If something goes wrong, you might need to take a look at the underlying function {rio} is actually using to import the file. Let’s look at the following example: testdata <- import("datasets/problems/mtcars.csv") head(testdata) ## mpg&cyl&disp&hp&drat&wt&qsec&vs&am&gear&carb ## 1 21&6&160&110&3.9&2.62&16.46&0&1&4&4 ## 2 21&6&160&110&3.9&2.875&17.02&0&1&4&4 ## 3 22.8&4&108&93&3.85&2.32&18.61&1&1&4&1 ## 4 21.4&6&258&110&3.08&3.215&19.44&1&0&3&1 ## 5 18.7&8&360&175&3.15&3.44&17.02&0&0&3&2 ## 6 18.1&6&225&105&2.76&3.46&20.22&1&0&3&1 as you can see, the import didn’t work quite well! This is because the separator is the & for some reason. Because we are trying to read a .csv file, rio::import() is using data.table::fread() under the hood (you can read this in import()’s help). If you then read data.table::fread()’s help, you see that the fread() function has an optional sep = argument that you can use to specify the separator. You can use this argument in import() too, and it will be passed down to data.table::fread(): testdata <- import("datasets/problems/mtcars.csv", sep = "&") head(testdata) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 21 6 160 110 3.9 2.62 16.46 0 1 4 4 ## 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4 ## 3 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 ## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## 5 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 ## 6 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 export() allows you to write data to disk, by simply providing the path and name of the file you wish to save. export(testdata, "path/where/to/save/testdata.csv") If you end the name with .csv the file is exported to the csv format, if instead you write .dta the data will be exported to the STATA format, and so on. If you wish to export to Excel, this is possible, but it may require that you change a file on your computer (you only have to do this once). Try running: export(testdata, "path/where/to/save/testdata.xlsx") if this results in an error, try the following: Run the following lines in Rstudio: if(!file.exists("~/.Rprofile")) # only create if not already there file.create("~/.Rprofile") # (don't overwrite it) file.edit("~/.Rprofile") These lines, taken shamelessly from Efficient R programming (go read it, it’s a very great resource) look for and open the .Rprofile file which is a file that is run every time you open Rstudio. This means that you can put any line of code there that will always be executed whenever you launch Rstudio. Add this line to the file: Sys.setenv("R_ZIPCMD" = "C:/Program Files (x86)/Rtools/zip.exe") This tells Rstudio to use zip.exe as the default zip tool, which is needed to export files to the Excel format. Try it out by restarting Rstudio, and then running the following lines: library(rio) data(mtcars) export(mtcars, "mtcars.xlsx") You should find the mtcars.xlsx inside your working directory. You can check what is your working directory with getwd(). {rio} should cover all your needs, but if not, there is very likely a package out there that will import the data you need. 3.2 Writing any object to disk {rio} is an amazing package, but is only able to write tabular representations of data. What if you would like to save, say, a list containing any arbitrary object? This is possible with the saveRDS() function. Literally anything can be saved with saveRDS(): my_list <- list("this is a list", list("which contains a list", 12), c(1, 2, 3, 4), matrix(c(2, 4, 3, 1, 5, 7), nrow = 2)) str(my_list) ## List of 4 ## $ : chr "this is a list" ## $ :List of 2 ## ..$ : chr "which contains a list" ## ..$ : num 12 ## $ : num [1:4] 1 2 3 4 ## $ : num [1:2, 1:3] 2 4 3 1 5 7 my_list is a list containing a string, a list which contains a string and a number, a vector and a matrix… Now suppose that computing this list takes a very long time. For example, imagine that each element of the list is the result of estimating a very complex model on a simulated dataset, which takes hours to run. Because this takes so long to compute, you’d want to save it to disk. This is possible with saveRDS(): saveRDS(my_list, "my_list.RDS") The next day, after having freshly started your computer and launched RStudio, it is possible to retrieve the object exactly like it was using readRDS(): my_list <- readRDS("my_list.RDS") str(my_list) ## List of 4 ## $ : chr "this is a list" ## $ :List of 2 ## ..$ : chr "which contains a list" ## ..$ : num 12 ## $ : num [1:4] 1 2 3 4 ## $ : num [1:2, 1:3] 2 4 3 1 5 7 Even if you want to save a regular dataset, using saveRDS() might be a good idea because the data gets compressed if you add the option compress = TRUE to saveRDS(). However keep in mind that this will only be readable by R, so if you need to share this data with colleagues that use another tool, save it in another format. 3.3 Using RStudio projects to manage paths Managing paths can be painful, especially if you’re collaborating with a colleague and both of you saved the data in paths that are different. Whenever one of you wants to work on the script, the path will need to be adapted first. The best way to avoid that is to use projects with RStudio. Imagine that you are working on a project entitled “housing”. You will create a folder called “housing” somewhere on your computer and inside this folder have another folder called “data”, then a bunch of other folders containing different files or the outputs of your analysis. What matters here is that you have a folder called “data” which contains the datasets you will ananlyze. When you are inside an RStudio project, granted that you chose your “housing” folder as the folder to host the project, you can read the data by simply specifying the path like so: my_data <- import("/data/data.csv") Constrast this to what you would need to write if you were not using a project: my_data <- import("C:/My Documents/Castor/Work/Projects/Housing/data/data.csv") Not only is that longer, but if Castor is working on this project with Pollux, Pollux would need to change the above line to this: my_data <- import("C:/My Documents/Pollux/Work/Projects/Housing/data/data.csv") whenever Pollux needs to work on it. Another, similar issue, is that if you need to write something to disk, such as a dataset or a plot, you would also need to specify the whole path: export(my_data, "C:/My Documents/Pollux/Work/Projects/Housing/data/data.csv") If you forget to write the whole path, then the dataset will be saved in the standard working directory, which is your “My Documents” folder on Windows, and “Home” on GNU+Linux or macOS. You can check what is the working directory with the getwd() function: getwd() On a fresh session on my computer this returns: "/home/bruno" or, on Windows: "C:/Users/Bruno/Documents" but if you call this function inside a project, it will return the path to your project. It is also possible to set the working directory with setwd(), so you don’t need to always write the full path, meaning that you can this: setwd("the/path/I/want/") import("data/my_data.csv") export(processed_data, "processed_data.xlsx") instead of: import("the/path/I/want/data/my_data.csv") export(processed_data, "the/path/I/want/processed_data.xlsx") However, I really, really, really urge you never to use setwd(). Use projects instead! Using projects saves a lot of pain in the long run. "],["descriptive-statistics-and-data-manipulation.html", "Chapter 4 Descriptive statistics and data manipulation 4.1 A data exploration exercice using base R 4.2 Smoking is bad for you, but pipes are your friend 4.3 The {tidyverse}’s enfant prodige: {dplyr} 4.4 Reshaping and sprucing up data with {tidyr} 4.5 Working on many columns with if_any(), if_all() and across() 4.6 Other useful {tidyverse} functions 4.7 Special packages for special kinds of data: {forcats}, {lubridate}, and {stringr} 4.8 List-columns 4.9 Going beyond descriptive statistics and data manipulation 4.10 Exercises", " Chapter 4 Descriptive statistics and data manipulation Now that we are familiar with some R objects and know how to import data, it is time to write some code. In this chapter, we are going to compute descriptive statistics for a single dataset, but also for a list of datasets later in the chapter. However, I will not give a list of functions to compute descriptive statistics; if you need a specific function you can find easily in the Help pane in Rstudio or using any modern internet search engine. What I will do is show you a workflow that allows you to compute the descripitive statisics you need fast. R has a lot of built-in functions for descriptive statistics; however, if you want to compute statistics for different sub-groups, some more complex manipulations are needed. At least this was true in the past. Nowadays, thanks to the packages from the {tidyverse}, it is very easy and fast to compute descriptive statistics by any stratifying variable(s). The package we are going to use for this is called {dplyr}. {dplyr} contains a lot of functions that make manipulating data and computing descriptive statistics very easy. To make things easier for now, we are going to use example data included with {dplyr}. So no need to import an external dataset; this does not change anything to the example that we are going to study here; the source of the data does not matter for this. Using {dplyr} is possible only if the data you are working with is already in a useful shape. When data is more messy, you will need to first manipulate it to bring it a tidy format. For this, we will use {tidyr}, which is very useful package to reshape data and to do advanced cleaning of your data. All these tidyverse functions are also called verbs. However, before getting to know these verbs, let’s do an analysis using standard, or base R functions. This will be the benchmark against which we are going to measure a {tidyverse} workflow. 4.1 A data exploration exercice using base R Let’s first load the starwars data set, included in the {dplyr} package: library(dplyr) data(starwars) Let’s first take a look at the data: head(starwars) ## # A tibble: 6 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooi… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… ## 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo ## 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… ## 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera… ## 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi… ## # … with 4 more variables: species <chr>, films <list>, vehicles <list>, ## # starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color, ## # ³​eye_color, ⁴​birth_year, ⁵​homeworld This data contains information on Star Wars characters. The first question you have to answer is to find the average height of the characters: mean(starwars$height) ## [1] NA As discussed in Chapter 2, $ allows you to access columns of a data.frame objects. Because there are NA values in the data, the result is also NA. To get the result, you need to add an option to mean(): mean(starwars$height, na.rm = TRUE) ## [1] 174.358 Let’s also take a look at the standard deviation: sd(starwars$height, na.rm = TRUE) ## [1] 34.77043 It might be more informative to compute these two statistics by sex, so for this, we are going to use aggregate(): aggregate(starwars$height, by = list(sex = starwars$sex), mean) ## sex x ## 1 female NA ## 2 hermaphroditic 175 ## 3 male NA ## 4 none NA Oh, shoot! Most groups have missing values in them, so we get NA back. We need to use na.rm = TRUE just like before. Thankfully, it is possible to pass this option to mean() inside aggregate() as well: aggregate(starwars$height, by = list(sex = starwars$sex), mean, na.rm = TRUE) ## sex x ## 1 female 169.2667 ## 2 hermaphroditic 175.0000 ## 3 male 179.1053 ## 4 none 131.2000 Later in the book, we are also going to see how to define our own functions (with the default options that are useful to us), and this will also help in this sort of situation. Even though we can use na.rm = TRUE, let’s also use subset() to filter out the NA values beforehand: starwars_no_nas <- subset(starwars, !is.na(height)) aggregate(starwars_no_nas$height, by = list(sex = starwars_no_nas$sex), mean) ## sex x ## 1 female 169.2667 ## 2 hermaphroditic 175.0000 ## 3 male 179.1053 ## 4 none 131.2000 (aggregate() also has a subset = option, but I prefer to explicitely subset the data set with subset()). Even if you are not familiar with aggregate(), I believe the above lines are quite self-explanatory. You need to provide aggregate() with 3 things; the variable you want to summarize (or only the data frame, if you want to summarize all variables), a list of grouping variables and then the function that will be applied to each subgroup. And by the way, to test for NA, one uses the function is.na() not something like species == \"NA\" or anything like that. !is.na() does the opposite (! reverses booleans, so !TRUE becomes FALSE and vice-versa). You can easily add another grouping variable: aggregate(starwars_no_nas$height, by = list(Sex = starwars_no_nas$sex, `Hair color` = starwars_no_nas$hair_color), mean) ## Sex Hair color x ## 1 female auburn 150.0000 ## 2 male auburn, grey 180.0000 ## 3 male auburn, white 182.0000 ## 4 female black 166.3333 ## 5 male black 176.2500 ## 6 male blond 176.6667 ## 7 female blonde 168.0000 ## 8 female brown 160.4000 ## 9 male brown 182.6667 ## 10 male brown, grey 178.0000 ## 11 male grey 170.0000 ## 12 female none 188.2500 ## 13 male none 182.2414 ## 14 none none 148.0000 ## 15 female white 167.0000 ## 16 male white 152.3333 or use another function: aggregate(starwars_no_nas$height, by = list(Sex = starwars_no_nas$sex), sd) ## Sex x ## 1 female 15.32256 ## 2 hermaphroditic NA ## 3 male 36.01075 ## 4 none 49.14977 (let’s ignore the NAs). It is important to note that aggregate() returns a data.frame object. You can only give one function to aggregate(), so if you need the mean and the standard deviation of height, you must do it in two steps. Since R 4.1, a new infix operator |> has been introduced, which is really handy for writing the kind of code we’ve been looking at in this chapter. |> is also called a pipe, or the base pipe to distinguish it from another pipe that we’ll discuss in the next section. For now, let’s learn about |>. Consider the following: 10 |> sqrt() ## [1] 3.162278 This computes sqrt(10); so what |> does, is pass the left hand side (10, in the example above) to the right hand side (sqrt()). Using |> might seem more complicated and verbose than not using it, but you will see in a bit why it can be useful. The next function I would like to introduce at this point is with(). with() makes it possible to apply functions on data.frame columns without having to write $ all the time. For example, consider this: mean(starwars$height, na.rm = TRUE) ## [1] 174.358 with(starwars, mean(height, na.rm = TRUE)) ## [1] 174.358 The advantage of using with() is that we can directly reference height without using $. Here again, this is more verbose than simply using $… so why bother with it? It turns out that by combining |> and with(), we can write very clean and concise code. Let’s go back to a previous example to illustrate this idea: starwars_no_nas <- subset(starwars, !is.na(height)) aggregate(starwars_no_nas$height, by = list(sex = starwars_no_nas$sex), mean) ## sex x ## 1 female 169.2667 ## 2 hermaphroditic 175.0000 ## 3 male 179.1053 ## 4 none 131.2000 First, we created a new dataset where we filtered out rows where height is NA. This dataset is useless otherwise, but we need it for the next part, where we actually do what we want (computing the average height by sex). Using |> and with(), we can write this in one go: starwars |> subset(!is.na(sex)) |> with(aggregate(height, by = list(Species = species, Sex = sex), mean)) ## Species Sex x ## 1 Clawdite female 168.0000 ## 2 Human female NA ## 3 Kaminoan female 213.0000 ## 4 Mirialan female 168.0000 ## 5 Tholothian female 184.0000 ## 6 Togruta female 178.0000 ## 7 Twi'lek female 178.0000 ## 8 Hutt hermaphroditic 175.0000 ## 9 Aleena male 79.0000 ## 10 Besalisk male 198.0000 ## 11 Cerean male 198.0000 ## 12 Chagrian male 196.0000 ## 13 Dug male 112.0000 ## 14 Ewok male 88.0000 ## 15 Geonosian male 183.0000 ## 16 Gungan male 208.6667 ## 17 Human male NA ## 18 Iktotchi male 188.0000 ## 19 Kaleesh male 216.0000 ## 20 Kaminoan male 229.0000 ## 21 Kel Dor male 188.0000 ## 22 Mon Calamari male 180.0000 ## 23 Muun male 191.0000 ## 24 Nautolan male 196.0000 ## 25 Neimodian male 191.0000 ## 26 Pau'an male 206.0000 ## 27 Quermian male 264.0000 ## 28 Rodian male 173.0000 ## 29 Skakoan male 193.0000 ## 30 Sullustan male 160.0000 ## 31 Toong male 163.0000 ## 32 Toydarian male 137.0000 ## 33 Trandoshan male 190.0000 ## 34 Twi'lek male 180.0000 ## 35 Vulptereen male 94.0000 ## 36 Wookiee male 231.0000 ## 37 Xexto male 122.0000 ## 38 Yoda's species male 66.0000 ## 39 Zabrak male 173.0000 ## 40 Droid none NA So let’s unpack this. In the first two rows, using |>, we pass the starwars data.frame to subset(): starwars |> subset(!is.na(sex)) ## # A tibble: 83 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… ## 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo ## 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… ## 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera… ## 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi… ## 7 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi… ## 8 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi… ## 9 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi… ## 10 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon ## # … with 73 more rows, 4 more variables: species <chr>, films <list>, ## # vehicles <list>, starships <list>, and abbreviated variable names ## # ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld as I explained before, this is exactly the same as subset(starwars, !is.na(sex)). Then, we pass the result of subset() to the next function, with(). The first argument of with() must be a data.frame, and this is exactly what subset() returns! So now the output of subset() is passed down to with(), which makes it now possible to reference the columns of the data.frame in aggregate() directly. If you have a hard time understanding what is going on, you can use quote() to see what’s going on. quote() returns an expression with evaluating it: quote(log(10)) ## log(10) Why am I bring this up? Well, since a |> f() is exactly equal to f(a), quoting the code above will return an expression with |>. For instance: quote(10 |> log()) ## log(10) So let’s quote the big block of code from above: quote( starwars |> subset(!is.na(sex)) |> with(aggregate(height, by = list(Species = species, Sex = sex), mean)) ) ## with(subset(starwars, !is.na(sex)), aggregate(height, by = list(Species = species, ## Sex = sex), mean)) I think now you see why using |> makes code much clearer; the nested expression you would need to write otherwise is much less readable, unless you define intermediate objects. And without with(), this is what you would need to write: b <- subset(starwars, !is.na(height)) aggregate(b$height, by = list(Species = b$species, Sex = b$sex), mean) To finish this section, let’s say that you wanted to have the average height and mass by sex. In this case you need to specify the columns in aggregate() with cbind() (let’s use na.rm = TRUE again instead of subset()ing the data beforehand): starwars |> with(aggregate(cbind(height, mass), by = list(Sex = sex), FUN = mean, na.rm = TRUE)) ## Sex height mass ## 1 female 169.2667 54.68889 ## 2 hermaphroditic 175.0000 1358.00000 ## 3 male 179.1053 81.00455 ## 4 none 131.2000 69.75000 Let’s now continue with some more advanced operations using this fake dataset: survey_data_base <- as.data.frame( tibble::tribble( ~id, ~var1, ~var2, ~var3, 1, 1, 0.2, 0.3, 2, 1.4, 1.9, 4.1, 3, 0.1, 2.8, 8.9, 4, 1.7, 1.9, 7.6 ) ) survey_data_base ## id var1 var2 var3 ## 1 1 1.0 0.2 0.3 ## 2 2 1.4 1.9 4.1 ## 3 3 0.1 2.8 8.9 ## 4 4 1.7 1.9 7.6 Depending on what you want to do with this data, it is not in the right shape. For example, it would not be possible to simply compute the average of var1, var2 and var3 for each id. This is because this would require running mean() by row, but this is not very easy. This is because R is not suited to row-based workflows. Well I’m lying a little bit here, it turns here that R comes with a rowMeans() function. So this would work: survey_data_base |> transform(mean_id = rowMeans(cbind(var1, var2, var3))) #transform adds a column to a data.frame ## id var1 var2 var3 mean_id ## 1 1 1.0 0.2 0.3 0.500000 ## 2 2 1.4 1.9 4.1 2.466667 ## 3 3 0.1 2.8 8.9 3.933333 ## 4 4 1.7 1.9 7.6 3.733333 But there is no rowSD() or rowMax(), etc… so it is much better to reshape the data and put it in a format that gives us maximum flexibility. To reshape the data, we’ll be using the aptly-called reshape() command: survey_data_long <- reshape(survey_data_base, varying = list(2:4), v.names = "variable", direction = "long") We can now easily compute the average of variable for each id: aggregate(survey_data_long$variable, by = list(Id = survey_data_long$id), mean) ## Id x ## 1 1 0.500000 ## 2 2 2.466667 ## 3 3 3.933333 ## 4 4 3.733333 or any other variable: aggregate(survey_data_long$variable, by = list(Id = survey_data_long$id), max) ## Id x ## 1 1 1.0 ## 2 2 4.1 ## 3 3 8.9 ## 4 4 7.6 As you can see, R comes with very powerful functions right out of the box, ready to use. When I was studying, unfortunately, my professors had been brought up on FORTRAN loops, so we had to do to all this using loops (not reshaping, thankfully), which was not so easy. Now that we have seen how base R works, let’s redo the analysis using {tidyverse} verbs. The {tidyverse} provides many more functions, each of them doing only one single thing. You will shortly see why this is quite important; by focusing on just one task, and by focusing on the data frame as the central object, it becomes possible to build really complex workflows, piece by piece, very easily. But before deep diving into the {tidyverse}, let’s take a moment to discuss about another infix operator, %>%. 4.2 Smoking is bad for you, but pipes are your friend The title of this section might sound weird at first, but by the end of it, you’ll get this (terrible) pun. You probably know the following painting by René Magritte, La trahison des images: It turns out there’s an R package from the tidyverse that is called magrittr. What does this package do? This package introduced pipes to R, way before |> in R 4.1. Pipes are a concept from the Unix operating system; if you’re using a GNU+Linux distribution or macOS, you’re basically using a modern unix (that’s an oversimplification, but I’m an economist by training, and outrageously oversimplifying things is what we do, deal with it). The magrittr pipe is written as %>%. Just like |>, %>% takes the left hand side to feed it as the first argument of the function in the right hand side. Try the following: library(magrittr) 16 %>% sqrt ## [1] 4 You can chain multiple functions, as you can with |>: 16 %>% sqrt %>% log ## [1] 1.386294 But unlike with |>, you can omit (). %>% also has other features. For example, you can pipe things to other infix operators. For example, +. You can use + as usual: 2 + 12 ## [1] 14 Or as a prefix operator: `+`(2, 12) ## [1] 14 You can use this notation with %>%: 16 %>% sqrt %>% `+`(18) ## [1] 22 This also works using |> since R version 4.2, but only if you use the _ pipe placeholder: 16 |> sqrt() |> `+`(x = _, 18) ## [1] 22 The output of 16 (16) got fed to sqrt(), and the output of sqrt(16) (4) got fed to +(18) (so we got +(4, 18) = 22). Without %>% you’d write the line just above like this: sqrt(16) + 18 ## [1] 22 Just like before, with |>, this might seem overly complicated, but using these pipes will make our code much more readable. I’m sure you’ll be convinced by the end of this chapter. %>% is not the only pipe operator in magrittr. There’s %T%, %<>% and %$%. All have their uses, but are basically shortcuts to some common tasks with %>% plus another function. Which means that you can live without them, and because of this, I will not discuss them. 4.3 The {tidyverse}’s enfant prodige: {dplyr} The best way to get started with the tidyverse packages is to get to know {dplyr}. {dplyr} provides a lot of very useful functions that makes it very easy to get discriptive statistics or add new columns to your data. 4.3.1 A first taste of data manipulation with {dplyr} This section will walk you through a typical analysis using {dplyr} funcitons. Just go with it; I will give more details in the next sections. First, let’s load {dplyr} and the included starwars dataset. Let’s also take a look at the first 5 lines of the dataset: library(dplyr) data(starwars) head(starwars) ## # A tibble: 6 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooi… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… ## 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo ## 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… ## 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera… ## 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi… ## # … with 4 more variables: species <chr>, films <list>, vehicles <list>, ## # starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color, ## # ³​eye_color, ⁴​birth_year, ⁵​homeworld data(starwars) loads the example dataset called starwars that is included in the package {dplyr}. As I said earlier, this is just an example; you could have loaded an external dataset, from a .csv file for instance. This does not matter for what comes next. Like we saw earlier, R includes a lot of functions for descriptive statistics, such as mean(), sd(), cov(), and many more. What {dplyr} brings to the table is a grammar of data manipulation that makes it very easy to apply descriptive statistics functions, or any other, very easily. Just like before, we are going to compute the average height by sex: starwars %>% group_by(sex) %>% summarise(mean_height = mean(height, na.rm = TRUE)) ## # A tibble: 5 × 2 ## sex mean_height ## <chr> <dbl> ## 1 female 169. ## 2 hermaphroditic 175 ## 3 male 179. ## 4 none 131. ## 5 <NA> 181. The very nice thing about using %>% and {dplyr} verbs/functions, is that this is really readable. The above three lines can be translated like so in English: Take the starwars dataset, then group by sex, then compute the mean height (for each subgroup) by omitting missing values. %>% can be translated by “then”. Without %>% you would need to change the code to: summarise(group_by(starwars, sex), mean(height, na.rm = TRUE)) ## # A tibble: 5 × 2 ## sex `mean(height, na.rm = TRUE)` ## <chr> <dbl> ## 1 female 169. ## 2 hermaphroditic 175 ## 3 male 179. ## 4 none 131. ## 5 <NA> 181. Unlike with the base approach, each function does only one thing. With the base function aggregate() was used to also define the subgroups. This is not the case with {dplyr}; one function to create the groups (group_by()) and then one function to compute the summaries (summarise()). Also, group_by() creates a specific subgroup for individuals where sex is missing. This is the last line in the data frame, where sex is NA. Another nice thing is that you can specify the column containing the average height. I chose to name it mean_height. Now, let’s suppose that we want to filter some data first: starwars %>% filter(gender == "masculine") %>% group_by(sex) %>% summarise(mean_height = mean(height, na.rm = TRUE)) ## # A tibble: 3 × 2 ## sex mean_height ## <chr> <dbl> ## 1 hermaphroditic 175 ## 2 male 179. ## 3 none 140 Again, the %>% makes the above lines of code very easy to read. Without it, one would need to write: summarise(group_by(filter(starwars, gender == "masculine"), sex), mean(height, na.rm = TRUE)) ## # A tibble: 3 × 2 ## sex `mean(height, na.rm = TRUE)` ## <chr> <dbl> ## 1 hermaphroditic 175 ## 2 male 179. ## 3 none 140 I think you agree with me that this is not very readable. One way to make it more readable would be to save intermediary variables: filtered_data <- filter(starwars, gender == "masculine") grouped_data <- group_by(filter(starwars, gender == "masculine"), sex) summarise(grouped_data, mean(height)) ## # A tibble: 3 × 2 ## sex `mean(height)` ## <chr> <dbl> ## 1 hermaphroditic 175 ## 2 male NA ## 3 none NA But this can get very tedious. Once you’re used to %>%, you won’t go back to not use it. Before continuing and to make things clearer; filter(), group_by() and summarise() are functions that are included in {dplyr}. %>% is actually a function from {magrittr}, but this package gets loaded on the fly when you load {dplyr}, so you do not need to worry about it. The result of all these operations that use {dplyr} functions are actually other datasets, or tibbles. This means that you can save them in variable, or write them to disk, and then work with these as any other datasets. mean_height <- starwars %>% group_by(sex) %>% summarise(mean(height)) class(mean_height) ## [1] "tbl_df" "tbl" "data.frame" head(mean_height) ## # A tibble: 5 × 2 ## sex `mean(height)` ## <chr> <dbl> ## 1 female NA ## 2 hermaphroditic 175 ## 3 male NA ## 4 none NA ## 5 <NA> NA You could then write this data to disk using rio::export() for instance. If you need more than the mean of the height, you can keep adding as many functions as needed (another advantage over aggregate(): summary_table <- starwars %>% group_by(sex) %>% summarise(mean_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) summary_table ## # A tibble: 5 × 4 ## sex mean_height var_height n_obs ## <chr> <dbl> <dbl> <int> ## 1 female 169. 235. 16 ## 2 hermaphroditic 175 NA 1 ## 3 male 179. 1297. 60 ## 4 none 131. 2416. 6 ## 5 <NA> 181. 8.33 4 I’ve added more functions, namely var(), to get the variance of height, and n(), which is a function from {dplyr}, not base R, to get the number of observations. This is quite useful, because we see that there is a group with only one individual. Let’s focus on the sexes for which we have more than 1 individual. Since we save all the previous operations (which produce a tibble) in a variable, we can keep going from there: summary_table2 <- summary_table %>% filter(n_obs > 1) summary_table2 ## # A tibble: 4 × 4 ## sex mean_height var_height n_obs ## <chr> <dbl> <dbl> <int> ## 1 female 169. 235. 16 ## 2 male 179. 1297. 60 ## 3 none 131. 2416. 6 ## 4 <NA> 181. 8.33 4 As mentioned before, there’s a lot of NAs; this is because by default, mean() and var() return NA if even one single observation is NA. This is good, because it forces you to look at the data to see what is going on. If you would get a number, even if there were NAs you could very easily miss these missing values. It is better for functions to fail early and often than the opposite. This is way we keep using na.rm = TRUE for mean() and var(). Now let’s actually take a look at the rows where sex is NA: starwars %>% filter(is.na(sex)) ## # A tibble: 4 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Ric Olié 183 NA brown fair blue NA <NA> <NA> Naboo ## 2 Quarsh Pana… 183 NA black dark brown 62 <NA> <NA> Naboo ## 3 Sly Moore 178 48 none pale white NA <NA> <NA> Umbara ## 4 Captain Pha… NA NA unknown unknown unknown NA <NA> <NA> <NA> ## # … with 4 more variables: species <chr>, films <list>, vehicles <list>, ## # starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color, ## # ³​eye_color, ⁴​birth_year, ⁵​homeworld There’s only 4 rows where sex is NA. Let’s ignore them: starwars %>% filter(!is.na(sex)) %>% group_by(sex) %>% summarise(ave_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) %>% filter(n_obs > 1) ## # A tibble: 3 × 4 ## sex ave_height var_height n_obs ## <chr> <dbl> <dbl> <int> ## 1 female 169. 235. 16 ## 2 male 179. 1297. 60 ## 3 none 131. 2416. 6 And why not compute the same table, but first add another stratifying variable? starwars %>% filter(!is.na(sex)) %>% group_by(sex, eye_color) %>% summarise(ave_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) %>% filter(n_obs > 1) ## `summarise()` has grouped output by 'sex'. You can override using the `.groups` ## argument. ## # A tibble: 12 × 5 ## # Groups: sex [3] ## sex eye_color ave_height var_height n_obs ## <chr> <chr> <dbl> <dbl> <int> ## 1 female black 196. 612. 2 ## 2 female blue 167 118. 6 ## 3 female brown 160 42 5 ## 4 female hazel 178 NA 2 ## 5 male black 182 1197 7 ## 6 male blue 190. 434. 12 ## 7 male brown 167. 1663. 15 ## 8 male orange 181. 1306. 7 ## 9 male red 190. 0.5 2 ## 10 male unknown 136 6498 2 ## 11 male yellow 180. 2196. 9 ## 12 none red 131 3571 3 Ok, that’s it for a first taste. We have already discovered some very useful {dplyr} functions, filter(), group_by() and summarise summarise(). Now, we are going to learn more about these functions in more detail. 4.3.2 Filter the rows of a dataset with filter() We’re going to use the Gasoline dataset from the plm package, so install that first: install.packages("plm") Then load the required data: data(Gasoline, package = "plm") and load dplyr: library(dplyr) This dataset gives the consumption of gasoline for 18 countries from 1960 to 1978. When you load the data like this, it is a standard data.frame. {dplyr} functions can be used on standard data.frame objects, but also on tibbles. tibbles are just like data frame, but with a better print method (and other niceties). I’ll discuss the {tibble} package later, but for now, let’s convert the data to a tibble and change its name, and also transform the country column to lower case: gasoline <- as_tibble(Gasoline) gasoline <- gasoline %>% mutate(country = tolower(country)) filter() is pretty straightforward. What if you would like to subset the data to focus on the year 1969? Simple: filter(gasoline, year == 1969) ## # A tibble: 18 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1969 4.05 -6.15 -0.559 -8.79 ## 2 belgium 1969 3.85 -5.86 -0.355 -8.52 ## 3 canada 1969 4.86 -5.56 -1.04 -8.10 ## 4 denmark 1969 4.17 -5.72 -0.407 -8.47 ## 5 france 1969 3.77 -5.84 -0.315 -8.37 ## 6 germany 1969 3.90 -5.83 -0.589 -8.44 ## 7 greece 1969 4.89 -6.59 -0.180 -10.7 ## 8 ireland 1969 4.21 -6.38 -0.272 -8.95 ## 9 italy 1969 3.74 -6.28 -0.248 -8.67 ## 10 japan 1969 4.52 -6.16 -0.417 -9.61 ## 11 netherla 1969 3.99 -5.88 -0.417 -8.63 ## 12 norway 1969 4.09 -5.74 -0.338 -8.69 ## 13 spain 1969 3.99 -5.60 0.669 -9.72 ## 14 sweden 1969 3.99 -7.77 -2.73 -8.20 ## 15 switzerl 1969 4.21 -5.91 -0.918 -8.47 ## 16 turkey 1969 5.72 -7.39 -0.298 -12.5 ## 17 u.k. 1969 3.95 -6.03 -0.383 -8.47 ## 18 u.s.a. 1969 4.84 -5.41 -1.22 -7.79 Let’s use %>%, since we’re familiar with it now: gasoline %>% filter(year == 1969) ## # A tibble: 18 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1969 4.05 -6.15 -0.559 -8.79 ## 2 belgium 1969 3.85 -5.86 -0.355 -8.52 ## 3 canada 1969 4.86 -5.56 -1.04 -8.10 ## 4 denmark 1969 4.17 -5.72 -0.407 -8.47 ## 5 france 1969 3.77 -5.84 -0.315 -8.37 ## 6 germany 1969 3.90 -5.83 -0.589 -8.44 ## 7 greece 1969 4.89 -6.59 -0.180 -10.7 ## 8 ireland 1969 4.21 -6.38 -0.272 -8.95 ## 9 italy 1969 3.74 -6.28 -0.248 -8.67 ## 10 japan 1969 4.52 -6.16 -0.417 -9.61 ## 11 netherla 1969 3.99 -5.88 -0.417 -8.63 ## 12 norway 1969 4.09 -5.74 -0.338 -8.69 ## 13 spain 1969 3.99 -5.60 0.669 -9.72 ## 14 sweden 1969 3.99 -7.77 -2.73 -8.20 ## 15 switzerl 1969 4.21 -5.91 -0.918 -8.47 ## 16 turkey 1969 5.72 -7.39 -0.298 -12.5 ## 17 u.k. 1969 3.95 -6.03 -0.383 -8.47 ## 18 u.s.a. 1969 4.84 -5.41 -1.22 -7.79 You can also filter more than just one year, by using the %in% operator: gasoline %>% filter(year %in% seq(1969, 1973)) ## # A tibble: 90 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1969 4.05 -6.15 -0.559 -8.79 ## 2 austria 1970 4.08 -6.08 -0.597 -8.73 ## 3 austria 1971 4.11 -6.04 -0.654 -8.64 ## 4 austria 1972 4.13 -5.98 -0.596 -8.54 ## 5 austria 1973 4.20 -5.90 -0.594 -8.49 ## 6 belgium 1969 3.85 -5.86 -0.355 -8.52 ## 7 belgium 1970 3.87 -5.80 -0.378 -8.45 ## 8 belgium 1971 3.87 -5.76 -0.399 -8.41 ## 9 belgium 1972 3.91 -5.71 -0.311 -8.36 ## 10 belgium 1973 3.90 -5.64 -0.373 -8.31 ## # … with 80 more rows It is also possible use between(), a helper function: gasoline %>% filter(between(year, 1969, 1973)) ## # A tibble: 90 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1969 4.05 -6.15 -0.559 -8.79 ## 2 austria 1970 4.08 -6.08 -0.597 -8.73 ## 3 austria 1971 4.11 -6.04 -0.654 -8.64 ## 4 austria 1972 4.13 -5.98 -0.596 -8.54 ## 5 austria 1973 4.20 -5.90 -0.594 -8.49 ## 6 belgium 1969 3.85 -5.86 -0.355 -8.52 ## 7 belgium 1970 3.87 -5.80 -0.378 -8.45 ## 8 belgium 1971 3.87 -5.76 -0.399 -8.41 ## 9 belgium 1972 3.91 -5.71 -0.311 -8.36 ## 10 belgium 1973 3.90 -5.64 -0.373 -8.31 ## # … with 80 more rows To select non-consecutive years: gasoline %>% filter(year %in% c(1969, 1973, 1977)) ## # A tibble: 54 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1969 4.05 -6.15 -0.559 -8.79 ## 2 austria 1973 4.20 -5.90 -0.594 -8.49 ## 3 austria 1977 3.93 -5.83 -0.422 -8.25 ## 4 belgium 1969 3.85 -5.86 -0.355 -8.52 ## 5 belgium 1973 3.90 -5.64 -0.373 -8.31 ## 6 belgium 1977 3.85 -5.56 -0.432 -8.14 ## 7 canada 1969 4.86 -5.56 -1.04 -8.10 ## 8 canada 1973 4.90 -5.41 -1.13 -7.94 ## 9 canada 1977 4.81 -5.34 -1.07 -7.77 ## 10 denmark 1969 4.17 -5.72 -0.407 -8.47 ## # … with 44 more rows %in% tests if an object is part of a set. 4.3.3 Select columns with select() While filter() allows you to keep or discard rows of data, select() allows you to keep or discard entire columns. To keep columns: gasoline %>% select(country, year, lrpmg) ## # A tibble: 342 × 3 ## country year lrpmg ## <chr> <int> <dbl> ## 1 austria 1960 -0.335 ## 2 austria 1961 -0.351 ## 3 austria 1962 -0.380 ## 4 austria 1963 -0.414 ## 5 austria 1964 -0.445 ## 6 austria 1965 -0.497 ## 7 austria 1966 -0.467 ## 8 austria 1967 -0.506 ## 9 austria 1968 -0.522 ## 10 austria 1969 -0.559 ## # … with 332 more rows To discard them: gasoline %>% select(-country, -year, -lrpmg) ## # A tibble: 342 × 3 ## lgaspcar lincomep lcarpcap ## <dbl> <dbl> <dbl> ## 1 4.17 -6.47 -9.77 ## 2 4.10 -6.43 -9.61 ## 3 4.07 -6.41 -9.46 ## 4 4.06 -6.37 -9.34 ## 5 4.04 -6.32 -9.24 ## 6 4.03 -6.29 -9.12 ## 7 4.05 -6.25 -9.02 ## 8 4.05 -6.23 -8.93 ## 9 4.05 -6.21 -8.85 ## 10 4.05 -6.15 -8.79 ## # … with 332 more rows To rename them: gasoline %>% select(country, date = year, lrpmg) ## # A tibble: 342 × 3 ## country date lrpmg ## <chr> <int> <dbl> ## 1 austria 1960 -0.335 ## 2 austria 1961 -0.351 ## 3 austria 1962 -0.380 ## 4 austria 1963 -0.414 ## 5 austria 1964 -0.445 ## 6 austria 1965 -0.497 ## 7 austria 1966 -0.467 ## 8 austria 1967 -0.506 ## 9 austria 1968 -0.522 ## 10 austria 1969 -0.559 ## # … with 332 more rows There’s also rename(): gasoline %>% rename(date = year) ## # A tibble: 342 × 6 ## country date lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows rename() does not do any kind of selection, but just renames. You can also use select() to re-order columns: gasoline %>% select(year, country, lrpmg, everything()) ## # A tibble: 342 × 6 ## year country lrpmg lgaspcar lincomep lcarpcap ## <int> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 1960 austria -0.335 4.17 -6.47 -9.77 ## 2 1961 austria -0.351 4.10 -6.43 -9.61 ## 3 1962 austria -0.380 4.07 -6.41 -9.46 ## 4 1963 austria -0.414 4.06 -6.37 -9.34 ## 5 1964 austria -0.445 4.04 -6.32 -9.24 ## 6 1965 austria -0.497 4.03 -6.29 -9.12 ## 7 1966 austria -0.467 4.05 -6.25 -9.02 ## 8 1967 austria -0.506 4.05 -6.23 -8.93 ## 9 1968 austria -0.522 4.05 -6.21 -8.85 ## 10 1969 austria -0.559 4.05 -6.15 -8.79 ## # … with 332 more rows everything() is a helper function, and there’s also starts_with(), and ends_with(). For example, what if we are only interested in columns whose name start with “l”? gasoline %>% select(starts_with("l")) ## # A tibble: 342 × 4 ## lgaspcar lincomep lrpmg lcarpcap ## <dbl> <dbl> <dbl> <dbl> ## 1 4.17 -6.47 -0.335 -9.77 ## 2 4.10 -6.43 -0.351 -9.61 ## 3 4.07 -6.41 -0.380 -9.46 ## 4 4.06 -6.37 -0.414 -9.34 ## 5 4.04 -6.32 -0.445 -9.24 ## 6 4.03 -6.29 -0.497 -9.12 ## 7 4.05 -6.25 -0.467 -9.02 ## 8 4.05 -6.23 -0.506 -8.93 ## 9 4.05 -6.21 -0.522 -8.85 ## 10 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows ends_with() works in a similar fashion. There is also contains(): gasoline %>% select(country, year, contains("car")) ## # A tibble: 342 × 4 ## country year lgaspcar lcarpcap ## <chr> <int> <dbl> <dbl> ## 1 austria 1960 4.17 -9.77 ## 2 austria 1961 4.10 -9.61 ## 3 austria 1962 4.07 -9.46 ## 4 austria 1963 4.06 -9.34 ## 5 austria 1964 4.04 -9.24 ## 6 austria 1965 4.03 -9.12 ## 7 austria 1966 4.05 -9.02 ## 8 austria 1967 4.05 -8.93 ## 9 austria 1968 4.05 -8.85 ## 10 austria 1969 4.05 -8.79 ## # … with 332 more rows You can read more about these helper functions here, but we’re going to look more into them in a coming section. Another verb, similar to select(), is pull(). Let’s compare the two: gasoline %>% select(lrpmg) ## # A tibble: 342 × 1 ## lrpmg ## <dbl> ## 1 -0.335 ## 2 -0.351 ## 3 -0.380 ## 4 -0.414 ## 5 -0.445 ## 6 -0.497 ## 7 -0.467 ## 8 -0.506 ## 9 -0.522 ## 10 -0.559 ## # … with 332 more rows gasoline %>% pull(lrpmg) %>% head() # using head() because there's 337 elements in total ## [1] -0.3345476 -0.3513276 -0.3795177 -0.4142514 -0.4453354 -0.4970607 pull(), unlike select(), does not return a tibble, but only the column you want, as a vector. 4.3.4 Group the observations of your dataset with group_by() group_by() is a very useful verb; as the name implies, it allows you to create groups and then, for example, compute descriptive statistics by groups. For example, let’s group our data by country: gasoline %>% group_by(country) ## # A tibble: 342 × 6 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows It looks like nothing much happened, but if you look at the second line of the output you can read the following: ## # Groups: country [18] this means that the data is grouped, and every computation you will do now will take these groups into account. It is also possible to group by more than one variable: gasoline %>% group_by(country, year) ## # A tibble: 342 × 6 ## # Groups: country, year [342] ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows and so on. You can then also ungroup: gasoline %>% group_by(country, year) %>% ungroup() ## # A tibble: 342 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows Once your data is grouped, the operations that will follow will be executed inside each group. 4.3.5 Get summary statistics with summarise() Ok, now that we have learned the basic verbs, we can start to do more interesting stuff. For example, one might want to compute the average gasoline consumption in each country, for the whole period: gasoline %>% group_by(country) %>% summarise(mean(lgaspcar)) ## # A tibble: 18 × 2 ## country `mean(lgaspcar)` ## <chr> <dbl> ## 1 austria 4.06 ## 2 belgium 3.92 ## 3 canada 4.86 ## 4 denmark 4.19 ## 5 france 3.82 ## 6 germany 3.89 ## 7 greece 4.88 ## 8 ireland 4.23 ## 9 italy 3.73 ## 10 japan 4.70 ## 11 netherla 4.08 ## 12 norway 4.11 ## 13 spain 4.06 ## 14 sweden 4.01 ## 15 switzerl 4.24 ## 16 turkey 5.77 ## 17 u.k. 3.98 ## 18 u.s.a. 4.82 mean() was given as an argument to summarise(), which is a {dplyr} verb. What we get is another tibble, that contains the variable we used to group, as well as the average per country. We can also rename this column: gasoline %>% group_by(country) %>% summarise(mean_gaspcar = mean(lgaspcar)) ## # A tibble: 18 × 2 ## country mean_gaspcar ## <chr> <dbl> ## 1 austria 4.06 ## 2 belgium 3.92 ## 3 canada 4.86 ## 4 denmark 4.19 ## 5 france 3.82 ## 6 germany 3.89 ## 7 greece 4.88 ## 8 ireland 4.23 ## 9 italy 3.73 ## 10 japan 4.70 ## 11 netherla 4.08 ## 12 norway 4.11 ## 13 spain 4.06 ## 14 sweden 4.01 ## 15 switzerl 4.24 ## 16 turkey 5.77 ## 17 u.k. 3.98 ## 18 u.s.a. 4.82 and because the output is a tibble, we can continue to use {dplyr} verbs on it: gasoline %>% group_by(country) %>% summarise(mean_gaspcar = mean(lgaspcar)) %>% filter(country == "france") ## # A tibble: 1 × 2 ## country mean_gaspcar ## <chr> <dbl> ## 1 france 3.82 summarise() is a very useful verb. For example, we can compute several descriptive statistics at once: gasoline %>% group_by(country) %>% summarise(mean_gaspcar = mean(lgaspcar), sd_gaspcar = sd(lgaspcar), max_gaspcar = max(lgaspcar), min_gaspcar = min(lgaspcar)) ## # A tibble: 18 × 5 ## country mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 austria 4.06 0.0693 4.20 3.92 ## 2 belgium 3.92 0.103 4.16 3.82 ## 3 canada 4.86 0.0262 4.90 4.81 ## 4 denmark 4.19 0.158 4.50 4.00 ## 5 france 3.82 0.0499 3.91 3.75 ## 6 germany 3.89 0.0239 3.93 3.85 ## 7 greece 4.88 0.255 5.38 4.48 ## 8 ireland 4.23 0.0437 4.33 4.16 ## 9 italy 3.73 0.220 4.05 3.38 ## 10 japan 4.70 0.684 6.00 3.95 ## 11 netherla 4.08 0.286 4.65 3.71 ## 12 norway 4.11 0.123 4.44 3.96 ## 13 spain 4.06 0.317 4.75 3.62 ## 14 sweden 4.01 0.0364 4.07 3.91 ## 15 switzerl 4.24 0.102 4.44 4.05 ## 16 turkey 5.77 0.329 6.16 5.14 ## 17 u.k. 3.98 0.0479 4.10 3.91 ## 18 u.s.a. 4.82 0.0219 4.86 4.79 Because the output is a tibble, you can save it in a variable of course: desc_gasoline <- gasoline %>% group_by(country) %>% summarise(mean_gaspcar = mean(lgaspcar), sd_gaspcar = sd(lgaspcar), max_gaspcar = max(lgaspcar), min_gaspcar = min(lgaspcar)) And then you can answer questions such as, which country has the maximum average gasoline consumption?: desc_gasoline %>% filter(max(mean_gaspcar) == mean_gaspcar) ## # A tibble: 1 × 5 ## country mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 turkey 5.77 0.329 6.16 5.14 Turns out it’s Turkey. What about the minimum consumption? desc_gasoline %>% filter(min(mean_gaspcar) == mean_gaspcar) ## # A tibble: 1 × 5 ## country mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 italy 3.73 0.220 4.05 3.38 Because the output of {dplyr} verbs is a tibble, it is possible to continue working with it. This is one shortcoming of using the base summary() function. The object returned by that function is not very easy to manipulate. 4.3.6 Adding columns with mutate() and transmute() mutate() adds a column to the tibble, which can contain any transformation of any other variable: gasoline %>% group_by(country) %>% mutate(n()) ## # A tibble: 342 × 7 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap `n()` ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <int> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 19 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 19 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 19 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 19 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 19 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 19 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 19 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 19 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 19 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 19 ## # … with 332 more rows Using mutate() I’ve added a column that counts how many times the country appears in the tibble, using n(), another {dplyr} function. There’s also count() and tally(), which we are going to see further down. It is also possible to rename the column on the fly: gasoline %>% group_by(country) %>% mutate(count = n()) ## # A tibble: 342 × 7 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap count ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <int> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 19 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 19 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 19 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 19 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 19 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 19 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 19 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 19 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 19 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 19 ## # … with 332 more rows It is possible to do any arbitrary operation: gasoline %>% group_by(country) %>% mutate(spam = exp(lgaspcar + lincomep)) ## # A tibble: 342 × 7 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap spam ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 0.100 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 0.0978 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 0.0969 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 0.0991 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 0.102 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 0.104 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 0.110 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 0.113 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 0.115 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 0.122 ## # … with 332 more rows transmute() is the same as mutate(), but only returns the created variable: gasoline %>% group_by(country) %>% transmute(spam = exp(lgaspcar + lincomep)) ## # A tibble: 342 × 2 ## # Groups: country [18] ## country spam ## <chr> <dbl> ## 1 austria 0.100 ## 2 austria 0.0978 ## 3 austria 0.0969 ## 4 austria 0.0991 ## 5 austria 0.102 ## 6 austria 0.104 ## 7 austria 0.110 ## 8 austria 0.113 ## 9 austria 0.115 ## 10 austria 0.122 ## # … with 332 more rows 4.3.7 Joining tibbles with full_join(), left_join(), right_join() and all the others I will end this section on {dplyr} with the very useful verbs: the *_join() verbs. Let’s first start by loading another dataset from the plm package. SumHes and let’s convert it to tibble and rename it: data(SumHes, package = "plm") pwt <- SumHes %>% as_tibble() %>% mutate(country = tolower(country)) Let’s take a quick look at the data: glimpse(pwt) ## Rows: 3,250 ## Columns: 7 ## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 19… ## $ country <chr> "algeria", "algeria", "algeria", "algeria", "algeria", "algeri… ## $ opec <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no… ## $ com <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no… ## $ pop <int> 10800, 11016, 11236, 11460, 11690, 11923, 12267, 12622, 12986,… ## $ gdp <int> 1723, 1599, 1275, 1517, 1589, 1584, 1548, 1600, 1758, 1835, 18… ## $ sr <dbl> 19.9, 21.1, 15.0, 13.9, 10.6, 11.0, 8.3, 11.3, 15.1, 18.2, 19.… We can merge both gasoline and pwt by country and year, as these two variables are common to both datasets. There are more countries and years in the pwt dataset, so when merging both, and depending on which function you use, you will either have NA’s for the variables where there is no match, or rows that will be dropped. Let’s start with full_join: gas_pwt_full <- gasoline %>% full_join(pwt, by = c("country", "year")) Let’s see which countries and years are included: gas_pwt_full %>% count(country, year) ## # A tibble: 3,307 × 3 ## country year n ## <chr> <int> <int> ## 1 algeria 1960 1 ## 2 algeria 1961 1 ## 3 algeria 1962 1 ## 4 algeria 1963 1 ## 5 algeria 1964 1 ## 6 algeria 1965 1 ## 7 algeria 1966 1 ## 8 algeria 1967 1 ## 9 algeria 1968 1 ## 10 algeria 1969 1 ## # … with 3,297 more rows As you see, every country and year was included, but what happened for, say, the U.S.S.R? This country is in pwt but not in gasoline at all: gas_pwt_full %>% filter(country == "u.s.s.r.") ## # A tibble: 26 × 11 ## country year lgaspcar lincomep lrpmg lcarp…¹ opec com pop gdp sr ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <int> <int> <dbl> ## 1 u.s.s.r. 1960 NA NA NA NA no yes 214400 2397 37.9 ## 2 u.s.s.r. 1961 NA NA NA NA no yes 217896 2542 39.4 ## 3 u.s.s.r. 1962 NA NA NA NA no yes 221449 2656 38.4 ## 4 u.s.s.r. 1963 NA NA NA NA no yes 225060 2681 38.4 ## 5 u.s.s.r. 1964 NA NA NA NA no yes 227571 2854 39.5 ## 6 u.s.s.r. 1965 NA NA NA NA no yes 230109 3049 39.9 ## 7 u.s.s.r. 1966 NA NA NA NA no yes 232676 3247 39.9 ## 8 u.s.s.r. 1967 NA NA NA NA no yes 235272 3454 40.2 ## 9 u.s.s.r. 1968 NA NA NA NA no yes 237896 3730 40.6 ## 10 u.s.s.r. 1969 NA NA NA NA no yes 240550 3808 37.9 ## # … with 16 more rows, and abbreviated variable name ¹​lcarpcap As you probably guessed, the variables from gasoline that are not included in pwt are filled with NAs. One could remove all these lines and only keep countries for which these variables are not NA everywhere with filter(), but there is a simpler solution: gas_pwt_inner <- gasoline %>% inner_join(pwt, by = c("country", "year")) Let’s use the tabyl() from the janitor packages which is a very nice alternative to the table() function from base R: library(janitor) gas_pwt_inner %>% tabyl(country) ## country n percent ## austria 19 0.06666667 ## belgium 19 0.06666667 ## canada 19 0.06666667 ## denmark 19 0.06666667 ## france 19 0.06666667 ## greece 19 0.06666667 ## ireland 19 0.06666667 ## italy 19 0.06666667 ## japan 19 0.06666667 ## norway 19 0.06666667 ## spain 19 0.06666667 ## sweden 19 0.06666667 ## turkey 19 0.06666667 ## u.k. 19 0.06666667 ## u.s.a. 19 0.06666667 Only countries with values in both datasets were returned. It’s almost every country from gasoline, apart from Germany (called “germany west” in pwt and “germany” in gasoline. I left it as is to provide an example of a country not in pwt). Let’s also look at the variables: glimpse(gas_pwt_inner) ## Rows: 285 ## Columns: 11 ## $ country <chr> "austria", "austria", "austria", "austria", "austria", "austr… ## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1… ## $ lgaspcar <dbl> 4.173244, 4.100989, 4.073177, 4.059509, 4.037689, 4.033983, 4… ## $ lincomep <dbl> -6.474277, -6.426006, -6.407308, -6.370679, -6.322247, -6.294… ## $ lrpmg <dbl> -0.3345476, -0.3513276, -0.3795177, -0.4142514, -0.4453354, -… ## $ lcarpcap <dbl> -9.766840, -9.608622, -9.457257, -9.343155, -9.237739, -9.123… ## $ opec <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, n… ## $ com <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, n… ## $ pop <int> 7048, 7087, 7130, 7172, 7215, 7255, 7308, 7338, 7362, 7384, 7… ## $ gdp <int> 5143, 5388, 5481, 5688, 5978, 6144, 6437, 6596, 6847, 7162, 7… ## $ sr <dbl> 24.3, 24.5, 23.3, 22.9, 25.2, 25.2, 26.7, 25.6, 25.7, 26.1, 2… The variables from both datasets are in the joined data. Contrast this to semi_join(): gas_pwt_semi <- gasoline %>% semi_join(pwt, by = c("country", "year")) glimpse(gas_pwt_semi) ## Rows: 285 ## Columns: 6 ## $ country <chr> "austria", "austria", "austria", "austria", "austria", "austr… ## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1… ## $ lgaspcar <dbl> 4.173244, 4.100989, 4.073177, 4.059509, 4.037689, 4.033983, 4… ## $ lincomep <dbl> -6.474277, -6.426006, -6.407308, -6.370679, -6.322247, -6.294… ## $ lrpmg <dbl> -0.3345476, -0.3513276, -0.3795177, -0.4142514, -0.4453354, -… ## $ lcarpcap <dbl> -9.766840, -9.608622, -9.457257, -9.343155, -9.237739, -9.123… gas_pwt_semi %>% tabyl(country) ## country n percent ## austria 19 0.06666667 ## belgium 19 0.06666667 ## canada 19 0.06666667 ## denmark 19 0.06666667 ## france 19 0.06666667 ## greece 19 0.06666667 ## ireland 19 0.06666667 ## italy 19 0.06666667 ## japan 19 0.06666667 ## norway 19 0.06666667 ## spain 19 0.06666667 ## sweden 19 0.06666667 ## turkey 19 0.06666667 ## u.k. 19 0.06666667 ## u.s.a. 19 0.06666667 Only columns of gasoline are returned, and only rows of gasoline that were matched with rows from pwt. semi_join() is not a commutative operation: pwt_gas_semi <- pwt %>% semi_join(gasoline, by = c("country", "year")) glimpse(pwt_gas_semi) ## Rows: 285 ## Columns: 7 ## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 19… ## $ country <chr> "canada", "canada", "canada", "canada", "canada", "canada", "c… ## $ opec <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no… ## $ com <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no… ## $ pop <int> 17910, 18270, 18614, 18963, 19326, 19678, 20049, 20411, 20744,… ## $ gdp <int> 7258, 7261, 7605, 7876, 8244, 8664, 9093, 9231, 9582, 9975, 10… ## $ sr <dbl> 22.7, 21.5, 22.1, 21.9, 22.9, 24.8, 25.4, 23.1, 22.6, 23.4, 21… gas_pwt_semi %>% tabyl(country) ## country n percent ## austria 19 0.06666667 ## belgium 19 0.06666667 ## canada 19 0.06666667 ## denmark 19 0.06666667 ## france 19 0.06666667 ## greece 19 0.06666667 ## ireland 19 0.06666667 ## italy 19 0.06666667 ## japan 19 0.06666667 ## norway 19 0.06666667 ## spain 19 0.06666667 ## sweden 19 0.06666667 ## turkey 19 0.06666667 ## u.k. 19 0.06666667 ## u.s.a. 19 0.06666667 The rows are the same, but not the columns. left_join() and right_join() return all the rows from either the dataset that is on the “left” (the first argument of the fonction) or on the “right” (the second argument of the function) but all columns from both datasets. So depending on which countries you’re interested in, you’re going to use either one of these functions: gas_pwt_left <- gasoline %>% left_join(pwt, by = c("country", "year")) gas_pwt_left %>% tabyl(country) ## country n percent ## austria 19 0.05555556 ## belgium 19 0.05555556 ## canada 19 0.05555556 ## denmark 19 0.05555556 ## france 19 0.05555556 ## germany 19 0.05555556 ## greece 19 0.05555556 ## ireland 19 0.05555556 ## italy 19 0.05555556 ## japan 19 0.05555556 ## netherla 19 0.05555556 ## norway 19 0.05555556 ## spain 19 0.05555556 ## sweden 19 0.05555556 ## switzerl 19 0.05555556 ## turkey 19 0.05555556 ## u.k. 19 0.05555556 ## u.s.a. 19 0.05555556 gas_pwt_right <- gasoline %>% right_join(pwt, by = c("country", "year")) gas_pwt_right %>% tabyl(country) %>% head() ## country n percent ## algeria 26 0.008 ## angola 26 0.008 ## argentina 26 0.008 ## australia 26 0.008 ## austria 26 0.008 ## bangladesh 26 0.008 The last merge function is anti_join(): gas_pwt_anti <- gasoline %>% anti_join(pwt, by = c("country", "year")) glimpse(gas_pwt_anti) ## Rows: 57 ## Columns: 6 ## $ country <chr> "germany", "germany", "germany", "germany", "germany", "germa… ## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1… ## $ lgaspcar <dbl> 3.916953, 3.885345, 3.871484, 3.848782, 3.868993, 3.861049, 3… ## $ lincomep <dbl> -6.159837, -6.120923, -6.094258, -6.068361, -6.013442, -5.966… ## $ lrpmg <dbl> -0.1859108, -0.2309538, -0.3438417, -0.3746467, -0.3996526, -… ## $ lcarpcap <dbl> -9.342481, -9.183841, -9.037280, -8.913630, -8.811013, -8.711… gas_pwt_anti %>% tabyl(country) ## country n percent ## germany 19 0.3333333 ## netherla 19 0.3333333 ## switzerl 19 0.3333333 gas_pwt_anti has the columns the gasoline dataset as well as the only country from gasoline that is not in pwt: “germany”. That was it for the basic {dplyr} verbs. Next, we’re going to learn about {tidyr}. 4.4 Reshaping and sprucing up data with {tidyr} Note: this section is going to be a lot harder than anything you’ve seen until now. Reshaping data is tricky, and to really grok it, you need time, and you need to run each line, and see what happens. Take your time, and don’t be discouraged. Another important package from the {tidyverse} that goes hand in hand with {dplyr} is {tidyr}. {tidyr} is the package you need when it’s time to reshape data. I will start by presenting pivot_wider() and pivot_longer(). 4.4.1 pivot_wider() and pivot_longer() Let’s first create a fake dataset: library(tidyr) survey_data <- tribble( ~id, ~variable, ~value, 1, "var1", 1, 1, "var2", 0.2, NA, "var3", 0.3, 2, "var1", 1.4, 2, "var2", 1.9, 2, "var3", 4.1, 3, "var1", 0.1, 3, "var2", 2.8, 3, "var3", 8.9, 4, "var1", 1.7, NA, "var2", 1.9, 4, "var3", 7.6 ) head(survey_data) ## # A tibble: 6 × 3 ## id variable value ## <dbl> <chr> <dbl> ## 1 1 var1 1 ## 2 1 var2 0.2 ## 3 NA var3 0.3 ## 4 2 var1 1.4 ## 5 2 var2 1.9 ## 6 2 var3 4.1 I used the tribble() function from the {tibble} package to create this fake dataset. I’ll discuss this package later, for now, let’s focus on {tidyr}. Let’s suppose that we need the data to be in the wide format which means var1, var2 and var3 need to be their own columns. To do this, we need to use the pivot_wider() function. Why wide? Because the data set will be wide, meaning, having more columns than rows. survey_data %>% pivot_wider(id_cols = id, names_from = variable, values_from = value) ## # A tibble: 5 × 4 ## id var1 var2 var3 ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1 0.2 NA ## 2 NA NA 1.9 0.3 ## 3 2 1.4 1.9 4.1 ## 4 3 0.1 2.8 8.9 ## 5 4 1.7 NA 7.6 Let’s go through pivot_wider()’s arguments: the first is id_cols = which requires the variable that uniquely identifies the rows to be supplied. names_from = is where you input the variable that will generate the names of the new columns. In our case, the variable colmuns has three values; var1, var2 and var3, and these are now the names of the new columns. Finally, values_from = is where you can specify the column containing the values that will fill the data frame. I find the argument names names_from = and values_from = quite explicit. As you can see, there are some missing values. Let’s suppose that we know that these missing values are true 0’s. pivot_wider() has an argument called values_fill = that makes it easy to replace the missing values: survey_data %>% pivot_wider(id_cols = id, names_from = variable, values_from = value, values_fill = list(value = 0)) ## # A tibble: 5 × 4 ## id var1 var2 var3 ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1 0.2 0 ## 2 NA 0 1.9 0.3 ## 3 2 1.4 1.9 4.1 ## 4 3 0.1 2.8 8.9 ## 5 4 1.7 0 7.6 A list of variables and their respective values to replace NA’s with must be supplied to values_fill. Let’s now use another dataset, which you can get from here (downloaded from: http://www.statistiques.public.lu/stat/TableViewer/tableView.aspx?ReportId=12950&IF_Language=eng&MainTheme=2&FldrName=3&RFPath=91). This data set gives the unemployment rate for each Luxembourguish canton from 2001 to 2015. We will come back to this data later on to learn how to plot it. For now, let’s use it to learn more about {tidyr}. unemp_lux_data <- rio::import( "https://raw.githubusercontent.com/b-rodrigues/modern_R/master/datasets/unemployment/all/unemployment_lux_all.csv" ) head(unemp_lux_data) ## division year active_population of_which_non_wage_earners ## 1 Beaufort 2001 688 85 ## 2 Beaufort 2002 742 85 ## 3 Beaufort 2003 773 85 ## 4 Beaufort 2004 828 80 ## 5 Beaufort 2005 866 96 ## 6 Beaufort 2006 893 87 ## of_which_wage_earners total_employed_population unemployed ## 1 568 653 35 ## 2 631 716 26 ## 3 648 733 40 ## 4 706 786 42 ## 5 719 815 51 ## 6 746 833 60 ## unemployment_rate_in_percent ## 1 5.09 ## 2 3.50 ## 3 5.17 ## 4 5.07 ## 5 5.89 ## 6 6.72 Now, let’s suppose that for our purposes, it would make more sense to have the data in a wide format, where columns are “divison times year” and the value is the unemployment rate. This can be easily done with providing more columns to names_from =. unemp_lux_data2 <- unemp_lux_data %>% filter(year %in% seq(2013, 2017), str_detect(division, ".*ange$"), !str_detect(division, ".*Canton.*")) %>% select(division, year, unemployment_rate_in_percent) %>% rowid_to_column() unemp_lux_data2 %>% pivot_wider(names_from = c(division, year), values_from = unemployment_rate_in_percent) ## # A tibble: 48 × 49 ## rowid Bertr…¹ Bertr…² Bertr…³ Diffe…⁴ Diffe…⁵ Diffe…⁶ Dudel…⁷ Dudel…⁸ Dudel…⁹ ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 5.69 NA NA NA NA NA NA NA NA ## 2 2 NA 5.65 NA NA NA NA NA NA NA ## 3 3 NA NA 5.35 NA NA NA NA NA NA ## 4 4 NA NA NA 13.2 NA NA NA NA NA ## 5 5 NA NA NA NA 12.6 NA NA NA NA ## 6 6 NA NA NA NA NA 11.4 NA NA NA ## 7 7 NA NA NA NA NA NA 9.35 NA NA ## 8 8 NA NA NA NA NA NA NA 9.37 NA ## 9 9 NA NA NA NA NA NA NA NA 8.53 ## 10 10 NA NA NA NA NA NA NA NA NA ## # … with 38 more rows, 39 more variables: Frisange_2013 <dbl>, ## # Frisange_2014 <dbl>, Frisange_2015 <dbl>, Hesperange_2013 <dbl>, ## # Hesperange_2014 <dbl>, Hesperange_2015 <dbl>, Leudelange_2013 <dbl>, ## # Leudelange_2014 <dbl>, Leudelange_2015 <dbl>, Mondercange_2013 <dbl>, ## # Mondercange_2014 <dbl>, Mondercange_2015 <dbl>, Pétange_2013 <dbl>, ## # Pétange_2014 <dbl>, Pétange_2015 <dbl>, Rumelange_2013 <dbl>, ## # Rumelange_2014 <dbl>, Rumelange_2015 <dbl>, Schifflange_2013 <dbl>, … In the filter() statement, I only kept data from 2013 to 2017, “division”s ending with the string “ange” (“division” can be a canton or a commune, for example “Canton Redange”, a canton, or “Hesperange” a commune), and removed the cantons as I’m only interested in communes. If you don’t understand this filter() statement, don’t fret; this is not important for what follows. I then only kept the columns I’m interested in and pivoted the data to a wide format. Also, I needed to add a unique identifier to the data frame. For this, I used rowid_to_column() function, from the {tibble} package, which adds a new column to the data frame with an id, going from 1 to the number of rows in the data frame. If I did not add this identifier, the statement would work still: unemp_lux_data3 <- unemp_lux_data %>% filter(year %in% seq(2013, 2017), str_detect(division, ".*ange$"), !str_detect(division, ".*Canton.*")) %>% select(division, year, unemployment_rate_in_percent) unemp_lux_data3 %>% pivot_wider(names_from = c(division, year), values_from = unemployment_rate_in_percent) ## # A tibble: 1 × 48 ## Bertrange_2013 Bertr…¹ Bertr…² Diffe…³ Diffe…⁴ Diffe…⁵ Dudel…⁶ Dudel…⁷ Dudel…⁸ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 5.69 5.65 5.35 13.2 12.6 11.4 9.35 9.37 8.53 ## # … with 39 more variables: Frisange_2013 <dbl>, Frisange_2014 <dbl>, ## # Frisange_2015 <dbl>, Hesperange_2013 <dbl>, Hesperange_2014 <dbl>, ## # Hesperange_2015 <dbl>, Leudelange_2013 <dbl>, Leudelange_2014 <dbl>, ## # Leudelange_2015 <dbl>, Mondercange_2013 <dbl>, Mondercange_2014 <dbl>, ## # Mondercange_2015 <dbl>, Pétange_2013 <dbl>, Pétange_2014 <dbl>, ## # Pétange_2015 <dbl>, Rumelange_2013 <dbl>, Rumelange_2014 <dbl>, ## # Rumelange_2015 <dbl>, Schifflange_2013 <dbl>, Schifflange_2014 <dbl>, … and actually look even better, but only because there are no repeated values; there is only one unemployment rate for each “commune times year”. I will come back to this later on, with another example that might be clearer. These last two code blocks are intense; make sure you go through each lien step by step and understand what is going on. You might have noticed that because there is no data for the years 2016 and 2017, these columns do not appear in the data. But suppose that we need to have these columns, so that a colleague from another department can fill in the values. This is possible by providing a data frame with the detailed specifications of the result data frame. This optional data frame must have at least two columns, .name, which are the column names you want, and .value which contains the values. Also, the function that uses this spec is a pivot_wider_spec(), and not pivot_wider(). unemp_spec <- unemp_lux_data %>% tidyr::expand(division, year = c(year, 2016, 2017), .value = "unemployment_rate_in_percent") %>% unite(".name", division, year, remove = FALSE) unemp_spec Here, I use another function, tidyr::expand(), which returns every combinations (cartesian product) of every variable from a dataset. To make it work, we still need to create a column that uniquely identifies each row in the data: unemp_lux_data4 <- unemp_lux_data %>% select(division, year, unemployment_rate_in_percent) %>% rowid_to_column() %>% pivot_wider_spec(spec = unemp_spec) unemp_lux_data4 ## # A tibble: 1,770 × 2,007 ## rowid Beauf…¹ Beauf…² Beauf…³ Beauf…⁴ Beauf…⁵ Beauf…⁶ Beauf…⁷ Beauf…⁸ Beauf…⁹ ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 5.09 NA NA NA NA NA NA NA NA ## 2 2 NA 3.5 NA NA NA NA NA NA NA ## 3 3 NA NA 5.17 NA NA NA NA NA NA ## 4 4 NA NA NA 5.07 NA NA NA NA NA ## 5 5 NA NA NA NA 5.89 NA NA NA NA ## 6 6 NA NA NA NA NA 6.72 NA NA NA ## 7 7 NA NA NA NA NA NA 4.3 NA NA ## 8 8 NA NA NA NA NA NA NA 7.08 NA ## 9 9 NA NA NA NA NA NA NA NA 8.52 ## 10 10 NA NA NA NA NA NA NA NA NA ## # … with 1,760 more rows, 1,997 more variables: Beaufort_2010 <dbl>, ## # Beaufort_2011 <dbl>, Beaufort_2012 <dbl>, Beaufort_2013 <dbl>, ## # Beaufort_2014 <dbl>, Beaufort_2015 <dbl>, Beaufort_2016 <dbl>, ## # Beaufort_2017 <dbl>, Bech_2001 <dbl>, Bech_2002 <dbl>, Bech_2003 <dbl>, ## # Bech_2004 <dbl>, Bech_2005 <dbl>, Bech_2006 <dbl>, Bech_2007 <dbl>, ## # Bech_2008 <dbl>, Bech_2009 <dbl>, Bech_2010 <dbl>, Bech_2011 <dbl>, ## # Bech_2012 <dbl>, Bech_2013 <dbl>, Bech_2014 <dbl>, Bech_2015 <dbl>, … You can notice that now we have columns for 2016 and 2017 too. Let’s clean the data a little bit more: unemp_lux_data4 %>% select(-rowid) %>% fill(matches(".*"), .direction = "down") %>% slice(n()) ## # A tibble: 1 × 2,006 ## Beaufort_2001 Beaufo…¹ Beauf…² Beauf…³ Beauf…⁴ Beauf…⁵ Beauf…⁶ Beauf…⁷ Beauf…⁸ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 5.09 3.5 5.17 5.07 5.89 6.72 4.3 7.08 8.52 ## # … with 1,997 more variables: Beaufort_2010 <dbl>, Beaufort_2011 <dbl>, ## # Beaufort_2012 <dbl>, Beaufort_2013 <dbl>, Beaufort_2014 <dbl>, ## # Beaufort_2015 <dbl>, Beaufort_2016 <dbl>, Beaufort_2017 <dbl>, ## # Bech_2001 <dbl>, Bech_2002 <dbl>, Bech_2003 <dbl>, Bech_2004 <dbl>, ## # Bech_2005 <dbl>, Bech_2006 <dbl>, Bech_2007 <dbl>, Bech_2008 <dbl>, ## # Bech_2009 <dbl>, Bech_2010 <dbl>, Bech_2011 <dbl>, Bech_2012 <dbl>, ## # Bech_2013 <dbl>, Bech_2014 <dbl>, Bech_2015 <dbl>, Bech_2016 <dbl>, … We will learn about fill(), anoher {tidyr} function a bit later in this chapter, but its basic purpose is to fill rows with whatever value comes before or after the missing values. slice(n()) then only keeps the last row of the data frame, which is the row that contains all the values (expect for 2016 and 2017, which has missing values, as we wanted). Here is another example of the importance of having an identifier column when using a spec: data(mtcars) mtcars_spec <- mtcars %>% tidyr::expand(am, cyl, .value = "mpg") %>% unite(".name", am, cyl, remove = FALSE) mtcars_spec We can now transform the data: mtcars %>% pivot_wider_spec(spec = mtcars_spec) ## # A tibble: 32 × 14 ## disp hp drat wt qsec vs gear carb `0_4` `0_6` `0_8` `1_4` `1_6` ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 160 110 3.9 2.62 16.5 0 4 4 NA NA NA NA 21 ## 2 160 110 3.9 2.88 17.0 0 4 4 NA NA NA NA 21 ## 3 108 93 3.85 2.32 18.6 1 4 1 NA NA NA 22.8 NA ## 4 258 110 3.08 3.22 19.4 1 3 1 NA 21.4 NA NA NA ## 5 360 175 3.15 3.44 17.0 0 3 2 NA NA 18.7 NA NA ## 6 225 105 2.76 3.46 20.2 1 3 1 NA 18.1 NA NA NA ## 7 360 245 3.21 3.57 15.8 0 3 4 NA NA 14.3 NA NA ## 8 147. 62 3.69 3.19 20 1 4 2 24.4 NA NA NA NA ## 9 141. 95 3.92 3.15 22.9 1 4 2 22.8 NA NA NA NA ## 10 168. 123 3.92 3.44 18.3 1 4 4 NA 19.2 NA NA NA ## # … with 22 more rows, and 1 more variable: `1_8` <dbl> As you can see, there are several values of “mpg” for some combinations of “am” times “cyl”. If we remove the other columns, each row will not be uniquely identified anymore. This results in a warning message, and a tibble that contains list-columns: mtcars %>% select(am, cyl, mpg) %>% pivot_wider_spec(spec = mtcars_spec) ## Warning: Values from `mpg` are not uniquely identified; output will contain list-cols. ## * Use `values_fn = list` to suppress this warning. ## * Use `values_fn = {summary_fun}` to summarise duplicates. ## * Use the following dplyr code to identify duplicates. ## {data} %>% ## dplyr::group_by(am, cyl) %>% ## dplyr::summarise(n = dplyr::n(), .groups = "drop") %>% ## dplyr::filter(n > 1L) ## # A tibble: 1 × 6 ## `0_4` `0_6` `0_8` `1_4` `1_6` `1_8` ## <list> <list> <list> <list> <list> <list> ## 1 <dbl [3]> <dbl [4]> <dbl [12]> <dbl [8]> <dbl [3]> <dbl [2]> We are going to learn about list-columns in the next section. List-columns are very powerful, and mastering them will be important. But generally speaking, when reshaping data, if you get list-columns back it often means that something went wrong. So you have to be careful with this. pivot_longer() is used when you need to go from a wide to a long dataset, meaning, a dataset where there are some columns that should not be columns, but rather, the levels of a factor variable. Let’s suppose that the “am” column is split into two columns, 1 for automatic and 0 for manual transmissions, and that the values filling these colums are miles per gallon, “mpg”: mtcars_wide_am <- mtcars %>% pivot_wider(names_from = am, values_from = mpg) mtcars_wide_am %>% select(`0`, `1`, everything()) ## # A tibble: 32 × 11 ## `0` `1` cyl disp hp drat wt qsec vs gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 NA 21 6 160 110 3.9 2.62 16.5 0 4 4 ## 2 NA 21 6 160 110 3.9 2.88 17.0 0 4 4 ## 3 NA 22.8 4 108 93 3.85 2.32 18.6 1 4 1 ## 4 21.4 NA 6 258 110 3.08 3.22 19.4 1 3 1 ## 5 18.7 NA 8 360 175 3.15 3.44 17.0 0 3 2 ## 6 18.1 NA 6 225 105 2.76 3.46 20.2 1 3 1 ## 7 14.3 NA 8 360 245 3.21 3.57 15.8 0 3 4 ## 8 24.4 NA 4 147. 62 3.69 3.19 20 1 4 2 ## 9 22.8 NA 4 141. 95 3.92 3.15 22.9 1 4 2 ## 10 19.2 NA 6 168. 123 3.92 3.44 18.3 1 4 4 ## # … with 22 more rows As you can see, the “0” and “1” columns should not be their own columns, unless there is a very specific and good reason they should… but rather, they should be the levels of another column (in our case, “am”). We can go back to a long dataset like so: mtcars_wide_am %>% pivot_longer(cols = c(`1`, `0`), names_to = "am", values_to = "mpg") %>% select(am, mpg, everything()) ## # A tibble: 64 × 11 ## am mpg cyl disp hp drat wt qsec vs gear carb ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 21 6 160 110 3.9 2.62 16.5 0 4 4 ## 2 0 NA 6 160 110 3.9 2.62 16.5 0 4 4 ## 3 1 21 6 160 110 3.9 2.88 17.0 0 4 4 ## 4 0 NA 6 160 110 3.9 2.88 17.0 0 4 4 ## 5 1 22.8 4 108 93 3.85 2.32 18.6 1 4 1 ## 6 0 NA 4 108 93 3.85 2.32 18.6 1 4 1 ## 7 1 NA 6 258 110 3.08 3.22 19.4 1 3 1 ## 8 0 21.4 6 258 110 3.08 3.22 19.4 1 3 1 ## 9 1 NA 8 360 175 3.15 3.44 17.0 0 3 2 ## 10 0 18.7 8 360 175 3.15 3.44 17.0 0 3 2 ## # … with 54 more rows In the cols argument, you need to list all the variables that need to be transformed. Only 1 and 0 must be pivoted, so I list them. Just for illustration purposes, imagine that we would need to pivot 50 columns. It would be faster to list the columns that do not need to be pivoted. This can be achieved by listing the columns that must be excluded with - in front, and maybe using match() with a regular expression: mtcars_wide_am %>% pivot_longer(cols = -matches("^[[:alpha:]]"), names_to = "am", values_to = "mpg") %>% select(am, mpg, everything()) ## # A tibble: 64 × 11 ## am mpg cyl disp hp drat wt qsec vs gear carb ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 21 6 160 110 3.9 2.62 16.5 0 4 4 ## 2 0 NA 6 160 110 3.9 2.62 16.5 0 4 4 ## 3 1 21 6 160 110 3.9 2.88 17.0 0 4 4 ## 4 0 NA 6 160 110 3.9 2.88 17.0 0 4 4 ## 5 1 22.8 4 108 93 3.85 2.32 18.6 1 4 1 ## 6 0 NA 4 108 93 3.85 2.32 18.6 1 4 1 ## 7 1 NA 6 258 110 3.08 3.22 19.4 1 3 1 ## 8 0 21.4 6 258 110 3.08 3.22 19.4 1 3 1 ## 9 1 NA 8 360 175 3.15 3.44 17.0 0 3 2 ## 10 0 18.7 8 360 175 3.15 3.44 17.0 0 3 2 ## # … with 54 more rows Every column that starts with a letter is ok, so there is no need to pivot them. I use the match() function with a regular expression so that I don’t have to type the names of all the columns. select() is used to re-order the columns, only for viewing purposes names_to = takes a string as argument, which will be the name of the name column containing the levels 0 and 1, and values_to = also takes a string as argument, which will be the name of the column containing the values. Finally, you can see that there are a lot of NAs in the output. These can be removed easily: mtcars_wide_am %>% pivot_longer(cols = c(`1`, `0`), names_to = "am", values_to = "mpg", values_drop_na = TRUE) %>% select(am, mpg, everything()) ## # A tibble: 32 × 11 ## am mpg cyl disp hp drat wt qsec vs gear carb ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 21 6 160 110 3.9 2.62 16.5 0 4 4 ## 2 1 21 6 160 110 3.9 2.88 17.0 0 4 4 ## 3 1 22.8 4 108 93 3.85 2.32 18.6 1 4 1 ## 4 0 21.4 6 258 110 3.08 3.22 19.4 1 3 1 ## 5 0 18.7 8 360 175 3.15 3.44 17.0 0 3 2 ## 6 0 18.1 6 225 105 2.76 3.46 20.2 1 3 1 ## 7 0 14.3 8 360 245 3.21 3.57 15.8 0 3 4 ## 8 0 24.4 4 147. 62 3.69 3.19 20 1 4 2 ## 9 0 22.8 4 141. 95 3.92 3.15 22.9 1 4 2 ## 10 0 19.2 6 168. 123 3.92 3.44 18.3 1 4 4 ## # … with 22 more rows Now for a more advanced example, let’s suppose that we are dealing with the following wide dataset: mtcars_wide <- mtcars %>% pivot_wider_spec(spec = mtcars_spec) mtcars_wide ## # A tibble: 32 × 14 ## disp hp drat wt qsec vs gear carb `0_4` `0_6` `0_8` `1_4` `1_6` ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 160 110 3.9 2.62 16.5 0 4 4 NA NA NA NA 21 ## 2 160 110 3.9 2.88 17.0 0 4 4 NA NA NA NA 21 ## 3 108 93 3.85 2.32 18.6 1 4 1 NA NA NA 22.8 NA ## 4 258 110 3.08 3.22 19.4 1 3 1 NA 21.4 NA NA NA ## 5 360 175 3.15 3.44 17.0 0 3 2 NA NA 18.7 NA NA ## 6 225 105 2.76 3.46 20.2 1 3 1 NA 18.1 NA NA NA ## 7 360 245 3.21 3.57 15.8 0 3 4 NA NA 14.3 NA NA ## 8 147. 62 3.69 3.19 20 1 4 2 24.4 NA NA NA NA ## 9 141. 95 3.92 3.15 22.9 1 4 2 22.8 NA NA NA NA ## 10 168. 123 3.92 3.44 18.3 1 4 4 NA 19.2 NA NA NA ## # … with 22 more rows, and 1 more variable: `1_8` <dbl> The difficulty here is that we have columns with two levels of information. For instance, the column “0_4” contains the miles per gallon values for manual cars (0) with 4 cylinders. The first step is to first pivot the columns: mtcars_wide %>% pivot_longer(cols = matches("0|1"), names_to = "am_cyl", values_to = "mpg", values_drop_na = TRUE) %>% select(am_cyl, mpg, everything()) ## # A tibble: 32 × 10 ## am_cyl mpg disp hp drat wt qsec vs gear carb ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1_6 21 160 110 3.9 2.62 16.5 0 4 4 ## 2 1_6 21 160 110 3.9 2.88 17.0 0 4 4 ## 3 1_4 22.8 108 93 3.85 2.32 18.6 1 4 1 ## 4 0_6 21.4 258 110 3.08 3.22 19.4 1 3 1 ## 5 0_8 18.7 360 175 3.15 3.44 17.0 0 3 2 ## 6 0_6 18.1 225 105 2.76 3.46 20.2 1 3 1 ## 7 0_8 14.3 360 245 3.21 3.57 15.8 0 3 4 ## 8 0_4 24.4 147. 62 3.69 3.19 20 1 4 2 ## 9 0_4 22.8 141. 95 3.92 3.15 22.9 1 4 2 ## 10 0_6 19.2 168. 123 3.92 3.44 18.3 1 4 4 ## # … with 22 more rows Now we only need to separate the “am_cyl” column into two new columns, “am” and “cyl”: mtcars_wide %>% pivot_longer(cols = matches("0|1"), names_to = "am_cyl", values_to = "mpg", values_drop_na = TRUE) %>% separate(am_cyl, into = c("am", "cyl"), sep = "_") %>% select(am, cyl, mpg, everything()) ## # A tibble: 32 × 11 ## am cyl mpg disp hp drat wt qsec vs gear carb ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 6 21 160 110 3.9 2.62 16.5 0 4 4 ## 2 1 6 21 160 110 3.9 2.88 17.0 0 4 4 ## 3 1 4 22.8 108 93 3.85 2.32 18.6 1 4 1 ## 4 0 6 21.4 258 110 3.08 3.22 19.4 1 3 1 ## 5 0 8 18.7 360 175 3.15 3.44 17.0 0 3 2 ## 6 0 6 18.1 225 105 2.76 3.46 20.2 1 3 1 ## 7 0 8 14.3 360 245 3.21 3.57 15.8 0 3 4 ## 8 0 4 24.4 147. 62 3.69 3.19 20 1 4 2 ## 9 0 4 22.8 141. 95 3.92 3.15 22.9 1 4 2 ## 10 0 6 19.2 168. 123 3.92 3.44 18.3 1 4 4 ## # … with 22 more rows It is also possible to construct a specification data frame, just like for pivot_wider_spec(). This time, I’m using the build_longer_spec() function that makes it easy to build specifications: mtcars_spec_long <- mtcars_wide %>% build_longer_spec(matches("0|1"), values_to = "mpg") %>% separate(name, c("am", "cyl"), sep = "_") mtcars_spec_long ## # A tibble: 6 × 4 ## .name .value am cyl ## <chr> <chr> <chr> <chr> ## 1 0_4 mpg 0 4 ## 2 0_6 mpg 0 6 ## 3 0_8 mpg 0 8 ## 4 1_4 mpg 1 4 ## 5 1_6 mpg 1 6 ## 6 1_8 mpg 1 8 This spec can now be specified to pivot_longer(): mtcars_wide %>% pivot_longer_spec(spec = mtcars_spec_long, values_drop_na = TRUE) %>% select(am, cyl, mpg, everything()) ## # A tibble: 32 × 11 ## am cyl mpg disp hp drat wt qsec vs gear carb ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 6 21 160 110 3.9 2.62 16.5 0 4 4 ## 2 1 6 21 160 110 3.9 2.88 17.0 0 4 4 ## 3 1 4 22.8 108 93 3.85 2.32 18.6 1 4 1 ## 4 0 6 21.4 258 110 3.08 3.22 19.4 1 3 1 ## 5 0 8 18.7 360 175 3.15 3.44 17.0 0 3 2 ## 6 0 6 18.1 225 105 2.76 3.46 20.2 1 3 1 ## 7 0 8 14.3 360 245 3.21 3.57 15.8 0 3 4 ## 8 0 4 24.4 147. 62 3.69 3.19 20 1 4 2 ## 9 0 4 22.8 141. 95 3.92 3.15 22.9 1 4 2 ## 10 0 6 19.2 168. 123 3.92 3.44 18.3 1 4 4 ## # … with 22 more rows Defining specifications give a lot of flexibility and in some complicated cases are the way to go. 4.4.2 fill() and full_seq() fill() is pretty useful to… fill in missing values. For instance, in survey_data, some “id”s are missing: survey_data ## # A tibble: 12 × 3 ## id variable value ## <dbl> <chr> <dbl> ## 1 1 var1 1 ## 2 1 var2 0.2 ## 3 NA var3 0.3 ## 4 2 var1 1.4 ## 5 2 var2 1.9 ## 6 2 var3 4.1 ## 7 3 var1 0.1 ## 8 3 var2 2.8 ## 9 3 var3 8.9 ## 10 4 var1 1.7 ## 11 NA var2 1.9 ## 12 4 var3 7.6 It seems pretty obvious that the first NA is supposed to be 1 and the second missing is supposed to be 4. With fill(), this is pretty easy to achieve: survey_data %>% fill(.direction = "down", id) full_seq() is similar: full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1) ## [1] "2018-08-01" "2018-08-02" "2018-08-03" We can add this as the date column to our survey data: survey_data %>% mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) ## # A tibble: 12 × 4 ## id variable value date ## <dbl> <chr> <dbl> <date> ## 1 1 var1 1 2018-08-01 ## 2 1 var2 0.2 2018-08-02 ## 3 NA var3 0.3 2018-08-03 ## 4 2 var1 1.4 2018-08-01 ## 5 2 var2 1.9 2018-08-02 ## 6 2 var3 4.1 2018-08-03 ## 7 3 var1 0.1 2018-08-01 ## 8 3 var2 2.8 2018-08-02 ## 9 3 var3 8.9 2018-08-03 ## 10 4 var1 1.7 2018-08-01 ## 11 NA var2 1.9 2018-08-02 ## 12 4 var3 7.6 2018-08-03 I use the base rep() function to repeat the date 4 times and then using mutate() I have added it the data frame. Putting all these operations together: survey_data %>% fill(.direction = "down", id) %>% mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) ## # A tibble: 12 × 4 ## id variable value date ## <dbl> <chr> <dbl> <date> ## 1 1 var1 1 2018-08-01 ## 2 1 var2 0.2 2018-08-02 ## 3 1 var3 0.3 2018-08-03 ## 4 2 var1 1.4 2018-08-01 ## 5 2 var2 1.9 2018-08-02 ## 6 2 var3 4.1 2018-08-03 ## 7 3 var1 0.1 2018-08-01 ## 8 3 var2 2.8 2018-08-02 ## 9 3 var3 8.9 2018-08-03 ## 10 4 var1 1.7 2018-08-01 ## 11 4 var2 1.9 2018-08-02 ## 12 4 var3 7.6 2018-08-03 You should be careful when imputing missing values though. The method described above is called Last Observation Carried Forward, and sometimes it makes sense, like here, but sometimes it doesn’t and doing this will introduce bias in your analysis. Discussing how to handle missing values in your analysis is outside of the scope of this book, but there are many resources available. You may want to check out the vignettes of the {mice} package, which lists many resources to get you started. 4.4.3 Put order in your columns with separate(), unite(), and in your rows with separate_rows() Sometimes, data can be in a format that makes working with it needlessly painful. For example, you get this: survey_data_not_tidy ## # A tibble: 12 × 3 ## id variable_date value ## <dbl> <chr> <dbl> ## 1 1 var1/2018-08-01 1 ## 2 1 var2/2018-08-02 0.2 ## 3 1 var3/2018-08-03 0.3 ## 4 2 var1/2018-08-01 1.4 ## 5 2 var2/2018-08-02 1.9 ## 6 2 var3/2018-08-03 4.1 ## 7 3 var1/2018-08-01 0.1 ## 8 3 var2/2018-08-02 2.8 ## 9 3 var3/2018-08-03 8.9 ## 10 4 var1/2018-08-01 1.7 ## 11 4 var2/2018-08-02 1.9 ## 12 4 var3/2018-08-03 7.6 Dealing with this is simple, thanks to separate(): survey_data_not_tidy %>% separate(variable_date, into = c("variable", "date"), sep = "/") ## # A tibble: 12 × 4 ## id variable date value ## <dbl> <chr> <chr> <dbl> ## 1 1 var1 2018-08-01 1 ## 2 1 var2 2018-08-02 0.2 ## 3 1 var3 2018-08-03 0.3 ## 4 2 var1 2018-08-01 1.4 ## 5 2 var2 2018-08-02 1.9 ## 6 2 var3 2018-08-03 4.1 ## 7 3 var1 2018-08-01 0.1 ## 8 3 var2 2018-08-02 2.8 ## 9 3 var3 2018-08-03 8.9 ## 10 4 var1 2018-08-01 1.7 ## 11 4 var2 2018-08-02 1.9 ## 12 4 var3 2018-08-03 7.6 The variable_date column gets separated into two columns, variable and date. One also needs to specify the separator, in this case “/”. unite() is the reverse operation, which can be useful when you are confronted to this situation: survey_data2 ## # A tibble: 12 × 6 ## id variable year month day value ## <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 1 var1 2018 08 01 1 ## 2 1 var2 2018 08 02 0.2 ## 3 1 var3 2018 08 03 0.3 ## 4 2 var1 2018 08 01 1.4 ## 5 2 var2 2018 08 02 1.9 ## 6 2 var3 2018 08 03 4.1 ## 7 3 var1 2018 08 01 0.1 ## 8 3 var2 2018 08 02 2.8 ## 9 3 var3 2018 08 03 8.9 ## 10 4 var1 2018 08 01 1.7 ## 11 4 var2 2018 08 02 1.9 ## 12 4 var3 2018 08 03 7.6 In some situation, it is better to have the date as a single column: survey_data2 %>% unite(date, year, month, day, sep = "-") ## # A tibble: 12 × 4 ## id variable date value ## <dbl> <chr> <chr> <dbl> ## 1 1 var1 2018-08-01 1 ## 2 1 var2 2018-08-02 0.2 ## 3 1 var3 2018-08-03 0.3 ## 4 2 var1 2018-08-01 1.4 ## 5 2 var2 2018-08-02 1.9 ## 6 2 var3 2018-08-03 4.1 ## 7 3 var1 2018-08-01 0.1 ## 8 3 var2 2018-08-02 2.8 ## 9 3 var3 2018-08-03 8.9 ## 10 4 var1 2018-08-01 1.7 ## 11 4 var2 2018-08-02 1.9 ## 12 4 var3 2018-08-03 7.6 Another awful situation is the following: survey_data_from_hell ## id variable value ## 1 1 var1 1 ## 2 1 var2 0.2 ## 3 NA var3 0.3 ## 4 2 var1, var2, var3 1.4, 1.9, 4.1 ## 5 3 var1, var2 0.1, 2.8 ## 6 3 var3 8.9 ## 7 4 var1 1.7 ## 8 NA var2 1.9 ## 9 4 var3 7.6 separate_rows() saves the day: survey_data_from_hell %>% separate_rows(variable, value) ## # A tibble: 12 × 3 ## id variable value ## <dbl> <chr> <chr> ## 1 1 var1 1 ## 2 1 var2 0.2 ## 3 NA var3 0.3 ## 4 2 var1 1.4 ## 5 2 var2 1.9 ## 6 2 var3 4.1 ## 7 3 var1 0.1 ## 8 3 var2 2.8 ## 9 3 var3 8.9 ## 10 4 var1 1.7 ## 11 NA var2 1.9 ## 12 4 var3 7.6 So to summarise… you can go from this: survey_data_from_hell ## id variable value ## 1 1 var1 1 ## 2 1 var2 0.2 ## 3 NA var3 0.3 ## 4 2 var1, var2, var3 1.4, 1.9, 4.1 ## 5 3 var1, var2 0.1, 2.8 ## 6 3 var3 8.9 ## 7 4 var1 1.7 ## 8 NA var2 1.9 ## 9 4 var3 7.6 to this: survey_data_clean ## # A tibble: 12 × 4 ## id variable date value ## <dbl> <chr> <chr> <dbl> ## 1 1 var1 2018-08-01 1 ## 2 1 var2 2018-08-02 0.2 ## 3 1 var3 2018-08-03 0.3 ## 4 2 var1 2018-08-01 1.4 ## 5 2 var2 2018-08-02 1.9 ## 6 2 var3 2018-08-03 4.1 ## 7 3 var1 2018-08-01 0.1 ## 8 3 var2 2018-08-02 2.8 ## 9 3 var3 2018-08-03 8.9 ## 10 4 var1 2018-08-01 1.7 ## 11 4 var2 2018-08-02 1.9 ## 12 4 var3 2018-08-03 7.6 quite easily: survey_data_from_hell %>% separate_rows(variable, value, convert = TRUE) %>% fill(.direction = "down", id) %>% mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) 4.5 Working on many columns with if_any(), if_all() and across() 4.5.1 Filtering rows where several columns verify a condition Let’s go back to the gasoline data from the {Ecdat} package. When using filter(), it is only possible to filter one column at a time. For example, you can only filter rows where a column equals “France” for instance. But suppose that we have a condition that we want to use to filter out a lot of columns at once. For example, for every column that is of type numeric, keep only the lines where the condition value > -8 is satisfied. The next line does that: gasoline %>% filter(if_any(where(is.numeric), \\(x)(`>`(x, -8)))) ## # A tibble: 342 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows The above code is using the if_any() function, included in {dplyr}. It also uses where(), which must be used for predicate functions like is.numeric(), or is.character(), etc. You can think of if_any() as a function that helps you select the columns to which to apply the function. You can read the code above like this: Start with the gasoline data, then filter rows that are greater than -8 across the columns which are numeric or similar. if_any(), if_all() and across() makes operations like these very easy to achieve. Sometimes, you’d want to filter rows from columns that end their labels with a letter, for instance \"p\". This can again be achieved using another helper, ends_with(), instead of where(): gasoline %>% filter(if_any(ends_with("p"), \\(x)(`>`(x, -8)))) ## # A tibble: 340 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 330 more rows We already know about ends_with() and starts_with(). So the above line means “for the columns whose name end with a ‘p’ only keep the lines where, for all the selected columns, the values are strictly superior to -8”. if_all() works exactly the same way, but think of the if in if_all() as having the conditions separated by and while the if of if_any() being separated by or. So for example, the code above, where if_any() is replaced by if_all(), results in a much smaller data frame: gasoline %>% filter(if_all(ends_with("p"), \\(x)(`>`(x, -8)))) ## # A tibble: 30 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 canada 1972 4.89 -5.44 -1.10 -7.99 ## 2 canada 1973 4.90 -5.41 -1.13 -7.94 ## 3 canada 1974 4.89 -5.42 -1.12 -7.90 ## 4 canada 1975 4.89 -5.38 -1.19 -7.87 ## 5 canada 1976 4.84 -5.36 -1.06 -7.81 ## 6 canada 1977 4.81 -5.34 -1.07 -7.77 ## 7 canada 1978 4.86 -5.31 -1.07 -7.79 ## 8 germany 1978 3.88 -5.56 -0.628 -7.95 ## 9 sweden 1975 3.97 -7.68 -2.77 -7.99 ## 10 sweden 1976 3.98 -7.67 -2.82 -7.96 ## # … with 20 more rows because here, we only keep rows for columns that end with “p” where ALL of them are simultaneously greater than 8. 4.5.2 Selecting several columns at once In a previous section we already played around a little bit with select() and some helpers, everything(), starts_with() and ends_with(). But there are many ways that you can use helper functions to select several columns easily: gasoline %>% select(where(is.numeric)) ## # A tibble: 342 × 5 ## year lgaspcar lincomep lrpmg lcarpcap ## <int> <dbl> <dbl> <dbl> <dbl> ## 1 1960 4.17 -6.47 -0.335 -9.77 ## 2 1961 4.10 -6.43 -0.351 -9.61 ## 3 1962 4.07 -6.41 -0.380 -9.46 ## 4 1963 4.06 -6.37 -0.414 -9.34 ## 5 1964 4.04 -6.32 -0.445 -9.24 ## 6 1965 4.03 -6.29 -0.497 -9.12 ## 7 1966 4.05 -6.25 -0.467 -9.02 ## 8 1967 4.05 -6.23 -0.506 -8.93 ## 9 1968 4.05 -6.21 -0.522 -8.85 ## 10 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows Selecting by column position is also possible: gasoline %>% select(c(1, 2, 5)) ## # A tibble: 342 × 3 ## country year lrpmg ## <chr> <int> <dbl> ## 1 austria 1960 -0.335 ## 2 austria 1961 -0.351 ## 3 austria 1962 -0.380 ## 4 austria 1963 -0.414 ## 5 austria 1964 -0.445 ## 6 austria 1965 -0.497 ## 7 austria 1966 -0.467 ## 8 austria 1967 -0.506 ## 9 austria 1968 -0.522 ## 10 austria 1969 -0.559 ## # … with 332 more rows As is selecting columns starting or ending with a certain string of characters, as discussed previously: gasoline %>% select(starts_with("l")) ## # A tibble: 342 × 4 ## lgaspcar lincomep lrpmg lcarpcap ## <dbl> <dbl> <dbl> <dbl> ## 1 4.17 -6.47 -0.335 -9.77 ## 2 4.10 -6.43 -0.351 -9.61 ## 3 4.07 -6.41 -0.380 -9.46 ## 4 4.06 -6.37 -0.414 -9.34 ## 5 4.04 -6.32 -0.445 -9.24 ## 6 4.03 -6.29 -0.497 -9.12 ## 7 4.05 -6.25 -0.467 -9.02 ## 8 4.05 -6.23 -0.506 -8.93 ## 9 4.05 -6.21 -0.522 -8.85 ## 10 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows Another very neat trick is selecting columns that may or may not exist in your data frame. For this quick examples let’s use the mtcars dataset: sort(colnames(mtcars)) ## [1] "am" "carb" "cyl" "disp" "drat" "gear" "hp" "mpg" "qsec" "vs" ## [11] "wt" Let’s create a vector with some column names: cols_to_select <- c("mpg", "cyl", "am", "nonsense") The following selects the columns that exist in the data frame but shows a warning for the column that does not exist: mtcars %>% select(any_of(cols_to_select)) ## mpg cyl am ## Mazda RX4 21.0 6 1 ## Mazda RX4 Wag 21.0 6 1 ## Datsun 710 22.8 4 1 ## Hornet 4 Drive 21.4 6 0 ## Hornet Sportabout 18.7 8 0 ## Valiant 18.1 6 0 ## Duster 360 14.3 8 0 ## Merc 240D 24.4 4 0 ## Merc 230 22.8 4 0 ## Merc 280 19.2 6 0 ## Merc 280C 17.8 6 0 ## Merc 450SE 16.4 8 0 ## Merc 450SL 17.3 8 0 ## Merc 450SLC 15.2 8 0 ## Cadillac Fleetwood 10.4 8 0 ## Lincoln Continental 10.4 8 0 ## Chrysler Imperial 14.7 8 0 ## Fiat 128 32.4 4 1 ## Honda Civic 30.4 4 1 ## Toyota Corolla 33.9 4 1 ## Toyota Corona 21.5 4 0 ## Dodge Challenger 15.5 8 0 ## AMC Javelin 15.2 8 0 ## Camaro Z28 13.3 8 0 ## Pontiac Firebird 19.2 8 0 ## Fiat X1-9 27.3 4 1 ## Porsche 914-2 26.0 4 1 ## Lotus Europa 30.4 4 1 ## Ford Pantera L 15.8 8 1 ## Ferrari Dino 19.7 6 1 ## Maserati Bora 15.0 8 1 ## Volvo 142E 21.4 4 1 and finally, if you want it to fail, don’t use any helper: mtcars %>% select(cols_to_select) Error: Can't subset columns that don't exist. The column `nonsense` doesn't exist. or use all_of(): mtcars %>% select(all_of(cols_to_select)) ✖ Column `nonsense` doesn't exist. Bulk-renaming can be achieved using rename_with() gasoline %>% rename_with(toupper, is.numeric) ## Warning: Predicate functions must be wrapped in `where()`. ## ## # Bad ## data %>% select(is.numeric) ## ## # Good ## data %>% select(where(is.numeric)) ## ## ℹ Please update your code. ## This message is displayed once per session. ## # A tibble: 342 × 6 ## country YEAR LGASPCAR LINCOMEP LRPMG LCARPCAP ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows you can also pass functions to rename_with(): gasoline %>% rename_with(\\(x)(paste0("new_", x))) ## # A tibble: 342 × 6 ## new_country new_year new_lgaspcar new_lincomep new_lrpmg new_lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows The reason I’m talking about renaming in a section about selecting is because you can also rename with select: gasoline %>% select(YEAR = year) ## # A tibble: 342 × 1 ## YEAR ## <int> ## 1 1960 ## 2 1961 ## 3 1962 ## 4 1963 ## 5 1964 ## 6 1965 ## 7 1966 ## 8 1967 ## 9 1968 ## 10 1969 ## # … with 332 more rows but of course here, you only keep that one column, and you can’t rename with a function. 4.5.3 Summarising with across() across() is used for summarising data. It allows to aggregations… across several columns. It is especially useful with group_by(). To illustrate how group_by() works with across() I have to first modify the gasoline data a little bit. As you can see below, the year column is of type double: gasoline %>% lapply(typeof) ## $country ## [1] "character" ## ## $year ## [1] "integer" ## ## $lgaspcar ## [1] "double" ## ## $lincomep ## [1] "double" ## ## $lrpmg ## [1] "double" ## ## $lcarpcap ## [1] "double" (we’ll discuss lapply() in a later chapter, but just to give you a little taste, lapply() applies a function to each element of a list or of a data frame, in this case, lapply() applied the typeof() function to each column of the gasoline data set, returning the type of each column) Let’s change that to character: gasoline <- gasoline %>% mutate(year = as.character(year), country = as.character(country)) This now allows me to group by type of columns for instance: gasoline %>% group_by(across(where(is.character))) %>% summarise(mean_lincomep = mean(lincomep)) ## `summarise()` has grouped output by 'country'. You can override using the ## `.groups` argument. ## # A tibble: 342 × 3 ## # Groups: country [18] ## country year mean_lincomep ## <chr> <chr> <dbl> ## 1 austria 1960 -6.47 ## 2 austria 1961 -6.43 ## 3 austria 1962 -6.41 ## 4 austria 1963 -6.37 ## 5 austria 1964 -6.32 ## 6 austria 1965 -6.29 ## 7 austria 1966 -6.25 ## 8 austria 1967 -6.23 ## 9 austria 1968 -6.21 ## 10 austria 1969 -6.15 ## # … with 332 more rows This is faster than having to write: gasoline %>% group_by(country, year) %>% summarise(mean_lincomep = mean(lincomep)) ## `summarise()` has grouped output by 'country'. You can override using the ## `.groups` argument. ## # A tibble: 342 × 3 ## # Groups: country [18] ## country year mean_lincomep ## <chr> <chr> <dbl> ## 1 austria 1960 -6.47 ## 2 austria 1961 -6.43 ## 3 austria 1962 -6.41 ## 4 austria 1963 -6.37 ## 5 austria 1964 -6.32 ## 6 austria 1965 -6.29 ## 7 austria 1966 -6.25 ## 8 austria 1967 -6.23 ## 9 austria 1968 -6.21 ## 10 austria 1969 -6.15 ## # … with 332 more rows You may think that having two write the name of two variables is not a huge deal, which is true. But imagine that you have dozens of character columns that you want to group by. With across() and the helper functions, it doesn’t matter if the data frame has 2 columns you need to group by or 2000. All that matters is that you can find some commonalities between all these columns that make it easy to select them. It can be their type, as we have seen before, or their label: gasoline %>% group_by(across(contains("y"))) %>% summarise(mean_licomep = mean(lincomep)) ## `summarise()` has grouped output by 'country'. You can override using the ## `.groups` argument. ## # A tibble: 342 × 3 ## # Groups: country [18] ## country year mean_licomep ## <chr> <chr> <dbl> ## 1 austria 1960 -6.47 ## 2 austria 1961 -6.43 ## 3 austria 1962 -6.41 ## 4 austria 1963 -6.37 ## 5 austria 1964 -6.32 ## 6 austria 1965 -6.29 ## 7 austria 1966 -6.25 ## 8 austria 1967 -6.23 ## 9 austria 1968 -6.21 ## 10 austria 1969 -6.15 ## # … with 332 more rows but it’s also possible to group_by() position: gasoline %>% group_by(across(c(1, 2))) %>% summarise(mean_licomep = mean(lincomep)) ## `summarise()` has grouped output by 'country'. You can override using the ## `.groups` argument. ## # A tibble: 342 × 3 ## # Groups: country [18] ## country year mean_licomep ## <chr> <chr> <dbl> ## 1 austria 1960 -6.47 ## 2 austria 1961 -6.43 ## 3 austria 1962 -6.41 ## 4 austria 1963 -6.37 ## 5 austria 1964 -6.32 ## 6 austria 1965 -6.29 ## 7 austria 1966 -6.25 ## 8 austria 1967 -6.23 ## 9 austria 1968 -6.21 ## 10 austria 1969 -6.15 ## # … with 332 more rows Using a sequence is also possible: gasoline %>% group_by(across(seq(1:2))) %>% summarise(mean_lincomep = mean(lincomep)) ## `summarise()` has grouped output by 'country'. You can override using the ## `.groups` argument. ## # A tibble: 342 × 3 ## # Groups: country [18] ## country year mean_lincomep ## <chr> <chr> <dbl> ## 1 austria 1960 -6.47 ## 2 austria 1961 -6.43 ## 3 austria 1962 -6.41 ## 4 austria 1963 -6.37 ## 5 austria 1964 -6.32 ## 6 austria 1965 -6.29 ## 7 austria 1966 -6.25 ## 8 austria 1967 -6.23 ## 9 austria 1968 -6.21 ## 10 austria 1969 -6.15 ## # … with 332 more rows but be careful, selecting by position is dangerous. If the position of columns changes, your code will fail. Selecting by type or label is much more robust, especially by label, since types can change as well (for example a date column can easily be exported as character column, etc). 4.5.4 summarise() across many columns Summarising across many columns is really incredibly useful and in my opinion one of the best arguments in favour of switching to a {tidyverse} only workflow: gasoline %>% group_by(country) %>% summarise(across(starts_with("l"), mean)) ## # A tibble: 18 × 5 ## country lgaspcar lincomep lrpmg lcarpcap ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 austria 4.06 -6.12 -0.486 -8.85 ## 2 belgium 3.92 -5.85 -0.326 -8.63 ## 3 canada 4.86 -5.58 -1.05 -8.08 ## 4 denmark 4.19 -5.76 -0.358 -8.58 ## 5 france 3.82 -5.87 -0.253 -8.45 ## 6 germany 3.89 -5.85 -0.517 -8.51 ## 7 greece 4.88 -6.61 -0.0339 -10.8 ## 8 ireland 4.23 -6.44 -0.348 -9.04 ## 9 italy 3.73 -6.35 -0.152 -8.83 ## 10 japan 4.70 -6.25 -0.287 -9.95 ## 11 netherla 4.08 -5.92 -0.370 -8.82 ## 12 norway 4.11 -5.75 -0.278 -8.77 ## 13 spain 4.06 -5.63 0.739 -9.90 ## 14 sweden 4.01 -7.82 -2.71 -8.25 ## 15 switzerl 4.24 -5.93 -0.902 -8.54 ## 16 turkey 5.77 -7.34 -0.422 -12.5 ## 17 u.k. 3.98 -6.02 -0.459 -8.55 ## 18 u.s.a. 4.82 -5.45 -1.21 -7.78 But where summarise() and across() really shine is when you want to apply several functions to many columns at once: gasoline %>% group_by(country) %>% summarise(across(starts_with("l"), tibble::lst(mean, sd, max, min), .names = "{fn}_{col}")) ## # A tibble: 18 × 17 ## country mean_lgasp…¹ sd_lg…² max_l…³ min_l…⁴ mean_…⁵ sd_li…⁶ max_l…⁷ min_l…⁸ ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 4.06 0.0693 4.20 3.92 -6.12 0.235 -5.76 -6.47 ## 2 belgium 3.92 0.103 4.16 3.82 -5.85 0.227 -5.53 -6.22 ## 3 canada 4.86 0.0262 4.90 4.81 -5.58 0.193 -5.31 -5.89 ## 4 denmark 4.19 0.158 4.50 4.00 -5.76 0.176 -5.48 -6.06 ## 5 france 3.82 0.0499 3.91 3.75 -5.87 0.241 -5.53 -6.26 ## 6 germany 3.89 0.0239 3.93 3.85 -5.85 0.193 -5.56 -6.16 ## 7 greece 4.88 0.255 5.38 4.48 -6.61 0.331 -6.15 -7.16 ## 8 ireland 4.23 0.0437 4.33 4.16 -6.44 0.162 -6.19 -6.72 ## 9 italy 3.73 0.220 4.05 3.38 -6.35 0.217 -6.08 -6.73 ## 10 japan 4.70 0.684 6.00 3.95 -6.25 0.425 -5.71 -6.99 ## 11 netherla 4.08 0.286 4.65 3.71 -5.92 0.193 -5.66 -6.22 ## 12 norway 4.11 0.123 4.44 3.96 -5.75 0.201 -5.42 -6.09 ## 13 spain 4.06 0.317 4.75 3.62 -5.63 0.278 -5.29 -6.17 ## 14 sweden 4.01 0.0364 4.07 3.91 -7.82 0.126 -7.67 -8.07 ## 15 switzerl 4.24 0.102 4.44 4.05 -5.93 0.124 -5.75 -6.16 ## 16 turkey 5.77 0.329 6.16 5.14 -7.34 0.331 -6.89 -7.84 ## 17 u.k. 3.98 0.0479 4.10 3.91 -6.02 0.107 -5.84 -6.19 ## 18 u.s.a. 4.82 0.0219 4.86 4.79 -5.45 0.148 -5.22 -5.70 ## # … with 8 more variables: mean_lrpmg <dbl>, sd_lrpmg <dbl>, max_lrpmg <dbl>, ## # min_lrpmg <dbl>, mean_lcarpcap <dbl>, sd_lcarpcap <dbl>, ## # max_lcarpcap <dbl>, min_lcarpcap <dbl>, and abbreviated variable names ## # ¹​mean_lgaspcar, ²​sd_lgaspcar, ³​max_lgaspcar, ⁴​min_lgaspcar, ⁵​mean_lincomep, ## # ⁶​sd_lincomep, ⁷​max_lincomep, ⁸​min_lincomep Here, I first started by grouping by country, then I applied the mean(), sd(), max() and min() functions to every column starting with the character \"l\". tibble::lst() allows you to create a list just like with list() but names its arguments automatically. So the mean() function gets name \"mean\", and so on. Finally, I use the .names = argument to create the template for the new column names. {fn}_{col} creates new column names of the form function name _ column name. As mentioned before, across() works with other helper functions: gasoline %>% group_by(country) %>% summarise(across(contains("car"), tibble::lst(mean, sd, max, min), .names = "{fn}_{col}")) ## # A tibble: 18 × 9 ## country mean_lgasp…¹ sd_lg…² max_l…³ min_l…⁴ mean_…⁵ sd_lc…⁶ max_l…⁷ min_l…⁸ ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 4.06 0.0693 4.20 3.92 -8.85 0.473 -8.21 -9.77 ## 2 belgium 3.92 0.103 4.16 3.82 -8.63 0.417 -8.10 -9.41 ## 3 canada 4.86 0.0262 4.90 4.81 -8.08 0.195 -7.77 -8.38 ## 4 denmark 4.19 0.158 4.50 4.00 -8.58 0.349 -8.20 -9.33 ## 5 france 3.82 0.0499 3.91 3.75 -8.45 0.344 -8.01 -9.15 ## 6 germany 3.89 0.0239 3.93 3.85 -8.51 0.406 -7.95 -9.34 ## 7 greece 4.88 0.255 5.38 4.48 -10.8 0.839 -9.57 -12.2 ## 8 ireland 4.23 0.0437 4.33 4.16 -9.04 0.345 -8.55 -9.70 ## 9 italy 3.73 0.220 4.05 3.38 -8.83 0.639 -8.11 -10.1 ## 10 japan 4.70 0.684 6.00 3.95 -9.95 1.20 -8.59 -12.2 ## 11 netherla 4.08 0.286 4.65 3.71 -8.82 0.617 -8.16 -10.0 ## 12 norway 4.11 0.123 4.44 3.96 -8.77 0.438 -8.17 -9.68 ## 13 spain 4.06 0.317 4.75 3.62 -9.90 0.960 -8.63 -11.6 ## 14 sweden 4.01 0.0364 4.07 3.91 -8.25 0.242 -7.96 -8.74 ## 15 switzerl 4.24 0.102 4.44 4.05 -8.54 0.378 -8.03 -9.26 ## 16 turkey 5.77 0.329 6.16 5.14 -12.5 0.751 -11.2 -13.5 ## 17 u.k. 3.98 0.0479 4.10 3.91 -8.55 0.281 -8.26 -9.12 ## 18 u.s.a. 4.82 0.0219 4.86 4.79 -7.78 0.162 -7.54 -8.02 ## # … with abbreviated variable names ¹​mean_lgaspcar, ²​sd_lgaspcar, ## # ³​max_lgaspcar, ⁴​min_lgaspcar, ⁵​mean_lcarpcap, ⁶​sd_lcarpcap, ⁷​max_lcarpcap, ## # ⁸​min_lcarpcap This is very likely the quickest, most elegant way to summarise that many columns. There’s also a way to summarise where: gasoline %>% group_by(country) %>% summarise(across(where(is.numeric), tibble::lst(mean, sd, min, max), .names = "{fn}_{col}")) ## # A tibble: 18 × 17 ## country mean_lgasp…¹ sd_lg…² min_l…³ max_l…⁴ mean_…⁵ sd_li…⁶ min_l…⁷ max_l…⁸ ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 4.06 0.0693 3.92 4.20 -6.12 0.235 -6.47 -5.76 ## 2 belgium 3.92 0.103 3.82 4.16 -5.85 0.227 -6.22 -5.53 ## 3 canada 4.86 0.0262 4.81 4.90 -5.58 0.193 -5.89 -5.31 ## 4 denmark 4.19 0.158 4.00 4.50 -5.76 0.176 -6.06 -5.48 ## 5 france 3.82 0.0499 3.75 3.91 -5.87 0.241 -6.26 -5.53 ## 6 germany 3.89 0.0239 3.85 3.93 -5.85 0.193 -6.16 -5.56 ## 7 greece 4.88 0.255 4.48 5.38 -6.61 0.331 -7.16 -6.15 ## 8 ireland 4.23 0.0437 4.16 4.33 -6.44 0.162 -6.72 -6.19 ## 9 italy 3.73 0.220 3.38 4.05 -6.35 0.217 -6.73 -6.08 ## 10 japan 4.70 0.684 3.95 6.00 -6.25 0.425 -6.99 -5.71 ## 11 netherla 4.08 0.286 3.71 4.65 -5.92 0.193 -6.22 -5.66 ## 12 norway 4.11 0.123 3.96 4.44 -5.75 0.201 -6.09 -5.42 ## 13 spain 4.06 0.317 3.62 4.75 -5.63 0.278 -6.17 -5.29 ## 14 sweden 4.01 0.0364 3.91 4.07 -7.82 0.126 -8.07 -7.67 ## 15 switzerl 4.24 0.102 4.05 4.44 -5.93 0.124 -6.16 -5.75 ## 16 turkey 5.77 0.329 5.14 6.16 -7.34 0.331 -7.84 -6.89 ## 17 u.k. 3.98 0.0479 3.91 4.10 -6.02 0.107 -6.19 -5.84 ## 18 u.s.a. 4.82 0.0219 4.79 4.86 -5.45 0.148 -5.70 -5.22 ## # … with 8 more variables: mean_lrpmg <dbl>, sd_lrpmg <dbl>, min_lrpmg <dbl>, ## # max_lrpmg <dbl>, mean_lcarpcap <dbl>, sd_lcarpcap <dbl>, ## # min_lcarpcap <dbl>, max_lcarpcap <dbl>, and abbreviated variable names ## # ¹​mean_lgaspcar, ²​sd_lgaspcar, ³​min_lgaspcar, ⁴​max_lgaspcar, ⁵​mean_lincomep, ## # ⁶​sd_lincomep, ⁷​min_lincomep, ⁸​max_lincomep This allows you to summarise every column that contains real numbers. The difference between is.double() and is.numeric() is that is.numeric() returns TRUE for integers too, whereas is.double() returns TRUE for real numbers only (integers are real numbers too, but you know what I mean). It is also possible to summarise every column at once: gasoline %>% select(-year) %>% group_by(country) %>% summarise(across(everything(), tibble::lst(mean, sd, min, max), .names = "{fn}_{col}")) ## # A tibble: 18 × 17 ## country mean_lgasp…¹ sd_lg…² min_l…³ max_l…⁴ mean_…⁵ sd_li…⁶ min_l…⁷ max_l…⁸ ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 4.06 0.0693 3.92 4.20 -6.12 0.235 -6.47 -5.76 ## 2 belgium 3.92 0.103 3.82 4.16 -5.85 0.227 -6.22 -5.53 ## 3 canada 4.86 0.0262 4.81 4.90 -5.58 0.193 -5.89 -5.31 ## 4 denmark 4.19 0.158 4.00 4.50 -5.76 0.176 -6.06 -5.48 ## 5 france 3.82 0.0499 3.75 3.91 -5.87 0.241 -6.26 -5.53 ## 6 germany 3.89 0.0239 3.85 3.93 -5.85 0.193 -6.16 -5.56 ## 7 greece 4.88 0.255 4.48 5.38 -6.61 0.331 -7.16 -6.15 ## 8 ireland 4.23 0.0437 4.16 4.33 -6.44 0.162 -6.72 -6.19 ## 9 italy 3.73 0.220 3.38 4.05 -6.35 0.217 -6.73 -6.08 ## 10 japan 4.70 0.684 3.95 6.00 -6.25 0.425 -6.99 -5.71 ## 11 netherla 4.08 0.286 3.71 4.65 -5.92 0.193 -6.22 -5.66 ## 12 norway 4.11 0.123 3.96 4.44 -5.75 0.201 -6.09 -5.42 ## 13 spain 4.06 0.317 3.62 4.75 -5.63 0.278 -6.17 -5.29 ## 14 sweden 4.01 0.0364 3.91 4.07 -7.82 0.126 -8.07 -7.67 ## 15 switzerl 4.24 0.102 4.05 4.44 -5.93 0.124 -6.16 -5.75 ## 16 turkey 5.77 0.329 5.14 6.16 -7.34 0.331 -7.84 -6.89 ## 17 u.k. 3.98 0.0479 3.91 4.10 -6.02 0.107 -6.19 -5.84 ## 18 u.s.a. 4.82 0.0219 4.79 4.86 -5.45 0.148 -5.70 -5.22 ## # … with 8 more variables: mean_lrpmg <dbl>, sd_lrpmg <dbl>, min_lrpmg <dbl>, ## # max_lrpmg <dbl>, mean_lcarpcap <dbl>, sd_lcarpcap <dbl>, ## # min_lcarpcap <dbl>, max_lcarpcap <dbl>, and abbreviated variable names ## # ¹​mean_lgaspcar, ²​sd_lgaspcar, ³​min_lgaspcar, ⁴​max_lgaspcar, ⁵​mean_lincomep, ## # ⁶​sd_lincomep, ⁷​min_lincomep, ⁸​max_lincomep I removed the year variable because it’s not a variable for which we want to have descriptive statistics. 4.6 Other useful {tidyverse} functions 4.6.1 if_else(), case_when() and recode() Some other very useful {tidyverse} functions are if_else() and case_when. These two functions, combined with mutate() make it easy to create a new variable whose values must respect certain conditions. For instance, we might want to have a dummy that equals 1 if a country in the European Union (to simplify, say as of 2017) and 0 if not. First let’s create a list of countries that are in the EU: eu_countries <- c("austria", "belgium", "bulgaria", "croatia", "republic of cyprus", "czech republic", "denmark", "estonia", "finland", "france", "germany", "greece", "hungary", "ireland", "italy", "latvia", "lithuania", "luxembourg", "malta", "netherla", "poland", "portugal", "romania", "slovakia", "slovenia", "spain", "sweden", "u.k.") I’ve had to change “netherlands” to “netherla” because that’s how the country is called in the gasoline data. Now let’s create a dummy variable that equals 1 for EU countries, and 0 for the others: gasoline %>% mutate(country = tolower(country)) %>% mutate(in_eu = if_else(country %in% eu_countries, 1, 0)) ## # A tibble: 342 × 7 ## country year lgaspcar lincomep lrpmg lcarpcap in_eu ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 1 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 1 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 1 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 1 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 1 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 1 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 1 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 1 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 1 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 1 ## # … with 332 more rows Instead of 1 and 0, we can of course use strings (I add filter(year == 1960) at the end to have a better view of what happened): gasoline %>% mutate(country = tolower(country)) %>% mutate(in_eu = if_else(country %in% eu_countries, "yes", "no")) %>% filter(year == 1960) ## # A tibble: 18 × 7 ## country year lgaspcar lincomep lrpmg lcarpcap in_eu ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 yes ## 2 belgium 1960 4.16 -6.22 -0.166 -9.41 yes ## 3 canada 1960 4.86 -5.89 -0.972 -8.38 no ## 4 denmark 1960 4.50 -6.06 -0.196 -9.33 yes ## 5 france 1960 3.91 -6.26 -0.0196 -9.15 yes ## 6 germany 1960 3.92 -6.16 -0.186 -9.34 yes ## 7 greece 1960 5.04 -7.16 -0.0835 -12.2 yes ## 8 ireland 1960 4.27 -6.72 -0.0765 -9.70 yes ## 9 italy 1960 4.05 -6.73 0.165 -10.1 yes ## 10 japan 1960 6.00 -6.99 -0.145 -12.2 no ## 11 netherla 1960 4.65 -6.22 -0.201 -10.0 yes ## 12 norway 1960 4.44 -6.09 -0.140 -9.68 no ## 13 spain 1960 4.75 -6.17 1.13 -11.6 yes ## 14 sweden 1960 4.06 -8.07 -2.52 -8.74 yes ## 15 switzerl 1960 4.40 -6.16 -0.823 -9.26 no ## 16 turkey 1960 6.13 -7.80 -0.253 -13.5 no ## 17 u.k. 1960 4.10 -6.19 -0.391 -9.12 yes ## 18 u.s.a. 1960 4.82 -5.70 -1.12 -8.02 no I think that if_else() is fairly straightforward, especially if you know ifelse() already. You might be wondering what is the difference between these two. if_else() is stricter than ifelse() and does not do type conversion. Compare the two next lines: ifelse(1 == 1, "0", 1) ## [1] "0" if_else(1 == 1, "0", 1) Error: `false` must be type string, not double Type conversion, especially without a warning is very dangerous. if_else()’s behaviour which consists in failing as soon as possble avoids a lot of pain and suffering, especially when programming non-interactively. if_else() also accepts an optional argument, that allows you to specify what should be returned in case of NA: if_else(1 <= NA, 0, 1, 999) ## [1] 999 # Or if_else(1 <= NA, 0, 1, NA_real_) ## [1] NA case_when() can be seen as a generalization of if_else(). Whenever you want to use multiple if_else()s, that’s when you know you should use case_when() (I’m adding the filter at the end for the same reason as before, to see the output better): gasoline %>% mutate(country = tolower(country)) %>% mutate(region = case_when( country %in% c("france", "italy", "turkey", "greece", "spain") ~ "mediterranean", country %in% c("germany", "austria", "switzerl", "belgium", "netherla") ~ "central europe", country %in% c("canada", "u.s.a.", "u.k.", "ireland") ~ "anglosphere", country %in% c("denmark", "norway", "sweden") ~ "nordic", country %in% c("japan") ~ "asia")) %>% filter(year == 1960) ## # A tibble: 18 × 7 ## country year lgaspcar lincomep lrpmg lcarpcap region ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 central europe ## 2 belgium 1960 4.16 -6.22 -0.166 -9.41 central europe ## 3 canada 1960 4.86 -5.89 -0.972 -8.38 anglosphere ## 4 denmark 1960 4.50 -6.06 -0.196 -9.33 nordic ## 5 france 1960 3.91 -6.26 -0.0196 -9.15 mediterranean ## 6 germany 1960 3.92 -6.16 -0.186 -9.34 central europe ## 7 greece 1960 5.04 -7.16 -0.0835 -12.2 mediterranean ## 8 ireland 1960 4.27 -6.72 -0.0765 -9.70 anglosphere ## 9 italy 1960 4.05 -6.73 0.165 -10.1 mediterranean ## 10 japan 1960 6.00 -6.99 -0.145 -12.2 asia ## 11 netherla 1960 4.65 -6.22 -0.201 -10.0 central europe ## 12 norway 1960 4.44 -6.09 -0.140 -9.68 nordic ## 13 spain 1960 4.75 -6.17 1.13 -11.6 mediterranean ## 14 sweden 1960 4.06 -8.07 -2.52 -8.74 nordic ## 15 switzerl 1960 4.40 -6.16 -0.823 -9.26 central europe ## 16 turkey 1960 6.13 -7.80 -0.253 -13.5 mediterranean ## 17 u.k. 1960 4.10 -6.19 -0.391 -9.12 anglosphere ## 18 u.s.a. 1960 4.82 -5.70 -1.12 -8.02 anglosphere If all you want is to recode values, you can use recode(). For example, the Netherlands is written as “NETHERLA” in the gasoline data, which is quite ugly. Same for Switzerland: gasoline <- gasoline %>% mutate(country = tolower(country)) %>% mutate(country = recode(country, "netherla" = "netherlands", "switzerl" = "switzerland")) I saved the data with these changes as they will become useful in the future. Let’s take a look at the data: gasoline %>% filter(country %in% c("netherlands", "switzerland"), year == 1960) ## # A tibble: 2 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 netherlands 1960 4.65 -6.22 -0.201 -10.0 ## 2 switzerland 1960 4.40 -6.16 -0.823 -9.26 4.6.2 lead() and lag() lead() and lag() are especially useful in econometrics. When I was doing my masters, in 4 B.d. (Before dplyr) lagging variables in panel data was quite tricky. Now, with {dplyr} it’s really very easy: gasoline %>% group_by(country) %>% mutate(lag_lgaspcar = lag(lgaspcar)) %>% mutate(lead_lgaspcar = lead(lgaspcar)) %>% filter(year %in% seq(1960, 1963)) ## # A tibble: 72 × 8 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap lag_lgaspcar lead_lgaspcar ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 NA 4.10 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 4.17 4.07 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 4.10 4.06 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 4.07 4.04 ## 5 belgium 1960 4.16 -6.22 -0.166 -9.41 NA 4.12 ## 6 belgium 1961 4.12 -6.18 -0.172 -9.30 4.16 4.08 ## 7 belgium 1962 4.08 -6.13 -0.222 -9.22 4.12 4.00 ## 8 belgium 1963 4.00 -6.09 -0.250 -9.11 4.08 3.99 ## 9 canada 1960 4.86 -5.89 -0.972 -8.38 NA 4.83 ## 10 canada 1961 4.83 -5.88 -0.972 -8.35 4.86 4.85 ## # … with 62 more rows To lag every variable, remember that you can use mutate_if(): gasoline %>% group_by(country) %>% mutate_if(is.double, lag) %>% filter(year %in% seq(1960, 1963)) ## `mutate_if()` ignored the following grouping variables: ## • Column `country` ## # A tibble: 72 × 6 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 belgium 1960 4.16 -6.22 -0.166 -9.41 ## 6 belgium 1961 4.12 -6.18 -0.172 -9.30 ## 7 belgium 1962 4.08 -6.13 -0.222 -9.22 ## 8 belgium 1963 4.00 -6.09 -0.250 -9.11 ## 9 canada 1960 4.86 -5.89 -0.972 -8.38 ## 10 canada 1961 4.83 -5.88 -0.972 -8.35 ## # … with 62 more rows you can replace lag() with lead(), but just keep in mind that the columns get transformed in place. 4.6.3 ntile() The last helper function I will discuss is ntile(). There are some other, so do read mutate()’s documentation with help(mutate)! If you need quantiles, you need ntile(). Let’s see how it works: gasoline %>% mutate(quintile = ntile(lgaspcar, 5)) %>% mutate(decile = ntile(lgaspcar, 10)) %>% select(country, year, lgaspcar, quintile, decile) ## # A tibble: 342 × 5 ## country year lgaspcar quintile decile ## <chr> <dbl> <dbl> <int> <int> ## 1 austria 1960 4.17 3 6 ## 2 austria 1961 4.10 3 6 ## 3 austria 1962 4.07 3 5 ## 4 austria 1963 4.06 3 5 ## 5 austria 1964 4.04 3 5 ## 6 austria 1965 4.03 3 5 ## 7 austria 1966 4.05 3 5 ## 8 austria 1967 4.05 3 5 ## 9 austria 1968 4.05 3 5 ## 10 austria 1969 4.05 3 5 ## # … with 332 more rows quintile and decile do not hold the values but the quantile the value lies in. If you want to have a column that contains the median for instance, you can use good ol’ quantile(): gasoline %>% group_by(country) %>% mutate(median = quantile(lgaspcar, 0.5)) %>% # quantile(x, 0.5) is equivalent to median(x) filter(year == 1960) %>% select(country, year, median) ## # A tibble: 18 × 3 ## # Groups: country [18] ## country year median ## <chr> <dbl> <dbl> ## 1 austria 1960 4.05 ## 2 belgium 1960 3.88 ## 3 canada 1960 4.86 ## 4 denmark 1960 4.16 ## 5 france 1960 3.81 ## 6 germany 1960 3.89 ## 7 greece 1960 4.89 ## 8 ireland 1960 4.22 ## 9 italy 1960 3.74 ## 10 japan 1960 4.52 ## 11 netherlands 1960 3.99 ## 12 norway 1960 4.08 ## 13 spain 1960 3.99 ## 14 sweden 1960 4.00 ## 15 switzerland 1960 4.26 ## 16 turkey 1960 5.72 ## 17 u.k. 1960 3.98 ## 18 u.s.a. 1960 4.81 4.6.4 arrange() arrange() re-orders the whole tibble according to values of the supplied variable: gasoline %>% arrange(lgaspcar) ## # A tibble: 342 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 italy 1977 3.38 -6.10 0.164 -8.15 ## 2 italy 1978 3.39 -6.08 0.0348 -8.11 ## 3 italy 1976 3.43 -6.12 0.103 -8.17 ## 4 italy 1974 3.50 -6.13 -0.223 -8.26 ## 5 italy 1975 3.52 -6.17 -0.0327 -8.22 ## 6 spain 1978 3.62 -5.29 0.621 -8.63 ## 7 italy 1972 3.63 -6.21 -0.215 -8.38 ## 8 italy 1971 3.65 -6.22 -0.148 -8.47 ## 9 spain 1977 3.65 -5.30 0.526 -8.73 ## 10 italy 1973 3.65 -6.16 -0.325 -8.32 ## # … with 332 more rows If you want to re-order the tibble in descending order of the variable: gasoline %>% arrange(desc(lgaspcar)) ## # A tibble: 342 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 turkey 1966 6.16 -7.51 -0.356 -13.0 ## 2 turkey 1960 6.13 -7.80 -0.253 -13.5 ## 3 turkey 1961 6.11 -7.79 -0.343 -13.4 ## 4 turkey 1962 6.08 -7.84 -0.408 -13.2 ## 5 turkey 1968 6.08 -7.42 -0.365 -12.8 ## 6 turkey 1963 6.08 -7.63 -0.225 -13.3 ## 7 turkey 1964 6.06 -7.63 -0.252 -13.2 ## 8 turkey 1967 6.04 -7.46 -0.335 -12.8 ## 9 japan 1960 6.00 -6.99 -0.145 -12.2 ## 10 turkey 1965 5.82 -7.62 -0.293 -12.9 ## # … with 332 more rows arrange’s documentation alerts the user that re-ording by group is only possible by explicitely specifying an option: gasoline %>% filter(year %in% seq(1960, 1963)) %>% group_by(country) %>% arrange(desc(lgaspcar), .by_group = TRUE) ## # A tibble: 72 × 6 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 belgium 1960 4.16 -6.22 -0.166 -9.41 ## 6 belgium 1961 4.12 -6.18 -0.172 -9.30 ## 7 belgium 1962 4.08 -6.13 -0.222 -9.22 ## 8 belgium 1963 4.00 -6.09 -0.250 -9.11 ## 9 canada 1960 4.86 -5.89 -0.972 -8.38 ## 10 canada 1962 4.85 -5.84 -0.979 -8.32 ## # … with 62 more rows This is especially useful for plotting. We’ll see this in Chapter 6. 4.6.5 tally() and count() tally() and count() count the number of observations in your data. I believe count() is the more useful of the two, as it counts the number of observations within a group that you can provide: gasoline %>% count(country) ## # A tibble: 18 × 2 ## country n ## <chr> <int> ## 1 austria 19 ## 2 belgium 19 ## 3 canada 19 ## 4 denmark 19 ## 5 france 19 ## 6 germany 19 ## 7 greece 19 ## 8 ireland 19 ## 9 italy 19 ## 10 japan 19 ## 11 netherlands 19 ## 12 norway 19 ## 13 spain 19 ## 14 sweden 19 ## 15 switzerland 19 ## 16 turkey 19 ## 17 u.k. 19 ## 18 u.s.a. 19 There’s also add_count() which adds the column to the data: gasoline %>% add_count(country) ## # A tibble: 342 × 7 ## country year lgaspcar lincomep lrpmg lcarpcap n ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 19 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 19 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 19 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 19 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 19 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 19 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 19 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 19 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 19 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 19 ## # … with 332 more rows add_count() is a shortcut for the following code: gasoline %>% group_by(country) %>% mutate(n = n()) ## # A tibble: 342 × 7 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap n ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 19 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 19 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 19 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 19 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 19 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 19 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 19 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 19 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 19 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 19 ## # … with 332 more rows where n() is a {dplyr} function that can only be used within summarise(), mutate() and filter(). 4.7 Special packages for special kinds of data: {forcats}, {lubridate}, and {stringr} 4.7.1 🐱🐱🐱🐱 Factor variables are very useful but not very easy to manipulate. forcats contains very useful functions that make working on factor variables painless. In my opinion, the four following functions, fct_recode(), fct_relevel(), fct_reorder() and fct_relabel(), are the ones you must know, so that’s what I’ll be showing. Remember in chapter 3 when I very quickly explained what were factor variables? In this section, we are going to work a little bit with these type of variable. factors are very useful, and the forcats package includes some handy functions to work with them. First, let’s load the forcats package: library(forcats) as an example, we are going to work with the gss_cat dataset that is included in forcats. Let’s load the data: data(gss_cat) head(gss_cat) ## # A tibble: 6 × 9 ## year marital age race rincome partyid relig denom tvhours ## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int> ## 1 2000 Never married 26 White $8000 to 9999 Ind,near r… Prot… Sout… 12 ## 2 2000 Divorced 48 White $8000 to 9999 Not str re… Prot… Bapt… NA ## 3 2000 Widowed 67 White Not applicable Independent Prot… No d… 2 ## 4 2000 Never married 39 White Not applicable Ind,near r… Orth… Not … 4 ## 5 2000 Divorced 25 White Not applicable Not str de… None Not … 1 ## 6 2000 Married 25 White $20000 - 24999 Strong dem… Prot… Sout… NA as you can see, marital, race, rincome and partyid are all factor variables. Let’s take a closer look at marital: str(gss_cat$marital) ## Factor w/ 6 levels "No answer","Never married",..: 2 4 5 2 4 6 2 4 6 6 ... and let’s see rincome: str(gss_cat$rincome) ## Factor w/ 16 levels "No answer","Don't know",..: 8 8 16 16 16 5 4 9 4 4 ... factor variables have different levels and the forcats package includes functions that allow you to recode, collapse and do all sorts of things on these levels. For example , using forcats::fct_recode() you can recode levels: gss_cat <- gss_cat %>% mutate(marital = fct_recode(marital, refuse = "No answer", never_married = "Never married", divorced = "Separated", divorced = "Divorced", widowed = "Widowed", married = "Married")) gss_cat %>% tabyl(marital) ## marital n percent ## refuse 17 0.0007913234 ## never_married 5416 0.2521063166 ## divorced 4126 0.1920588372 ## widowed 1807 0.0841130196 ## married 10117 0.4709305032 Using fct_recode(), I was able to recode the levels and collapse Separated and Divorced to a single category called divorced. As you can see, refuse and widowed are less than 10%, so maybe you’d want to lump these categories together: gss_cat <- gss_cat %>% mutate(marital = fct_lump(marital, prop = 0.10, other_level = "other")) gss_cat %>% tabyl(marital) ## marital n percent ## never_married 5416 0.25210632 ## divorced 4126 0.19205884 ## married 10117 0.47093050 ## other 1824 0.08490434 fct_reorder() is especially useful for plotting. We will explore plotting in the next chapter, but to show you why fct_reorder() is so useful, I will create a barplot, first without using fct_reorder() to re-order the factors, then with reordering. Do not worry if you don’t understand all the code for now: gss_cat %>% tabyl(marital) %>% ggplot() + geom_col(aes(y = n, x = marital)) + coord_flip() It would be much better if the categories were ordered by frequency. This is easy to do with fct_reorder(): gss_cat %>% tabyl(marital) %>% mutate(marital = fct_reorder(marital, n, .desc = FALSE)) %>% ggplot() + geom_col(aes(y = n, x = marital)) + coord_flip() Much better! In Chapter 6, we are going to learn about {ggplot2}. The last family of function I’d like to mention are the fct_lump*() functions. These make it possible to lump several levels of a factor into a new other level: gss_cat %>% mutate( # Description of the different functions taken from help(fct_lump) denom_lowfreq = fct_lump_lowfreq(denom), # lumps together the least frequent levels, ensuring that "other" is still the smallest level. denom_min = fct_lump_min(denom, min = 10), # lumps levels that appear fewer than min times. denom_n = fct_lump_n(denom, n = 3), # lumps all levels except for the n most frequent (or least frequent if n < 0) denom_prop = fct_lump_prop(denom, prop = 0.10) # lumps levels that appear in fewer prop * n times. ) ## # A tibble: 21,483 × 13 ## year marital age race rincome partyid relig denom tvhours denom…¹ denom…² ## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int> <fct> <fct> ## 1 2000 never_… 26 White $8000 … Ind,ne… Prot… Sout… 12 Southe… Southe… ## 2 2000 divorc… 48 White $8000 … Not st… Prot… Bapt… NA Baptis… Baptis… ## 3 2000 other 67 White Not ap… Indepe… Prot… No d… 2 No den… No den… ## 4 2000 never_… 39 White Not ap… Ind,ne… Orth… Not … 4 Not ap… Not ap… ## 5 2000 divorc… 25 White Not ap… Not st… None Not … 1 Not ap… Not ap… ## 6 2000 married 25 White $20000… Strong… Prot… Sout… NA Southe… Southe… ## 7 2000 never_… 36 White $25000… Not st… Chri… Not … 3 Not ap… Not ap… ## 8 2000 divorc… 44 White $7000 … Ind,ne… Prot… Luth… NA Luther… Luther… ## 9 2000 married 44 White $25000… Not st… Prot… Other 0 Other Other ## 10 2000 married 47 White $25000… Strong… Prot… Sout… 3 Southe… Southe… ## # … with 21,473 more rows, 2 more variables: denom_n <fct>, denom_prop <fct>, ## # and abbreviated variable names ¹​denom_lowfreq, ²​denom_min There’s many other, so I’d advise you go through the package’s function reference. 4.7.2 Get your dates right with {lubridate} {lubridate} is yet another tidyverse package, that makes dealing with dates or durations (and intervals) as painless as possible. I do not use every function contained in the package daily, and as such will only focus on some of the functions. However, if you have to deal with dates often, you might want to explore the package thouroughly. 4.7.2.1 Defining dates, the tidy way Let’s load new dataset, called independence from the datasets folder: independence <- readRDS("datasets/independence.rds") This dataset was scraped from the following Wikipedia page. It shows when African countries gained independence and from which colonial powers. In Chapter 10, I will show you how to scrape Wikipedia pages using R. For now, let’s take a look at the contents of the dataset: independence ## # A tibble: 54 × 6 ## country colonial_name colon…¹ indep…² first…³ indep…⁴ ## <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Liberia Liberia United… 26 Jul… Joseph… Liberi… ## 2 South Africa Cape Colony Colony of Natal O… United… 31 May… Louis … South … ## 3 Egypt Sultanate of Egypt United… 28 Feb… Fuad I Egypti… ## 4 Eritrea Italian Eritrea Italy 10 Feb… Haile … - ## 5 Libya British Military Administration… United… 24 Dec… Idris - ## 6 Sudan Anglo-Egyptian Sudan United… 1 Janu… Ismail… - ## 7 Tunisia French Protectorate of Tunisia France 20 Mar… Muhamm… - ## 8 Morocco French Protectorate in Morocco … France… 2 Marc… Mohamm… Ifni W… ## 9 Ghana Gold Coast United… 6 Marc… Kwame … Gold C… ## 10 Guinea French West Africa France 2 Octo… Ahmed … Guinea… ## # … with 44 more rows, and abbreviated variable names ¹​colonial_power, ## # ²​independence_date, ³​first_head_of_state, ⁴​independence_won_through as you can see, the date of independence is in a format that might make it difficult to answer questions such as Which African countries gained independence before 1960 ? for two reasons. First of all, the date uses the name of the month instead of the number of the month, and second of all the type of the independence day column is character and not “date”. So our first task is to correctly define the column as being of type date, while making sure that R understands that January is supposed to be “01”, and so on. There are several helpful functions included in {lubridate} to convert columns to dates. For instance if the column you want to convert is of the form “2012-11-21”, then you would use the function ymd(), for “year-month-day”. If, however the column is “2012-21-11”, then you would use ydm(). There’s a few of these helper functions, and they can handle a lot of different formats for dates. In our case, having the name of the month instead of the number might seem quite problematic, but it turns out that this is a case that {lubridate} handles painfully: library(lubridate) independence <- independence %>% mutate(independence_date = dmy(independence_date)) ## Warning: 5 failed to parse. Some dates failed to parse, for instance for Morocco. This is because these countries have several independence dates; this means that the string to convert looks like: "2 March 1956 7 April 1956 10 April 1958 4 January 1969" which obviously cannot be converted by {lubridate} without further manipulation. I ignore these cases for simplicity’s sake. 4.7.2.2 Data manipulation with dates Let’s take a look at the data now: independence ## # A tibble: 54 × 6 ## country colonial_name colon…¹ independ…² first…³ indep…⁴ ## <chr> <chr> <chr> <date> <chr> <chr> ## 1 Liberia Liberia United… 1847-07-26 Joseph… Liberi… ## 2 South Africa Cape Colony Colony of Natal… United… 1910-05-31 Louis … South … ## 3 Egypt Sultanate of Egypt United… 1922-02-28 Fuad I Egypti… ## 4 Eritrea Italian Eritrea Italy 1947-02-10 Haile … - ## 5 Libya British Military Administrat… United… 1951-12-24 Idris - ## 6 Sudan Anglo-Egyptian Sudan United… 1956-01-01 Ismail… - ## 7 Tunisia French Protectorate of Tunis… France 1956-03-20 Muhamm… - ## 8 Morocco French Protectorate in Moroc… France… NA Mohamm… Ifni W… ## 9 Ghana Gold Coast United… 1957-03-06 Kwame … Gold C… ## 10 Guinea French West Africa France 1958-10-02 Ahmed … Guinea… ## # … with 44 more rows, and abbreviated variable names ¹​colonial_power, ## # ²​independence_date, ³​first_head_of_state, ⁴​independence_won_through As you can see, we now have a date column in the right format. We can now answer questions such as Which countries gained independence before 1960? quite easily, by using the functions year(), month() and day(). Let’s see which countries gained independence before 1960: independence %>% filter(year(independence_date) <= 1960) %>% pull(country) ## [1] "Liberia" "South Africa" ## [3] "Egypt" "Eritrea" ## [5] "Libya" "Sudan" ## [7] "Tunisia" "Ghana" ## [9] "Guinea" "Cameroon" ## [11] "Togo" "Mali" ## [13] "Madagascar" "Democratic Republic of the Congo" ## [15] "Benin" "Niger" ## [17] "Burkina Faso" "Ivory Coast" ## [19] "Chad" "Central African Republic" ## [21] "Republic of the Congo" "Gabon" ## [23] "Mauritania" You guessed it, year() extracts the year of the date column and converts it as a numeric so that we can work on it. This is the same for month() or day(). Let’s try to see if countries gained their independence on Christmas Eve: independence %>% filter(month(independence_date) == 12, day(independence_date) == 24) %>% pull(country) ## [1] "Libya" Seems like Libya was the only one! You can also operate on dates. For instance, let’s compute the difference between two dates, using the interval() column: independence %>% mutate(today = lubridate::today()) %>% mutate(independent_since = interval(independence_date, today)) %>% select(country, independent_since) ## # A tibble: 54 × 2 ## country independent_since ## <chr> <Interval> ## 1 Liberia 1847-07-26 UTC--2022-10-13 UTC ## 2 South Africa 1910-05-31 UTC--2022-10-13 UTC ## 3 Egypt 1922-02-28 UTC--2022-10-13 UTC ## 4 Eritrea 1947-02-10 UTC--2022-10-13 UTC ## 5 Libya 1951-12-24 UTC--2022-10-13 UTC ## 6 Sudan 1956-01-01 UTC--2022-10-13 UTC ## 7 Tunisia 1956-03-20 UTC--2022-10-13 UTC ## 8 Morocco NA--NA ## 9 Ghana 1957-03-06 UTC--2022-10-13 UTC ## 10 Guinea 1958-10-02 UTC--2022-10-13 UTC ## # … with 44 more rows The independent_since column now contains an interval object that we can convert to years: independence %>% mutate(today = lubridate::today()) %>% mutate(independent_since = interval(independence_date, today)) %>% select(country, independent_since) %>% mutate(years_independent = as.numeric(independent_since, "years")) ## # A tibble: 54 × 3 ## country independent_since years_independent ## <chr> <Interval> <dbl> ## 1 Liberia 1847-07-26 UTC--2022-10-13 UTC 175. ## 2 South Africa 1910-05-31 UTC--2022-10-13 UTC 112. ## 3 Egypt 1922-02-28 UTC--2022-10-13 UTC 101. ## 4 Eritrea 1947-02-10 UTC--2022-10-13 UTC 75.7 ## 5 Libya 1951-12-24 UTC--2022-10-13 UTC 70.8 ## 6 Sudan 1956-01-01 UTC--2022-10-13 UTC 66.8 ## 7 Tunisia 1956-03-20 UTC--2022-10-13 UTC 66.6 ## 8 Morocco NA--NA NA ## 9 Ghana 1957-03-06 UTC--2022-10-13 UTC 65.6 ## 10 Guinea 1958-10-02 UTC--2022-10-13 UTC 64.0 ## # … with 44 more rows We can now see for how long the last country to gain independence has been independent. Because the data is not tidy (in some cases, an African country was colonized by two powers, see Libya), I will only focus on 4 European colonial powers: Belgium, France, Portugal and the United Kingdom: independence %>% filter(colonial_power %in% c("Belgium", "France", "Portugal", "United Kingdom")) %>% mutate(today = lubridate::today()) %>% mutate(independent_since = interval(independence_date, today)) %>% mutate(years_independent = as.numeric(independent_since, "years")) %>% group_by(colonial_power) %>% summarise(last_colony_independent_for = min(years_independent, na.rm = TRUE)) ## # A tibble: 4 × 2 ## colonial_power last_colony_independent_for ## <chr> <dbl> ## 1 Belgium 60.3 ## 2 France 45.3 ## 3 Portugal 46.9 ## 4 United Kingdom 46.3 4.7.2.3 Arithmetic with dates Adding or substracting days to dates is quite easy: ymd("2018-12-31") + 16 ## [1] "2019-01-16" It is also possible to be more explicit and use days(): ymd("2018-12-31") + days(16) ## [1] "2019-01-16" To add years, you can use years(): ymd("2018-12-31") + years(1) ## [1] "2019-12-31" But you have to be careful with leap years: ymd("2016-02-29") + years(1) ## [1] NA Because 2017 is not a leap year, the above computation returns NA. The same goes for months with a different number of days: ymd("2018-12-31") + months(2) ## [1] NA The way to solve these issues is to use the special %m+% infix operator: ymd("2016-02-29") %m+% years(1) ## [1] "2017-02-28" and for months: ymd("2018-12-31") %m+% months(2) ## [1] "2019-02-28" {lubridate} contains many more functions. If you often work with dates, duration or interval data, {lubridate} is a package that you have to add to your toolbox. 4.7.3 Manipulate strings with {stringr} {stringr} contains functions to manipulate strings. In Chapter 10, I will teach you about regular expressions, but the functions contained in {stringr} allow you to already do a lot of work on strings, without needing to be a regular expression expert. I will discuss the most common string operations: detecting, locating, matching, searching and replacing, and exctracting/removing strings. To introduce these operations, let us use an ALTO file of an issue of The Winchester News from October 31, 1910, which you can find on this link (to see how the newspaper looked like, click here). I re-hosted the file on a public gist for archiving purposes. While working on the book, the original site went down several times… ALTO is an XML schema for the description of text OCR and layout information of pages for digitzed material, such as newspapers (source: ALTO Wikipedia page). For more details, you can read my blogpost on the matter, but for our current purposes, it is enough to know that the file contains the text of newspaper articles. The file looks like this: <TextLine HEIGHT="138.0" WIDTH="2434.0" HPOS="4056.0" VPOS="5814.0"> <String STYLEREFS="ID7" HEIGHT="108.0" WIDTH="393.0" HPOS="4056.0" VPOS="5838.0" CONTENT="timore" WC="0.82539684"> <ALTERNATIVE>timole</ALTERNATIVE> <ALTERNATIVE>tlnldre</ALTERNATIVE> <ALTERNATIVE>timor</ALTERNATIVE> <ALTERNATIVE>insole</ALTERNATIVE> <ALTERNATIVE>landed</ALTERNATIVE> </String> <SP WIDTH="74.0" HPOS="4449.0" VPOS="5838.0"/> <String STYLEREFS="ID7" HEIGHT="105.0" WIDTH="432.0" HPOS="4524.0" VPOS="5847.0" CONTENT="market" WC="0.95238096"/> <SP WIDTH="116.0" HPOS="4956.0" VPOS="5847.0"/> <String STYLEREFS="ID7" HEIGHT="69.0" WIDTH="138.0" HPOS="5073.0" VPOS="5883.0" CONTENT="as" WC="0.96825397"/> <SP WIDTH="74.0" HPOS="5211.0" VPOS="5883.0"/> <String STYLEREFS="ID7" HEIGHT="69.0" WIDTH="285.0" HPOS="5286.0" VPOS="5877.0" CONTENT="were" WC="1.0"> <ALTERNATIVE>verc</ALTERNATIVE> <ALTERNATIVE>veer</ALTERNATIVE> </String> <SP WIDTH="68.0" HPOS="5571.0" VPOS="5877.0"/> <String STYLEREFS="ID7" HEIGHT="111.0" WIDTH="147.0" HPOS="5640.0" VPOS="5838.0" CONTENT="all" WC="1.0"/> <SP WIDTH="83.0" HPOS="5787.0" VPOS="5838.0"/> <String STYLEREFS="ID7" HEIGHT="111.0" WIDTH="183.0" HPOS="5871.0" VPOS="5835.0" CONTENT="the" WC="0.95238096"> <ALTERNATIVE>tll</ALTERNATIVE> <ALTERNATIVE>Cu</ALTERNATIVE> <ALTERNATIVE>tall</ALTERNATIVE> </String> <SP WIDTH="75.0" HPOS="6054.0" VPOS="5835.0"/> <String STYLEREFS="ID3" HEIGHT="132.0" WIDTH="351.0" HPOS="6129.0" VPOS="5814.0" CONTENT="cattle" WC="0.95238096"/> </TextLine> We are interested in the strings after CONTENT=. We are going to use functions from the {stringr} package to get the strings after CONTENT=. In Chapter 10, we are going to explore this file again, but using complex regular expressions to get all the content in one go. 4.7.3.1 Getting text data into Rstudio First of all, let us read in the file: winchester <- read_lines("https://gist.githubusercontent.com/b-rodrigues/5139560e7d0f2ecebe5da1df3629e015/raw/e3031d894ffb97217ddbad1ade1b307c9937d2c8/gistfile1.txt") Even though the file is an XML file, I still read it in using read_lines() and not read_xml() from the {xml2} package. This is for the purposes of the current exercise, and also because I always have trouble with XML files, and prefer to treat them as simple text files, and use regular expressions to get what I need. Now that the ALTO file is read in and saved in the winchester variable, you might want to print the whole thing in the console. Before that, take a look at the structure: str(winchester) ## chr [1:43] "" ... So the winchester variable is a character atomic vector with 43 elements. So first, we need to understand what these elements are. Let’s start with the first one: winchester[1] ## [1] "" Ok, so it seems like the first element is part of the header of the file. What about the second one? winchester[2] ## [1] "<meta http-equiv=\\"Content-Type\\" content=\\"text/html; charset=UTF-8\\"><base href=\\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\\"><style>body{margin-left:0;margin-right:0;margin-top:0}#bN015htcoyT__google-cache-hdr{background:#f5f5f5;font:13px arial,sans-serif;text-align:left;color:#202020;border:0;margin:0;border-bottom:1px solid #cecece;line-height:16px;padding:16px 28px 24px 28px}#bN015htcoyT__google-cache-hdr *{display:inline;font:inherit;text-align:inherit;color:inherit;line-height:inherit;background:none;border:0;margin:0;padding:0;letter-spacing:0}#bN015htcoyT__google-cache-hdr a{text-decoration:none;color:#1a0dab}#bN015htcoyT__google-cache-hdr a:hover{text-decoration:underline}#bN015htcoyT__google-cache-hdr a:visited{color:#609}#bN015htcoyT__google-cache-hdr div{display:block;margin-top:4px}#bN015htcoyT__google-cache-hdr b{font-weight:bold;display:inline-block;direction:ltr}</style><div id=\\"bN015htcoyT__google-cache-hdr\\"><div><span>This is Google's cache of <a href=\\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\\">https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml</a>.</span>&nbsp;<span>It is a snapshot of the page as it appeared on 21 Jan 2019 05:18:18 GMT.</span>&nbsp;<span>The <a href=\\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\\">current page</a> could have changed in the meantime.</span>&nbsp;<a href=\\"http://support.google.com/websearch/bin/answer.py?hl=en&amp;p=cached&amp;answer=1687222\\"><span>Learn more</span>.</a></div><div><span style=\\"display:inline-block;margin-top:8px;margin-right:104px;white-space:nowrap\\"><span style=\\"margin-right:28px\\"><span style=\\"font-weight:bold\\">Full version</span></span><span style=\\"margin-right:28px\\"><a href=\\"http://webcache.googleusercontent.com/search?q=cache:2BVPV8QGj3oJ:https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml&amp;hl=en&amp;gl=lu&amp;strip=1&amp;vwsrc=0\\"><span>Text-only version</span></a></span><span style=\\"margin-right:28px\\"><a href=\\"http://webcache.googleusercontent.com/search?q=cache:2BVPV8QGj3oJ:https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml&amp;hl=en&amp;gl=lu&amp;strip=0&amp;vwsrc=1\\"><span>View source</span></a></span></span></div><span style=\\"display:inline-block;margin-top:8px;color:#717171\\"><span>Tip: To quickly find your search term on this page, press <b>Ctrl+F</b> or <b>⌘-F</b> (Mac) and use the find bar.</span></span></div><div style=\\"position:relative;\\"><?xml version=\\"1.0\\" encoding=\\"UTF-8\\"?>" Same. So where is the content? The file is very large, so if you print it in the console, it will take quite some time to print, and you will not really be able to make out anything. The best way would be to try to detect the string CONTENT and work from there. 4.7.3.2 Detecting, getting the position and locating strings When confronted to an atomic vector of strings, you might want to know inside which elements you can find certain strings. For example, to know which elements of winchester contain the string CONTENT, use str_detect(): winchester %>% str_detect("CONTENT") ## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [37] FALSE FALSE FALSE FALSE FALSE FALSE TRUE This returns a boolean atomic vector of the same length as winchester. If the string CONTENT is nowhere to be found, the result will equal FALSE, if not it will equal TRUE. Here it is easy to see that the last element contains the string CONTENT. But what if instead of having 43 elements, the vector had 24192 elements? And hundreds would contain the string CONTENT? It would be easier to instead have the indices of the vector where one can find the word CONTENT. This is possible with str_which(): winchester %>% str_which("CONTENT") ## [1] 43 Here, the result is 43, meaning that the 43rd element of winchester contains the string CONTENT somewhere. If we need more precision, we can use str_locate() and str_locate_all(). To explain how both these functions work, let’s create a very small example: ancient_philosophers <- c("aristotle", "plato", "epictetus", "seneca the younger", "epicurus", "marcus aurelius") Now suppose I am interested in philosophers whose name ends in us. Let us use str_locate() first: ancient_philosophers %>% str_locate("us") ## start end ## [1,] NA NA ## [2,] NA NA ## [3,] 8 9 ## [4,] NA NA ## [5,] 7 8 ## [6,] 5 6 You can interpret the result as follows: in the rows, the index of the vector where the string us is found. So the 3rd, 5th and 6th philosopher have us somewhere in their name. The result also has two columns: start and end. These give the position of the string. So the string us can be found starting at position 8 of the 3rd element of the vector, and ends at position 9. Same goes for the other philisophers. However, consider Marcus Aurelius. He has two names, both ending with us. However, str_locate() only shows the position of the us in Marcus. To get both us strings, you need to use str_locate_all(): ancient_philosophers %>% str_locate_all("us") ## [[1]] ## start end ## ## [[2]] ## start end ## ## [[3]] ## start end ## [1,] 8 9 ## ## [[4]] ## start end ## ## [[5]] ## start end ## [1,] 7 8 ## ## [[6]] ## start end ## [1,] 5 6 ## [2,] 14 15 Now we get the position of the two us in Marcus Aurelius. Doing this on the winchester vector will give use the position of the CONTENT string, but this is not really important right now. What matters is that you know how str_locate() and str_locate_all() work. So now that we know what interests us in the 43nd element of winchester, let’s take a closer look at it: winchester[43] As you can see, it’s a mess: <TextLine HEIGHT=\\"126.0\\" WIDTH=\\"1731.0\\" HPOS=\\"17160.0\\" VPOS=\\"21252.0\\"><String HEIGHT=\\"114.0\\" WIDTH=\\"354.0\\" HPOS=\\"17160.0\\" VPOS=\\"21264.0\\" CONTENT=\\"0tV\\" WC=\\"0.8095238\\"/><SP WIDTH=\\"131.0\\" HPOS=\\"17514.0\\" VPOS=\\"21264.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"111.0\\" WIDTH=\\"474.0\\" HPOS=\\"17646.0\\" VPOS=\\"21258.0\\" CONTENT=\\"BATES\\" WC=\\"1.0\\"/><SP WIDTH=\\"140.0\\" HPOS=\\"18120.0\\" VPOS=\\"21258.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"114.0\\" WIDTH=\\"630.0\\" HPOS=\\"18261.0\\" VPOS=\\"21252.0\\" CONTENT=\\"President\\" WC=\\"1.0\\"><ALTERNATIVE>Prcideht</ALTERNATIVE><ALTERNATIVE>Pride</ALTERNATIVE></String></TextLine><TextLine HEIGHT=\\"153.0\\" WIDTH=\\"1689.0\\" HPOS=\\"17145.0\\" VPOS=\\"21417.0\\"><String STYLEREFS=\\"ID7\\" HEIGHT=\\"105.0\\" WIDTH=\\"258.0\\" HPOS=\\"17145.0\\" VPOS=\\"21439.0\\" CONTENT=\\"WM\\" WC=\\"0.82539684\\"><TextLine HEIGHT=\\"120.0\\" WIDTH=\\"2211.0\\" HPOS=\\"16788.0\\" VPOS=\\"21870.0\\"><String STYLEREFS=\\"ID7\\" HEIGHT=\\"96.0\\" WIDTH=\\"102.0\\" HPOS=\\"16788.0\\" VPOS=\\"21894.0\\" CONTENT=\\"It\\" WC=\\"1.0\\"/><SP WIDTH=\\"72.0\\" HPOS=\\"16890.0\\" VPOS=\\"21894.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"96.0\\" WIDTH=\\"93.0\\" HPOS=\\"16962.0\\" VPOS=\\"21885.0\\" CONTENT=\\"is\\" WC=\\"1.0\\"/><SP WIDTH=\\"80.0\\" HPOS=\\"17055.0\\" VPOS=\\"21885.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"102.0\\" WIDTH=\\"417.0\\" HPOS=\\"17136.0\\" VPOS=\\"21879.0\\" CONTENT=\\"seldom\\" WC=\\"1.0\\"/><SP WIDTH=\\"80.0\\" HPOS=\\"17553.0\\" VPOS=\\"21879.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"96.0\\" WIDTH=\\"267.0\\" HPOS=\\"17634.0\\" VPOS=\\"21873.0\\" CONTENT=\\"hard\\" WC=\\"1.0\\"/><SP WIDTH=\\"81.0\\" HPOS=\\"17901.0\\" VPOS=\\"21873.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"87.0\\" WIDTH=\\"111.0\\" HPOS=\\"17982.0\\" VPOS=\\"21879.0\\" CONTENT=\\"to\\" WC=\\"1.0\\"/><SP WIDTH=\\"81.0\\" HPOS=\\"18093.0\\" VPOS=\\"21879.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"96.0\\" WIDTH=\\"219.0\\" HPOS=\\"18174.0\\" VPOS=\\"21870.0\\" CONTENT=\\"find\\" WC=\\"1.0\\"/><SP WIDTH=\\"77.0\\" HPOS=\\"18393.0\\" VPOS=\\"21870.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"69.0\\" WIDTH=\\"66.0\\" HPOS=\\"18471.0\\" VPOS=\\"21894.0\\" CONTENT=\\"a\\" WC=\\"1.0\\"/><SP WIDTH=\\"77.0\\" HPOS=\\"18537.0\\" VPOS=\\"21894.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"78.0\\" WIDTH=\\"384.0\\" HPOS=\\"18615.0\\" VPOS=\\"21888.0\\" CONTENT=\\"succes\\" WC=\\"0.82539684\\"><ALTERNATIVE>success</ALTERNATIVE></String></TextLine><TextLine HEIGHT=\\"126.0\\" WIDTH=\\"2316.0\\" HPOS=\\"16662.0\\" VPOS=\\"22008.0\\"><String STYLEREFS=\\"ID7\\" HEIGHT=\\"75.0\\" WIDTH=\\"183.0\\" HPOS=\\"16662.0\\" VPOS=\\"22059.0\\" CONTENT=\\"sor\\" WC=\\"1.0\\"><ALTERNATIVE>soar</ALTERNATIVE></String><SP WIDTH=\\"72.0\\" HPOS=\\"16845.0\\" VPOS=\\"22059.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"90.0\\" WIDTH=\\"168.0\\" HPOS=\\"16917.0\\" VPOS=\\"22035.0\\" CONTENT=\\"for\\" WC=\\"1.0\\"/><SP WIDTH=\\"72.0\\" HPOS=\\"17085.0\\" VPOS=\\"22035.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"69.0\\" WIDTH=\\"267.0\\" HPOS=\\"17157.0\\" VPOS=\\"22050.0\\" CONTENT=\\"even\\" WC=\\"1.0\\"><ALTERNATIVE>cen</ALTERNATIVE><ALTERNATIVE>cent</ALTERNATIVE></String><SP WIDTH=\\"77.0\\" HPOS=\\"17434.0\\" VPOS=\\"22050.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"66.0\\" WIDTH=\\"63.0\\" HPOS=\\"17502.0\\" VPOS=\\"22044.0\\" The file was imported without any newlines. So we need to insert them ourselves, by splitting the string in a clever way. 4.7.3.3 Splitting strings There are two functions included in {stringr} to split strings, str_split() and str_split_fixed(). Let’s go back to our ancient philosophers. Two of them, Seneca the Younger and Marcus Aurelius have something else in common than both being Roman Stoic philosophers. Their names are composed of several words. If we want to split their names at the space character, we can use str_split() like this: ancient_philosophers %>% str_split(" ") ## [[1]] ## [1] "aristotle" ## ## [[2]] ## [1] "plato" ## ## [[3]] ## [1] "epictetus" ## ## [[4]] ## [1] "seneca" "the" "younger" ## ## [[5]] ## [1] "epicurus" ## ## [[6]] ## [1] "marcus" "aurelius" str_split() also has a simplify = TRUE option: ancient_philosophers %>% str_split(" ", simplify = TRUE) ## [,1] [,2] [,3] ## [1,] "aristotle" "" "" ## [2,] "plato" "" "" ## [3,] "epictetus" "" "" ## [4,] "seneca" "the" "younger" ## [5,] "epicurus" "" "" ## [6,] "marcus" "aurelius" "" This time, the returned object is a matrix. What about str_split_fixed()? The difference is that here you can specify the number of pieces to return. For example, you could consider the name “Aurelius” to be the middle name of Marcus Aurelius, and the “the younger” to be the middle name of Seneca the younger. This means that you would want to split the name only at the first space character, and not at all of them. This is easily achieved with str_split_fixed(): ancient_philosophers %>% str_split_fixed(" ", 2) ## [,1] [,2] ## [1,] "aristotle" "" ## [2,] "plato" "" ## [3,] "epictetus" "" ## [4,] "seneca" "the younger" ## [5,] "epicurus" "" ## [6,] "marcus" "aurelius" This gives the expected result. So how does this help in our case? Well, if you look at how the ALTO file looks like, at the beginning of this section, you will notice that every line ends with the “>” character. So let’s split at that character! winchester_text <- winchester[43] %>% str_split(">") Let’s take a closer look at winchester_text: str(winchester_text) ## List of 1 ## $ : chr [1:19706] "</processingStepSettings" "<processingSoftware" "<softwareCreator" "iArchives</softwareCreator" ... So this is a list of length one, and the first, and only, element of that list is an atomic vector with 19706 elements. Since this is a list of only one element, we can simplify it by saving the atomic vector in a variable: winchester_text <- winchester_text[[1]] Let’s now look at some lines: winchester_text[1232:1245] ## [1] "<SP WIDTH=\\"66.0\\" HPOS=\\"5763.0\\" VPOS=\\"9696.0\\"/" ## [2] "<String STYLEREFS=\\"ID7\\" HEIGHT=\\"108.0\\" WIDTH=\\"612.0\\" HPOS=\\"5829.0\\" VPOS=\\"9693.0\\" CONTENT=\\"Louisville\\" WC=\\"1.0\\"" ## [3] "<ALTERNATIVE" ## [4] "Loniile</ALTERNATIVE" ## [5] "<ALTERNATIVE" ## [6] "Lenities</ALTERNATIVE" ## [7] "</String" ## [8] "</TextLine" ## [9] "<TextLine HEIGHT=\\"150.0\\" WIDTH=\\"2520.0\\" HPOS=\\"4032.0\\" VPOS=\\"9849.0\\"" ## [10] "<String STYLEREFS=\\"ID7\\" HEIGHT=\\"108.0\\" WIDTH=\\"510.0\\" HPOS=\\"4032.0\\" VPOS=\\"9861.0\\" CONTENT=\\"Tobacco\\" WC=\\"1.0\\"/" ## [11] "<SP WIDTH=\\"113.0\\" HPOS=\\"4542.0\\" VPOS=\\"9861.0\\"/" ## [12] "<String STYLEREFS=\\"ID7\\" HEIGHT=\\"105.0\\" WIDTH=\\"696.0\\" HPOS=\\"4656.0\\" VPOS=\\"9861.0\\" CONTENT=\\"Warehouse\\" WC=\\"1.0\\"" ## [13] "<ALTERNATIVE" ## [14] "WHrchons</ALTERNATIVE" This now looks easier to handle. We can narrow it down to the lines that only contain the string we are interested in, “CONTENT”. First, let’s get the indices: content_winchester_index <- winchester_text %>% str_which("CONTENT") How many lines contain the string “CONTENT”? length(content_winchester_index) ## [1] 4462 As you can see, this reduces the amount of data we have to work with. Let us save this is a new variable: content_winchester <- winchester_text[content_winchester_index] 4.7.3.4 Matching strings Matching strings is useful, but only in combination with regular expressions. As stated at the beginning of this section, we are going to learn about regular expressions in Chapter 10, but in order to make this section useful, we are going to learn the easiest, but perhaps the most useful regular expression: .*. Let’s go back to our ancient philosophers, and use str_match() and see what happens. Let’s match the “us” string: ancient_philosophers %>% str_match("us") ## [,1] ## [1,] NA ## [2,] NA ## [3,] "us" ## [4,] NA ## [5,] "us" ## [6,] "us" Not very useful, but what about the regular expression .*? How could it help? ancient_philosophers %>% str_match(".*us") ## [,1] ## [1,] NA ## [2,] NA ## [3,] "epictetus" ## [4,] NA ## [5,] "epicurus" ## [6,] "marcus aurelius" That’s already very interesting! So how does .* work? To understand, let’s first start by using . alone: ancient_philosophers %>% str_match(".us") ## [,1] ## [1,] NA ## [2,] NA ## [3,] "tus" ## [4,] NA ## [5,] "rus" ## [6,] "cus" This also matched whatever symbol comes just before the “u” from “us”. What if we use two . instead? ancient_philosophers %>% str_match("..us") ## [,1] ## [1,] NA ## [2,] NA ## [3,] "etus" ## [4,] NA ## [5,] "urus" ## [6,] "rcus" This time, we get the two symbols that immediately precede “us”. Instead of continuing like this we now use the *, which matches zero or more of .. So by combining * and ., we can match any symbol repeatedly, until there is nothing more to match. Note that there is also +, which works similarly to *, but it matches one or more symbols. There is also a str_match_all(): ancient_philosophers %>% str_match_all(".*us") ## [[1]] ## [,1] ## ## [[2]] ## [,1] ## ## [[3]] ## [,1] ## [1,] "epictetus" ## ## [[4]] ## [,1] ## ## [[5]] ## [,1] ## [1,] "epicurus" ## ## [[6]] ## [,1] ## [1,] "marcus aurelius" In this particular case it does not change the end result, but keep it in mind for cases like this one: c("haha", "huhu") %>% str_match("ha") ## [,1] ## [1,] "ha" ## [2,] NA and: c("haha", "huhu") %>% str_match_all("ha") ## [[1]] ## [,1] ## [1,] "ha" ## [2,] "ha" ## ## [[2]] ## [,1] What if we want to match names containing the letter “t”? Easy: ancient_philosophers %>% str_match(".*t.*") ## [,1] ## [1,] "aristotle" ## [2,] "plato" ## [3,] "epictetus" ## [4,] "seneca the younger" ## [5,] NA ## [6,] NA So how does this help us with our historical newspaper? Let’s try to get the strings that come after “CONTENT”: winchester_content <- winchester_text %>% str_match("CONTENT.*") Let’s use our faithful str() function to take a look: winchester_content %>% str ## chr [1:19706, 1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ... Hum, there’s a lot of NA values! This is because a lot of the lines from the file did not have the string “CONTENT”, so there is no match possible. Let’s us remove all these NAs. Because the result is a matrix, we cannot use the filter() function from {dplyr}. So we need to convert it to a tibble first: winchester_content <- winchester_content %>% as.tibble() %>% filter(!is.na(V1)) Because matrix columns do not have names, when a matrix gets converted into a tibble, the firt column gets automatically called V1. This is why I filter on this column. Let’s take a look at the data: head(winchester_content) ## # A tibble: 6 × 1 ## V1 ## <chr> ## 1 "CONTENT=\\"J\\" WC=\\"0.8095238\\"/" ## 2 "CONTENT=\\"a\\" WC=\\"0.8095238\\"/" ## 3 "CONTENT=\\"Ira\\" WC=\\"0.95238096\\"/" ## 4 "CONTENT=\\"mj\\" WC=\\"0.8095238\\"/" ## 5 "CONTENT=\\"iI\\" WC=\\"0.8095238\\"/" ## 6 "CONTENT=\\"tE1r\\" WC=\\"0.8095238\\"/" 4.7.3.5 Searching and replacing strings We are getting close to the final result. We still need to do some cleaning however. Since our data is inside a nice tibble, we might as well stick with it. So let’s first rename the column and change all the strings to lowercase: winchester_content <- winchester_content %>% mutate(content = tolower(V1)) %>% select(-V1) Let’s take a look at the result: head(winchester_content) ## # A tibble: 6 × 1 ## content ## <chr> ## 1 "content=\\"j\\" wc=\\"0.8095238\\"/" ## 2 "content=\\"a\\" wc=\\"0.8095238\\"/" ## 3 "content=\\"ira\\" wc=\\"0.95238096\\"/" ## 4 "content=\\"mj\\" wc=\\"0.8095238\\"/" ## 5 "content=\\"ii\\" wc=\\"0.8095238\\"/" ## 6 "content=\\"te1r\\" wc=\\"0.8095238\\"/" The second part of the string, “wc=….” is not really interesting. Let’s search and replace this with an empty string, using str_replace(): winchester_content <- winchester_content %>% mutate(content = str_replace(content, "wc.*", "")) head(winchester_content) ## # A tibble: 6 × 1 ## content ## <chr> ## 1 "content=\\"j\\" " ## 2 "content=\\"a\\" " ## 3 "content=\\"ira\\" " ## 4 "content=\\"mj\\" " ## 5 "content=\\"ii\\" " ## 6 "content=\\"te1r\\" " We need to use the regular expression from before to replace “wc” and every character that follows. The same can be use to remove “content=”: winchester_content <- winchester_content %>% mutate(content = str_replace(content, "content=", "")) head(winchester_content) ## # A tibble: 6 × 1 ## content ## <chr> ## 1 "\\"j\\" " ## 2 "\\"a\\" " ## 3 "\\"ira\\" " ## 4 "\\"mj\\" " ## 5 "\\"ii\\" " ## 6 "\\"te1r\\" " We are almost done, but some cleaning is still necessary: 4.7.3.6 Exctracting or removing strings Now, because I now the ALTO spec, I know how to find words that are split between two sentences: winchester_content %>% filter(str_detect(content, "hyppart")) ## # A tibble: 64 × 1 ## content ## <chr> ## 1 "\\"aver\\" subs_type=\\"hyppart1\\" subs_content=\\"average\\" " ## 2 "\\"age\\" subs_type=\\"hyppart2\\" subs_content=\\"average\\" " ## 3 "\\"considera\\" subs_type=\\"hyppart1\\" subs_content=\\"consideration\\" " ## 4 "\\"tion\\" subs_type=\\"hyppart2\\" subs_content=\\"consideration\\" " ## 5 "\\"re\\" subs_type=\\"hyppart1\\" subs_content=\\"resigned\\" " ## 6 "\\"signed\\" subs_type=\\"hyppart2\\" subs_content=\\"resigned\\" " ## 7 "\\"install\\" subs_type=\\"hyppart1\\" subs_content=\\"installed\\" " ## 8 "\\"ed\\" subs_type=\\"hyppart2\\" subs_content=\\"installed\\" " ## 9 "\\"be\\" subs_type=\\"hyppart1\\" subs_content=\\"before\\" " ## 10 "\\"fore\\" subs_type=\\"hyppart2\\" subs_content=\\"before\\" " ## # … with 54 more rows For instance, the word “average” was split over two lines, the first part of the word, “aver” on the first line, and the second part of the word, “age”, on the second line. We want to keep what comes after “subs_content”. Let’s extract the word “average” using str_extract(). However, because only some words were split between two lines, we first need to detect where the string “hyppart1” is located, and only then can we extract what comes after “subs_content”. Thus, we need to combine str_detect() to first detect the string, and then str_extract() to extract what comes after “subs_content”: winchester_content <- winchester_content %>% mutate(content = if_else(str_detect(content, "hyppart1"), str_extract_all(content, "content=.*", simplify = TRUE), content)) Let’s take a look at the result: winchester_content %>% filter(str_detect(content, "content")) ## # A tibble: 64 × 1 ## content ## <chr> ## 1 "content=\\"average\\" " ## 2 "\\"age\\" subs_type=\\"hyppart2\\" subs_content=\\"average\\" " ## 3 "content=\\"consideration\\" " ## 4 "\\"tion\\" subs_type=\\"hyppart2\\" subs_content=\\"consideration\\" " ## 5 "content=\\"resigned\\" " ## 6 "\\"signed\\" subs_type=\\"hyppart2\\" subs_content=\\"resigned\\" " ## 7 "content=\\"installed\\" " ## 8 "\\"ed\\" subs_type=\\"hyppart2\\" subs_content=\\"installed\\" " ## 9 "content=\\"before\\" " ## 10 "\\"fore\\" subs_type=\\"hyppart2\\" subs_content=\\"before\\" " ## # … with 54 more rows We still need to get rid of the string “content=” and then of all the strings that contain “hyppart2”, which are not needed now: winchester_content <- winchester_content %>% mutate(content = str_replace(content, "content=", "")) %>% mutate(content = if_else(str_detect(content, "hyppart2"), NA_character_, content)) head(winchester_content) ## # A tibble: 6 × 1 ## content ## <chr> ## 1 "\\"j\\" " ## 2 "\\"a\\" " ## 3 "\\"ira\\" " ## 4 "\\"mj\\" " ## 5 "\\"ii\\" " ## 6 "\\"te1r\\" " Almost done! We only need to remove the \" characters: winchester_content <- winchester_content %>% mutate(content = str_replace_all(content, "\\"", "")) head(winchester_content) ## # A tibble: 6 × 1 ## content ## <chr> ## 1 "j " ## 2 "a " ## 3 "ira " ## 4 "mj " ## 5 "ii " ## 6 "te1r " Let’s remove space characters with str_trim(): winchester_content <- winchester_content %>% mutate(content = str_trim(content)) head(winchester_content) ## # A tibble: 6 × 1 ## content ## <chr> ## 1 j ## 2 a ## 3 ira ## 4 mj ## 5 ii ## 6 te1r To finish off this section, let’s remove stop words (words that do not add any meaning to a sentence, such as “as”, “and”…) and words that are composed of less than 3 characters. You can find a dataset with stopwords inside the {stopwords} package: library(stopwords) data(data_stopwords_stopwordsiso) eng_stopwords <- tibble("content" = data_stopwords_stopwordsiso$en) winchester_content <- winchester_content %>% anti_join(eng_stopwords) %>% filter(nchar(content) > 3) ## Joining, by = "content" head(winchester_content) ## # A tibble: 6 × 1 ## content ## <chr> ## 1 te1r ## 2 jilas ## 3 edition ## 4 winchester ## 5 news ## 6 injuries That’s it for this section! You now know how to work with strings, but in Chapter 10 we are going one step further by learning about regular expressions, which offer much more power. 4.7.4 Tidy data frames with {tibble} We have already seen and used several functions from the {tibble} package. Let’s now go through some more useful functions. 4.7.4.1 Creating tibbles tribble() makes it easy to create tibble row by row, manually: It is also possible to create a tibble from a named list: as_tibble(list("combustion" = c("oil", "diesel", "oil", "electric"), "doors" = c(3, 5, 5, 5))) ## # A tibble: 4 × 2 ## combustion doors ## <chr> <dbl> ## 1 oil 3 ## 2 diesel 5 ## 3 oil 5 ## 4 electric 5 enframe(list("combustion" = c(1,2), "doors" = c(1,2,4), "cylinders" = c(1,8,9,10))) ## # A tibble: 3 × 2 ## name value ## <chr> <list> ## 1 combustion <dbl [2]> ## 2 doors <dbl [3]> ## 3 cylinders <dbl [4]> 4.8 List-columns To learn about list-columns, let’s first focus on a single character of the starwars dataset: data(starwars) starwars %>% filter(name == "Luke Skywalker") %>% glimpse() ## Rows: 1 ## Columns: 14 ## $ name <chr> "Luke Skywalker" ## $ height <int> 172 ## $ mass <dbl> 77 ## $ hair_color <chr> "blond" ## $ skin_color <chr> "fair" ## $ eye_color <chr> "blue" ## $ birth_year <dbl> 19 ## $ sex <chr> "male" ## $ gender <chr> "masculine" ## $ homeworld <chr> "Tatooine" ## $ species <chr> "Human" ## $ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return … ## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike"> ## $ starships <list> <"X-wing", "Imperial shuttle"> We see that the columns films, vehicles and starships (at the bottom) are all lists, and in the case of films, it lists all the films where Luke Skywalker has appeared. What if you want to take a closer look at films where Luke Skywalker appeared? starwars %>% filter(name == "Luke Skywalker") %>% pull(films) ## [[1]] ## [1] "The Empire Strikes Back" "Revenge of the Sith" ## [3] "Return of the Jedi" "A New Hope" ## [5] "The Force Awakens" pull() is a {dplyr} function that extract (pulls) the column you’re interested in. It is quite useful when you want to inspect a column. Instead of just looking at Luke Skywalker’s films, let’s pull the complete films column instead: starwars %>% head() %>% # let's just look at the first six rows pull(films) ## [[1]] ## [1] "The Empire Strikes Back" "Revenge of the Sith" ## [3] "Return of the Jedi" "A New Hope" ## [5] "The Force Awakens" ## ## [[2]] ## [1] "The Empire Strikes Back" "Attack of the Clones" ## [3] "The Phantom Menace" "Revenge of the Sith" ## [5] "Return of the Jedi" "A New Hope" ## ## [[3]] ## [1] "The Empire Strikes Back" "Attack of the Clones" ## [3] "The Phantom Menace" "Revenge of the Sith" ## [5] "Return of the Jedi" "A New Hope" ## [7] "The Force Awakens" ## ## [[4]] ## [1] "The Empire Strikes Back" "Revenge of the Sith" ## [3] "Return of the Jedi" "A New Hope" ## ## [[5]] ## [1] "The Empire Strikes Back" "Revenge of the Sith" ## [3] "Return of the Jedi" "A New Hope" ## [5] "The Force Awakens" ## ## [[6]] ## [1] "Attack of the Clones" "Revenge of the Sith" "A New Hope" Let’s stop here a moment. As you see, the films column contains several items in it. How is it possible that a single cell contains more than one film? This is because what is actually contained in the cell is not the seven films, as seven separate characters, but an atomic vector that happens to have seven elements. But it is still only one vector. Zooming in into the data frame helps understand: In the picture above we see three columns. The first two, name and sex are what you’re used to see, just one element defining the characters name and sex respectively. The last one also contains only one element for each character, it just so happens to be a complete vector of characters. Because what is inside the cells of a list-column can be very different things (as list can contain anything), you have to think a bit about it in order to extract insights from such columns. List-columns may seem arcane, but they are extremely powerful once you master them. As an example, suppose we want to create a numerical variable which counts the number of movies in which the characters have appeared. For this we need to compute the length of the list, or count the number of elements this list has. Let’s try with length() a base R function: starwars %>% filter(name == "Luke Skywalker") %>% pull(films) %>% length() ## [1] 1 This might be surprising, but remember that a list with only one element, has a length of 1: length( list(words) # this creates a list which one element. This element is a list of 980 words. ) ## [1] 1 Even though words contain a vector of 980 words, if we put this very long vector inside the first element of list, length(list(words)) will this compute the length of the list. Let’s see what happens if we create a more complex list: numbers <- seq(1, 5) length( list(words, # this creates a list which one element. This element is a list of 980 words. numbers) # numbers contains numbers 1 through 5 ) ## [1] 2 list(words, numbers) is now a list of two elements, words and numbers. If we want to compute the length of words and numbers, we need to learn about another powerful concept called higher-order functions. We are going to learn about this in greater detail in Chapter 8. For now, let’s use the fact that our list films is contained inside of a data frame, and use a convenience function included in {dplyr} to handle situations like this: starwars <- starwars %>% rowwise() %>% # <- Apply the next steps for each row individually mutate(n_films = length(films)) dplyr::rowwise() is useful when working with list-columns because whatever instructions follow get run on the single element contained in the list. The picture below illustrates this: Let’s take a look at the characters and the number of films they have appeared in: starwars %>% select(name, films, n_films) ## # A tibble: 87 × 3 ## # Rowwise: ## name films n_films ## <chr> <list> <int> ## 1 Luke Skywalker <chr [5]> 5 ## 2 C-3PO <chr [6]> 6 ## 3 R2-D2 <chr [7]> 7 ## 4 Darth Vader <chr [4]> 4 ## 5 Leia Organa <chr [5]> 5 ## 6 Owen Lars <chr [3]> 3 ## 7 Beru Whitesun lars <chr [3]> 3 ## 8 R5-D4 <chr [1]> 1 ## 9 Biggs Darklighter <chr [1]> 1 ## 10 Obi-Wan Kenobi <chr [6]> 6 ## # … with 77 more rows Now we can, for example, create a factor variable that groups characters by asking whether they appeared only in 1 movie, or more: starwars <- starwars %>% mutate(more_1 = case_when(n_films == 1 ~ "Exactly one movie", n_films >= 1 ~ "More than 1 movie")) You can also create list-columns with your own datasets, by using tidyr::nest(). Remember the fake survey_data I created to illustrate pivot_longer() and pivot_wider()? Let’s go back to that dataset again: survey_data <- tribble( ~id, ~variable, ~value, 1, "var1", 1, 1, "var2", 0.2, NA, "var3", 0.3, 2, "var1", 1.4, 2, "var2", 1.9, 2, "var3", 4.1, 3, "var1", 0.1, 3, "var2", 2.8, 3, "var3", 8.9, 4, "var1", 1.7, NA, "var2", 1.9, 4, "var3", 7.6 ) print(survey_data) ## # A tibble: 12 × 3 ## id variable value ## <dbl> <chr> <dbl> ## 1 1 var1 1 ## 2 1 var2 0.2 ## 3 NA var3 0.3 ## 4 2 var1 1.4 ## 5 2 var2 1.9 ## 6 2 var3 4.1 ## 7 3 var1 0.1 ## 8 3 var2 2.8 ## 9 3 var3 8.9 ## 10 4 var1 1.7 ## 11 NA var2 1.9 ## 12 4 var3 7.6 nested_data <- survey_data %>% group_by(id) %>% nest() glimpse(nested_data) ## Rows: 5 ## Columns: 2 ## Groups: id [5] ## $ id <dbl> 1, NA, 2, 3, 4 ## $ data <list> [<tbl_df[2 x 2]>], [<tbl_df[2 x 2]>], [<tbl_df[3 x 2]>], [<tbl_df… This creates a new tibble, with columns id and data. data is a list-column that contains tibbles; each tibble is the variable and value for each individual: nested_data %>% filter(id == "1") %>% pull(data) ## [[1]] ## # A tibble: 2 × 2 ## variable value ## <chr> <dbl> ## 1 var1 1 ## 2 var2 0.2 As you can see, for individual 1, the column data contains a 2x2 tibble with columns variable and value. Because group_by() followed by nest() is so useful, there is a wrapper around these two functions called group_nest(): survey_data %>% group_nest(id) ## # A tibble: 5 × 2 ## id data ## <dbl> <list<tibble[,2]>> ## 1 1 [2 × 2] ## 2 2 [3 × 2] ## 3 3 [3 × 2] ## 4 4 [2 × 2] ## 5 NA [2 × 2] You might be wondering why this is useful, because this seems to introduce an unnecessary layer of complexity. The usefulness of list-columns will become apparent in the next chapters, where we are going to learn how to repeat actions over, say, individuals. So if you’ve reached the end of this section and still didn’t really grok list-columns, go take some fresh air and come back to this section again later on. 4.9 Going beyond descriptive statistics and data manipulation The {tidyverse} collection of packages can do much more than simply data manipulation and descriptive statisics. You can use the principles we have covered and the functions you now know to do much more. For instance, you can use a few {tidyverse} functions to do Monte Carlo simulations, for example to estimate \\(\\pi\\). Draw the unit circle inside the unit square, the ratio of the area of the circle to the area of the square will be \\(\\pi/4\\). Then shot K arrows at the square; roughly \\(K*\\pi/4\\) should have fallen inside the circle. So if now you shoot N arrows at the square, and M fall inside the circle, you have the following relationship \\(M = N*\\pi/4\\). You can thus compute \\(\\pi\\) like so: \\(\\pi = 4*M/N\\). The more arrows N you throw at the square, the better approximation of \\(\\pi\\) you’ll have. Let’s try to do this with a tidy Monte Carlo simulation. First, let’s randomly pick some points inside the unit square: library(tidyverse) n <- 5000 set.seed(2019) points <- tibble("x" = runif(n), "y" = runif(n)) Now, to know if a point is inside the unit circle, we need to check wether \\(x^2 + y^2 < 1\\). Let’s add a new column to the points tibble, called inside equal to 1 if the point is inside the unit circle and 0 if not: points <- points %>% mutate(inside = map2_dbl(.x = x, .y = y, ~ifelse(.x**2 + .y**2 < 1, 1, 0))) %>% rowid_to_column("N") Let’s take a look at points: points ## # A tibble: 5,000 × 4 ## N x y inside ## <int> <dbl> <dbl> <dbl> ## 1 1 0.770 0.984 0 ## 2 2 0.713 0.0107 1 ## 3 3 0.303 0.133 1 ## 4 4 0.618 0.0378 1 ## 5 5 0.0505 0.677 1 ## 6 6 0.0432 0.0846 1 ## 7 7 0.820 0.727 0 ## 8 8 0.00961 0.0758 1 ## 9 9 0.102 0.373 1 ## 10 10 0.609 0.676 1 ## # … with 4,990 more rows Now, I can compute the estimation of \\(\\pi\\) at each row, by computing the cumulative sum of the 1’s in the inside column and dividing that by the current value of N column: points <- points %>% mutate(estimate = 4*cumsum(inside)/N) cumsum(inside) is the M from the formula. Now, we can finish by plotting the result: ggplot(points) + geom_line(aes(y = estimate, x = N)) + geom_hline(yintercept = pi) In the next chapter, we are going to learn all about {ggplot2}, the package I used in the lines above to create this plot. As the number of tries grows, the estimation of \\(\\pi\\) gets better. Using a data frame as a structure to hold our simulated points and the results makes it very easy to avoid loops, and thus write code that is more concise and easier to follow. If you studied a quantitative field in university, you might have done a similar exercise at the time, very likely by defining a matrix to hold your points, and an empty vector to hold whether a particular point was inside the unit circle. Then you wrote a loop to compute whether a point was inside the unit circle, save this result in the before-defined empty vector and then compute the estimation of \\(\\pi\\). Again, I take this opportunity here to stress that there is nothing wrong with this approach per se, but R is better suited for a workflow where lists or data frames are the central objects and where the analyst operates over them with functional programming techniques. 4.10 Exercises Exercise 1 Combine mutate() with across() to exponentiate every column of type double of the gasoline dataset. To obtain the gasoline dataset, run the following lines: data(Gasoline, package = "plm") gasoline <- as_tibble(Gasoline) gasoline <- gasoline %>% mutate(country = tolower(country)) Exponeniate columns starting with the character \"l\" of the gasoline dataset. Convert all columns’ classes into the character class. Exercise 2 Load the LaborSupply dataset from the {Ecdat} package and answer the following questions: Compute the average annual hours worked by year (plus standard deviation) What age group worked the most hours in the year 1982? Create a variable, n_years that equals the number of years an individual stays in the panel. Is the panel balanced? Which are the individuals that do not have any kids during the whole period? Create a variable, no_kids, that flags these individuals (1 = no kids, 0 = kids) Using the no_kids variable from before compute the average wage, standard deviation and number of observations in each group for the year 1980 (no kids group vs kids group). Create the lagged logarithm of hours worked and wages. Remember that this is a panel. Exercise 3 What does the following code do? Copy and paste it in an R interpreter to find out! LaborSupply %>% group_by(id) %>% mutate(across(starts_with("l"), tibble::lst(lag, lead))) Using summarise() and across(), compute the mean, standard deviation and number of individuals of lnhr and lnwg for each individual. Exercise 4 In the dataset folder you downloaded at the beginning of the chapter, there is a folder called “unemployment”. I used the data in the section about working with lists of datasets. Using rio::import_list(), read the 4 datasets into R. Using map(), map the janitor::clean_names() function to each dataset (just like in the example in the section on working with lists of datasets). Then, still with map() and mutate() convert all commune names in the commune column with the function tolower(), in a new column called lcommune. This is not an easy exercise; so here are some hints: Remember that all_datasets is a list of datasets. Which function do you use when you want to map a function to each element of a list? Each element of all_datasets are data.frame objects. Which function do you use to add a column to a data.frame? What symbol can you use to access a column of a data.frame? "],["graphs.html", "Chapter 5 Graphs 5.1 Resources 5.2 Examples 5.3 Customization 5.4 Saving plots to disk 5.5 Exercises", " Chapter 5 Graphs By default, it is possible to make a lot of graphs with R without the need of any external packages. However, in this chapter, we are going to learn how to make graphs using {ggplot2} which is a very powerful package that produces amazing graphs. There is an entry cost to {ggplot2} as it works in a very different way than what you would expect, especially if you know how to make plots with the basic R functions already. But the resulting graphs are well worth the effort and once you will know more about {ggplot2} you will see that in a lot of situations it is actually faster and easier. Another advantage is that making plots with {ggplot2} is consistent, so you do not need to learn anything specific to make, say, density plots. There are a lot of extensions to {ggplot2}, such as {ggridges} to create so-called ridge plots and {gganimate} to create animated plots. By the end of this chapter you will know how to do basic plots with {ggplot2} and also how to use these two extensions. 5.1 Resources Before showing some examples and the general functionality of {ggplot2}, I list here some online resources that I keep coming back to: Data Visualization for Social Science R Graphics Cookbook R graph gallery Tufte in R ggplot2 extensions ggthemes function reference ggplot2 cheatsheet When I first started using {ggplot2}, I had a cookbook approach to it; I tried findinge examples online that looked like what I needed, copy and paste the code and then adapted it to my case. The above resources are the ones I consulted and keep consulting in these situations (I also go back to past code I’ve written, of course). Don’t hesitate to skim these resources for inspiration and to learn more about some extensions to {ggplot2}. In the next subsections I am going to show you how to draw the most common plots, as well as show you how to customize your plots with {ggthemes}, a package that contains pre-defined themes for {ggplot2}. 5.2 Examples I think that the best way to learn how to use {ggplot2} is to jump right into it. Let’s first start with barplots. 5.2.1 Barplots To follow the examples below, load the following libraries: library(ggplot2) library(ggthemes) {ggplot2} is an implementation of the Grammar of Graphics by Wilkinson (2006), but you don’t need to read the books to start using it. If we go back to the Star Wars data (contained in dplyr), and wish to draw a barplot of the gender, the following lines are enough: ggplot(starwars, aes(gender)) + geom_bar() The first argument of the function is the data (called starwars in this example), and then the function aes(). This function is where you list the variables that you want to map to the aesthetics of the geoms functions. On the second line, you see that we use the geom_bar() function. This function creates a barplot of gender variable. You can get different kind of plots by using different geom_ functions. You can also provide the aes() argument to the geom_*() function: ggplot(starwars) + geom_bar(aes(gender)) The difference between these two approaches is that when you specify the aesthetics in the ggplot() function, all the geom_*() functions that follow will inherited these aesthetics. This is useful if you want to avoid writing the same code over and over again, but can be problematic if you need to specify different aesthetics to different geom_*() functions. This will become clear in a later example. You can add options to your plots, for instance, you can change the coordinate system in your barplot: ggplot(starwars, aes(gender)) + geom_bar() + coord_flip() This is the basic recipe to create plots using {ggplot2}: start with a call to ggplot() where you specify the data you want to plot, and optionally the aesthetics. Then, use the geom_*() function you need; if you did not specify the aesthetics in the call to the ggplot() function, do it here. Then, you can add different options, such as changing the coordinate system, changing the theme, the colour palette used, changing the position of the legend and much, much more. This chapter will only give you an overview of the capabilities of {ggplot2}. 5.2.2 Scatter plots Scatter plots are very useful, especially if you are trying to figure out the relationship between two variables. For instance, let’s make a scatter plot of height vs weight of Star Wars characters: ggplot(starwars) + geom_point(aes(height, mass)) As you can see there is an outlier; a very heavy character! Star Wars fans already guessed it, it’s Jabba the Hut. To make the plot easier to read, let’s remove this outlier: starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot() + geom_point(aes(height, mass)) There is a positive correlation between height and mass, by adding geom_smooth() with the option method = \"lm\": starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot(aes(height, mass)) + geom_point(aes(height, mass)) + geom_smooth(method = "lm") ## `geom_smooth()` using formula 'y ~ x' I’ve moved the aes(height, mass) up to the ggplot() function because both geom_point() and geom_smooth() need them, and as explained in the begging of this section, the aesthetics listed in ggplot() get passed down to the other geoms. If you omit method = \"lm, you get a non-parametric curve: starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot(aes(height, mass)) + geom_point(aes(height, mass)) + geom_smooth() ## `geom_smooth()` using method = 'loess' and formula 'y ~ x' 5.2.3 Density Use geom_density() to get density plots: ggplot(starwars, aes(height)) + geom_density() ## Warning: Removed 6 rows containing non-finite values (stat_density). Let’s go into more detail now; what if you would like to plot the densities for feminines and masculines only (removing the droids from the data first)? This can be done by first filtering the data using dplyr and then separating the dataset by gender: starwars %>% filter(gender %in% c("feminine", "masculine")) The above lines do the filtering; only keep gender if gender is in the vector \"feminine\", \"masculine\". This is much easier than having to write gender == \"feminine\" | gender == \"masculine\". Then, we pipe this dataset to ggplot: starwars %>% filter(gender %in% c("feminine", "masculine")) %>% ggplot(aes(height, fill = gender)) + geom_density() ## Warning: Removed 5 rows containing non-finite values (stat_density). Let’s take a closer look to the aes() function: I’ve added fill = gender. This means that there will be one density plot for each gender in the data, and each will be coloured accordingly. This is where {ggplot2} might be confusing; there is no need to write explicitly (even if it is possible) that you want the feminine density to be red and the masculine density to be blue. You just map the variable gender to this particular aesthetic. You conclude the plot by adding geom_density() which is this case is the plot you want. We will see later how to change the colours of your plot. An alternative way to write this code is first to save the filtered data in a variable, and define the aesthetics inside the geom_density() function: filtered_data <- starwars %>% filter(gender %in% c("feminine", "masculine")) ggplot(filtered_data) + geom_density(aes(height, fill = gender)) ## Warning: Removed 5 rows containing non-finite values (stat_density). 5.2.4 Line plots For the line plots, we are going to use official unemployment data (the same as in the previous chapter, but with all the available years). Get it from here (downloaded from the website of the Luxembourguish national statistical institute. Let’s plot the unemployment for the canton of Luxembourg only: unemp_lux_data <- import("datasets/unemployment/all/unemployment_lux_all.csv") unemp_lux_data %>% filter(division == "Luxembourg") %>% ggplot(aes(x = year, y = unemployment_rate_in_percent, group = 1)) + geom_line() Because line plots are 2D, you need to specify the y and x axes. There is also another option you need to add, group = 1. This is to tell aes() that the dots have to be connected with a single line. What if you want to plot more than one commune? unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette")) %>% ggplot(aes(x = year, y = unemployment_rate_in_percent, group = division, colour = division)) + geom_line() This time, I’ve specified group = division which means that there has to be one line per as many communes as in the variable division. I do the same for colours. I think the next example illustrates how {ggplot2} is actually brilliant; if you need to add a third commune, there is no need to specify anything else; no need to add anything to the legend, no need to specify a third colour etc: unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(x = year, y = unemployment_rate_in_percent, group = division, colour = division)) + geom_line() The three communes get mapped to the colour aesthetic so whatever the number of communes, as long as there are enough colours, the communes will each get mapped to one of these colours. 5.2.5 Facets In some case you have a factor variable that separates the data you wish to plot into different categories. If you want to have a plot per category you can use the facet_grid() function. Careful though, this function does not take a variable as an argument, but a formula, hence the ~ symbol in the code below: starwars %>% mutate(human = case_when(species == "Human" ~ "Human", species != "Human" ~ "Not Human")) %>% filter(gender %in% c("feminine", "masculine"), !is.na(human)) %>% ggplot(aes(height, fill = gender)) + facet_grid(. ~ human) + #<--- this is a formula geom_density() ## Warning: Removed 5 rows containing non-finite values (stat_density). I first created a factor variable that specifies if a Star Wars character is human or not, and then use it for facetting. By changing the formula, you change how the facetting is done: starwars %>% mutate(human = case_when(species == "Human" ~ "Human", species != "Human" ~ "Not Human")) %>% filter(gender %in% c("feminine", "masculine"), !is.na(human)) %>% ggplot(aes(height, fill = gender)) + facet_grid(human ~ .) + geom_density() ## Warning: Removed 5 rows containing non-finite values (stat_density). Recall the categorical variable more_1 that we computed in the previous chapter? Let’s use it as a faceting variable: starwars %>% rowwise() %>% mutate(n_films = length(films)) %>% mutate(more_1 = case_when(n_films == 1 ~ "Exactly one movie", n_films != 1 ~ "More than 1 movie")) %>% mutate(human = case_when(species == "Human" ~ "Human", species != "Human" ~ "Not Human")) %>% filter(gender %in% c("feminine", "masculine"), !is.na(human)) %>% ggplot(aes(height, fill = gender)) + facet_grid(human ~ more_1) + geom_density() ## Warning: Removed 5 rows containing non-finite values (stat_density). 5.2.6 Pie Charts I am not a huge fan of pie charts, but sometimes this is what you have to do. So let’s see how you can create pie charts. First, let’s create a mock dataset with the function tibble::tribble() which allows you to create a dataset line by line: test_data <- tribble( ~id, ~var1, ~var2, ~var3, ~var4, ~var5, "a", 26.5, 38, 30, 32, 34, "b", 30, 30, 28, 32, 30, "c", 34, 32, 30, 28, 26.5 ) This data is not in the right format though, which is wide. We need to have it in the long format for it to work with {ggplot2}. For this, let’s use tidyr::gather() as seen in the previous chapter: test_data_long = test_data %>% gather(variable, value, starts_with("var")) Now, let’s plot this data, first by creating 3 bar plots: ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(variable, value, fill = variable), stat = "identity") In the code above, I introduce a new option, called stat = \"identity\". By default, geom_bar() counts the number of observations of each category that is plotted, which is a statistical transformation. By adding stat = \"identity\", I force the statistical transformation to be the identity function, and thus plot the data as is. To create the pie chart, first we need to compute the share of each id to var1, var2, etc… To do this, we first group by id, then compute the total. Then we use a new function ungroup(). After using ungroup() all the computations are done on the whole dataset instead of by group, which is what we need to compute the share: test_data_long <- test_data_long %>% group_by(id) %>% mutate(total = sum(value)) %>% ungroup() %>% mutate(share = value/total) Let’s take a look to see if this is what we wanted: print(test_data_long) ## # A tibble: 15 × 5 ## id variable value total share ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 a var1 26.5 160. 0.165 ## 2 b var1 30 150 0.2 ## 3 c var1 34 150. 0.226 ## 4 a var2 38 160. 0.237 ## 5 b var2 30 150 0.2 ## 6 c var2 32 150. 0.213 ## 7 a var3 30 160. 0.187 ## 8 b var3 28 150 0.187 ## 9 c var3 30 150. 0.199 ## 10 a var4 32 160. 0.199 ## 11 b var4 32 150 0.213 ## 12 c var4 28 150. 0.186 ## 13 a var5 34 160. 0.212 ## 14 b var5 30 150 0.2 ## 15 c var5 26.5 150. 0.176 If you didn’t understand what ungroup() did, rerun the last few lines with it and inspect the output. To plot the pie chart, we create a barplot again, but specify polar coordinates: ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(y = share, x = "", fill = variable), stat = "identity") + theme() + coord_polar("y", start = 0) As you can see, this typical pie chart is not very easy to read; compared to the barplots above it is not easy to distinguish if a has a higher share than b or c. You can change the look of the pie chart, for example by specifying variable as the x: ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(y = share, x = variable, fill = variable), stat = "identity") + theme() + coord_polar("x", start = 0) But as a general rule, avoid pie charts if possible. I find that pie charts are only interesting if you need to show proportions that are hugely unequal, to really emphasize the difference between said proportions. 5.2.7 Adding text to plots Sometimes you might want to add some text to your plots. This is possible with geom_text(): ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(variable, value, fill = variable), stat = "identity") + geom_text(aes(variable, value + 1.5, label = value)) You can put anything after label = but in general what you want are the values, so that’s what I put there. But you can also refine it, imagine the values are actually in euros: ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(variable, value, fill = variable), stat = "identity") + geom_text(aes(variable, value + 1.5, label = paste(value, "€"))) You can also achieve something similar with geom_label(): ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(variable, value, fill = variable), stat = "identity") + geom_label(aes(variable, value + 1.5, label = paste(value, "€"))) 5.3 Customization Every plot you’ve seen until now was made with the default look of {ggplot2}. If you want to change the look, you can apply a theme, and a colour scheme. Let’s take a look at themes first by using the ones found in the package ggthemes. But first, let’s learn how to change the names of the axes and how to title a plot. 5.3.1 Changing titles, axes labels, options, mixing geoms and changing themes The name of this subsection is quite long, but this is because everything is kind of linked. Let’s start by learning what the labs() function does. To change the title of the plot, and of the axes, you need to pass the names to the labs() function: unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() What if you want to make the lines thicker? unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line(size = 2) Each geom_*() function has its own options. Notice that the size=2 argument is not inside an aes() function. This is because I do not want to map a variable of the data to the size of the line, in other words, I do not want to make the size of the line proportional to a certain variable in the data. Recall the scatter plot we did earlier, where we showed that height and mass of star wars characters increased together? Let’s take this plot again, but make the size of the dots proportional to the birth year of the character: starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot() + geom_point(aes(height, mass, size = birth_year)) Making the size proportional to the birth year (the age would have been more informative) allows us to see a third dimension. It is also possible to “see” a fourth dimension, the gender for instance, by changing the colour of the dots: starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot() + geom_point(aes(height, mass, size = birth_year, colour = gender)) As I promised above, we are now going to learn how to add a regression line to this scatter plot: starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot() + geom_point(aes(height, mass, size = birth_year, colour = gender)) + geom_smooth(aes(height, mass), method = "lm") ## `geom_smooth()` using formula 'y ~ x' geom_smooth() adds a regression line, but only if you specify method = \"lm\" (“lm” stands for “linear model”). What happens if you remove this option? starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot() + geom_point(aes(height, mass, size = birth_year, colour = gender)) + geom_smooth(aes(height, mass)) ## `geom_smooth()` using method = 'loess' and formula 'y ~ x' By default, geom_smooth() does a non-parametric regression called LOESS (locally estimated scatterplot smoothing), which is more flexible. It is also possible to have one regression line by gender: starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot() + geom_point(aes(height, mass, size = birth_year, colour = gender)) + geom_smooth(aes(height, mass, colour = gender)) ## `geom_smooth()` using method = 'loess' and formula 'y ~ x' Because there are only a few observations for feminines and NAs the regression lines are not very informative, but this was only an example to show you some options of geom_smooth(). Let’s go back to the unemployment line plots. For now, let’s keep the base {ggplot2} theme, but modify it a bit. For example, the legend placement is actually a feature of the theme. This means that if you want to change where the legend is placed you need to modify this feature from the theme. This is done with the function theme(): unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme(legend.position = "bottom") + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() What I also like to do is remove the title of the legend, because it is often superfluous: unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme(legend.position = "bottom", legend.title = element_blank()) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() The legend title has to be an element_text object.element_text objects are used with theme to specify how text should be displayed. element_blank() draws nothing and assigns no space (not even blank space). If you want to keep the legend title but change it, you need to use element_text(): unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme(legend.position = "bottom", legend.title = element_text(colour = "red")) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() If you want to change the word “division” to something else, you can do so by providing the colour argument to the labs() function: unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme(legend.position = "bottom") + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate", colour = "Administrative division") + geom_line() You could modify every feature of the theme like that, but there are built-in themes that you can use: unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme_minimal() + theme(legend.position = "bottom", legend.title = element_blank()) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() For example in the code above, I have used theme_minimal() which I like quite a lot. You can also use themes from the ggthemes package, which even contains a STATA theme, if you like it: unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme_stata() + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() As you can see, theme_stata() has the legend on the bottom by default, because this is how the legend position is defined within the theme. However the legend title is still there. Let’s remove it: unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme_stata() + theme(legend.title = element_blank()) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() ggthemes even features an Excel 2003 theme (don’t use it though): unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme_excel() + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() You can create your own theme by using a simple theme, such as theme_minimal() as a base and then add your options. We are going to create one theme after we learn how to create our own functions, in Chapter 7. Then, we are going to create a package to share this theme with the world, and we are going to learn how to make packages in Chapter 9. 5.3.2 Colour schemes You can also change colour schemes, by specifying either scale_colour_*() or scale_fill_*() functions. scale_colour_*() functions are used for continuous variables, while scale_fill_*() functions for discrete variables (so for barplots for example). A colour scheme I like is the Highcharts colour scheme. unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme_minimal() + scale_colour_hc() + theme(legend.position = "bottom", legend.title = element_blank()) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() An example with a barplot: ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(variable, value, fill = variable), stat = "identity") + geom_text(aes(variable, value + 1.5, label = value)) + theme_minimal() + scale_fill_hc() It is also possible to define and use your own palette. To use your own colours you can use scale_colour_manual() and scale_fill_manual() and specify the html codes of the colours you want to use. unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme_minimal() + scale_colour_manual(values = c("#FF336C", "#334BFF", "#2CAE00")) + theme(legend.position = "bottom", legend.title = element_blank()) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() To get html codes of colours you can use this online tool. There is also a very nice package, called colourpicker that allows you to pick colours from with RStudio. Also, you do not even need to load it to use it, since it comes with an Addin: For a barplot you would do the same: ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(variable, value, fill = variable), stat = "identity") + geom_text(aes(variable, value + 1.5, label = value)) + theme_minimal() + theme(legend.position = "bottom", legend.title = element_blank()) + scale_fill_manual(values = c("#FF336C", "#334BFF", "#2CAE00", "#B3C9C6", "#765234")) For countinuous variables, things are a bit different. Let’s first create a plot where we map a continuous variable to the colour argument of aes(): ggplot(diamonds) + geom_point(aes(carat, price, colour = depth)) To change the colour, we need to use scale_color_gradient() and specify a value for low values of the variable, and a value for high values of the variable. For example, using the colours of the theme I made for my blog: ggplot(diamonds) + geom_point(aes(carat, price, colour = depth)) + scale_color_gradient(low = "#bec3b8", high = "#ad2c6c") 5.4 Saving plots to disk There are two ways to save plots on disk; one through the Plots plane in RStudio and another using the ggsave() function. Using RStudio, navigate to the Plots pane and click on Export. You can then choose where to save the plot and other various options: This is fine if you only generate one or two plots but if you generate a large number of them, it is less tedious to use the ggsave() function: my_plot1 <- ggplot(my_data) + geom_bar(aes(variable)) ggsave("path/you/want/to/save/the/plot/to/my_plot1.pdf", my_plot1) There are other options that you can specify such as the width and height, resolution, units, etc… 5.5 Exercises Exercise 1 Load the Bwages dataset from the Ecdat package. Your first task is to create a new variable, educ_level, which is a factor variable that equals: “Primary school” if educ == 1 “High school” if educ == 2 “Some university” if educ == 3 “Master’s degree” if educ == 4 “Doctoral degree” if educ == 5 Use case_when() for this. Then, plot a scatter plot of wages on experience, by education level. Add a theme that you like, and remove the title of the legend. The scatter plot is not very useful, because you cannot make anything out. Instead, use another geom that shows you a non-parametric fit with confidence bands. References "],["statistical-models.html", "Chapter 6 Statistical models 6.1 Terminology 6.2 Fitting a model to data 6.3 Diagnostics 6.4 Interpreting models 6.5 Comparing models 6.6 Using a model for prediction 6.7 Beyond linear regression 6.8 Hyper-parameters 6.9 Training, validating, and testing models", " Chapter 6 Statistical models In this chapter, we will not learn about all the models out there that you may or may not need. Instead, I will show you how can use what you have learned until now and how you can apply these concepts to modeling. Also, as you read in the beginning of the book, R has many many packages. So the model you need is most probably already implemented in some package and you will very likely not need to write your own from scratch. In the first section, I will discuss the terminology used in this book. Then I will discuss linear regression; showing how linear regression works illsutrates very well how other models work too, without loss of generality. Then I will introduce the concepte of hyper-parameters with ridge regression. This chapter will then finish with an introduction to cross-validation as a way to tune the hyper-parameters of models that features them. 6.1 Terminology Before continuing discussing about statistical models and model fitting it is worthwhile to discuss terminology a little bit. Depending on your background, you might call an explanatory variable a feature or the dependent variable the target. These are the same objects. The matrix of features is usually called a design matrix, and what statisticians call the intercept is what machine learning engineers call the bias. Referring to the intercept by bias is unfortunate, as bias also has a very different meaning; bias is also what we call the error in a model that may cause biased estimates. To finish up, the estimated parameters of the model may be called coefficients or weights. Here again, I don’t like the using weight as weight as a very different meaning in statistics. So, in the remainder of this chapter, and book, I will use the terminology from the statistical litterature, using dependent and explanatory variables (y and x), and calling the estimated parameters coefficients and the intercept… well the intercept (the \\(\\beta\\)s of the model). However, I will talk of training a model, instead of estimating a model. 6.2 Fitting a model to data Suppose you have a variable y that you wish to explain using a set of other variables x1, x2, x3, etc. Let’s take a look at the Housing dataset from the Ecdat package: library(Ecdat) data(Housing) You can read a description of the dataset by running: ?Housing Housing package:Ecdat R Documentation Sales Prices of Houses in the City of Windsor Description: a cross-section from 1987 _number of observations_ : 546 _observation_ : goods _country_ : Canada Usage: data(Housing) Format: A dataframe containing : price: sale price of a house lotsize: the lot size of a property in square feet bedrooms: number of bedrooms bathrms: number of full bathrooms stories: number of stories excluding basement driveway: does the house has a driveway ? recroom: does the house has a recreational room ? fullbase: does the house has a full finished basement ? gashw: does the house uses gas for hot water heating ? airco: does the house has central air conditioning ? garagepl: number of garage places prefarea: is the house located in the preferred neighbourhood of the city ? Source: Anglin, P.M. and R. Gencay (1996) “Semiparametric estimation of a hedonic price function”, _Journal of Applied Econometrics_, *11(6)*, 633-648. References: Verbeek, Marno (2004) _A Guide to Modern Econometrics_, John Wiley and Sons, chapter 3. Journal of Applied Econometrics data archive : <URL: http://qed.econ.queensu.ca/jae/>. See Also: ‘Index.Source’, ‘Index.Economics’, ‘Index.Econometrics’, ‘Index.Observations’ or by looking for Housing in the help pane of RStudio. Usually, you would take a look a the data before doing any modeling: glimpse(Housing) ## Rows: 546 ## Columns: 12 ## $ price <dbl> 42000, 38500, 49500, 60500, 61000, 66000, 66000, 69000, 83800… ## $ lotsize <dbl> 5850, 4000, 3060, 6650, 6360, 4160, 3880, 4160, 4800, 5500, 7… ## $ bedrooms <dbl> 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 3, 4, 1, 2, 3… ## $ bathrms <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1… ## $ stories <dbl> 2, 1, 1, 2, 1, 1, 2, 3, 1, 4, 1, 1, 2, 1, 1, 1, 2, 3, 1, 1, 2… ## $ driveway <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, no, ye… ## $ recroom <fct> no, no, no, yes, no, yes, no, no, yes, yes, no, no, no, no, n… ## $ fullbase <fct> yes, no, no, no, no, yes, yes, no, yes, no, yes, no, no, no, … ## $ gashw <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, n… ## $ airco <fct> no, no, no, no, no, yes, no, no, no, yes, yes, no, no, no, no… ## $ garagepl <dbl> 1, 0, 0, 0, 0, 0, 2, 0, 0, 1, 3, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1… ## $ prefarea <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, n… Housing prices depend on a set of variables such as the number of bedrooms, the area it is located and so on. If you believe that housing prices depend linearly on a set of explanatory variables, you will want to estimate a linear model. To estimate a linear model, you will need to use the built-in lm() function: model1 <- lm(price ~ lotsize + bedrooms, data = Housing) lm() takes a formula as an argument, which defines the model you want to estimate. In this case, I ran the following regression: \\[ \\text{price} = \\beta_0 + \\beta_1 * \\text{lotsize} + \\beta_2 * \\text{bedrooms} + \\varepsilon \\] where \\(\\beta_0, \\beta_1\\) and \\(\\beta_2\\) are three parameters to estimate. To take a look at the results, you can use the summary() method (not to be confused with dplyr::summarise()): summary(model1) ## ## Call: ## lm(formula = price ~ lotsize + bedrooms, data = Housing) ## ## Residuals: ## Min 1Q Median 3Q Max ## -65665 -12498 -2075 8970 97205 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.613e+03 4.103e+03 1.368 0.172 ## lotsize 6.053e+00 4.243e-01 14.265 < 2e-16 *** ## bedrooms 1.057e+04 1.248e+03 8.470 2.31e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 21230 on 543 degrees of freedom ## Multiple R-squared: 0.3703, Adjusted R-squared: 0.3679 ## F-statistic: 159.6 on 2 and 543 DF, p-value: < 2.2e-16 if you wish to remove the intercept (\\(\\beta_0\\) in the above equation) from your model, you can do so with -1: model2 <- lm(price ~ -1 + lotsize + bedrooms, data = Housing) summary(model2) ## ## Call: ## lm(formula = price ~ -1 + lotsize + bedrooms, data = Housing) ## ## Residuals: ## Min 1Q Median 3Q Max ## -67229 -12342 -1333 9627 95509 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## lotsize 6.283 0.390 16.11 <2e-16 *** ## bedrooms 11968.362 713.194 16.78 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 21250 on 544 degrees of freedom ## Multiple R-squared: 0.916, Adjusted R-squared: 0.9157 ## F-statistic: 2965 on 2 and 544 DF, p-value: < 2.2e-16 or if you want to use all the columns inside Housing, replacing the column names by .: model3 <- lm(price ~ ., data = Housing) summary(model3) ## ## Call: ## lm(formula = price ~ ., data = Housing) ## ## Residuals: ## Min 1Q Median 3Q Max ## -41389 -9307 -591 7353 74875 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -4038.3504 3409.4713 -1.184 0.236762 ## lotsize 3.5463 0.3503 10.124 < 2e-16 *** ## bedrooms 1832.0035 1047.0002 1.750 0.080733 . ## bathrms 14335.5585 1489.9209 9.622 < 2e-16 *** ## stories 6556.9457 925.2899 7.086 4.37e-12 *** ## drivewayyes 6687.7789 2045.2458 3.270 0.001145 ** ## recroomyes 4511.2838 1899.9577 2.374 0.017929 * ## fullbaseyes 5452.3855 1588.0239 3.433 0.000642 *** ## gashwyes 12831.4063 3217.5971 3.988 7.60e-05 *** ## aircoyes 12632.8904 1555.0211 8.124 3.15e-15 *** ## garagepl 4244.8290 840.5442 5.050 6.07e-07 *** ## prefareayes 9369.5132 1669.0907 5.614 3.19e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 15420 on 534 degrees of freedom ## Multiple R-squared: 0.6731, Adjusted R-squared: 0.6664 ## F-statistic: 99.97 on 11 and 534 DF, p-value: < 2.2e-16 You can access different elements of model3 with $, because the result of lm() is a list (you can check this claim with typeof(model3): print(model3$coefficients) ## (Intercept) lotsize bedrooms bathrms stories drivewayyes ## -4038.350425 3.546303 1832.003466 14335.558468 6556.945711 6687.778890 ## recroomyes fullbaseyes gashwyes aircoyes garagepl prefareayes ## 4511.283826 5452.385539 12831.406266 12632.890405 4244.829004 9369.513239 but I prefer to use the {broom} package, and more specifically the tidy() function, which converts model3 into a neat data.frame: results3 <- broom::tidy(model3) glimpse(results3) ## Rows: 12 ## Columns: 5 ## $ term <chr> "(Intercept)", "lotsize", "bedrooms", "bathrms", "stories", … ## $ estimate <dbl> -4038.350425, 3.546303, 1832.003466, 14335.558468, 6556.9457… ## $ std.error <dbl> 3409.4713, 0.3503, 1047.0002, 1489.9209, 925.2899, 2045.2458… ## $ statistic <dbl> -1.184451, 10.123618, 1.749764, 9.621691, 7.086369, 3.269914… ## $ p.value <dbl> 2.367616e-01, 3.732442e-22, 8.073341e-02, 2.570369e-20, 4.37… I explicitely write broom::tidy() because tidy() is a popular function name. For instance, it is also a function from the {yardstick} package, which does not do the same thing at all. Since I will also be using {yardstick} I prefer to explicitely write broom::tidy() to avoid conflicts. Using broom::tidy() is useful, because you can then work on the results easily, for example if you wish to only keep results that are significant at the 5% level: results3 %>% filter(p.value < 0.05) ## # A tibble: 10 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 lotsize 3.55 0.350 10.1 3.73e-22 ## 2 bathrms 14336. 1490. 9.62 2.57e-20 ## 3 stories 6557. 925. 7.09 4.37e-12 ## 4 drivewayyes 6688. 2045. 3.27 1.15e- 3 ## 5 recroomyes 4511. 1900. 2.37 1.79e- 2 ## 6 fullbaseyes 5452. 1588. 3.43 6.42e- 4 ## 7 gashwyes 12831. 3218. 3.99 7.60e- 5 ## 8 aircoyes 12633. 1555. 8.12 3.15e-15 ## 9 garagepl 4245. 841. 5.05 6.07e- 7 ## 10 prefareayes 9370. 1669. 5.61 3.19e- 8 You can even add new columns, such as the confidence intervals: results3 <- broom::tidy(model3, conf.int = TRUE, conf.level = 0.95) print(results3) ## # A tibble: 12 × 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -4038. 3409. -1.18 2.37e- 1 -10736. 2659. ## 2 lotsize 3.55 0.350 10.1 3.73e-22 2.86 4.23 ## 3 bedrooms 1832. 1047. 1.75 8.07e- 2 -225. 3889. ## 4 bathrms 14336. 1490. 9.62 2.57e-20 11409. 17262. ## 5 stories 6557. 925. 7.09 4.37e-12 4739. 8375. ## 6 drivewayyes 6688. 2045. 3.27 1.15e- 3 2670. 10705. ## 7 recroomyes 4511. 1900. 2.37 1.79e- 2 779. 8244. ## 8 fullbaseyes 5452. 1588. 3.43 6.42e- 4 2333. 8572. ## 9 gashwyes 12831. 3218. 3.99 7.60e- 5 6511. 19152. ## 10 aircoyes 12633. 1555. 8.12 3.15e-15 9578. 15688. ## 11 garagepl 4245. 841. 5.05 6.07e- 7 2594. 5896. ## 12 prefareayes 9370. 1669. 5.61 3.19e- 8 6091. 12648. Going back to model estimation, you can of course use lm() in a pipe workflow: Housing %>% select(-driveway, -stories) %>% lm(price ~ ., data = .) %>% broom::tidy() ## # A tibble: 10 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3025. 3263. 0.927 3.54e- 1 ## 2 lotsize 3.67 0.363 10.1 4.52e-22 ## 3 bedrooms 4140. 1036. 3.99 7.38e- 5 ## 4 bathrms 16443. 1546. 10.6 4.29e-24 ## 5 recroomyes 5660. 2010. 2.82 5.05e- 3 ## 6 fullbaseyes 2241. 1618. 1.38 1.67e- 1 ## 7 gashwyes 13568. 3411. 3.98 7.93e- 5 ## 8 aircoyes 15578. 1597. 9.75 8.53e-21 ## 9 garagepl 4232. 883. 4.79 2.12e- 6 ## 10 prefareayes 10729. 1753. 6.12 1.81e- 9 The first . in the lm() function is used to indicate that we wish to use all the data from Housing (minus driveway and stories which I removed using select() and the - sign), and the second . is used to place the result from the two dplyr instructions that preceded is to be placed there. The picture below should help you understand: You have to specify this, because by default, when using %>% the left hand side argument gets passed as the first argument of the function on the right hand side. Since version 4.2, R now also natively includes a placeholder, _: Housing |> select(-driveway, -stories) |> lm(price ~ ., data = _) |> broom::tidy() ## # A tibble: 10 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3025. 3263. 0.927 3.54e- 1 ## 2 lotsize 3.67 0.363 10.1 4.52e-22 ## 3 bedrooms 4140. 1036. 3.99 7.38e- 5 ## 4 bathrms 16443. 1546. 10.6 4.29e-24 ## 5 recroomyes 5660. 2010. 2.82 5.05e- 3 ## 6 fullbaseyes 2241. 1618. 1.38 1.67e- 1 ## 7 gashwyes 13568. 3411. 3.98 7.93e- 5 ## 8 aircoyes 15578. 1597. 9.75 8.53e-21 ## 9 garagepl 4232. 883. 4.79 2.12e- 6 ## 10 prefareayes 10729. 1753. 6.12 1.81e- 9 For the example above, I’ve also switched from %>% to |>, or else I can’t use the _ placeholder. The advantage of the _ placeholder is that it disambiguates .. So here, the . is a placeholder for all the variables in the dataset, and _ is a placeholder for the dataset. 6.3 Diagnostics Diagnostics are useful metrics to assess model fit. You can read some of these diagnostics, such as the \\(R^2\\) at the bottom of the summary (when running summary(my_model)), but if you want to do more than simply reading these diagnostics from RStudio, you can put those in a data.frame too, using broom::glance(): glance(model3) ## # A tibble: 1 × 12 ## r.squared adj.r.…¹ sigma stati…² p.value df logLik AIC BIC devia…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.673 0.666 15423. 100. 6.18e-122 11 -6034. 12094. 12150. 1.27e11 ## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names ¹​adj.r.squared, ²​statistic, ³​deviance You can also plot the usual diagnostics plots using ggfortify::autoplot() which uses the {ggplot2} package under the hood: library(ggfortify) autoplot(model3, which = 1:6) + theme_minimal() which=1:6 is an additional option that shows you all the diagnostics plot. If you omit this option, you will only get 4 of them. You can also get the residuals of the regression in two ways; either you grab them directly from the model fit: resi3 <- residuals(model3) or you can augment the original data with a residuals column, using broom::augment(): housing_aug <- augment(model3) Let’s take a look at housing_aug: glimpse(housing_aug) ## Rows: 546 ## Columns: 18 ## $ price <dbl> 42000, 38500, 49500, 60500, 61000, 66000, 66000, 69000, 838… ## $ lotsize <dbl> 5850, 4000, 3060, 6650, 6360, 4160, 3880, 4160, 4800, 5500,… ## $ bedrooms <dbl> 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 3, 4, 1, 2,… ## $ bathrms <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2,… ## $ stories <dbl> 2, 1, 1, 2, 1, 1, 2, 3, 1, 4, 1, 1, 2, 1, 1, 1, 2, 3, 1, 1,… ## $ driveway <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, no, … ## $ recroom <fct> no, no, no, yes, no, yes, no, no, yes, yes, no, no, no, no,… ## $ fullbase <fct> yes, no, no, no, no, yes, yes, no, yes, no, yes, no, no, no… ## $ gashw <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,… ## $ airco <fct> no, no, no, no, no, yes, no, no, no, yes, yes, no, no, no, … ## $ garagepl <dbl> 1, 0, 0, 0, 0, 0, 2, 0, 0, 1, 3, 0, 0, 0, 0, 0, 1, 0, 0, 1,… ## $ prefarea <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,… ## $ .fitted <dbl> 66037.98, 41391.15, 39889.63, 63689.09, 49760.43, 66387.12,… ## $ .resid <dbl> -24037.9757, -2891.1515, 9610.3699, -3189.0873, 11239.5735,… ## $ .hat <dbl> 0.013477335, 0.008316321, 0.009893730, 0.021510891, 0.01033… ## $ .sigma <dbl> 15402.01, 15437.14, 15431.98, 15437.02, 15429.89, 15437.64,… ## $ .cooksd <dbl> 2.803214e-03, 2.476265e-05, 3.265481e-04, 8.004787e-05, 4.6… ## $ .std.resid <dbl> -1.56917096, -0.18823924, 0.62621736, -0.20903274, 0.732539… A few columns have been added to the original data, among them .resid which contains the residuals. Let’s plot them: ggplot(housing_aug) + geom_density(aes(.resid)) Fitted values are also added to the original data, under the variable .fitted. It would also have been possible to get the fitted values with: fit3 <- fitted(model3) but I prefer using augment(), because the columns get merged to the original data, which then makes it easier to find specific individuals, for example, you might want to know for which housing units the model underestimates the price: total_pos <- housing_aug %>% filter(.resid > 0) %>% summarise(total = n()) %>% pull(total) we find 261 individuals where the residuals are positive. It is also easier to extract outliers: housing_aug %>% mutate(prank = cume_dist(.cooksd)) %>% filter(prank > 0.99) %>% glimpse() ## Rows: 6 ## Columns: 19 ## $ price <dbl> 163000, 125000, 132000, 175000, 190000, 174500 ## $ lotsize <dbl> 7420, 4320, 3500, 9960, 7420, 7500 ## $ bedrooms <dbl> 4, 3, 4, 3, 4, 4 ## $ bathrms <dbl> 1, 1, 2, 2, 2, 2 ## $ stories <dbl> 2, 2, 2, 2, 3, 2 ## $ driveway <fct> yes, yes, yes, yes, yes, yes ## $ recroom <fct> yes, no, no, no, no, no ## $ fullbase <fct> yes, yes, no, yes, no, yes ## $ gashw <fct> no, yes, yes, no, no, no ## $ airco <fct> yes, no, no, no, yes, yes ## $ garagepl <dbl> 2, 2, 2, 2, 2, 3 ## $ prefarea <fct> no, no, no, yes, yes, yes ## $ .fitted <dbl> 94826.68, 77688.37, 85495.58, 108563.18, 115125.03, 118549.… ## $ .resid <dbl> 68173.32, 47311.63, 46504.42, 66436.82, 74874.97, 55951.00 ## $ .hat <dbl> 0.02671105, 0.05303793, 0.05282929, 0.02819317, 0.02008141,… ## $ .sigma <dbl> 15144.70, 15293.34, 15298.27, 15159.14, 15085.99, 15240.66 ## $ .cooksd <dbl> 0.04590995, 0.04637969, 0.04461464, 0.04616068, 0.04107317,… ## $ .std.resid <dbl> 4.480428, 3.152300, 3.098176, 4.369631, 4.904193, 3.679815 ## $ prank <dbl> 0.9963370, 1.0000000, 0.9945055, 0.9981685, 0.9926740, 0.99… prank is a variable I created with cume_dist() which is a dplyr function that returns the proportion of all values less than or equal to the current rank. For example: example <- c(5, 4.6, 2, 1, 0.8, 0, -1) cume_dist(example) ## [1] 1.0000000 0.8571429 0.7142857 0.5714286 0.4285714 0.2857143 0.1428571 by filtering prank > 0.99 we get the top 1% of outliers according to Cook’s distance. 6.4 Interpreting models Model interpretation is essential in the social sciences, but it is also getting very important in machine learning. As usual, the terminology is different; in machine learning, we speak about explainability. There is a very important aspect that one has to understand when it comes to interpretability/explainability: classical, parametric models, and black-box models. This is very well explained in Breiman (2001), an absolute must read (link to paper, in PDF format: click here). The gist of the paper is that there are two cultures of statistical modeling; one culture relies on modeling the data generating process, for instance, by considering that a variable y (independent variable, or target) is a linear combination of input variables x (dependent variables, or features) plus some noise. The other culture uses complex algorithms (random forests, neural networks) to model the relationship between y and x. The author argues that most statisticians have relied for too long on modeling data generating processes and do not use all the potential offered by these complex algorithms. I think that a lot of things have changed since then, and that nowadays any practitioner that uses data is open to use any type of model or algorithm, as long as it does the job. However, the paper is very interesting, and the discussion on trade-off between simplicity of the model and interpretability/explainability is still relevant today. In this section, I will explain how one can go about interpreting or explaining models from these two cultures. Also, it is important to note here that the discussion that will follow will be heavily influenced by my econometrics background. I will focus on marginal effects as way to interpret parametric models (models from the first culture described above), but depending on the field, practitioners might use something else (for instance by computing odds ratios in a logistic regression). I will start by interpretability of classical statistical models. 6.4.1 Marginal effects If one wants to know the effect of variable x on the dependent variable y, so-called marginal effects have to be computed. This is easily done in R with the {marginaleffects} package. Formally, marginal effects are the partial derivative of the regression equation with respect to the variable we want to look at. library(marginaleffects) effects_model3 <- marginaleffects(model3) summary(effects_model3) ## Term Contrast Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 % ## 1 lotsize dY/dX 3.546 0.3503 10.124 < 2.22e-16 2.86 4.233 ## 2 bedrooms dY/dX 1832.003 1047.0056 1.750 0.08016056 -220.09 3884.097 ## 3 bathrms dY/dX 14335.558 1489.9557 9.621 < 2.22e-16 11415.30 17255.818 ## 4 stories dY/dX 6556.946 925.2943 7.086 1.3771e-12 4743.40 8370.489 ## 5 driveway yes - no 6687.779 2045.2459 3.270 0.00107580 2679.17 10696.387 ## 6 recroom yes - no 4511.284 1899.9577 2.374 0.01757689 787.44 8235.132 ## 7 fullbase yes - no 5452.386 1588.0239 3.433 0.00059597 2339.92 8564.855 ## 8 gashw yes - no 12831.406 3217.5970 3.988 6.6665e-05 6525.03 19137.781 ## 9 airco yes - no 12632.890 1555.0211 8.124 4.5131e-16 9585.11 15680.676 ## 10 garagepl dY/dX 4244.829 840.5965 5.050 4.4231e-07 2597.29 5892.368 ## 11 prefarea yes - no 9369.513 1669.0906 5.614 1.9822e-08 6098.16 12640.871 ## ## Model type: lm ## Prediction type: response Let’s go through this: summary(effects_model3) shows the average marginal effects for each of the dependent variables that were used in model3. The way to interpret them is as follows: everything else held constant (often you’ll read the Latin ceteris paribus for this), a unit increase in lotize increases the price by 3.546 units, on average. The everything held constant part is crucial; with marginal effects, you’re looking at just the effect of one variable at a time. For discrete variables, like driveway, this is simpler: imagine you have two equal houses, exactly the same house, one has a driveway and the other doesn’t. The one with the driveway is 6687 units more expensive, on average. Now it turns out that in the case of a linear model, the average marginal effects are exactly equal to the coefficients. Just compare summary(model3) to effects_model3 to see (and remember, I told you that marginal effects were the partial derivative of the regression equation with respect to the variable of interest. So the derivative of \\(\\alpha*X_1 + ....\\) with respect to \\(X_1\\) will be \\(\\alpha\\)). But in the case of a more complex, non-linear model, this is not so obvious. This is where {marginaleffects} will make your life much easier. It is also possible to plot the results: plot(effects_model3) effects_model3 is a data frame containing the effects for each house in the data set. For example, let’s take a look at the first house: effects_model3 %>% filter(rowid == 1) ## rowid type term contrast dydx std.error statistic ## 1 1 response lotsize dY/dX 3.546303 0.3502195 10.125944 ## 2 1 response bedrooms dY/dX 1832.003466 1046.1608842 1.751168 ## 3 1 response bathrms dY/dX 14335.558468 1490.4827945 9.618064 ## 4 1 response stories dY/dX 6556.945711 925.4764870 7.084940 ## 5 1 response driveway yes - no 6687.778890 2045.2460319 3.269914 ## 6 1 response recroom yes - no 4511.283826 1899.9577182 2.374413 ## 7 1 response fullbase yes - no 5452.385539 1588.0237538 3.433441 ## 8 1 response gashw yes - no 12831.406266 3217.5971931 3.987885 ## 9 1 response airco yes - no 12632.890405 1555.0207045 8.123937 ## 10 1 response garagepl dY/dX 4244.829004 840.8930857 5.048001 ## 11 1 response prefarea yes - no 9369.513239 1669.0904968 5.613544 ## p.value conf.low conf.high predicted predicted_hi predicted_lo ## 1 4.238689e-24 2.859885 4.232721 66037.98 66043.14 66037.98 ## 2 7.991698e-02 -218.434189 3882.441121 66037.98 66038.89 66037.98 ## 3 6.708200e-22 11414.265872 17256.851065 66037.98 66042.28 66037.98 ## 4 1.391042e-12 4743.045128 8370.846295 66037.98 66039.94 66037.98 ## 5 1.075801e-03 2679.170328 10696.387452 66037.98 66037.98 59350.20 ## 6 1.757689e-02 787.435126 8235.132526 66037.98 70549.26 66037.98 ## 7 5.959723e-04 2339.916175 8564.854903 66037.98 66037.98 60585.59 ## 8 6.666508e-05 6525.031651 19137.780882 66037.98 78869.38 66037.98 ## 9 4.512997e-16 9585.105829 15680.674981 66037.98 78670.87 66037.98 ## 10 4.464572e-07 2596.708842 5892.949167 66037.98 66039.25 66037.98 ## 11 1.982240e-08 6098.155978 12640.870499 66037.98 75407.49 66037.98 ## price lotsize bedrooms bathrms stories driveway recroom fullbase gashw airco ## 1 42000 5850 3 1 2 yes no yes no no ## 2 42000 5850 3 1 2 yes no yes no no ## 3 42000 5850 3 1 2 yes no yes no no ## 4 42000 5850 3 1 2 yes no yes no no ## 5 42000 5850 3 1 2 yes no yes no no ## 6 42000 5850 3 1 2 yes no yes no no ## 7 42000 5850 3 1 2 yes no yes no no ## 8 42000 5850 3 1 2 yes no yes no no ## 9 42000 5850 3 1 2 yes no yes no no ## 10 42000 5850 3 1 2 yes no yes no no ## 11 42000 5850 3 1 2 yes no yes no no ## garagepl prefarea eps ## 1 1 no 1.4550 ## 2 1 no 0.0005 ## 3 1 no 0.0003 ## 4 1 no 0.0003 ## 5 1 no NA ## 6 1 no NA ## 7 1 no NA ## 8 1 no NA ## 9 1 no NA ## 10 1 no 0.0003 ## 11 1 no NA rowid is column that identifies the houses in the original data set, so rowid == 1 filters out the first house. This shows you the marginal effects (column dydx computed for this house; but remember, since we’re dealing with a linear model, the values of the marginal effects are constant. If you don’t see the point of this discussion, don’t fret, the next example should make things clearer. Let’s estimate a logit model and compute the marginal effects. You might know logit models as logistic regression. Logit models can be estimated using the glm() function, which stands for generalized linear models. As an example, we are going to use the Participation data, also from the {Ecdat} package: data(Participation) ?Particpation Participation package:Ecdat R Documentation Labor Force Participation Description: a cross-section _number of observations_ : 872 _observation_ : individuals _country_ : Switzerland Usage: data(Participation) Format: A dataframe containing : lfp labour force participation ? lnnlinc the log of nonlabour income age age in years divided by 10 educ years of formal education nyc the number of young children (younger than 7) noc number of older children foreign foreigner ? Source: Gerfin, Michael (1996) “Parametric and semiparametric estimation of the binary response”, _Journal of Applied Econometrics_, *11(3)*, 321-340. References: Davidson, R. and James G. MacKinnon (2004) _Econometric Theory and Methods_, New York, Oxford University Press, <URL: http://www.econ.queensu.ca/ETM/>, chapter 11. Journal of Applied Econometrics data archive : <URL: http://qed.econ.queensu.ca/jae/>. See Also: ‘Index.Source’, ‘Index.Economics’, ‘Index.Econometrics’, ‘Index.Observations’ The variable of interest is lfp: whether the individual participates in the labour force or not. To know which variables are relevant in the decision to participate in the labour force, one could train a logit model, using glm(): logit_participation <- glm(lfp ~ ., data = Participation, family = "binomial") broom::tidy(logit_participation) ## # A tibble: 7 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 10.4 2.17 4.79 1.69e- 6 ## 2 lnnlinc -0.815 0.206 -3.97 7.31e- 5 ## 3 age -0.510 0.0905 -5.64 1.72e- 8 ## 4 educ 0.0317 0.0290 1.09 2.75e- 1 ## 5 nyc -1.33 0.180 -7.39 1.51e-13 ## 6 noc -0.0220 0.0738 -0.298 7.66e- 1 ## 7 foreignyes 1.31 0.200 6.56 5.38e-11 From the results above, one can only interpret the sign of the coefficients. To know how much a variable influences the labour force participation, one has to use marginaleffects(): effects_logit_participation <- marginaleffects(logit_participation) summary(effects_logit_participation) ## Term Contrast Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 % ## 1 lnnlinc dY/dX -0.169940 0.04151 -4.0939 4.2416e-05 -0.251300 -0.08858 ## 2 age dY/dX -0.106407 0.01759 -6.0492 1.4560e-09 -0.140884 -0.07193 ## 3 educ dY/dX 0.006616 0.00604 1.0954 0.27335 -0.005222 0.01845 ## 4 nyc dY/dX -0.277463 0.03325 -8.3436 < 2.22e-16 -0.342642 -0.21229 ## 5 noc dY/dX -0.004584 0.01538 -0.2981 0.76563 -0.034725 0.02556 ## 6 foreign yes - no 0.283377 0.03984 7.1129 1.1361e-12 0.205292 0.36146 ## ## Model type: glm ## Prediction type: response As you can see, the average marginal effects here are not equal to the estimated coefficients of the model. Let’s take a look at the first row of the data: Participation[1, ] ## lfp lnnlinc age educ nyc noc foreign ## 1 no 10.7875 3 8 1 1 no and let’s now look at rowid == 1 in the marginal effects data frame: effects_logit_participation %>% filter(rowid == 1) ## rowid type term contrast dydx std.error statistic ## 1 1 response lnnlinc dY/dX -0.156661756 0.038522800 -4.0667282 ## 2 1 response age dY/dX -0.098097148 0.020123709 -4.8747052 ## 3 1 response educ dY/dX 0.006099266 0.005367036 1.1364310 ## 4 1 response nyc dY/dX -0.255784406 0.029367783 -8.7096942 ## 5 1 response noc dY/dX -0.004226368 0.014167283 -0.2983189 ## 6 1 response foreign yes - no 0.305630005 0.045174828 6.7654935 ## p.value conf.low conf.high predicted predicted_hi predicted_lo lfp ## 1 4.767780e-05 -0.232165056 -0.08115846 0.2596523 0.2595710 0.2596523 no ## 2 1.089711e-06 -0.137538892 -0.05865540 0.2596523 0.2596111 0.2596523 no ## 3 2.557762e-01 -0.004419931 0.01661846 0.2596523 0.2596645 0.2596523 no ## 4 3.046958e-18 -0.313344203 -0.19822461 0.2596523 0.2595755 0.2596523 no ## 5 7.654598e-01 -0.031993732 0.02354100 0.2596523 0.2596497 0.2596523 no ## 6 1.328556e-11 0.217088969 0.39417104 0.2596523 0.5652823 0.2596523 no ## lnnlinc age educ nyc noc foreign eps ## 1 10.7875 3 8 1 1 no 0.0005188749 ## 2 10.7875 3 8 1 1 no 0.0004200000 ## 3 10.7875 3 8 1 1 no 0.0020000000 ## 4 10.7875 3 8 1 1 no 0.0003000000 ## 5 10.7875 3 8 1 1 no 0.0006000000 ## 6 10.7875 3 8 1 1 no NA Let’s focus on the first row, where term is lnnlinc. What we see here is the effect of an infinitesimal increase in the variable lnnlinc on the participation, for an individual who has the following characteristics: lnnlinc = 10.7875, age = 3, educ = 8, nyc = 1, noc = 1 and foreign = no, which are the characteristics of this first individual in our data. So let’s look at the value of dydx for every individual: dydx_lnnlinc <- effects_logit_participation %>% filter(term == "lnnlinc") head(dydx_lnnlinc) ## rowid type term contrast dydx std.error statistic p.value ## 1 1 response lnnlinc dY/dX -0.15666176 0.03852280 -4.066728 4.767780e-05 ## 2 2 response lnnlinc dY/dX -0.20013939 0.05124543 -3.905507 9.402813e-05 ## 3 3 response lnnlinc dY/dX -0.18493932 0.04319729 -4.281271 1.858287e-05 ## 4 4 response lnnlinc dY/dX -0.05376281 0.01586468 -3.388837 7.018964e-04 ## 5 5 response lnnlinc dY/dX -0.18709356 0.04502973 -4.154890 3.254439e-05 ## 6 6 response lnnlinc dY/dX -0.19586185 0.04782143 -4.095692 4.209096e-05 ## conf.low conf.high predicted predicted_hi predicted_lo lfp lnnlinc age ## 1 -0.23216506 -0.08115846 0.25965227 0.25957098 0.25965227 no 10.78750 3.0 ## 2 -0.30057859 -0.09970018 0.43340025 0.43329640 0.43340025 yes 10.52425 4.5 ## 3 -0.26960445 -0.10027418 0.34808777 0.34799181 0.34808777 no 10.96858 4.6 ## 4 -0.08485701 -0.02266862 0.07101902 0.07099113 0.07101902 no 11.10500 3.1 ## 5 -0.27535020 -0.09883692 0.35704926 0.35695218 0.35704926 no 11.10847 4.4 ## 6 -0.28959014 -0.10213356 0.40160949 0.40150786 0.40160949 yes 11.02825 4.2 ## educ nyc noc foreign eps ## 1 8 1 1 no 0.0005188749 ## 2 8 0 1 no 0.0005188749 ## 3 9 0 0 no 0.0005188749 ## 4 11 2 0 no 0.0005188749 ## 5 12 0 2 no 0.0005188749 ## 6 12 0 1 no 0.0005188749 dydx_lnnlinc is a data frame with all individual marginal effect for the variable lnnlinc. What if we compute the mean of this column? dydx_lnnlinc %>% summarise(mean(dydx)) ## mean(dydx) ## 1 -0.1699405 Let’s compare this to the average marginal effects: summary(effects_logit_participation) ## Term Contrast Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 % ## 1 lnnlinc dY/dX -0.169940 0.04151 -4.0939 4.2416e-05 -0.251300 -0.08858 ## 2 age dY/dX -0.106407 0.01759 -6.0492 1.4560e-09 -0.140884 -0.07193 ## 3 educ dY/dX 0.006616 0.00604 1.0954 0.27335 -0.005222 0.01845 ## 4 nyc dY/dX -0.277463 0.03325 -8.3436 < 2.22e-16 -0.342642 -0.21229 ## 5 noc dY/dX -0.004584 0.01538 -0.2981 0.76563 -0.034725 0.02556 ## 6 foreign yes - no 0.283377 0.03984 7.1129 1.1361e-12 0.205292 0.36146 ## ## Model type: glm ## Prediction type: response Yep, it’s the same! This is why we speak of average marginal effects. Now that we know why these are called average marginal effects, let’s go back to interpreting them. This time, let’s plot them, because why not: plot(effects_logit_participation) So an infinitesimal increase, in say, non-labour income (lnnlinc) of 0.001 is associated with a decrease of the probability of labour force participation by 0.001*17 percentage points. This is just scratching the surface of interpreting these kinds of models. There are many more types of effects that you can compute and look at. I highly recommend you read the documentation of {marginaleffects} which you can find here. The author of the package, Vincent Arel-Bundock writes a lot of very helpful documentation for his packages, so if model interpretation is important for your job, definitely take a look. 6.4.2 Explainability of black-box models Just read Christoph Molnar’s Interpretable Machine Learning. Seriously, I cannot add anything meaningful to it. His book is brilliant. 6.5 Comparing models Consider this section more as an illustration of what is possible with the knowledge you acquired at this point. Imagine that the task at hand is to compare two models. We would like to select the one which has the best fit to the data. Let’s first estimate another model on the same data; prices are only positive, so a linear regression might not be the best model, because the model could predict negative prices. Let’s look at the distribution of prices: ggplot(Housing) + geom_density(aes(price)) it looks like modeling the log of price might provide a better fit: model_log <- lm(log(price) ~ ., data = Housing) result_log <- broom::tidy(model_log) print(result_log) ## # A tibble: 12 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 10.0 0.0472 212. 0 ## 2 lotsize 0.0000506 0.00000485 10.4 2.91e-23 ## 3 bedrooms 0.0340 0.0145 2.34 1.94e- 2 ## 4 bathrms 0.168 0.0206 8.13 3.10e-15 ## 5 stories 0.0923 0.0128 7.20 2.10e-12 ## 6 drivewayyes 0.131 0.0283 4.61 5.04e- 6 ## 7 recroomyes 0.0735 0.0263 2.79 5.42e- 3 ## 8 fullbaseyes 0.0994 0.0220 4.52 7.72e- 6 ## 9 gashwyes 0.178 0.0446 4.00 7.22e- 5 ## 10 aircoyes 0.178 0.0215 8.26 1.14e-15 ## 11 garagepl 0.0508 0.0116 4.36 1.58e- 5 ## 12 prefareayes 0.127 0.0231 5.50 6.02e- 8 Let’s take a look at the diagnostics: glance(model_log) ## # A tibble: 1 × 12 ## r.squared adj.r.squ…¹ sigma stati…² p.value df logLik AIC BIC devia…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.677 0.670 0.214 102. 3.67e-123 11 73.9 -122. -65.8 24.4 ## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names ¹​adj.r.squared, ²​statistic, ³​deviance Let’s compare these to the ones from the previous model: diag_lm <- glance(model3) diag_lm <- diag_lm %>% mutate(model = "lin-lin model") diag_log <- glance(model_log) diag_log <- diag_log %>% mutate(model = "log-lin model") diagnostics_models <- full_join(diag_lm, diag_log) %>% select(model, everything()) # put the `model` column first ## Joining, by = c("r.squared", "adj.r.squared", "sigma", "statistic", "p.value", "df", "logLik", "AIC", "BIC", "deviance", ## "df.residual", "nobs", "model") print(diagnostics_models) ## # A tibble: 2 × 13 ## model r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 lin-li… 0.673 0.666 1.54e+4 100. 6.18e-122 11 -6034. 12094. 12150. ## 2 log-li… 0.677 0.670 2.14e-1 102. 3.67e-123 11 73.9 -122. -65.8 ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>, and ## # abbreviated variable names ¹​r.squared, ²​adj.r.squared, ³​statistic I saved the diagnostics in two different data.frame objects using the glance() function and added a model column to indicate which model the diagnostics come from. Then I merged both datasets using full_join(), a {dplyr} function. Using this approach, we can easily build a data frame with the diagnostics of several models and compare them. The model using the logarithm of prices has lower AIC and BIC (and this higher likelihood), so if you’re worried about selecting the model with the better fit to the data, you’d go for this model. 6.6 Using a model for prediction Once you estimated a model, you might want to use it for prediction. This is easily done using the predict() function that works with most models. Prediction is also useful as a way to test the accuracy of your model: split your data into a training set (used for training) and a testing set (used for the pseudo-prediction) and see if your model overfits the data. We are going to see how to do that in a later section; for now, let’s just get acquainted with predict() and other functions. I insist, keep in mind that this section is only to get acquainted with these functions. We are going to explore prediction, overfitting and tuning of models in a later section. Let’s go back to the models we trained in the previous section, model3 and model_log. Let’s also take a subsample of data, which we will be using for prediction: set.seed(1234) pred_set <- Housing %>% sample_n(20) In order to always get the same pred_set, I set the random seed first. Let’s take a look at the data: print(pred_set) ## price lotsize bedrooms bathrms stories driveway recroom fullbase gashw ## 284 45000 6750 2 1 1 yes no no no ## 101 57000 4500 3 2 2 no no yes no ## 400 85000 7231 3 1 2 yes yes yes no ## 98 59900 8250 3 1 1 yes no yes no ## 103 125000 4320 3 1 2 yes no yes yes ## 326 99000 8880 3 2 2 yes no yes no ## 79 55000 3180 2 2 1 yes no yes no ## 270 59000 4632 4 1 2 yes no no no ## 382 112500 6550 3 1 2 yes no yes no ## 184 63900 3510 3 1 2 yes no no no ## 4 60500 6650 3 1 2 yes yes no no ## 212 42000 2700 2 1 1 no no no no ## 195 33000 3180 2 1 1 yes no no no ## 511 70000 4646 3 1 2 yes yes yes no ## 479 88000 5450 4 2 1 yes no yes no ## 510 64000 4040 3 1 2 yes no no no ## 424 62900 2880 3 1 2 yes no no no ## 379 84000 7160 3 1 1 yes no yes no ## 108 58500 3680 3 2 2 yes no no no ## 131 35000 4840 2 1 2 yes no no no ## airco garagepl prefarea ## 284 no 0 no ## 101 yes 0 no ## 400 yes 0 yes ## 98 no 3 no ## 103 no 2 no ## 326 yes 1 no ## 79 no 2 no ## 270 yes 0 no ## 382 yes 0 yes ## 184 no 0 no ## 4 no 0 no ## 212 no 0 no ## 195 no 0 no ## 511 no 2 no ## 479 yes 0 yes ## 510 no 1 no ## 424 no 0 yes ## 379 no 2 yes ## 108 no 0 no ## 131 no 0 no If we wish to use it for prediction, this is easily done with predict(): predict(model3, pred_set) ## 284 101 400 98 103 326 79 270 ## 51143.48 77286.31 93204.28 76481.82 77688.37 103751.72 66760.79 66486.26 ## 382 184 4 212 195 511 479 510 ## 86277.96 48042.41 63689.09 30093.18 38483.18 70524.34 91987.65 54166.78 ## 424 379 108 131 ## 55177.75 77741.03 62980.84 50926.99 This returns a vector of predicted prices. This can then be used to compute the Root Mean Squared Error for instance. Let’s do it within a tidyverse pipeline: rmse <- pred_set %>% mutate(predictions = predict(model3, .)) %>% summarise(sqrt(sum(predictions - price)**2/n())) The root mean square error of model3 is 3646.0817347. I also used the n() function which returns the number of observations in a group (or all the observations, if the data is not grouped). Let’s compare model3 ’s RMSE with the one from model_log: rmse2 <- pred_set %>% mutate(predictions = exp(predict(model_log, .))) %>% summarise(sqrt(sum(predictions - price)**2/n())) Don’t forget to exponentiate the predictions, remember you’re dealing with a log-linear model! model_log’s RMSE is 1.2125133^{4} which is lower than model3’s. However, keep in mind that the model was trained on the whole data, and then the prediction quality was assessed using a subsample of the data the model was trained on… so actually we can’t really say if model_log’s predictions are very useful. Of course, this is the same for model3. In a later section we are going to learn how to do cross validation to avoid this issue. Just as a side note, notice that I had to copy and paste basically the same lines twice to compute the predictions for both models. That’s not much, but if I wanted to compare 10 models, copy and paste mistakes could have sneaked in. Instead, it would have been nice to have a function that computes the RMSE and then use it on my models. We are going to learn how to write our own functions and use it just like if it was another built-in R function. 6.7 Beyond linear regression R has a lot of other built-in functions for regression, such as glm() (for Generalized Linear Models) and nls() for (for Nonlinear Least Squares). There are also functions and additional packages for time series, panel data, machine learning, bayesian and nonparametric methods. Presenting everything here would take too much space, and would be pretty useless as you can find whatever you need using an internet search engine. What you have learned until now is quite general and should work on many type of models. To help you out, here is a list of methods and the recommended packages that you can use: Model Package Quick example Robust Linear Regression MASS rlm(y ~ x, data = mydata) Nonlinear Least Squares stats2 nls(y ~ x1 / (1 + x2), data = mydata)3 Logit stats glm(y ~ x, data = mydata, family = \"binomial\") Probit stats glm(y ~ x, data = mydata, family = binomial(link = \"probit\")) K-Means stats kmeans(data, n)4 PCA stats prcomp(data, scale = TRUE, center = TRUE)5 Multinomial Logit mlogit Requires several steps of data pre-processing and formula definition, refer to the Vignette for more details. Cox PH survival coxph(Surv(y_time, y_status) ~ x, data = mydata)6 Time series Several, depending on your needs. Time series in R is a vast subject that would require a very thick book to cover. You can get started with the following series of blog articles, Tidy time-series, part 1, Tidy time-series, part 2, Tidy time-series, part 3 and Tidy time-series, part 4 Panel data plm plm(y ~ x, data = mydata, model = \"within|random\") Machine learning Several, depending on your needs. R is a very popular programming language for machine learning. This book is a must read if you need to do machine learning with R. Nonparametric regression np Several functions and options available, refer to the Vignette for more details. This table is far from being complete. Should you be a Bayesian, you’d want to look at packages such as {rstan}, which uses STAN, an external piece of software that must be installed on your system. It is also possible to train models using Bayesian inference without the need of external tools, with the {bayesm} package which estimates the usual micro-econometric models. There really are a lot of packages available for Bayesian inference, and you can find them all in the related CRAN Task View. 6.8 Hyper-parameters Hyper-parameters are parameters of the model that cannot be directly learned from the data. A linear regression does not have any hyper-parameters, but a random forest for instance has several. You might have heard of ridge regression, lasso and elasticnet. These are extensions of linear models that avoid over-fitting by penalizing large models. These extensions of the linear regression have hyper-parameters that the practitioner has to tune. There are several ways one can tune these parameters, for example, by doing a grid-search, or a random search over the grid or using more elaborate methods. To introduce hyper-parameters, let’s get to know ridge regression, also called Tikhonov regularization. 6.8.1 Ridge regression Ridge regression is used when the data you are working with has a lot of explanatory variables, or when there is a risk that a simple linear regression might overfit to the training data, because, for example, your explanatory variables are collinear. If you are training a linear model and then you notice that it generalizes very badly to new, unseen data, it is very likely that the linear model you trained overfit the data. In this case, ridge regression might prove useful. The way ridge regression works might seem counter-intuititive; it boils down to fitting a worse model to the training data, but in return, this worse model will generalize better to new data. The closed form solution of the ordinary least squares estimator is defined as: \\[ \\widehat{\\beta} = (X'X)^{-1}X'Y \\] where \\(X\\) is the design matrix (the matrix made up of the explanatory variables) and \\(Y\\) is the dependent variable. For ridge regression, this closed form solution changes a little bit: \\[ \\widehat{\\beta} = (X'X + \\lambda I_p)^{-1}X'Y \\] where \\(\\lambda \\in \\mathbb{R}\\) is an hyper-parameter and \\(I_p\\) is the identity matrix of dimension \\(p\\) (\\(p\\) is the number of explanatory variables). This formula above is the closed form solution to the following optimisation program: \\[ \\sum_{i=1}^n \\left(y_i - \\sum_{j=1}^px_{ij}\\beta_j\\right)^2 \\] such that: \\[ \\sum_{j=1}^p(\\beta_j)^2 < c \\] for any strictly positive \\(c\\). The glmnet() function from the {glmnet} package can be used for ridge regression, by setting the alpha argument to 0 (setting it to 1 would do LASSO, and setting it to a number between 0 and 1 would do elasticnet). But in order to compare linear regression and ridge regression, let me first divide the data into a training set and a testing set: index <- 1:nrow(Housing) set.seed(12345) train_index <- sample(index, round(0.90*nrow(Housing)), replace = FALSE) test_index <- setdiff(index, train_index) train_x <- Housing[train_index, ] %>% select(-price) train_y <- Housing[train_index, ] %>% pull(price) test_x <- Housing[test_index, ] %>% select(-price) test_y <- Housing[test_index, ] %>% pull(price) I do the train/test split this way, because glmnet() requires a design matrix as input, and not a formula. Design matrices can be created using the model.matrix() function: library("glmnet") train_matrix <- model.matrix(train_y ~ ., data = train_x) test_matrix <- model.matrix(test_y ~ ., data = test_x) Let’s now run a linear regression, by setting the penalty to 0: model_lm_ridge <- glmnet(y = train_y, x = train_matrix, alpha = 0, lambda = 0) The model above provides the same result as a linear regression, because I set lambda to 0. Let’s compare the coefficients between the two: coef(model_lm_ridge) ## 13 x 1 sparse Matrix of class "dgCMatrix" ## s0 ## (Intercept) -2667.542863 ## (Intercept) . ## lotsize 3.397596 ## bedrooms 2081.087654 ## bathrms 13294.192823 ## stories 6400.454580 ## drivewayyes 6530.644895 ## recroomyes 5389.856794 ## fullbaseyes 4899.099463 ## gashwyes 12575.611265 ## aircoyes 13078.144146 ## garagepl 4155.249461 ## prefareayes 10260.781753 and now the coefficients of the linear regression (because I provide a design matrix, I have to use lm.fit() instead of lm() which requires a formula, not a matrix.) coef(lm.fit(x = train_matrix, y = train_y)) ## (Intercept) lotsize bedrooms bathrms stories drivewayyes ## -2667.052098 3.397629 2081.344118 13293.707725 6400.416730 6529.972544 ## recroomyes fullbaseyes gashwyes aircoyes garagepl prefareayes ## 5388.871137 4899.024787 12575.970220 13077.988867 4155.269629 10261.056772 as you can see, the coefficients are the same. Let’s compute the RMSE for the unpenalized linear regression: preds_lm <- predict(model_lm_ridge, test_matrix) rmse_lm <- sqrt(mean(preds_lm - test_y)^2) The RMSE for the linear unpenalized regression is equal to 1731.5553157. Let’s now run a ridge regression, with lambda equal to 100, and see if the RMSE is smaller: model_ridge <- glmnet(y = train_y, x = train_matrix, alpha = 0, lambda = 100) and let’s compute the RMSE again: preds <- predict(model_ridge, test_matrix) rmse <- sqrt(mean(preds - test_y)^2) The RMSE for the linear penalized regression is equal to 1726.7632312, which is smaller than before. But which value of lambda gives smallest RMSE? To find out, one must run model over a grid of lambda values and pick the model with lowest RMSE. This procedure is available in the cv.glmnet() function, which picks the best value for lambda: best_model <- cv.glmnet(train_matrix, train_y) # lambda that minimises the MSE best_model$lambda.min ## [1] 61.42681 According to cv.glmnet() the best value for lambda is 61.4268056. In the next section, we will implement cross validation ourselves, in order to find the hyper-parameters of a random forest. 6.9 Training, validating, and testing models Cross-validation is an important procedure which is used to compare models but also to tune the hyper-parameters of a model. In this section, we are going to use several packages from the {tidymodels} collection of packages, namely {recipes}, {rsample} and {parsnip} to train a random forest the tidy way. I will also use {mlrMBO} to tune the hyper-parameters of the random forest. 6.9.1 Set up Let’s load the needed packages: library("tidyverse") library("recipes") library("rsample") library("parsnip") library("yardstick") library("brotools") library("mlbench") Load the data which is included in the {mlrbench} package: data("BostonHousing2") I will train a random forest to predict the housing prices, which is the cmedv column: head(BostonHousing2) ## town tract lon lat medv cmedv crim zn indus chas nox ## 1 Nahant 2011 -70.9550 42.2550 24.0 24.0 0.00632 18 2.31 0 0.538 ## 2 Swampscott 2021 -70.9500 42.2875 21.6 21.6 0.02731 0 7.07 0 0.469 ## 3 Swampscott 2022 -70.9360 42.2830 34.7 34.7 0.02729 0 7.07 0 0.469 ## 4 Marblehead 2031 -70.9280 42.2930 33.4 33.4 0.03237 0 2.18 0 0.458 ## 5 Marblehead 2032 -70.9220 42.2980 36.2 36.2 0.06905 0 2.18 0 0.458 ## 6 Marblehead 2033 -70.9165 42.3040 28.7 28.7 0.02985 0 2.18 0 0.458 ## rm age dis rad tax ptratio b lstat ## 1 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 ## 2 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 ## 3 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 ## 4 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 ## 5 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 ## 6 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 Only keep relevant columns: boston <- BostonHousing2 %>% select(-medv, -tract, -lon, -lat) %>% rename(price = cmedv) I remove tract, lat and lon because the information contained in the column town is enough. To train and evaluate the model’s performance, I split the data in two. One data set, called the training set, will be further split into two down below. I won’t touch the second data set, the test set, until the very end, to finally assess the model’s performance. train_test_split <- initial_split(boston, prop = 0.9) housing_train <- training(train_test_split) housing_test <- testing(train_test_split) initial_split(), training() and testing() are functions from the {rsample} package. I will train a random forest on the training data, but the question, is which random forest? Because random forests have several hyper-parameters, and as explained in the intro these hyper-parameters cannot be directly learned from the data, which one should we choose? We could train 6 random forests for instance and compare their performance, but why only 6? Why not 16? In order to find the right hyper-parameters, the practitioner can use values from the literature that seemed to have worked well (like is done in Macro-econometrics) or you can further split the train set into two, create a grid of hyperparameter, train the model on one part of the data for all values of the grid, and compare the predictions of the models on the second part of the data. You then stick with the model that performed the best, for example, the model with lowest RMSE. The thing is, you can’t estimate the true value of the RMSE with only one value. It’s like if you wanted to estimate the height of the population by drawing one single observation from the population. You need a bit more observations. To approach the true value of the RMSE for a give set of hyperparameters, instead of doing one split, let’s do 30. Then we compute the average RMSE, which implies training 30 models for each combination of the values of the hyperparameters. First, let’s split the training data again, using the mc_cv() function from {rsample} package. This function implements Monte Carlo cross-validation: validation_data <- mc_cv(housing_train, prop = 0.9, times = 30) What does validation_data look like? validation_data ## # Monte Carlo cross-validation (0.9/0.1) with 30 resamples ## # A tibble: 30 × 2 ## splits id ## <list> <chr> ## 1 <split [409/46]> Resample01 ## 2 <split [409/46]> Resample02 ## 3 <split [409/46]> Resample03 ## 4 <split [409/46]> Resample04 ## 5 <split [409/46]> Resample05 ## 6 <split [409/46]> Resample06 ## 7 <split [409/46]> Resample07 ## 8 <split [409/46]> Resample08 ## 9 <split [409/46]> Resample09 ## 10 <split [409/46]> Resample10 ## # … with 20 more rows Let’s look further down: validation_data$splits[[1]] ## <Analysis/Assess/Total> ## <409/46/455> The first value is the number of rows of the first set, the second value of the second, and the third was the original amount of values in the training data, before splitting again. How should we call these two new data sets? The author of {rsample}, Max Kuhn, talks about the analysis and the assessment sets, and I’m going to use this terminology as well. Now, in order to continue I need to pre-process the data. I will do this in three steps. The first and the second steps are used to center and scale the numeric variables and the third step converts character and factor variables to dummy variables. This is needed because I will train a random forest, which cannot handle factor variables directly. Let’s define a recipe to do that, and start by pre-processing the testing set. I write a wrapper function around the recipe, because I will need to apply this recipe to various data sets: simple_recipe <- function(dataset){ recipe(price ~ ., data = dataset) %>% step_center(all_numeric()) %>% step_scale(all_numeric()) %>% step_dummy(all_nominal()) } We have not learned yet about writing functions, and will do so in the next chapter. However, for now, you only need to know that you can write your own functions, and that these functions can take any arguments you need. In the case of the above function, which we called simple_recipe(), we only need one argument, which is a dataset, and which we called dataset. Once the recipe is defined, I can use the prep() function, which estimates the parameters from the data which are needed to process the data. For example, for centering, prep() estimates the mean which will then be subtracted from the variables. With bake() the estimates are then applied on the data: testing_rec <- prep(simple_recipe(housing_test), testing = housing_test) test_data <- bake(testing_rec, new_data = housing_test) It is important to split the data before using prep() and bake(), because if not, you will use observations from the test set in the prep() step, and thus introduce knowledge from the test set into the training data. This is called data leakage, and must be avoided. This is why it is necessary to first split the training data into an analysis and an assessment set, and then also pre-process these sets separately. However, the validation_data object cannot now be used with recipe(), because it is not a dataframe. No worries, I simply need to write a function that extracts the analysis and assessment sets from the validation_data object, applies the pre-processing, trains the model, and returns the RMSE. This will be a big function, at the center of the analysis. But before that, let’s run a simple linear regression, as a benchmark. For the linear regression, I will not use any CV, so let’s pre-process the training set: trainlm_rec <- prep(simple_recipe(housing_train), testing = housing_train) trainlm_data <- bake(trainlm_rec, new_data = housing_train) linreg_model <- lm(price ~ ., data = trainlm_data) broom::augment(linreg_model, newdata = test_data) %>% yardstick::rmse(price, .fitted) ## Warning in predict.lm(x, newdata = newdata, na.action = na.pass, ...): ## prediction from a rank-deficient fit may be misleading ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 0.439 broom::augment() adds the predictions to the test_data in a new column, .fitted. I won’t use this trick with the random forest, because there is no augment() method for random forests from the {ranger} package which I’ll use. I’ll add the predictions to the data myself. Ok, now let’s go back to the random forest and write the big function: my_rf <- function(mtry, trees, split, id){ analysis_set <- analysis(split) analysis_prep <- prep(simple_recipe(analysis_set), training = analysis_set) analysis_processed <- bake(analysis_prep, new_data = analysis_set) model <- rand_forest(mode = "regression", mtry = mtry, trees = trees) %>% set_engine("ranger", importance = 'impurity') %>% fit(price ~ ., data = analysis_processed) assessment_set <- assessment(split) assessment_prep <- prep(simple_recipe(assessment_set), testing = assessment_set) assessment_processed <- bake(assessment_prep, new_data = assessment_set) tibble::tibble("id" = id, "truth" = assessment_processed$price, "prediction" = unlist(predict(model, new_data = assessment_processed))) } The rand_forest() function is available in the {parsnip} package. This package provides an unified interface to a lot of other machine learning packages. This means that instead of having to learn the syntax of range() and randomForest() and, and… you can simply use the rand_forest() function and change the engine argument to the one you want (ranger, randomForest, etc). Let’s try this function: results_example <- map2_df(.x = validation_data$splits, .y = validation_data$id, ~my_rf(mtry = 3, trees = 200, split = .x, id = .y)) head(results_example) ## # A tibble: 6 × 3 ## id truth prediction ## <chr> <dbl> <dbl> ## 1 Resample01 -0.328 -0.0274 ## 2 Resample01 1.06 0.686 ## 3 Resample01 1.04 0.726 ## 4 Resample01 -0.418 -0.0190 ## 5 Resample01 0.909 0.642 ## 6 Resample01 0.0926 -0.134 I can now compute the RMSE when mtry = 3 and trees = 200: results_example %>% group_by(id) %>% yardstick::rmse(truth, prediction) %>% summarise(mean_rmse = mean(.estimate)) %>% pull ## [1] 0.6305034 The random forest has already lower RMSE than the linear regression. The goal now is to lower this RMSE by tuning the mtry and trees hyperparameters. For this, I will use Bayesian Optimization methods implemented in the {mlrMBO} package. 6.9.2 Bayesian hyperparameter optimization I will re-use the code from above, and define a function that does everything from pre-processing to returning the metric I want to minimize by tuning the hyperparameters, the RMSE: tuning <- function(param, validation_data){ mtry <- param[1] trees <- param[2] results <- purrr::map2_df(.x = validation_data$splits, .y = validation_data$id, ~my_rf(mtry = mtry, trees = trees, split = .x, id = .y)) results %>% group_by(id) %>% yardstick::rmse(truth, prediction) %>% summarise(mean_rmse = mean(.estimate)) %>% pull } This is exactly the code from before, but it now returns the RMSE. Let’s try the function with the values from before: tuning(c(3, 200), validation_data) ## [1] 0.6319843 I now follow the code that can be found in the arxiv paper to run the optimization. A simpler model, called the surrogate model, is used to look for promising points and to evaluate the value of the function at these points. This seems somewhat similar (in spirit) to the Indirect Inference method as described in Gourieroux, Monfort, Renault. If you don’t really get what follows, no worries, it is not really important as such. The idea is simply to look for hyper-parameters in an efficient way, and bayesian optimisation provides this efficient way. However, you could use another method, for example a grid search. This would not change anything to the general approach. So I will not spend too much time explaining what is going on below, as you can read the details in the paper cited above as well as the package’s documentation. The focus here is not on this particular method, but rather showing you how you can use various packages to solve a data science problem. Let’s first load the package and create the function to optimize: library("mlrMBO") fn <- makeSingleObjectiveFunction(name = "tuning", fn = tuning, par.set = makeParamSet(makeIntegerParam("x1", lower = 3, upper = 8), makeIntegerParam("x2", lower = 100, upper = 500))) This function is based on the function I defined before. The parameters to optimize are also defined as are their bounds. I will look for mtry between the values of 3 and 8, and trees between 50 and 500. We still need to define some other objects before continuing: # Create initial random Latin Hypercube Design of 10 points library(lhs)# for randomLHS des <- generateDesign(n = 5L * 2L, getParamSet(fn), fun = randomLHS) Then we choose the surrogate model, a random forest too: # Specify kriging model with standard error estimation surrogate <- makeLearner("regr.ranger", predict.type = "se", keep.inbag = TRUE) Here I define some options: # Set general controls ctrl <- makeMBOControl() ctrl <- setMBOControlTermination(ctrl, iters = 10L) ctrl <- setMBOControlInfill(ctrl, crit = makeMBOInfillCritEI()) And this is the optimization part: # Start optimization result <- mbo(fn, des, surrogate, ctrl, more.args = list("validation_data" = validation_data)) result ## Recommended parameters: ## x1=8; x2=314 ## Objective: y = 0.484 ## ## Optimization path ## 10 + 10 entries in total, displaying last 10 (or less): ## x1 x2 y dob eol error.message exec.time ei error.model ## 11 8 283 0.4855415 1 NA <NA> 7.353 -3.276847e-04 <NA> ## 12 8 284 0.4852047 2 NA <NA> 7.321 -3.283713e-04 <NA> ## 13 8 314 0.4839817 3 NA <NA> 7.703 -3.828517e-04 <NA> ## 14 8 312 0.4841398 4 NA <NA> 7.633 -2.829713e-04 <NA> ## 15 8 318 0.4841066 5 NA <NA> 7.692 -2.668354e-04 <NA> ## 16 8 314 0.4845221 6 NA <NA> 7.574 -1.382333e-04 <NA> ## 17 8 321 0.4843018 7 NA <NA> 7.693 -3.828924e-05 <NA> ## 18 8 318 0.4868457 8 NA <NA> 7.696 -8.692828e-07 <NA> ## 19 8 310 0.4862687 9 NA <NA> 7.594 -1.061185e-07 <NA> ## 20 8 313 0.4878694 10 NA <NA> 7.628 -5.153015e-07 <NA> ## train.time prop.type propose.time se mean ## 11 0.011 infill_ei 0.450 0.0143886864 0.5075765 ## 12 0.011 infill_ei 0.427 0.0090265872 0.4971003 ## 13 0.012 infill_ei 0.443 0.0062693960 0.4916927 ## 14 0.012 infill_ei 0.435 0.0037308971 0.4878950 ## 15 0.012 infill_ei 0.737 0.0024446891 0.4860699 ## 16 0.013 infill_ei 0.442 0.0012713838 0.4850705 ## 17 0.012 infill_ei 0.444 0.0006371109 0.4847248 ## 18 0.013 infill_ei 0.467 0.0002106381 0.4844576 ## 19 0.014 infill_ei 0.435 0.0002182254 0.4846214 ## 20 0.013 infill_ei 0.748 0.0002971160 0.4847383 So the recommended parameters are 8 for mtry and 314 for trees. The user can access these recommended parameters with result$x$x1 and result$x$x2. The value of the RMSE is lower than before, and equals 0.4839817. It can be accessed with result$y. Let’s now train the random forest on the training data with this values. First, I pre-process the training data training_rec <- prep(simple_recipe(housing_train), testing = housing_train) train_data <- bake(training_rec, new_data = housing_train) Let’s now train our final model and predict the prices: final_model <- rand_forest(mode = "regression", mtry = result$x$x1, trees = result$x$x2) %>% set_engine("ranger", importance = 'impurity') %>% fit(price ~ ., data = train_data) price_predict <- predict(final_model, new_data = select(test_data, -price)) Let’s transform the data back and compare the predicted prices to the true ones visually: cbind(price_predict * sd(housing_train$price) + mean(housing_train$price), housing_test$price) ## .pred housing_test$price ## 1 16.76938 13.5 ## 2 27.59510 30.8 ## 3 23.14952 24.7 ## 4 21.92390 21.2 ## 5 21.35030 20.0 ## 6 23.15809 22.9 ## 7 23.00947 23.9 ## 8 25.74268 26.6 ## 9 24.13122 22.6 ## 10 34.97671 43.8 ## 11 19.30543 18.8 ## 12 18.09146 15.7 ## 13 18.82922 19.2 ## 14 18.63397 13.3 ## 15 19.14438 14.0 ## 16 17.05549 15.6 ## 17 23.79491 27.0 ## 18 20.30125 17.4 ## 19 22.99200 23.6 ## 20 32.77092 33.3 ## 21 31.66258 34.6 ## 22 28.79583 34.9 ## 23 39.02755 50.0 ## 24 23.53336 21.7 ## 25 24.66551 24.3 ## 26 24.91737 24.0 ## 27 25.11847 25.1 ## 28 24.42518 23.7 ## 29 24.59139 23.7 ## 30 24.91760 26.2 ## 31 38.73875 43.5 ## 32 29.71848 35.1 ## 33 36.89490 46.0 ## 34 24.04041 26.4 ## 35 20.91349 20.3 ## 36 21.18602 23.1 ## 37 22.57069 22.2 ## 38 25.21751 23.9 ## 39 28.55841 50.0 ## 40 14.38216 7.2 ## 41 12.76573 8.5 ## 42 11.78237 9.5 ## 43 13.29279 13.4 ## 44 14.95076 16.4 ## 45 15.79182 19.1 ## 46 18.26510 19.6 ## 47 14.84985 13.3 ## 48 16.01508 16.7 ## 49 24.09930 25.0 ## 50 20.75357 21.8 ## 51 19.49487 19.7 Let’s now compute the RMSE: tibble::tibble("truth" = test_data$price, "prediction" = unlist(price_predict)) %>% yardstick::rmse(truth, prediction) ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 0.425 As I mentioned above, all the part about looking for hyper-parameters could be changed to something else. The general approach though remains what I have described, and can be applied for any models that have hyper-parameters. References "],["defining-your-own-functions.html", "Chapter 7 Defining your own functions 7.1 Control flow 7.2 Writing your own functions 7.3 Exercises 7.4 Functions that take functions as arguments: writing your own higher-order functions 7.5 Functions that return functions 7.6 Functions that take columns of data as arguments 7.7 Functions that use loops 7.8 Anonymous functions 7.9 Exercises", " Chapter 7 Defining your own functions In this section we are going to learn some advanced concepts that are going to make you into a full-fledged R programmer. Before this chapter you only used whatever R came with, as well as the functions contained in packages. We did define some functions ourselves in Chapter 6 already, but without going into many details. In this chapter, we will learn about building functions ourselves, and do so in greater detail than what we did before. 7.1 Control flow Knowing about control flow is essential to build your own functions. Without control flow statements, such as if-else statements or loops (or, in the case of pure functional programming languages, recursion), programming languages would be very limited. 7.1.1 If-else Imagine you want a variable to be equal to a certain value if a condition is met. This is a typical problem that requires the if ... else ... construct. For instance: a <- 4 b <- 5 Suppose that if a > b then f should be equal to 20, else f should be equal to 10. Using if ... else ... you can achieve this like so: if (a > b) { f <- 20 } else { f <- 10 } Obviously, here f = 10. Another way to achieve this is by using the ifelse() function: f <- ifelse(a > b, 20, 10) if...else... and ifelse() might seem interchangeable, but they’re not. ifelse() is vectorized, while if...else.. is not. Let’s try the following: ifelse(c(1,2,4) > c(3, 1, 0), "yes", "no") ## [1] "no" "yes" "yes" The result is a vector. Now, let’s see what happens if we use if...else... instead of ifelse(): if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no") > Error in if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no") : the condition has length > 1 This results in an error (in previous R version, only the first element of the vector would get used). We have already discussed this in Chapter 2, remember? If you want to make sure that such an expression evaluates to TRUE, then you need to use all(): ifelse(all(c(1,2,4) > c(3, 1, 0)), "all elements are greater", "not all elements are greater") ## [1] "not all elements are greater" You may also remember the any() function: ifelse(any(c(1,2,4) > c(3, 1, 0)), "at least one element is greater", "no element greater") ## [1] "at least one element is greater" These are the basics. But sometimes, you might need to test for more complex conditions, which can lead to using nested if...else... constructs. These, however, can get messy: if (10 %% 3 == 0) { print("10 is divisible by 3") } else if (10 %% 2 == 0) { print("10 is divisible by 2") } ## [1] "10 is divisible by 2" 10 being obviously divisible by 2 and not 3, it is the second sentence that will be printed. The %% operator is the modulus operator, which gives the rest of the division of 10 by 2. In such cases, it is easier to use dplyr::case_when(): case_when(10 %% 3 == 0 ~ "10 is divisible by 3", 10 %% 2 == 0 ~ "10 is divisible by 2") ## [1] "10 is divisible by 2" We have already encountered this function in Chapter 4, inside a dplyr::mutate() call to create a new column. Let’s now discuss loops. 7.1.2 For loops For loops make it possible to repeat a set of instructions i times. For example, try the following: for (i in 1:10){ print("hello") } ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" It is also possible to do computations using for loops. Let’s compute the sum of the first 100 integers: result <- 0 for (i in 1:100){ result <- result + i } print(result) ## [1] 5050 result is equal to 5050, the expected result. What happened in that loop? First, we defined a variable called result and set it to 0. Then, when the loops starts, i equals 1, so we add result to 1, which is 1. Then, i equals 2, and again, we add result to i. But this time, result equals 1 and i equals 2, so now result equals 3, and we repeat this until i equals 100. If you know a programming language like C, this probably looks familiar. However, R is not C, and you should, if possible, avoid writing code that looks like this. You should always ask yourself the following questions: Is there an inbuilt function to achieve what I need? In this case we have sum(), so we could use sum(seq(1, 100)). Is there a way to use matrix algebra? This can sometimes make things easier, but it depends how comfortable you are with matrix algebra. This would be the solution with matrix algebra: rep(1, 100) %*% seq(1, 100). Is there a way to use building blocks that are already available? For instance, suppose that sum() would not be a function available in R. Another way to solve this issue would be to use the following building blocks: +, which computes the sum of two numbers and Reduce(), which reduces a list of elements using an operator. Sounds complicated? Let’s see how Reduce() works. First, let me show you how I combine these two functions to achieve the same result as when using sum(): Reduce(`+`, seq(1, 100)) ## [1] 5050 We will see how Reduce() works in greater detail in the next chapter, but what happened was something like this: Reduce(`+`, seq(1, 100)) = 1 + Reduce(`+`, seq(2, 100)) = 1 + 2 + Reduce(`+`, seq(3, 100)) = 1 + 2 + 3 + Reduce(`+`, seq(4, 100)) = .... If you ask yourself these questions, it turns out that you only rarely actually need to write loops, but loops are still important, because sometimes there simply isn’t an alternative. Also, there are other situations where loops are also important, so I refer you to the following section of Hadley Wickham’s Advanced R for an in-depth discussion on situations where loops make more sense than using functions such as Reduce(). 7.1.3 While loops While loops are very similar to for loops. The instructions inside a while loop are repeated while a certain condition holds true. Let’s consider the sum of the first 100 integers again: result <- 0 i <- 1 while (i<=100){ result = result + i i = i + 1 } print(result) ## [1] 5050 Here, we first set result and i to 0. Then, while i is less than, or equal to 100, we add i to result. Notice that there is one more line than in the for loop version of this code: we need to increment the value of i at each iteration, if not, i would stay equal to 1, and the condition would always be fulfilled, and the loop would run forever (not really, only until your computer runs out of memory, or until the heat death of the universe, whichever comes first). Now that we know how to write loops, and know about if...else... constructs, we have (almost) all the ingredients to write our own functions. 7.2 Writing your own functions As you have seen by now, R includes a very large amount of in-built functions, but also many more functions are available in packages. However, there will be a lot of situations where you will need to write your own. In this section we are going to learn how to write our own functions. 7.2.1 Declaring functions in R Suppose you want to create the following function: \\(f(x) = \\dfrac{1}{\\sqrt{x}}\\). Writing this in R is quite simple: my_function <- function(x){ 1/sqrt(x) } The argument of the function, x, gets passed to the function() function and the body of the function (more on that in the next Chapter) contains the function definition. Of course, you could define functions that use more than one input: my_function <- function(x, y){ 1/sqrt(x + y) } or inputs with names longer than one character: my_function <- function(argument1, argument2){ 1/sqrt(argument1 + argument2) } Functions written by the user get called just the same way as functions included in R: my_function(1, 10) ## [1] 0.3015113 It is also possible to provide default values to the function’s arguments, which are values that are used if the user omits them: my_function <- function(argument1, argument2 = 10){ 1/sqrt(argument1 + argument2) } my_function(1) ## [1] 0.3015113 This is especially useful for functions with many arguments. Consider also the following example, where the function has a default method: my_function <- function(argument1, argument2, method = "foo"){ x <- argument1 + argument2 if(method == "foo"){ 1/sqrt(x) } else if (method == "bar"){ "this is a string" } } my_function(10, 11) ## [1] 0.2182179 my_function(10, 11, "bar") ## [1] "this is a string" As you see, depending on the “method” chosen, the returned result is either a numeric, or a string. What happens if the user provides a “method” that is neither “foo” nor “bar”? my_function(10, 11, "spam") As you can see nothing happens. It is possible to add safeguards to your function to avoid such situations: my_function <- function(argument1, argument2, method = "foo"){ if(!(method %in% c("foo", "bar"))){ return("Method must be either 'foo' or 'bar'") } x <- argument1 + argument2 if(method == "foo"){ 1/sqrt(x) } else if (method == "bar"){ "this is a string" } } my_function(10, 11) ## [1] 0.2182179 my_function(10, 11, "bar") ## [1] "this is a string" my_function(10, 11, "foobar") ## [1] "Method must be either 'foo' or 'bar'" Notice that I have used return() inside my first if statement. This is to immediately stop evaluation of the function and return a value. If I had omitted it, evaluation would have continued, as it is always the last expression that gets evaluated. Remove return() and run the function again, and see what happens. Later, we are going to learn how to add better safeguards to your functions and to avoid runtime errors. While in general, it is a good idea to add comments to your functions to explain what they do, I would avoid adding comments to functions that do things that are very obvious, such as with this one. Function names should be of the form: function_name(). Always give your function very explicit names! In mathematics it is standard to give functions just one letter as a name, but I would advise against doing that in your code. Functions that you write are not special in any way; this means that R will treat them the same way, and they will work in conjunction with any other function just as if it was built-in into R. They have one limitation though (which is shared with R’s native function): just like in math, they can only return one value. However, sometimes, you may need to return more than one value. To be able to do this, you must put your values in a list, and return the list of values. For example: average_and_sd <- function(x){ c(mean(x), sd(x)) } average_and_sd(c(1, 3, 8, 9, 10, 12)) ## [1] 7.166667 4.262237 You’re still returning a single object, but it’s a vector. You can also return a named list: average_and_sd <- function(x){ list("mean_x" = mean(x), "sd_x" = sd(x)) } average_and_sd(c(1, 3, 8, 9, 10, 12)) ## $mean_x ## [1] 7.166667 ## ## $sd_x ## [1] 4.262237 As described before, you can use return() at the end of your functions: average_and_sd <- function(x){ result <- c(mean(x), sd(x)) return(result) } average_and_sd(c(1, 3, 8, 9, 10, 12)) ## [1] 7.166667 4.262237 But this is only needed if you need to return a value early: average_and_sd <- function(x){ if(any(is.na(x))){ return(NA) } else { c(mean(x), sd(x)) } } average_and_sd(c(1, 3, 8, 9, 10, 12)) ## [1] 7.166667 4.262237 average_and_sd(c(1, 3, NA, 9, 10, 12)) ## [1] NA If you need to use a function from a package inside your function use ::: my_sum <- function(a_vector){ purrr::reduce(a_vector, `+`) } However, if you need to use more than one function, this can become tedious. A quick and dirty way of doing that, is to use library(package_name), inside the function: my_sum <- function(a_vector){ library(purrr) reduce(a_vector, `+`) } Loading the library inside the function has the advantage that you will be sure that the package upon which your function depends will be loaded. If the package is already loaded, it will not be loaded again, thus not impact performance, but if you forgot to load it at the beginning of your script, then, no worries, your function will load it the first time you use it! However, the very best way would be to write your own package and declare the packages upon which your functions depend as dependencies. This is something we are going to explore in Chapter 11. You can put a lot of instructions inside a function, such as loops. Let’s create the function that returns Fionacci numbers. 7.2.2 Fibonacci numbers The Fibonacci sequence is the following: \\[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...\\] Each subsequent number is composed of the sum of the two preceding ones. In R, it is possible to define a function that returns the \\(n^{th}\\) fibonacci number: my_fibo <- function(n){ a <- 0 b <- 1 for (i in 1:n){ temp <- b b <- a a <- a + temp } a } Inside the loop, we defined a variable called temp. Defining temporary variables is usually very useful. Let’s try to understand what happens inside this loop: First, we assign the value 0 to variable a and value 1 to variable b. We start a loop, that goes from 1 to n. We assign the value inside of b to a temporary variable, called temp. b becomes a. We assign the sum of a and temp to a. When the loop is finished, we return a. What happens if we want the 3rd fibonacci number? At n = 1 we have first a = 0 and b = 1, then temp = 1, b = 0 and a = 0 + 1. Then n = 2. Now b = 0 and temp = 0. The previous result, a = 0 + 1 is now assigned to b, so b = 1. Then, a = 1 + 0. Finally, n = 3. temp = 1 (because b = 1), the previous result a = 1 is assigned to b and finally, a = 1 + 1. So the third fibonacci number equals 2. Reading this might be a bit confusing; I strongly advise you to run the algorithm on a sheet of paper, step by step. The above algorithm is called an iterative algorithm, because it uses a loop to compute the result. Let’s look at another way to think about the problem, with a so-called recursive function: fibo_recur <- function(n){ if (n == 0 || n == 1){ return(n) } else { fibo_recur(n-1) + fibo_recur(n-2) } } This algorithm should be easier to understand: if n = 0 or n = 1 the function should return n (0 or 1). If n is strictly bigger than 1, fibo_recur() should return the sum of fibo_recur(n-1) and fibo_recur(n-2). This version of the function is very much the same as the mathematical definition of the fibonacci sequence. So why not use only recursive algorithms then? Try to run the following: system.time(my_fibo(30)) ## user system elapsed ## 0.007 0.000 0.007 The result should be printed very fast (the system.time() function returns the time that it took to execute my_fibo(30)). Let’s try with the recursive version: system.time(fibo_recur(30)) ## user system elapsed ## 1.720 0.044 1.772 It takes much longer to execute! Recursive algorithms are very CPU demanding, so if speed is critical, it’s best to avoid recursive algorithms. Also, in fibo_recur() try to remove this line: if (n == 0 || n == 1) and try to run fibo_recur(5) and see what happens. You should get an error: this is because for recursive algorithms you need a stopping condition, or else, it would run forever. This is not the case for iterative algorithms, because the stopping condition is the last step of the loop. So as you can see, for recursive relationships, for or while loops are the way to go in R, whether you’re writing these loops inside functions or not. 7.3 Exercises Exercise 1 In this exercise, you will write a function to compute the sum of the n first integers. Combine the algorithm we saw in section about while loops and what you learned about functions in this section. Exercise 2 Write a function called my_fact() that computes the factorial of a number n. Do it using a loop, using a recursive function, and using a functional: Exercise 3 Write a function to find the roots of quadratic functions. Your function should take 3 arguments, a, b and c and return the two roots. Only consider the case where there are two real roots (delta > 0). 7.4 Functions that take functions as arguments: writing your own higher-order functions Functions that take functions as arguments are very powerful and useful tools. You already know a couple, purrr::map() and purrr::reduce(), discussed briefly in Chapter 4. But you can also write your own! A very simple example would be the following: my_func <- function(x, func){ func(x) } my_func() is a very simple function that takes x and func() as arguments and that simply executes func(x). This might not seem very useful (after all, you could simply use func(x)!) but this is just for illustration purposes, in practice, your functions would be more useful than that! Let’s try to use my_func(): my_func(c(1, 8, 1, 0, 8), mean) ## [1] 3.6 As expected, this returns the mean of the given vector. But now suppose the following: my_func(c(1, 8, 1, NA, 8), mean) ## [1] NA Because one element of the list is NA, the whole mean is NA. mean() has a na.rm argument that you can set to TRUE to ignore the NAs in the vector. However, here, there is no way to provide this argument to the function mean()! Let’s see what happens when we try to: my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE) Error in my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE) : unused argument (na.rm = TRUE) So what you could do is pass the value TRUE to the na.rm argument of mean() from your own function: my_func <- function(x, func, remove_na){ func(x, na.rm = remove_na) } my_func(c(1, 8, 1, NA, 8), mean, remove_na = TRUE) ## [1] 4.5 This is one solution, but mean() also has another argument called trim. What if some other user needs this argument? Should you also add it to your function? Surely there’s a way to avoid this problem? Yes, there is, and it by using the dots. The ... simply mean “any other argument as needed”, and it’s very easy to use: my_func <- function(x, func, ...){ func(x, ...) } my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE) ## [1] 4.5 or, now, if you need the trim argument: my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE, trim = 0.1) ## [1] 4.5 The ... are very useful when writing higher-order functions such as my_func(), because it allows you to pass arguments down to the underlying functions. 7.5 Functions that return functions The example from before, my_func() took three arguments, some x, a function func, and ... (dots). my_func() was a kind of wrapper that evaluated func on its arguments x and .... But sometimes this is not quite what you need or want. It is sometimes useful to write a function that returns a modified function. This type of function is called a function factory, as it builds functions. For instance, suppose that we want to time how long functions take to run. An idea would be to proceed like this: tic <- Sys.time() very_slow_function(x) toc <- Sys.time() running_time <- toc - tic but if you want to time several functions, this gets very tedious. It would be much easier if functions would time themselves. We could achieve this by writing a wrapper, like this: timed_very_slow_function <- function(...){ tic <- Sys.time() result <- very_slow_function(x) toc <- Sys.time() running_time <- toc - tic list("result" = result, "running_time" = running_time) } The problem here is that we have to change each function we need to time. But thanks to the concept of function factories, we can write a function that does this for us: time_f <- function(.f, ...){ function(...){ tic <- Sys.time() result <- .f(...) toc <- Sys.time() running_time <- toc - tic list("result" = result, "running_time" = running_time) } } time_f() is a function that returns a function, a function factory. Calling it on a function returns, as expected, a function: t_mean <- time_f(mean) t_mean ## function(...){ ## ## tic <- Sys.time() ## result <- .f(...) ## toc <- Sys.time() ## ## running_time <- toc - tic ## ## list("result" = result, ## "running_time" = running_time) ## ## } ## <environment: 0x55a89cb7fb78> This function can now be used like any other function: output <- t_mean(seq(-500000, 500000)) output is a list of two elements, the first being simply the result of mean(seq(-500000, 500000)), and the other being the running time. This approach is super flexible. For instance, imagine that there is an NA in the vector. This would result in the mean of this vector being NA: t_mean(c(NA, seq(-500000, 500000))) ## $result ## [1] NA ## ## $running_time ## Time difference of 0.006837606 secs But because we use the ... in the definition of time_f(), we can now simply pass mean()’s option down to it: t_mean(c(NA, seq(-500000, 500000)), na.rm = TRUE) ## $result ## [1] 0 ## ## $running_time ## Time difference of 0.01413321 secs 7.6 Functions that take columns of data as arguments 7.6.1 The enquo() - !!() approach In many situations, you will want to write functions that look similar to this: my_function(my_data, one_column_inside_data) Such a function would be useful in situation where you have to apply a certain number of operations to columns for different data frames. For example if you need to create tables of descriptive statistics or graphs periodically, it might be very interesting to put these operations inside a function and then call the function whenever you need it, on the fresh batch of data. However, if you try to write something like that, something that might seem unexpected, at first, will happen: data(mtcars) simple_function <- function(dataset, col_name){ dataset %>% group_by(col_name) %>% summarise(mean_speed = mean(speed)) } simple_function(cars, "dist") Error: unknown variable to group by : col_name The variable col_name is passed to simple_function() as a string, but group_by() requires a variable name. So why not try to convert col_name to a name? simple_function <- function(dataset, col_name){ col_name <- as.name(col_name) dataset %>% group_by(col_name) %>% summarise(mean_speed = mean(speed)) } simple_function(cars, "dist") Error: unknown variable to group by : col_name This is because R is literally looking for the variable \"dist\" somewhere in the global environment, and not as a column of the data. R does not understand that you are refering to the column \"dist\" that is inside the dataset. So how can we make R understands what you mean? To be able to do that, we need to use a framework that was introduced in the {tidyverse}, called tidy evaluation. This framework can be used by installing the {rlang} package. {rlang} is quite a technical package, so I will spare you the details. But you should at the very least take a look at the following documents here and here. The discussion can get complicated, but you don’t need to know everything about {rlang}. As you will see, knowing some of the capabilities {rlang} provides can be incredibly useful. Take a look at the code below: simple_function <- function(dataset, col_name){ col_name <- enquo(col_name) dataset %>% group_by(!!col_name) %>% summarise(mean_mpg = mean(mpg)) } simple_function(mtcars, cyl) ## # A tibble: 3 × 2 ## cyl mean_mpg ## <dbl> <dbl> ## 1 4 26.7 ## 2 6 19.7 ## 3 8 15.1 As you can see, the previous idea we had, which was using as.name() was not very far away from the solution. The solution, with {rlang}, consists in using enquo(), which (for our purposes), does something similar to as.name(). Now that col_name is (R programmers call it) quoted, or defused, we need to tell group_by() to evaluate the input as is. This is done with !!(), called the injection operator, which is another {rlang} function. I say it again; don’t worry if you don’t understand everything. Just remember to use enquo() on your column names and then !!() inside the {dplyr} function you want to use. Let’s see some other examples: simple_function <- function(dataset, col_name, value){ col_name <- enquo(col_name) dataset %>% filter((!!col_name) == value) %>% summarise(mean_cyl = mean(cyl)) } simple_function(mtcars, am, 1) ## mean_cyl ## 1 5.076923 Notice that I’ve written: filter((!!col_name) == value) and not: filter(!!col_name == value) I have enclosed !!col_name inside parentheses. This is because operators such as == have precedence over !!, so you have to be explicit. Also, notice that I didn’t have to quote 1. This is because it’s standard variable, not a column inside the dataset. Let’s make this function a bit more general. I hard-coded the variable cyl inside the body of the function, but maybe you’d like the mean of another variable? simple_function <- function(dataset, filter_col, mean_col, value){ filter_col <- enquo(filter_col) mean_col <- enquo(mean_col) dataset %>% filter((!!filter_col) == value) %>% summarise(mean((!!mean_col))) } simple_function(mtcars, am, cyl, 1) ## mean(cyl) ## 1 5.076923 Notice that I had to quote mean_col too. Using the ... that we discovered in the previous section, we can pass more than one column: simple_function <- function(dataset, ...){ col_vars <- quos(...) dataset %>% summarise_at(vars(!!!col_vars), funs(mean, sd)) } Because these dots contain more than one variable, you have to use quos() instead of enquo(). This will put the arguments provided via the dots in a list. Then, because we have a list of columns, we have to use summarise_at(), which you should know if you did the exercices of Chapter 4. So if you didn’t do them, go back to them and finish them first. Doing the exercise will also teach you what vars() and funs() are. The last thing you have to pay attention to is to use !!!() if you used quos(). So 3 ! instead of only 2. This allows you to then do things like this: simple_function(mtcars, am, cyl, mpg) ## am_mean cyl_mean mpg_mean am_sd cyl_sd mpg_sd ## 1 0.40625 6.1875 20.09062 0.4989909 1.785922 6.026948 Using ... with !!!() allows you to write very flexible functions. If you need to be even more general, you can also provide the summary functions as arguments of your function, but you have to rewrite your function a little bit: simple_function <- function(dataset, cols, funcs){ dataset %>% summarise_at(vars(!!!cols), funs(!!!funcs)) } You might be wondering where the quos() went? Well because now we are passing two lists, a list of columns that we have to quote, and a list of functions, that we also have to quote, we need to use quos() when calling the function: simple_function(mtcars, quos(am, cyl, mpg), quos(mean, sd, sum)) ## am_mean cyl_mean mpg_mean am_sd cyl_sd mpg_sd am_sum cyl_sum mpg_sum ## 1 0.40625 6.1875 20.09062 0.4989909 1.785922 6.026948 13 198 642.9 This works, but I don’t think you’ll need to have that much flexibility; either the columns are variables, or the functions, but rarely both at the same time. To conclude this function, I should also talk about as_label() which allows you to change the name of a variable, for instance if you want to call the resulting column mean_mpg when you compute the mean of the mpg column: simple_function <- function(dataset, filter_col, mean_col, value){ filter_col <- enquo(filter_col) mean_col <- enquo(mean_col) mean_name <- paste0("mean_", as_label(mean_col)) dataset %>% filter((!!filter_col) == value) %>% summarise(!!(mean_name) := mean((!!mean_col))) } Pay attention to the := operator in the last line. This is needed when using as_label(). 7.6.2 Curly Curly, a simplified approach to enquo() and !!() The previous section might have been a bit difficult to grasp, but there is a simplified way of doing it, which consists in using {{}}, introduced in {rlang} version 0.4.0. The suggested pronunciation of {{}} is curly-curly, but there is no consensus yet. Let’s suppose that I need to write a function that takes a data frame, as well as a column from this data frame as arguments, just like before: how_many_na <- function(dataframe, column_name){ dataframe %>% filter(is.na(column_name)) %>% count() } Let’s try this function out on the starwars data: data(starwars) head(starwars) ## # A tibble: 6 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooi… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… ## 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo ## 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… ## 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera… ## 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi… ## # … with 4 more variables: species <chr>, films <list>, vehicles <list>, ## # starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color, ## # ³​eye_color, ⁴​birth_year, ⁵​homeworld As you can see, there are missing values in the hair_color column. Let’s try to count how many missing values are in this column: how_many_na(starwars, hair_color) Error: object 'hair_color' not found Just as expected, this does not work. The issue is that the column is inside the dataframe, but when calling the function with hair_color as the second argument, R is looking for a variable called hair_color that does not exist. What about trying with \"hair_color\"? how_many_na(starwars, "hair_color") ## # A tibble: 1 × 1 ## n ## <int> ## 1 0 Now we get something, but something wrong! One way to solve this issue, is to not use the filter() function, and instead rely on base R: how_many_na_base <- function(dataframe, column_name){ na_index <- is.na(dataframe[, column_name]) nrow(dataframe[na_index, column_name]) } how_many_na_base(starwars, "hair_color") ## [1] 5 This works, but not using the {tidyverse} at all is not always an option. For instance, the next function, which uses a grouping variable, would be difficult to implement without the {tidyverse}: summarise_groups <- function(dataframe, grouping_var, column_name){ dataframe %>% group_by(grouping_var) %>% summarise(mean(column_name, na.rm = TRUE)) } Calling this function results in the following error message, as expected: Error: Column `grouping_var` is unknown In the previous section, we solved the issue like so: summarise_groups <- function(dataframe, grouping_var, column_name){ grouping_var <- enquo(grouping_var) column_name <- enquo(column_name) mean_name <- paste0("mean_", as_label(column_name)) dataframe %>% group_by(!!grouping_var) %>% summarise(!!(mean_name) := mean(!!column_name, na.rm = TRUE)) } The core of the function remained very similar to the version from before, but now one has to use the enquo()-!! syntax. Now this can be simplified using the new {{}} syntax: summarise_groups <- function(dataframe, grouping_var, column_name){ dataframe %>% group_by({{grouping_var}}) %>% summarise({{column_name}} := mean({{column_name}}, na.rm = TRUE)) } Much easier and cleaner! You still have to use the := operator instead of = for the column name however, and if you want to modify the column names, for instance in this case return \"mean_height\" instead of height you have to keep using the enquo()-!! syntax. 7.7 Functions that use loops It is entirely possible to put a loop inside a function. For example, consider the following function that return the square root of a number using Newton’s algorithm: sqrt_newton <- function(a, init = 1, eps = 0.01){ stopifnot(a >= 0) while(abs(init**2 - a) > eps){ init <- 1/2 *(init + a/init) } init } This functions contains a while loop inside its body. Let’s see if it works: sqrt_newton(16) ## [1] 4.000001 In the definition of the function, I wrote init = 1 and eps = 0.01 which means that this argument can be omitted and will have the provided value (0.01) as the default. You can then use this function as any other, for example with map(): map(c(16, 7, 8, 9, 12), sqrt_newton) ## [[1]] ## [1] 4.000001 ## ## [[2]] ## [1] 2.645767 ## ## [[3]] ## [1] 2.828469 ## ## [[4]] ## [1] 3.000092 ## ## [[5]] ## [1] 3.464616 This is what I meant before with “your functions are nothing special”. Once the function is defined, you can use it like any other base R function. Notice the use of stopifnot() inside the body of the function. This is a way to return an error in case a condition is not fulfilled. We are going to learn more about this type of functions in the next chapter. 7.8 Anonymous functions As the name implies, anonymous functions are functions that do not have a name. These are useful inside functions that have functions as arguments, such as purrr::map() or purrr::reduce(): map(c(1,2,3,4), function(x){1/sqrt(x)}) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 0.7071068 ## ## [[3]] ## [1] 0.5773503 ## ## [[4]] ## [1] 0.5 These anonymous functions get defined in a very similar way to regular functions, you just skip the name and that’s it. {tidyverse} functions also support formulas; these get converted to anonymous functions: map(c(1,2,3,4), ~{1/sqrt(.)}) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 0.7071068 ## ## [[3]] ## [1] 0.5773503 ## ## [[4]] ## [1] 0.5 Using a formula instead of an anonymous function is less verbose; you use ~ instead of function(x) and a single dot . instead of x. What if you need an anonymous function that requires more than one argument? This is not a problem: map2(c(1, 2, 3, 4, 5), c(9, 8, 7, 6, 5), function(x, y){(x**2)/y}) ## [[1]] ## [1] 0.1111111 ## ## [[2]] ## [1] 0.5 ## ## [[3]] ## [1] 1.285714 ## ## [[4]] ## [1] 2.666667 ## ## [[5]] ## [1] 5 or, using a formula: map2(c(1, 2, 3, 4, 5), c(9, 8, 7, 6, 5), ~{(.x**2)/.y}) ## [[1]] ## [1] 0.1111111 ## ## [[2]] ## [1] 0.5 ## ## [[3]] ## [1] 1.285714 ## ## [[4]] ## [1] 2.666667 ## ## [[5]] ## [1] 5 Because you have now two arguments, a single dot could not work, so instead you use .x and .y to avoid confusion. Since version 4.1, R introduced a short-hand for defining anonymous functions: map(c(1,2,3,4), \\(x)(1/sqrt(x))) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 0.7071068 ## ## [[3]] ## [1] 0.5773503 ## ## [[4]] ## [1] 0.5 \\(x) is supposed to look like this notation: \\(\\lambda(x)\\). This is a notation comes from lambda calculus, where functions are defined like this: \\[ \\lambda(x).1/sqrt(x) \\] which is equivalent to \\(f(x) = 1/sqrt(x)\\). You can use \\(x) or function(x) interchangeably. You now know a lot about writing your own functions. In the next chapter, we are going to learn about functional programming, the programming paradigm I described in the introduction of this book. 7.9 Exercises Exercise 1 Create the following vector: \\[a = (1,6,7,8,8,9,2)\\] Using a for loop and a while loop, compute the sum of its elements. To avoid issues, use i as the counter inside the for loop, and j as the counter for the while loop. How would you achieve that with a functional (a function that takes a function as an argument)? Exercise 2 Let’s use a loop to get the matrix product of a matrix A and B. Follow these steps to create the loop: Create matrix A: \\[A = \\left( \\begin{array}{ccc} 9 & 4 & 12 \\\\ 5 & 0 & 7 \\\\ 2 & 6 & 8 \\\\ 9 & 2 & 9 \\end{array} \\right) \\] Create matrix B: \\[B = \\left( \\begin{array}{cccc} 5 & 4 & 2 & 5 \\\\ 2 & 7 & 2 & 1 \\\\ 8 & 3 & 2 & 6 \\\\ \\end{array} \\right) \\] Create a matrix C, with dimension 4x4 that will hold the result. Use this command: `C = matrix(rep(0,16), nrow = 4)} Using a for loop, loop over the rows of A first: `for(i in 1:nrow(A))} Inside this loop, loop over the columns of B: `for(j in 1:ncol(B))} Again, inside this loop, loop over the rows of B: `for(k in 1:nrow(B))} Inside this last loop, compute the result and save it inside C: `C[i,j] = C[i,j] + A[i,k] * B[k,j]} Now write a function that takes two matrices as arguments, and returns their product. R has a built-in function to compute the dot product of 2 matrices. Which is it? Exercise 3 Fizz Buzz: Print integers from 1 to 100. If a number is divisible by 3, print the word \"Fizz\" if it’s divisible by 5, print \"Buzz\". Use a for loop and if statements. Write a function that takes an integer as arguments, and prints \"Fizz\" or \"Buzz\" up to that integer. Exercise 4 Fizz Buzz 2: Same as above, but now add this third condition: if a number is both divisible by 3 and 5, print \"FizzBuzz\". Write a function that takes an integer as argument, and prints Fizz, Buzz or FizzBuzz up to that integer. "],["functional-programming.html", "Chapter 8 Functional programming 8.1 Function definitions 8.2 Properties of functions 8.3 Functional programming with {purrr} 8.4 List-based workflows for efficiency 8.5 Exercises", " Chapter 8 Functional programming Functional programming is a paradigm that I find very suitable for data science. In functional programming, your code is organised into functions that perform the operations you need. Your scripts will only be a sequence of calls to these functions, making them easier to understand. R is not a pure functional programming language, so we need some self-discipline to apply pure functional programming principles. However, these efforts are worth it, because pure functions are easier to debug, extend and document. In this chapter, we are going to learn about functional programming principles that you can adopt and start using to make your code better. 8.1 Function definitions You should now be familiar with function definitions in R. Let’s suppose you want to write a function to compute the square root of a number and want to do so using Newton’s algorithm: sqrt_newton <- function(a, init, eps = 0.01){ while(abs(init**2 - a) > eps){ init <- 1/2 *(init + a/init) } init } You can then use this function to get the square root of a number: sqrt_newton(16, 2) ## [1] 4.00122 We are using a while loop inside the body of the function. The body of a function are the instructions that define the function. You can get the body of a function with body(some_func). In pure functional programming languages, like Haskell, loops do not exist. How can you program without loops, you may ask? In functional programming, loops are replaced by recursion, which we already discussed in the previous chapter. Let’s rewrite our little example above with recursion: sqrt_newton_recur <- function(a, init, eps = 0.01){ if(abs(init**2 - a) < eps){ result <- init } else { init <- 1/2 * (init + a/init) result <- sqrt_newton_recur(a, init, eps) } result } sqrt_newton_recur(16, 2) ## [1] 4.00122 R is not a pure functional programming language though, so we can still use loops (be it while or for loops) in the bodies of our functions. As discussed in the previous chapter, it is actually better, performance-wise, to use loops instead of recursion, because R is not tail-call optimized. I won’t got into the details of what tail-call optimization is but just remember that if performance is important a loop will be faster. However, sometimes, it is easier to write a function using recursion. I personally tend to avoid loops if performance is not important, because I find that code that avoids loops is easier to read and debug. However, knowing that you can use loops is reassuring, and encapsulating loops inside functions gives you the benefits of both using functions, and loops. In the coming sections I will show you some built-in functions that make it possible to avoid writing loops and that don’t rely on recursion, so performance won’t be penalized. 8.2 Properties of functions Mathematical functions have a nice property: we always get the same output for a given input. This is called referential transparency and we should aim to write our R functions in such a way. For example, the following function: increment <- function(x){ x + 1 } Is a referential transparent function. We always get the same result for any x that we give to this function. This: increment(10) ## [1] 11 will always produce 11. However, this one: increment_opaque <- function(x){ x + spam } is not a referential transparent function, because its value depends on the global variable spam. spam <- 1 increment_opaque(10) ## [1] 11 will produce 11 if spam = 1. But what if spam = 19? spam <- 19 increment_opaque(10) ## [1] 29 To make increment_opaque() a referential transparent function, it is enough to make spam an argument: increment_not_opaque <- function(x, spam){ x + spam } Now even if there is a global variable called spam, this will not influence our function: spam <- 19 increment_not_opaque(10, 34) ## [1] 44 This is because the variable spam defined in the body of the function is a local variable. It could have been called anything else, really. Avoiding opaque functions makes our life easier. Another property that adepts of functional programming value is that functions should have no, or very limited, side-effects. This means that functions should not change the state of your program. For example this function (which is not a referential transparent function): count_iter <- 0 sqrt_newton_side_effect <- function(a, init, eps = 0.01){ while(abs(init**2 - a) > eps){ init <- 1/2 *(init + a/init) count_iter <<- count_iter + 1 # The "<<-" symbol means that we assign the } # RHS value in a variable inside the global environment init } If you look in the environment pane, you will see that count_iter equals 0. Now call this function with the following arguments: sqrt_newton_side_effect(16000, 2) ## [1] 126.4911 print(count_iter) ## [1] 9 If you check the value of count_iter now, you will see that it increased! This is a side effect, because the function changed something outside of its scope. It changed a value in the global environment. In general, it is good practice to avoid side-effects. For example, we could make the above function not have any side effects like this: sqrt_newton_count <- function(a, init, count_iter = 0, eps = 0.01){ while(abs(init**2 - a) > eps){ init <- 1/2 *(init + a/init) count_iter <- count_iter + 1 } c(init, count_iter) } Now, this function returns a list with two elements, the result, and the number of iterations it took to get the result: sqrt_newton_count(16000, 2) ## [1] 126.4911 9.0000 Writing to disk is also considered a side effect, because the function changes something (a file) outside its scope. But this cannot be avoided since you want to write to disk. Just remember: try to avoid having functions changing variables in the global environment unless you have a very good reason of doing so. Very long scripts that don’t use functions and use a lot of global variables with loops changing the values of global variables are a nightmare to debug. If something goes wrong, it might be very difficult to pinpoint where the problem is. Is there an error in one of the loops? Is your code running for a particular value of a particular variable in the global environment, but not for other values? Which values? And of which variables? It can be very difficult to know what is wrong with such a script. With functional programming, you can avoid a lot of this pain for free (well not entirely for free, it still requires some effort, since R is not a pure functional language). Writing functions also makes it easier to parallelize your code. We are going to learn about that later in this chapter too. Finally, another property of mathematical functions, is that they do one single thing. Functional programming purists also program their functions to do one single task. This has benefits, but can complicate things. The function we wrote previously does two things: it computes the square root of a number and also returns the number of iterations it took to compute the result. However, this is not a bad thing; the function is doing two tasks, but these tasks are related to each other and it makes sense to have them together. My piece of advice: avoid having functions that do many unrelated things. This makes debugging harder. In conclusion: you should strive for referential transparency, try to avoid side effects unless you have a good reason to have them and try to keep your functions short and do as little tasks as possible. This makes testing and debugging easier, as you will see in the next chapter, but also improves readability and maintainability of your code. 8.3 Functional programming with {purrr} I mentioned it several times already, but R is not a pure functional programming language. It is possible to write R code using the functional programming paradigm, but some effort is required. The {purrr} package extends R’s base functional programming capabilities with some very interesting functions. We have already seen map() and reduce(), which we are going to see in more detail now. Then, we are going to learn about some other functions included in {purrr} that make functional programming easier in R. 8.3.1 Doing away with loops: the map*() family of functions Instead of using loops, pure functional programming languages use functions that achieve the same result. These functions are often called Map or Reduce (also called Fold). R comes with the *apply() family of functions (which are implementations of Map), as well as Reduce() for functional programming. Within this family, you can find lapply(), sapply(), vapply(), tapply(), mapply(), rapply(), eapply() and apply() (I might have forgotten one or the other, but that’s not important). Each version of an *apply() function has a different purpose, but it is not very easy to remember which does what exactly. To add even more confusion, the arguments are sometimes different between each of these. In the {purrr} package, these functions are replaced by the map*() family of functions. As you will shortly see, they are very consistent, and thus easier to use. The first part of these functions’ names all start with map_ and the second part tells you what this function is going to return. For example, if you want doubles out, you would use map_dbl(). If you are working on data frames and want a data frame back, you would use map_df(). Let’s start with the basic map() function. The following gif (source: Wikipedia) illustrates what map() does fairly well: \\(X\\) is a vector composed of the following scalars: \\((0, 5, 8, 3, 2, 1)\\). The function we want to map to each element of \\(X\\) is \\(f(x) = x + 1\\). \\(X'\\) is the result of this operation. Using R, we would do the following: library("purrr") numbers <- c(0, 5, 8, 3, 2, 1) plus_one <- function(x) (x + 1) my_results <- map(numbers, plus_one) my_results ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 6 ## ## [[3]] ## [1] 9 ## ## [[4]] ## [1] 4 ## ## [[5]] ## [1] 3 ## ## [[6]] ## [1] 2 Using a loop, you would write: numbers <- c(0, 5, 8, 3, 2, 1) plus_one <- function(x) (x + 1) my_results <- vector("list", 6) for(number in seq_along(numbers)){ my_results[[number]] <- plus_one(number) } my_results ## [[1]] ## [1] 2 ## ## [[2]] ## [1] 3 ## ## [[3]] ## [1] 4 ## ## [[4]] ## [1] 5 ## ## [[5]] ## [1] 6 ## ## [[6]] ## [1] 7 Now I don’t know about you, but I prefer the first option. Using functional programming, you don’t need to create an empty list to hold your results, and the code is more concise. Plus, it is less error prone. I had to try several times to get the loop right (and I’ve using R for almost 10 years now). Why? Well, first of all I used %in% instead of in. Then, I forgot about seq_along(). After that, I made a typo, plos_one() instead of plus_one() (ok, that one is unrelated to the loop). Let’s also see how this works using base R: numbers <- c(0, 5, 8, 3, 2, 1) plus_one <- function(x) (x + 1) my_results <- lapply(numbers, plus_one) my_results ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 6 ## ## [[3]] ## [1] 9 ## ## [[4]] ## [1] 4 ## ## [[5]] ## [1] 3 ## ## [[6]] ## [1] 2 So what is the added value of using {purrr}, you might ask. Well, imagine that instead of a list, I need to an atomic vector of numerics. This is fairly easy with {purrr}: library("purrr") numbers <- c(0, 5, 8, 3, 2, 1) plus_one <- function(x) (x + 1) my_results <- map_dbl(numbers, plus_one) my_results ## [1] 1 6 9 4 3 2 We’re going to discuss these functions below, but know that in base R, outputting something else involves more effort. Let’s go back to our sqrt_newton() function. This function has more than one parameter. Often, we would like to map functions with more than one parameter to a list, while holding constant some of the functions parameters. This is easily achieved like so: library("purrr") numbers <- c(7, 8, 19, 64) map(numbers, sqrt_newton, init = 1) ## [[1]] ## [1] 2.645767 ## ## [[2]] ## [1] 2.828469 ## ## [[3]] ## [1] 4.358902 ## ## [[4]] ## [1] 8.000002 It is also possible to use a formula: library("purrr") numbers <- c(7, 8, 19, 64) map(numbers, ~sqrt_newton(., init = 1)) ## [[1]] ## [1] 2.645767 ## ## [[2]] ## [1] 2.828469 ## ## [[3]] ## [1] 4.358902 ## ## [[4]] ## [1] 8.000002 Another function that is similar to map() is rerun(). You guessed it, this one simply reruns an expression: rerun(10, "hello") ## [[1]] ## [1] "hello" ## ## [[2]] ## [1] "hello" ## ## [[3]] ## [1] "hello" ## ## [[4]] ## [1] "hello" ## ## [[5]] ## [1] "hello" ## ## [[6]] ## [1] "hello" ## ## [[7]] ## [1] "hello" ## ## [[8]] ## [1] "hello" ## ## [[9]] ## [1] "hello" ## ## [[10]] ## [1] "hello" rerun() simply runs an expression (which can be arbitrarily complex) n times, whereas map() maps a function to a list of inputs, so to achieve the same with map(), you need to map the print() function to a vector of characters: map(rep("hello", 10), print) ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [[1]] ## [1] "hello" ## ## [[2]] ## [1] "hello" ## ## [[3]] ## [1] "hello" ## ## [[4]] ## [1] "hello" ## ## [[5]] ## [1] "hello" ## ## [[6]] ## [1] "hello" ## ## [[7]] ## [1] "hello" ## ## [[8]] ## [1] "hello" ## ## [[9]] ## [1] "hello" ## ## [[10]] ## [1] "hello" rep() is a function that creates a vector by repeating something, in this case the string “hello”, as many times as needed, here 10. The output here is a bit different that before though, because first you will see “hello” printed 10 times and then the list where each element is “hello”. This is because the print() function has a side effect, which is, well printing to the console. We see this side effect 10 times, plus then the list created with map(). rerun() is useful if you want to run simulation. For instance, let’s suppose that I perform a simulation where I throw a die 5 times, and compute the mean of the points obtained, as well as the variance: mean_var_throws <- function(n){ throws <- sample(1:6, n, replace = TRUE) mean_throws <- mean(throws) var_throws <- var(throws) tibble::tribble(~mean_throws, ~var_throws, mean_throws, var_throws) } mean_var_throws(5) ## # A tibble: 1 × 2 ## mean_throws var_throws ## <dbl> <dbl> ## 1 2.2 1.7 mean_var_throws() returns a tibble object with mean of points and the variance of the points. Now suppose I want to compute the expected value of the distribution of throwing dice. We know from theory that it should be equal to \\(3.5 (= 1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6)\\). Let’s rerun the simulation 50 times: simulations <- rerun(50, mean_var_throws(5)) Let’s see what the simulations object is made of: str(simulations) ## List of 50 ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 2 variables: ## ..$ mean_throws: num 2 ## ..$ var_throws : num 3 ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 2 variables: ## ..$ mean_throws: num 2.8 ## ..$ var_throws : num 0.2 ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 2 variables: ## ..$ mean_throws: num 2.8 ## ..$ var_throws : num 0.7 ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 2 variables: ## ..$ mean_throws: num 2.8 ## ..$ var_throws : num 1.7 ..... simulations is a list of 50 data frames. We can easily combine them into a single data frame, and compute the mean of the means, which should return something close to the expected value of 3.5: bind_rows(simulations) %>% summarise(expected_value = mean(mean_throws)) ## # A tibble: 1 × 1 ## expected_value ## <dbl> ## 1 3.44 Pretty close! Now of course, one could have simply done something like this: mean(sample(1:6, 1000, replace = TRUE)) ## [1] 3.481 but the point was to illustrate that rerun() can run any arbitrarily complex expression, and that it is good practice to put the result in a data frame or list, for easier further manipulation. You now know the standard map() function, and also rerun(), which return lists, but there are a number of variants of this function. map_dbl() returns an atomic vector of doubles, as seen we’ve seen before. A little reminder below: map_dbl(numbers, sqrt_newton, init = 1) ## [1] 2.645767 2.828469 4.358902 8.000002 In a similar fashion, map_chr() returns an atomic vector of strings: map_chr(numbers, sqrt_newton, init = 1) ## [1] "2.645767" "2.828469" "4.358902" "8.000002" map_lgl() returns an atomic vector of TRUE or FALSE: divisible <- function(x, y){ if_else(x %% y == 0, TRUE, FALSE) } map_lgl(seq(1:100), divisible, 3) ## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [13] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [25] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [37] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [49] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [61] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [73] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [85] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [97] FALSE FALSE TRUE FALSE There are also other interesting variants, such as map_if(): a <- seq(1,10) map_if(a, (function(x) divisible(x, 2)), sqrt) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 1.414214 ## ## [[3]] ## [1] 3 ## ## [[4]] ## [1] 2 ## ## [[5]] ## [1] 5 ## ## [[6]] ## [1] 2.44949 ## ## [[7]] ## [1] 7 ## ## [[8]] ## [1] 2.828427 ## ## [[9]] ## [1] 9 ## ## [[10]] ## [1] 3.162278 I used map_if() to take the square root of only those numbers in vector a that are divisble by 2, by using an anonymous function that checks if a number is divisible by 2 (by wrapping divisible()). map_at() is similar to map_if() but maps the function at a position specified by the user: map_at(numbers, c(1, 3), sqrt) ## [[1]] ## [1] 2.645751 ## ## [[2]] ## [1] 8 ## ## [[3]] ## [1] 4.358899 ## ## [[4]] ## [1] 64 or if you have a named list: recipe <- list("spam" = 1, "eggs" = 3, "bacon" = 10) map_at(recipe, "bacon", `*`, 2) ## $spam ## [1] 1 ## ## $eggs ## [1] 3 ## ## $bacon ## [1] 20 I used map_at() to double the quantity of bacon in the recipe (by using the * function, and specifying its second argument, 2. Try the following in the command prompt: `*`(3, 4)). map2() is the equivalent of mapply() and pmap() is the generalisation of map2() for more than 2 arguments: print(a) ## [1] 1 2 3 4 5 6 7 8 9 10 b <- seq(1, 2, length.out = 10) print(b) ## [1] 1.000000 1.111111 1.222222 1.333333 1.444444 1.555556 1.666667 1.777778 ## [9] 1.888889 2.000000 map2(a, b, `*`) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 2.222222 ## ## [[3]] ## [1] 3.666667 ## ## [[4]] ## [1] 5.333333 ## ## [[5]] ## [1] 7.222222 ## ## [[6]] ## [1] 9.333333 ## ## [[7]] ## [1] 11.66667 ## ## [[8]] ## [1] 14.22222 ## ## [[9]] ## [1] 17 ## ## [[10]] ## [1] 20 Each element of a gets multiplied by the element of b that is in the same position. Let’s see what pmap() does. Can you guess from the code below what is going on? I will print a and b again for clarity: a ## [1] 1 2 3 4 5 6 7 8 9 10 b ## [1] 1.000000 1.111111 1.222222 1.333333 1.444444 1.555556 1.666667 1.777778 ## [9] 1.888889 2.000000 n <- seq(1:10) pmap(list(a, b, n), rnorm) ## [[1]] ## [1] -0.1758315 ## ## [[2]] ## [1] -0.2162863 1.1033912 ## ## [[3]] ## [1] 4.5731231 -0.3743379 6.8130737 ## ## [[4]] ## [1] 0.8933089 4.1930837 7.5276030 -2.3575522 ## ## [[5]] ## [1] 2.1814981 -1.7455750 5.0548288 2.7848458 0.9230675 ## ## [[6]] ## [1] 2.806217 5.667499 -5.032922 6.741065 -2.757928 12.414101 ## ## [[7]] ## [1] -3.314145 -7.912019 -3.865292 4.307842 18.022049 1.278158 1.083208 ## ## [[8]] ## [1] 6.2629161 2.1213552 0.3543566 2.1041606 -0.2643654 8.7600450 3.3616206 ## [8] -7.7446668 ## ## [[9]] ## [1] -7.609538 5.472267 -4.869374 -11.943063 4.707929 -7.730088 13.431771 ## [8] 1.606800 -6.578745 ## ## [[10]] ## [1] -9.101480 4.404571 -16.071437 1.110689 7.168097 15.848579 ## [7] 16.710863 1.998482 -17.856521 -2.021087 Let’s take a closer look at what a, b and n look like, when they are place next to each other: cbind(a, b, n) ## a b n ## [1,] 1 1.000000 1 ## [2,] 2 1.111111 2 ## [3,] 3 1.222222 3 ## [4,] 4 1.333333 4 ## [5,] 5 1.444444 5 ## [6,] 6 1.555556 6 ## [7,] 7 1.666667 7 ## [8,] 8 1.777778 8 ## [9,] 9 1.888889 9 ## [10,] 10 2.000000 10 rnorm() gets first called with the parameters from the first line, meaning rnorm(a[1], b[1], n[1]). The second time rnorm() gets called, you guessed it, it with the parameters on the second line of the array above, rnorm(a[2], b[2], n[2]), etc. There are other functions in the map() family of functions, but we will discover them in the exercises! The map() family of functions does not have any more secrets for you. Let’s now take a look at the reduce() family of functions. 8.3.2 Reducing with purrr Reducing is another important concept in functional programming. It allows going from a list of elements, to a single element, by somehow combining the elements into one. For instance, using the base R Reduce() function, you can sum the elements of a list like so: Reduce(`+`, seq(1:100)) ## [1] 5050 using purrr::reduce(), this becomes: reduce(seq(1:100), `+`) ## [1] 5050 If you don’t really get what happening, don’t worry. Things should get clearer once I’ll introduce another version of reduce(), called accumulate(), which we will see below. Sometimes, the direction from which we start to reduce is quite important. You can “start from the end” of the list by using the .dir argument: reduce(seq(1:100), `+`, .dir = "backward") ## [1] 5050 Of course, for commutative operations, direction does not matter. But it does matter for non-commutative operations: reduce(seq(1:100), `-`) ## [1] -5048 reduce(seq(1:100), `-`, .dir = "backward") ## [1] -50 Let’s now take a look at accumulate(). accumulate() is very similar to map(), but keeps the intermediary results. Which intermediary results? Let’s try and see what happens: a <- seq(1, 10) accumulate(a, `-`) ## [1] 1 -1 -4 -8 -13 -19 -26 -34 -43 -53 accumulate() illustrates pretty well what is happening; the first element, 1, is simply the first element of seq(1, 10). The second element of the result however, is the difference between 1 and 2, -1. The next element in a is 3. Thus the next result is -1-3, -4, and so on until we run out of elements in a. The below illustration shows the algorithm step-by-step: (1-2-3-4-5-6-7-8-9-10) ((1)-2-3-4-5-6-7-8-9-10) ((1-2)-3-4-5-6-7-8-9-10) ((-1-3)-4-5-6-7-8-9-10) ((-4-4)-5-6-7-8-9-10) ((-8-5)-6-7-8-9-10) ((-13-6)-7-8-9-10) ((-19-7)-8-9-10) ((-26-8)-9-10) ((-34-9)-10) (-43-10) -53 reduce() only shows the final result of all these operations. accumulate() and reduce() also have an .init argument, that makes it possible to start the reducing procedure from an initial value that is different from the first element of the vector: reduce(a, `+`, .init = 1000) accumulate(a, `-`, .init = 1000, .dir = "backward") ## [1] 1055 ## [1] 995 -994 996 -993 997 -992 998 -991 999 -990 1000 reduce() generalizes functions that only take two arguments. If you were to write a function that returns the minimum between two numbers: my_min <- function(a, b){ if(a < b){ return(a) } else { return(b) } } You could use reduce() to get the minimum of a list of numbers: numbers2 <- c(3, 1, -8, 9) reduce(numbers2, my_min) ## [1] -8 map() and reduce() are arguably the most useful higher-order functions, and perhaps also the most famous one, true ambassadors of functional programming. You might have read about MapReduce, a programming model for processing big data in parallel. The way MapReduce works is inspired by both these map() and reduce() functions, which are always included in functional programming languages. This illustrates that the functional programming paradigm is very well suited to parallel computing. Something else that is very important to understand at this point; up until now, we only used these functions on lists, or atomic vectors, of numbers. However, map() and reduce(), and other higher-order functions for that matter, do not care about the contents of the list. What these functions do, is take another functions, and make it do something to the elements of the list. It does not matter if it’s a list of numbers, of characters, of data frames, even of models. All that matters is that the function that will be applied to these elements, can operate on them. So if you have a list of fitted models, you can map summary() on this list to get summaries of each model. Or if you have a list of data frames, you can map a function that performs several cleaning steps. This will be explored in a future section, but it is important to keep this in mind. 8.3.3 Error handling with safely() and possibly() safely() and possibly() are very useful functions. Consider the following situation: a <- list("a", 4, 5) sqrt(a) Error in sqrt(a) : non-numeric argument to mathematical function Using map() or Map() will result in a similar error. safely() is an higher-order function that takes one function as an argument and executes it… safely, meaning the execution of the function will not stop if there is an error. The error message gets captured alongside valid results. a <- list("a", 4, 5) safe_sqrt <- safely(sqrt) map(a, safe_sqrt) ## [[1]] ## [[1]]$result ## NULL ## ## [[1]]$error ## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function> ## ## ## [[2]] ## [[2]]$result ## [1] 2 ## ## [[2]]$error ## NULL ## ## ## [[3]] ## [[3]]$result ## [1] 2.236068 ## ## [[3]]$error ## NULL possibly() works similarly, but also allows you to specify a return value in case of an error: possible_sqrt <- possibly(sqrt, otherwise = NA_real_) map(a, possible_sqrt) ## [[1]] ## [1] NA ## ## [[2]] ## [1] 2 ## ## [[3]] ## [1] 2.236068 Of course, in this particular example, the same effect could be obtained way more easily: sqrt(as.numeric(a)) ## Warning: NAs introduced by coercion ## [1] NA 2.000000 2.236068 However, in some situations, this trick does not work as intended (or at all). possibly() and safely() allow the programmer to model errors explicitly, and to then provide a consistent way of dealing with them. For instance, consider the following example: data(mtcars) write.csv(mtcars, "my_data/mtcars.csv") Error in file(file, ifelse(append, "a", "w")) : cannot open the connection In addition: Warning message: In file(file, ifelse(append, "a", "w")) : cannot open file 'my_data/mtcars.csv': No such file or directory The folder path/to/save/ does not exist, and as such this code produces an error. You might want to catch this error, and create the directory for instance: possibly_write.csv <- possibly(write.csv, otherwise = NULL) if(is.null(possibly_write.csv(mtcars, "my_data/mtcars.csv"))) { print("Creating folder...") dir.create("my_data/") print("Saving file...") write.csv(mtcars, "my_data/mtcars.csv") } [1] "Creating folder..." [1] "Saving file..." Warning message: In file(file, ifelse(append, "a", "w")) : cannot open file 'my_data/mtcars.csv': No such file or directory The warning message comes from the first time we try to write the .csv, inside the if statement. Because this fails, we create the directory and then actually save the file. In the exercises, you’ll discover quietly(), which also captures warnings and messages. To conclude this section: remember function factories? Turns out that safely(), purely() and quietly() are function factories. 8.3.4 Partial applications with partial() Consider the following simple function: add <- function(a, b) a+b It is possible to create a new function, where one of the parameters is fixed, for instance, where a = 10: add_to_10 <- partial(add, a = 10) add_to_10(12) ## [1] 22 This is equivalent to the following: add_to_10_2 <- function(b){ add(a = 10, b) } Using partial() is much less verbose however, and allowing you to define new functions very quickly: head10 <- partial(head, n = 10) head10(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 8.3.5 Function composition using compose Function composition is another handy tool, which makes chaining equation much more elegant: compose(sqrt, log10, exp)(10) ## [1] 2.083973 You can read this expression as exp() after log10() after sqrt() and is equivalent to: sqrt(log10(exp(10))) ## [1] 2.083973 It is also possible to reverse the order the functions get called using the .dir = option: compose(sqrt, log10, exp, .dir = "forward")(10) ## [1] 1.648721 One could also use the %>% operator to achieve the same result: 10 %>% sqrt %>% log10 %>% exp ## [1] 1.648721 but strictly speaking, this is not function composition. 8.3.6 «Transposing lists» Another interesting function is transpose(). It is not an alternative to the function t() from base but, has a similar effect. transpose() works on lists. Let’s take a look at the example from before: safe_sqrt <- safely(sqrt, otherwise = NA_real_) map(a, safe_sqrt) ## [[1]] ## [[1]]$result ## [1] NA ## ## [[1]]$error ## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function> ## ## ## [[2]] ## [[2]]$result ## [1] 2 ## ## [[2]]$error ## NULL ## ## ## [[3]] ## [[3]]$result ## [1] 2.236068 ## ## [[3]]$error ## NULL The output is a list with the first element being a list with a result and an error message. One might want to have all the results in a single list, and all the error messages in another list. This is possible with transpose(): purrr::transpose(map(a, safe_sqrt)) ## $result ## $result[[1]] ## [1] NA ## ## $result[[2]] ## [1] 2 ## ## $result[[3]] ## [1] 2.236068 ## ## ## $error ## $error[[1]] ## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function> ## ## $error[[2]] ## NULL ## ## $error[[3]] ## NULL I explicitely call purrr::transpose() because there is also a data.table::transpose(), which is not the same function. You have to be careful about that sort of thing, because it can cause errors in your programs and debuging this type of error is a nightmare. Now that we are familiar with functional programming, let’s try to apply some of its principles to data manipulation. 8.4 List-based workflows for efficiency You can use your own functions in pipe workflows: double_number <- function(x){ x+x } mtcars %>% head() %>% mutate(double_mpg = double_number(mpg)) ## mpg cyl disp hp drat wt qsec vs am gear carb double_mpg ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 42.0 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 42.0 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 45.6 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 42.8 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 37.4 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 36.2 It is important to understand that your functions, and functions that are built-in into R, or that come from packages, are exactly the same thing. Every function is a first-class object in R, no matter where they come from. The consequence of functions being first-class objects is that functions can take functions as arguments, functions can return functions (the function factories from the previous chapter) and can be assigned to any variable: plop <- sqrt plop(4) ## [1] 2 bacon <- function(.f){ message("Bacon is tasty") .f } bacon(sqrt) # `bacon` is a function factory, as it returns a function (alongside an informative message) ## Bacon is tasty ## function (x) .Primitive("sqrt") # To actually call it: bacon(sqrt)(4) ## Bacon is tasty ## [1] 2 Now, let’s step back for a bit and think about what we learned up until now, and especially the map() family of functions. Let’s read the list of datasets from the previous chapter: paths <- Sys.glob("datasets/unemployment/*.csv") all_datasets <- import_list(paths) str(all_datasets) ## List of 4 ## $ unemp_2013:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ... ## ..$ of which: Wage-earners : int [1:118] 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ... ## ..$ of which: Non-wage-earners: int [1:118] 19872 1809 168 94 116 294 272 113 189 338 ... ## ..$ Unemployed : int [1:118] 19287 1071 114 25 74 261 98 45 66 207 ... ## ..$ Active population : int [1:118] 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.95 5.67 6.27 2.88 4.92 ... ## ..$ Year : int [1:118] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2013.csv" ## $ unemp_2014:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 228423 18166 1767 845 1505 4129 2172 1007 1268 3124 ... ## ..$ of which: Wage-earners : int [1:118] 208238 16366 1606 757 1390 3840 1897 887 1082 2782 ... ## ..$ of which: Non-wage-earners: int [1:118] 20185 1800 161 88 115 289 275 120 186 342 ... ## ..$ Unemployed : int [1:118] 19362 1066 122 19 66 287 91 38 61 202 ... ## ..$ Active population : int [1:118] 247785 19232 1889 864 1571 4416 2263 1045 1329 3326 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.81 5.54 6.46 2.2 4.2 ... ## ..$ Year : int [1:118] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2014.csv" ## $ unemp_2015:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 233130 18310 1780 870 1470 4130 2170 1050 1300 3140 ... ## ..$ of which: Wage-earners : int [1:118] 212530 16430 1620 780 1350 3820 1910 920 1100 2770 ... ## ..$ of which: Non-wage-earners: int [1:118] 20600 1880 160 90 120 310 260 130 200 370 ... ## ..$ Unemployed : int [1:118] 18806 988 106 29 73 260 80 41 72 169 ... ## ..$ Active population : int [1:118] 251936 19298 1886 899 1543 4390 2250 1091 1372 3309 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.46 5.12 5.62 3.23 4.73 ... ## ..$ Year : int [1:118] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2015.csv" ## $ unemp_2016:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 236100 18380 1790 870 1470 4160 2160 1030 1330 3150 ... ## ..$ of which: Wage-earners : int [1:118] 215430 16500 1640 780 1350 3840 1900 900 1130 2780 ... ## ..$ of which: Non-wage-earners: int [1:118] 20670 1880 150 90 120 320 260 130 200 370 ... ## ..$ Unemployed : int [1:118] 18185 975 91 27 66 246 76 35 70 206 ... ## ..$ Active population : int [1:118] 254285 19355 1881 897 1536 4406 2236 1065 1400 3356 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.15 5.04 4.84 3.01 4.3 ... ## ..$ Year : int [1:118] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2016.csv" all_datasets is a list with 4 elements, each of them is a data.frame. The first thing we are going to do is use a function to clean the names of the datasets. These names are not very easy to work with; there are spaces, and it would be better if the names of the columns would be all lowercase. For this we are going to use the function clean_names() from the janitor package. For a single dataset, I would write this: library(janitor) one_dataset <- one_dataset %>% clean_names() and I would get a dataset with column names in lowercase and spaces replaced by _ (and other corrections). How can I apply, or map, this function to each dataset in the list? To do this I need to use purrr::map(), which we’ve seen in the previous section: library(purrr) all_datasets <- all_datasets %>% map(clean_names) all_datasets %>% glimpse() ## List of 4 ## $ unemp_2013:'data.frame': 118 obs. of 8 variables: ## ..$ commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ total_employed_population : int [1:118] 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ... ## ..$ of_which_wage_earners : int [1:118] 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ... ## ..$ of_which_non_wage_earners : int [1:118] 19872 1809 168 94 116 294 272 113 189 338 ... ## ..$ unemployed : int [1:118] 19287 1071 114 25 74 261 98 45 66 207 ... ## ..$ active_population : int [1:118] 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ... ## ..$ unemployment_rate_in_percent: num [1:118] 7.95 5.67 6.27 2.88 4.92 ... ## ..$ year : int [1:118] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2013.csv" ## $ unemp_2014:'data.frame': 118 obs. of 8 variables: ## ..$ commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ total_employed_population : int [1:118] 228423 18166 1767 845 1505 4129 2172 1007 1268 3124 ... ## ..$ of_which_wage_earners : int [1:118] 208238 16366 1606 757 1390 3840 1897 887 1082 2782 ... ## ..$ of_which_non_wage_earners : int [1:118] 20185 1800 161 88 115 289 275 120 186 342 ... ## ..$ unemployed : int [1:118] 19362 1066 122 19 66 287 91 38 61 202 ... ## ..$ active_population : int [1:118] 247785 19232 1889 864 1571 4416 2263 1045 1329 3326 ... ## ..$ unemployment_rate_in_percent: num [1:118] 7.81 5.54 6.46 2.2 4.2 ... ## ..$ year : int [1:118] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2014.csv" ## $ unemp_2015:'data.frame': 118 obs. of 8 variables: ## ..$ commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ total_employed_population : int [1:118] 233130 18310 1780 870 1470 4130 2170 1050 1300 3140 ... ## ..$ of_which_wage_earners : int [1:118] 212530 16430 1620 780 1350 3820 1910 920 1100 2770 ... ## ..$ of_which_non_wage_earners : int [1:118] 20600 1880 160 90 120 310 260 130 200 370 ... ## ..$ unemployed : int [1:118] 18806 988 106 29 73 260 80 41 72 169 ... ## ..$ active_population : int [1:118] 251936 19298 1886 899 1543 4390 2250 1091 1372 3309 ... ## ..$ unemployment_rate_in_percent: num [1:118] 7.46 5.12 5.62 3.23 4.73 ... ## ..$ year : int [1:118] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2015.csv" ## $ unemp_2016:'data.frame': 118 obs. of 8 variables: ## ..$ commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ total_employed_population : int [1:118] 236100 18380 1790 870 1470 4160 2160 1030 1330 3150 ... ## ..$ of_which_wage_earners : int [1:118] 215430 16500 1640 780 1350 3840 1900 900 1130 2780 ... ## ..$ of_which_non_wage_earners : int [1:118] 20670 1880 150 90 120 320 260 130 200 370 ... ## ..$ unemployed : int [1:118] 18185 975 91 27 66 246 76 35 70 206 ... ## ..$ active_population : int [1:118] 254285 19355 1881 897 1536 4406 2236 1065 1400 3356 ... ## ..$ unemployment_rate_in_percent: num [1:118] 7.15 5.04 4.84 3.01 4.3 ... ## ..$ year : int [1:118] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2016.csv" Remember that map(list, function) simply evaluates function to each element of list. So now, what if I want to know, for each dataset, which communes have an unemployment rate that is less than, say, 3%? For a single dataset I would do something like this: one_dataset %>% filter(unemployment_rate_in_percent < 3) but since we’re dealing with a list of data sets, we cannot simply use filter() on it. This is because filter() expects a data frame, not a list of data frames. The way around this is to use map(). all_datasets %>% map(~filter(., unemployment_rate_in_percent < 3)) ## $unemp_2013 ## commune total_employed_population of_which_wage_earners ## 1 Garnich 844 750 ## 2 Leudelange 1064 937 ## 3 Bech 526 463 ## of_which_non_wage_earners unemployed active_population ## 1 94 25 869 ## 2 127 32 1096 ## 3 63 16 542 ## unemployment_rate_in_percent year ## 1 2.876870 2013 ## 2 2.919708 2013 ## 3 2.952030 2013 ## ## $unemp_2014 ## commune total_employed_population of_which_wage_earners ## 1 Garnich 845 757 ## 2 Leudelange 1102 965 ## 3 Bech 543 476 ## 4 Flaxweiler 879 789 ## of_which_non_wage_earners unemployed active_population ## 1 88 19 864 ## 2 137 34 1136 ## 3 67 15 558 ## 4 90 27 906 ## unemployment_rate_in_percent year ## 1 2.199074 2014 ## 2 2.992958 2014 ## 3 2.688172 2014 ## 4 2.980132 2014 ## ## $unemp_2015 ## commune total_employed_population of_which_wage_earners ## 1 Bech 520 450 ## 2 Bous 750 680 ## of_which_non_wage_earners unemployed active_population ## 1 70 14 534 ## 2 70 22 772 ## unemployment_rate_in_percent year ## 1 2.621723 2015 ## 2 2.849741 2015 ## ## $unemp_2016 ## commune total_employed_population of_which_wage_earners ## 1 Reckange-sur-Mess 980 850 ## 2 Bech 520 450 ## 3 Betzdorf 1500 1350 ## 4 Flaxweiler 910 820 ## of_which_non_wage_earners unemployed active_population ## 1 130 30 1010 ## 2 70 11 531 ## 3 150 45 1545 ## 4 90 24 934 ## unemployment_rate_in_percent year ## 1 2.970297 2016 ## 2 2.071563 2016 ## 3 2.912621 2016 ## 4 2.569593 2016 map() needs a function to map to each element of the list. all_datasets is the list to which I want to map the function. But what function? filter() is the function I need, so why doesn’t: all_datasets %>% map(filter(unemployment_rate_in_percent < 3)) work? This is what happens if we try it: Error in filter(unemployment_rate_in_percent < 3) : object 'unemployment_rate_in_percent' not found This is because filter() needs both the data set, and a so-called predicate (a predicate is an expression that evaluates to TRUE or FALSE). But you need to make more explicit what is the dataset and what is the predicate, because here, filter() thinks that the dataset is unemployment_rate_in_percent. The way to do this is to use an anonymous function (discussed in Chapter 7), which allows you to explicitely state what is the dataset, and what is the predicate. As we’ve seen, there’s three ways to define anonymous functions: Using a formula (only works within {tidyverse} functions): all_datasets %>% map(~filter(., unemployment_rate_in_percent < 3)) %>% glimpse() ## List of 4 ## $ unemp_2013:'data.frame': 3 obs. of 8 variables: ## ..$ commune : chr [1:3] "Garnich" "Leudelange" "Bech" ## ..$ total_employed_population : int [1:3] 844 1064 526 ## ..$ of_which_wage_earners : int [1:3] 750 937 463 ## ..$ of_which_non_wage_earners : int [1:3] 94 127 63 ## ..$ unemployed : int [1:3] 25 32 16 ## ..$ active_population : int [1:3] 869 1096 542 ## ..$ unemployment_rate_in_percent: num [1:3] 2.88 2.92 2.95 ## ..$ year : int [1:3] 2013 2013 2013 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2013.csv" ## $ unemp_2014:'data.frame': 4 obs. of 8 variables: ## ..$ commune : chr [1:4] "Garnich" "Leudelange" "Bech" "Flaxweiler" ## ..$ total_employed_population : int [1:4] 845 1102 543 879 ## ..$ of_which_wage_earners : int [1:4] 757 965 476 789 ## ..$ of_which_non_wage_earners : int [1:4] 88 137 67 90 ## ..$ unemployed : int [1:4] 19 34 15 27 ## ..$ active_population : int [1:4] 864 1136 558 906 ## ..$ unemployment_rate_in_percent: num [1:4] 2.2 2.99 2.69 2.98 ## ..$ year : int [1:4] 2014 2014 2014 2014 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2014.csv" ## $ unemp_2015:'data.frame': 2 obs. of 8 variables: ## ..$ commune : chr [1:2] "Bech" "Bous" ## ..$ total_employed_population : int [1:2] 520 750 ## ..$ of_which_wage_earners : int [1:2] 450 680 ## ..$ of_which_non_wage_earners : int [1:2] 70 70 ## ..$ unemployed : int [1:2] 14 22 ## ..$ active_population : int [1:2] 534 772 ## ..$ unemployment_rate_in_percent: num [1:2] 2.62 2.85 ## ..$ year : int [1:2] 2015 2015 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2015.csv" ## $ unemp_2016:'data.frame': 4 obs. of 8 variables: ## ..$ commune : chr [1:4] "Reckange-sur-Mess" "Bech" "Betzdorf" "Flaxweiler" ## ..$ total_employed_population : int [1:4] 980 520 1500 910 ## ..$ of_which_wage_earners : int [1:4] 850 450 1350 820 ## ..$ of_which_non_wage_earners : int [1:4] 130 70 150 90 ## ..$ unemployed : int [1:4] 30 11 45 24 ## ..$ active_population : int [1:4] 1010 531 1545 934 ## ..$ unemployment_rate_in_percent: num [1:4] 2.97 2.07 2.91 2.57 ## ..$ year : int [1:4] 2016 2016 2016 2016 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2016.csv" (notice the . in the formula, making the position of the dataset as the first argument to filter() explicit) or using an anonymous function (using the function(x) keyword): all_datasets %>% map(function(x)filter(x, unemployment_rate_in_percent < 3)) %>% glimpse() ## List of 4 ## $ unemp_2013:'data.frame': 3 obs. of 8 variables: ## ..$ commune : chr [1:3] "Garnich" "Leudelange" "Bech" ## ..$ total_employed_population : int [1:3] 844 1064 526 ## ..$ of_which_wage_earners : int [1:3] 750 937 463 ## ..$ of_which_non_wage_earners : int [1:3] 94 127 63 ## ..$ unemployed : int [1:3] 25 32 16 ## ..$ active_population : int [1:3] 869 1096 542 ## ..$ unemployment_rate_in_percent: num [1:3] 2.88 2.92 2.95 ## ..$ year : int [1:3] 2013 2013 2013 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2013.csv" ## $ unemp_2014:'data.frame': 4 obs. of 8 variables: ## ..$ commune : chr [1:4] "Garnich" "Leudelange" "Bech" "Flaxweiler" ## ..$ total_employed_population : int [1:4] 845 1102 543 879 ## ..$ of_which_wage_earners : int [1:4] 757 965 476 789 ## ..$ of_which_non_wage_earners : int [1:4] 88 137 67 90 ## ..$ unemployed : int [1:4] 19 34 15 27 ## ..$ active_population : int [1:4] 864 1136 558 906 ## ..$ unemployment_rate_in_percent: num [1:4] 2.2 2.99 2.69 2.98 ## ..$ year : int [1:4] 2014 2014 2014 2014 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2014.csv" ## $ unemp_2015:'data.frame': 2 obs. of 8 variables: ## ..$ commune : chr [1:2] "Bech" "Bous" ## ..$ total_employed_population : int [1:2] 520 750 ## ..$ of_which_wage_earners : int [1:2] 450 680 ## ..$ of_which_non_wage_earners : int [1:2] 70 70 ## ..$ unemployed : int [1:2] 14 22 ## ..$ active_population : int [1:2] 534 772 ## ..$ unemployment_rate_in_percent: num [1:2] 2.62 2.85 ## ..$ year : int [1:2] 2015 2015 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2015.csv" ## $ unemp_2016:'data.frame': 4 obs. of 8 variables: ## ..$ commune : chr [1:4] "Reckange-sur-Mess" "Bech" "Betzdorf" "Flaxweiler" ## ..$ total_employed_population : int [1:4] 980 520 1500 910 ## ..$ of_which_wage_earners : int [1:4] 850 450 1350 820 ## ..$ of_which_non_wage_earners : int [1:4] 130 70 150 90 ## ..$ unemployed : int [1:4] 30 11 45 24 ## ..$ active_population : int [1:4] 1010 531 1545 934 ## ..$ unemployment_rate_in_percent: num [1:4] 2.97 2.07 2.91 2.57 ## ..$ year : int [1:4] 2016 2016 2016 2016 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2016.csv" or, since R 4.1, using the shorthand \\(x): all_datasets %>% map(\\(x)filter(x, unemployment_rate_in_percent < 3)) %>% glimpse() ## List of 4 ## $ unemp_2013:'data.frame': 3 obs. of 8 variables: ## ..$ commune : chr [1:3] "Garnich" "Leudelange" "Bech" ## ..$ total_employed_population : int [1:3] 844 1064 526 ## ..$ of_which_wage_earners : int [1:3] 750 937 463 ## ..$ of_which_non_wage_earners : int [1:3] 94 127 63 ## ..$ unemployed : int [1:3] 25 32 16 ## ..$ active_population : int [1:3] 869 1096 542 ## ..$ unemployment_rate_in_percent: num [1:3] 2.88 2.92 2.95 ## ..$ year : int [1:3] 2013 2013 2013 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2013.csv" ## $ unemp_2014:'data.frame': 4 obs. of 8 variables: ## ..$ commune : chr [1:4] "Garnich" "Leudelange" "Bech" "Flaxweiler" ## ..$ total_employed_population : int [1:4] 845 1102 543 879 ## ..$ of_which_wage_earners : int [1:4] 757 965 476 789 ## ..$ of_which_non_wage_earners : int [1:4] 88 137 67 90 ## ..$ unemployed : int [1:4] 19 34 15 27 ## ..$ active_population : int [1:4] 864 1136 558 906 ## ..$ unemployment_rate_in_percent: num [1:4] 2.2 2.99 2.69 2.98 ## ..$ year : int [1:4] 2014 2014 2014 2014 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2014.csv" ## $ unemp_2015:'data.frame': 2 obs. of 8 variables: ## ..$ commune : chr [1:2] "Bech" "Bous" ## ..$ total_employed_population : int [1:2] 520 750 ## ..$ of_which_wage_earners : int [1:2] 450 680 ## ..$ of_which_non_wage_earners : int [1:2] 70 70 ## ..$ unemployed : int [1:2] 14 22 ## ..$ active_population : int [1:2] 534 772 ## ..$ unemployment_rate_in_percent: num [1:2] 2.62 2.85 ## ..$ year : int [1:2] 2015 2015 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2015.csv" ## $ unemp_2016:'data.frame': 4 obs. of 8 variables: ## ..$ commune : chr [1:4] "Reckange-sur-Mess" "Bech" "Betzdorf" "Flaxweiler" ## ..$ total_employed_population : int [1:4] 980 520 1500 910 ## ..$ of_which_wage_earners : int [1:4] 850 450 1350 820 ## ..$ of_which_non_wage_earners : int [1:4] 130 70 150 90 ## ..$ unemployed : int [1:4] 30 11 45 24 ## ..$ active_population : int [1:4] 1010 531 1545 934 ## ..$ unemployment_rate_in_percent: num [1:4] 2.97 2.07 2.91 2.57 ## ..$ year : int [1:4] 2016 2016 2016 2016 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2016.csv" As you see, everything is starting to come together: lists, to hold complex objects, over which anonymous functions are mapped using higher-order functions. Let’s continue cleaning this dataset. Before merging these datasets together, we would need them to have a year column indicating the year the data was measured in each data frame. It would also be helpful if gave names to these datasets, meaning converting the list to a named list. For this task, we can use purrr::set_names(): all_datasets <- set_names(all_datasets, as.character(seq(2013, 2016))) Let’s take a look at the list now: str(all_datasets) As you can see, each data.frame object contained in the list has been renamed. You can thus access them with the $ operator: Using map() we now know how to apply a function to each dataset of a list. But maybe it would be easier to merge all the datasets first, and then manipulate them? This can be the case sometimes, but not always. As long as you provide a function and a list of elements to reduce(), you will get a single output. So how could reduce() help us with merging all the datasets that are in the list? dplyr comes with a lot of function to merge two datasets. Remember that I said before that reduce() allows you to generalize a function of two arguments? Let’s try it with our list of datasets: unemp_lux <- reduce(all_datasets, full_join) ## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", ## "active_population", "unemployment_rate_in_percent", "year") ## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", ## "active_population", "unemployment_rate_in_percent", "year") ## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", ## "active_population", "unemployment_rate_in_percent", "year") glimpse(unemp_lux) ## Rows: 472 ## Columns: 8 ## $ commune <chr> "Grand-Duche de Luxembourg", "Canton Cape… ## $ total_employed_population <int> 223407, 17802, 1703, 844, 1431, 4094, 214… ## $ of_which_wage_earners <int> 203535, 15993, 1535, 750, 1315, 3800, 187… ## $ of_which_non_wage_earners <int> 19872, 1809, 168, 94, 116, 294, 272, 113,… ## $ unemployed <int> 19287, 1071, 114, 25, 74, 261, 98, 45, 66… ## $ active_population <int> 242694, 18873, 1817, 869, 1505, 4355, 224… ## $ unemployment_rate_in_percent <dbl> 7.947044, 5.674773, 6.274078, 2.876870, 4… ## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013,… full_join() is one of the dplyr function that merges data. There are others that might be useful depending on the kind of join operation you need. Let’s write this data to disk as we’re going to keep using it for the next chapters: export(unemp_lux, "datasets/unemp_lux.csv") 8.4.1 Functional programming and plotting In this section, we are going to learn how to use the possibilities offered by the purrr package and how it can work together with ggplot2 to generate many plots. This is a more advanced topic, but what comes next is also what makes R, and the functional programming paradigm so powerful. For example, suppose that instead of wanting a single plot with the unemployment rate of each commune, you need one unemployment plot, per commune: unemp_lux_data %>% filter(division == "Luxembourg") %>% ggplot(aes(year, unemployment_rate_in_percent, group = division)) + theme_minimal() + labs(title = "Unemployment in Luxembourg", x = "Year", y = "Rate") + geom_line() and then you would write the same for “Esch-sur-Alzette” and also for “Wiltz”. If you only have to make to make these 3 plots, copy and pasting the above lines is no big deal: unemp_lux_data %>% filter(division == "Esch-sur-Alzette") %>% ggplot(aes(year, unemployment_rate_in_percent, group = division)) + theme_minimal() + labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") + geom_line() unemp_lux_data %>% filter(division == "Wiltz") %>% ggplot(aes(year, unemployment_rate_in_percent, group = division)) + theme_minimal() + labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") + geom_line() But copy and pasting is error prone. Can you spot the copy-paste mistake I made? And what if you have to create the above plots for all 108 Luxembourguish communes? That’s a lot of copy pasting. What if, once you are done copy pasting, you have to change something, for example, the theme? You could use the search and replace function of RStudio, true, but sometimes search and replace can also introduce bugs and typos. You can avoid all these issues by using purrr::map(). What do you need to map over? The commune names. So let’s create a vector of commune names: communes <- list("Luxembourg", "Esch-sur-Alzette", "Wiltz") Now we can create the graphs using map(), or map2() to be exact: plots_tibble <- unemp_lux_data %>% filter(division %in% communes) %>% group_by(division) %>% nest() %>% mutate(plot = map2(.x = data, .y = division, ~ggplot(data = .x) + theme_minimal() + geom_line(aes(year, unemployment_rate_in_percent, group = 1)) + labs(title = paste("Unemployment in", .y)))) Let’s study this line by line: the first line is easy, we simply use filter() to keep only the communes we are interested in. Then we group by division and use tidyr::nest(). As a refresher, let’s take a look at what this does: unemp_lux_data %>% filter(division %in% communes) %>% group_by(division) %>% nest() ## # A tibble: 3 × 2 ## # Groups: division [3] ## division data ## <chr> <list> ## 1 Esch-sur-Alzette <tibble [15 × 7]> ## 2 Luxembourg <tibble [15 × 7]> ## 3 Wiltz <tibble [15 × 7]> This creates a tibble with two columns, division and data, where each individual (or commune in this case) is another tibble with all the original variables. This is very useful, because now we can pass these tibbles to map2(), to generate the plots. But why map2() and what’s the difference with map()? map2() works the same way as map(), but maps over two inputs: numbers1 <- list(1, 2, 3, 4, 5) numbers2 <- list(9, 8, 7, 6, 5) map2(numbers1, numbers2, `*`) ## [[1]] ## [1] 9 ## ## [[2]] ## [1] 16 ## ## [[3]] ## [1] 21 ## ## [[4]] ## [1] 24 ## ## [[5]] ## [1] 25 In our example with the graphs, the two inputs are the data, and the names of the communes. This is useful to create the title with labs(title = paste(\"Unemployment in\", .y)))) where .y is the second input of map2(), the commune names contained in variable division. So what happened? We now have a tibble called plots_tibble that looks like this: print(plots_tibble) ## # A tibble: 3 × 3 ## # Groups: division [3] ## division data plot ## <chr> <list> <list> ## 1 Esch-sur-Alzette <tibble [15 × 7]> <gg> ## 2 Luxembourg <tibble [15 × 7]> <gg> ## 3 Wiltz <tibble [15 × 7]> <gg> This tibble contains three columns, division, data and now a new one called plot, that we created before using the last line mutate(plot = ...) (remember that mutate() adds columns to tibbles). plot is a list-column, with elements… being plots! Yes you read that right, the elements of the column plot are literally plots. This is what I meant with list columns. Let’s see what is inside the data and the plot columns exactly: plots_tibble %>% pull(data) ## [[1]] ## # A tibble: 15 × 7 ## year active_population of_which_non_wage_e…¹ of_wh…² total…³ unemp…⁴ unemp…⁵ ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2001 11.3 665 10.1 10.8 561 4.95 ## 2 2002 11.7 677 10.3 11.0 696 5.96 ## 3 2003 11.7 674 10.2 10.9 813 6.94 ## 4 2004 12.2 659 10.6 11.3 899 7.38 ## 5 2005 11.9 654 10.3 11.0 952 7.97 ## 6 2006 12.2 657 10.5 11.2 1.07 8.71 ## 7 2007 12.6 634 10.9 11.5 1.03 8.21 ## 8 2008 12.9 638 11.0 11.6 1.28 9.92 ## 9 2009 13.2 652 11.0 11.7 1.58 11.9 ## 10 2010 13.6 638 11.2 11.8 1.73 12.8 ## 11 2011 13.9 630 11.5 12.1 1.77 12.8 ## 12 2012 14.3 684 11.8 12.5 1.83 12.8 ## 13 2013 14.8 694 12.0 12.7 2.05 13.9 ## 14 2014 15.2 703 12.5 13.2 2.00 13.2 ## 15 2015 15.3 710 12.6 13.3 2.03 13.2 ## # … with abbreviated variable names ¹​of_which_non_wage_earners, ## # ²​of_which_wage_earners, ³​total_employed_population, ⁴​unemployed, ## # ⁵​unemployment_rate_in_percent ## ## [[2]] ## # A tibble: 15 × 7 ## year active_population of_which_non_wage_e…¹ of_wh…² total…³ unemp…⁴ unemp…⁵ ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2001 34.4 2.89 30.4 33.2 1.14 3.32 ## 2 2002 34.8 2.94 30.3 33.2 1.56 4.5 ## 3 2003 35.2 3.03 30.1 33.2 2.04 5.78 ## 4 2004 35.6 3.06 30.1 33.2 2.39 6.73 ## 5 2005 35.6 3.13 29.8 33.0 2.64 7.42 ## 6 2006 35.5 3.12 30.3 33.4 2.03 5.72 ## 7 2007 36.1 3.25 31.1 34.4 1.76 4.87 ## 8 2008 37.5 3.39 31.9 35.3 2.23 5.95 ## 9 2009 37.9 3.49 31.6 35.1 2.85 7.51 ## 10 2010 38.6 3.54 32.1 35.7 2.96 7.66 ## 11 2011 40.3 3.66 33.6 37.2 3.11 7.72 ## 12 2012 41.8 3.81 34.6 38.4 3.37 8.07 ## 13 2013 43.4 3.98 35.5 39.5 3.86 8.89 ## 14 2014 44.6 4.11 36.7 40.8 3.84 8.6 ## 15 2015 45.2 4.14 37.5 41.6 3.57 7.9 ## # … with abbreviated variable names ¹​of_which_non_wage_earners, ## # ²​of_which_wage_earners, ³​total_employed_population, ⁴​unemployed, ## # ⁵​unemployment_rate_in_percent ## ## [[3]] ## # A tibble: 15 × 7 ## year active_population of_which_non_wage_e…¹ of_wh…² total…³ unemp…⁴ unemp…⁵ ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2001 2.13 223 1.79 2.01 122 5.73 ## 2 2002 2.14 220 1.78 2.00 134 6.27 ## 3 2003 2.18 223 1.79 2.02 163 7.48 ## 4 2004 2.24 227 1.85 2.08 156 6.97 ## 5 2005 2.26 229 1.85 2.08 187 8.26 ## 6 2006 2.20 206 1.82 2.02 181 8.22 ## 7 2007 2.27 198 1.88 2.08 197 8.67 ## 8 2008 2.30 200 1.90 2.10 201 8.75 ## 9 2009 2.36 201 1.94 2.15 216 9.14 ## 10 2010 2.42 195 1.97 2.17 256 10.6 ## 11 2011 2.48 190 2.02 2.21 269 10.9 ## 12 2012 2.59 188 2.10 2.29 301 11.6 ## 13 2013 2.66 195 2.15 2.34 318 12.0 ## 14 2014 2.69 185 2.19 2.38 315 11.7 ## 15 2015 2.77 180 2.27 2.45 321 11.6 ## # … with abbreviated variable names ¹​of_which_non_wage_earners, ## # ²​of_which_wage_earners, ³​total_employed_population, ⁴​unemployed, ## # ⁵​unemployment_rate_in_percent each element of data is a tibble for the specific country with columns year, active_population, etc, the original columns. But obviously, there is no division column. So to plot the data, and join all the dots together, we need to add group = 1 in the call to ggplot2() (whereas if you plot multiple lines in the same graph, you need to write group = division). But more interestingly, how can you actually see the plots? If you want to simply look at them, it is enough to use pull(): plots_tibble %>% pull(plot) ## [[1]] ## ## [[2]] ## ## [[3]] And if we want to save these plots, we can do so using map2(): map2(paste0(plots_tibble$division, ".pdf"), plots_tibble$plot, ggsave) Saving 7 x 5 in image Saving 6.01 x 3.94 in image Saving 6.01 x 3.94 in image This was probably the most advanced topic we have studied yet; but you probably agree with me that it is among the most useful ones. This section is a perfect illustration of the power of functional programming; you can mix and match functions as long as you give them the correct arguments. You can pass data to functions that use data and then pass these functions to other functions that use functions as arguments, such as map().7 map() does not care if the functions you pass to it produces tables, graphs or even another function. map() will simply map this function to a list of inputs, and as long as these inputs are correct arguments to the function, map() will do its magic. If you combine this with list-columns, you can even use map() alongside dplyr functions and map your function by first grouping, filtering, etc… 8.4.2 Modeling with functional programming As written just above, map() simply applies a function to a list of inputs, and in the previous section we mapped ggplot() to generate many plots at once. This approach can also be used to map any modeling functions, for instance lm() to a list of datasets. For instance, suppose that you wish to perform a Monte Carlo simulation. Suppose that you are dealing with a binary choice problem; usually, you would use a logistic regression for this. However, in certain disciplines, especially in the social sciences, the so-called Linear Probability Model is often used as well. The LPM is a simple linear regression, but unlike the standard setting of a linear regression, the dependent variable, or target, is a binary variable, and not a continuous variable. Before you yell “Wait, that’s illegal”, you should know that in practice LPMs do a good job of estimating marginal effects, which is what social scientists and econometricians are often interested in. Marginal effects are another way of interpreting models, giving how the outcome (or the target) changes given a change in a independent variable (or a feature). For instance, a marginal effect of 0.10 for age would mean that probability of success would increase by 10% for each added year of age. We already discussed marginal effects in Chapter 6. There has been a lot of discussion on logistic regression vs LPMs, and there are pros and cons of using LPMs. Micro-econometricians are still fond of LPMs, even though the pros of LPMs are not really convincing. However, quoting Angrist and Pischke: “While a nonlinear model may fit the CEF (population conditional expectation function) for LDVs (limited dependent variables) more closely than a linear model, when it comes to marginal effects, this probably matters little” (source: Mostly Harmless Econometrics) so LPMs are still used for estimating marginal effects. Let us check this assessment with one example. First, we simulate some data, then run a logistic regression and compute the marginal effects, and then compare with a LPM: set.seed(1234) x1 <- rnorm(100) x2 <- rnorm(100) z <- .5 + 2*x1 + 4*x2 p <- 1/(1 + exp(-z)) y <- rbinom(100, 1, p) df <- tibble(y = y, x1 = x1, x2 = x2) This data generating process generates data from a binary choice model. Fitting the model using a logistic regression allows us to recover the structural parameters: logistic_regression <- glm(y ~ ., data = df, family = binomial(link = "logit")) Let’s see a summary of the model fit: summary(logistic_regression) ## ## Call: ## glm(formula = y ~ ., family = binomial(link = "logit"), data = df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.91941 -0.44872 0.00038 0.42843 2.55426 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.0960 0.3293 0.292 0.770630 ## x1 1.6625 0.4628 3.592 0.000328 *** ## x2 3.6582 0.8059 4.539 5.64e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 138.629 on 99 degrees of freedom ## Residual deviance: 60.576 on 97 degrees of freedom ## AIC: 66.576 ## ## Number of Fisher Scoring iterations: 7 We do recover the parameters that generated the data, but what about the marginal effects? We can get the marginal effects easily using the {margins} package: library(margins) margins(logistic_regression) ## Average marginal effects ## glm(formula = y ~ ., family = binomial(link = "logit"), data = df) ## x1 x2 ## 0.1598 0.3516 Or, even better, we can compute the true marginal effects, since we know the data generating process: meffects <- function(dataset, coefs){ X <- dataset %>% select(-y) %>% as.matrix() dydx_x1 <- mean(dlogis(X%*%c(coefs[2], coefs[3]))*coefs[2]) dydx_x2 <- mean(dlogis(X%*%c(coefs[2], coefs[3]))*coefs[3]) tribble(~term, ~true_effect, "x1", dydx_x1, "x2", dydx_x2) } (true_meffects <- meffects(df, c(0.5, 2, 4))) ## # A tibble: 2 × 2 ## term true_effect ## <chr> <dbl> ## 1 x1 0.175 ## 2 x2 0.350 Ok, so now what about using this infamous Linear Probability Model to estimate the marginal effects? lpm <- lm(y ~ ., data = df) summary(lpm) ## ## Call: ## lm(formula = y ~ ., data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.83953 -0.31588 -0.02885 0.28774 0.77407 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.51340 0.03587 14.314 < 2e-16 *** ## x1 0.16771 0.03545 4.732 7.58e-06 *** ## x2 0.31250 0.03449 9.060 1.43e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3541 on 97 degrees of freedom ## Multiple R-squared: 0.5135, Adjusted R-squared: 0.5034 ## F-statistic: 51.18 on 2 and 97 DF, p-value: 6.693e-16 It’s not too bad, but maybe it could have been better in other circumstances. Perhaps if we had more observations, or perhaps for a different set of structural parameters the results of the LPM would have been closer. The LPM estimates the marginal effect of x1 to be 0.1677134 vs 0.1597956 for the logistic regression and for x2, the LPM estimation is 0.3124966 vs 0.351607. The true marginal effects are 0.1750963 and 0.3501926 for x1 and x2 respectively. Just as to assess the accuracy of a model data scientists perform cross-validation, a Monte Carlo study can be performed to asses how close the estimation of the marginal effects using a LPM is to the marginal effects derived from a logistic regression. It will allow us to test with datasets of different sizes, and generated using different structural parameters. First, let’s write a function that generates data. The function below generates 10 datasets of size 100 (the code is inspired by this StackExchange answer): generate_datasets <- function(coefs = c(.5, 2, 4), sample_size = 100, repeats = 10){ generate_one_dataset <- function(coefs, sample_size){ x1 <- rnorm(sample_size) x2 <- rnorm(sample_size) z <- coefs[1] + coefs[2]*x1 + coefs[3]*x2 p <- 1/(1 + exp(-z)) y <- rbinom(sample_size, 1, p) df <- tibble(y = y, x1 = x1, x2 = x2) } simulations <- rerun(.n = repeats, generate_one_dataset(coefs, sample_size)) tibble("coefs" = list(coefs), "sample_size" = sample_size, "repeats" = repeats, "simulations" = list(simulations)) } Let’s first generate one dataset: one_dataset <- generate_datasets(repeats = 1) Let’s take a look at one_dataset: one_dataset ## # A tibble: 1 × 4 ## coefs sample_size repeats simulations ## <list> <dbl> <dbl> <list> ## 1 <dbl [3]> 100 1 <list [1]> As you can see, the tibble with the simulated data is inside a list-column called simulations. Let’s take a closer look: str(one_dataset$simulations) ## List of 1 ## $ :List of 1 ## ..$ : tibble [100 × 3] (S3: tbl_df/tbl/data.frame) ## .. ..$ y : int [1:100] 0 1 1 1 0 1 1 0 0 1 ... ## .. ..$ x1: num [1:100] 0.437 1.06 0.452 0.663 -1.136 ... ## .. ..$ x2: num [1:100] -2.316 0.562 -0.784 -0.226 -1.587 ... The structure is quite complex, and it’s important to understand this, because it will have an impact on the next lines of code; it is a list, containing a list, containing a dataset! No worries though, we can still map over the datasets directly, by using modify_depth() instead of map(). Now, let’s fit a LPM and compare the estimation of the marginal effects with the true marginal effects. In order to have some confidence in our results, we will not simply run a linear regression on that single dataset, but will instead simulate hundreds, then thousands and ten of thousands of data sets, get the marginal effects and compare them to the true ones (but here I won’t simulate more than 500 datasets). Let’s first generate 10 datasets: many_datasets <- generate_datasets() Now comes the tricky part. I have this object, many_datasets looking like this: many_datasets ## # A tibble: 1 × 4 ## coefs sample_size repeats simulations ## <list> <dbl> <dbl> <list> ## 1 <dbl [3]> 100 10 <list [10]> I would like to fit LPMs to the 10 datasets. For this, I will need to use all the power of functional programming and the {tidyverse}. I will be adding columns to this data frame using mutate() and mapping over the simulations list-column using modify_depth(). The list of data frames is at the second level (remember, it’s a list containing a list containing data frames). I’ll start by fitting the LPMs, then using broom::tidy() I will get a nice data frame of the estimated parameters. I will then only select what I need, and then bind the rows of all the data frames. I will do the same for the true marginal effects. I highly suggest that you run the following lines, one after another. It is complicated to understand what’s going on if you are not used to such workflows. However, I hope to convince you that once it will click, it’ll be much more intuitive than doing all this inside a loop. Here’s the code: results <- many_datasets %>% mutate(lpm = modify_depth(simulations, 2, ~lm(y ~ ., data = .x))) %>% mutate(lpm = modify_depth(lpm, 2, broom::tidy)) %>% mutate(lpm = modify_depth(lpm, 2, ~select(., term, estimate))) %>% mutate(lpm = modify_depth(lpm, 2, ~filter(., term != "(Intercept)"))) %>% mutate(lpm = map(lpm, bind_rows)) %>% mutate(true_effect = modify_depth(simulations, 2, ~meffects(., coefs = coefs[[1]]))) %>% mutate(true_effect = map(true_effect, bind_rows)) This is how results looks like: results ## # A tibble: 1 × 6 ## coefs sample_size repeats simulations lpm true_effect ## <list> <dbl> <dbl> <list> <list> <list> ## 1 <dbl [3]> 100 10 <list [10]> <tibble [20 × 2]> <tibble [20 × 2]> Let’s take a closer look to the lpm and true_effect columns: results$lpm ## [[1]] ## # A tibble: 20 × 2 ## term estimate ## <chr> <dbl> ## 1 x1 0.228 ## 2 x2 0.353 ## 3 x1 0.180 ## 4 x2 0.361 ## 5 x1 0.165 ## 6 x2 0.374 ## 7 x1 0.182 ## 8 x2 0.358 ## 9 x1 0.125 ## 10 x2 0.345 ## 11 x1 0.171 ## 12 x2 0.331 ## 13 x1 0.122 ## 14 x2 0.309 ## 15 x1 0.129 ## 16 x2 0.332 ## 17 x1 0.102 ## 18 x2 0.374 ## 19 x1 0.176 ## 20 x2 0.410 results$true_effect ## [[1]] ## # A tibble: 20 × 2 ## term true_effect ## <chr> <dbl> ## 1 x1 0.183 ## 2 x2 0.366 ## 3 x1 0.166 ## 4 x2 0.331 ## 5 x1 0.174 ## 6 x2 0.348 ## 7 x1 0.169 ## 8 x2 0.339 ## 9 x1 0.167 ## 10 x2 0.335 ## 11 x1 0.173 ## 12 x2 0.345 ## 13 x1 0.157 ## 14 x2 0.314 ## 15 x1 0.170 ## 16 x2 0.340 ## 17 x1 0.182 ## 18 x2 0.365 ## 19 x1 0.161 ## 20 x2 0.321 Let’s bind the columns, and compute the difference between the true and estimated marginal effects: simulation_results <- results %>% mutate(difference = map2(.x = lpm, .y = true_effect, full_join)) %>% mutate(difference = map(difference, ~mutate(., difference = true_effect - estimate))) %>% mutate(difference = map(difference, ~select(., term, difference))) %>% pull(difference) %>% .[[1]] ## Joining, by = "term" Let’s take a look at the simulation results: simulation_results %>% group_by(term) %>% summarise(mean = mean(difference), sd = sd(difference)) ## # A tibble: 2 × 3 ## term mean sd ## <chr> <dbl> <dbl> ## 1 x1 0.0122 0.0368 ## 2 x2 -0.0141 0.0311 Already with only 10 simulated datasets, the difference in means is not significant. Let’s rerun the analysis, but for difference sizes. In order to make things easier, we can put all the code into a nifty function: monte_carlo <- function(coefs, sample_size, repeats){ many_datasets <- generate_datasets(coefs, sample_size, repeats) results <- many_datasets %>% mutate(lpm = modify_depth(simulations, 2, ~lm(y ~ ., data = .x))) %>% mutate(lpm = modify_depth(lpm, 2, broom::tidy)) %>% mutate(lpm = modify_depth(lpm, 2, ~select(., term, estimate))) %>% mutate(lpm = modify_depth(lpm, 2, ~filter(., term != "(Intercept)"))) %>% mutate(lpm = map(lpm, bind_rows)) %>% mutate(true_effect = modify_depth(simulations, 2, ~meffects(., coefs = coefs[[1]]))) %>% mutate(true_effect = map(true_effect, bind_rows)) simulation_results <- results %>% mutate(difference = map2(.x = lpm, .y = true_effect, full_join)) %>% mutate(difference = map(difference, ~mutate(., difference = true_effect - estimate))) %>% mutate(difference = map(difference, ~select(., term, difference))) %>% pull(difference) %>% .[[1]] simulation_results %>% group_by(term) %>% summarise(mean = mean(difference), sd = sd(difference)) } And now, let’s run the simulation for different parameters and sizes: monte_carlo(c(.5, 2, 4), 100, 10) ## Joining, by = "term" ## # A tibble: 2 × 3 ## term mean sd ## <chr> <dbl> <dbl> ## 1 x1 -0.00826 0.0318 ## 2 x2 -0.00732 0.0421 monte_carlo(c(.5, 2, 4), 100, 100) ## Joining, by = "term" ## # A tibble: 2 × 3 ## term mean sd ## <chr> <dbl> <dbl> ## 1 x1 0.00360 0.0408 ## 2 x2 0.00517 0.0459 monte_carlo(c(.5, 2, 4), 100, 500) ## Joining, by = "term" ## # A tibble: 2 × 3 ## term mean sd ## <chr> <dbl> <dbl> ## 1 x1 -0.00152 0.0388 ## 2 x2 -0.000701 0.0462 monte_carlo(c(pi, 6, 9), 100, 10) ## Joining, by = "term" ## # A tibble: 2 × 3 ## term mean sd ## <chr> <dbl> <dbl> ## 1 x1 -0.00829 0.0421 ## 2 x2 0.00178 0.0397 monte_carlo(c(pi, 6, 9), 100, 100) ## Joining, by = "term" ## # A tibble: 2 × 3 ## term mean sd ## <chr> <dbl> <dbl> ## 1 x1 0.0107 0.0576 ## 2 x2 0.00831 0.0772 monte_carlo(c(pi, 6, 9), 100, 500) ## Joining, by = "term" ## # A tibble: 2 × 3 ## term mean sd ## <chr> <dbl> <dbl> ## 1 x1 0.00879 0.0518 ## 2 x2 0.0113 0.0687 We see that, at least for this set of parameters, the LPM does a good job of estimating marginal effects. Now, this study might in itself not be very interesting to you, but I believe the general approach is quite useful and flexible enough to be adapted to all kinds of use-cases. 8.5 Exercises Exercise 1 Suppose you have an Excel workbook that contains data on three sheets. Create a function that reads entire workbooks, and that returns a list of tibbles, where each tibble is the data of one sheet (download the example Excel workbook, example_workbook.xlsx, from the assets folder on the books github). Exercise 2 Use one of the map() functions to combine two lists into one. Consider the following two lists: mediterranean <- list("starters" = list("humous", "lasagna"), "dishes" = list("sardines", "olives")) continental <- list("starters" = list("pea soup", "terrine"), "dishes" = list("frikadelle", "sauerkraut")) The result we’d like to have would look like this: $starters $starters[[1]] [1] "humous" $starters[[2]] [1] "olives" $starters[[3]] [1] "pea soup" $starters[[4]] [1] "terrine" $dishes $dishes[[1]] [1] "sardines" $dishes[[2]] [1] "lasagna" $dishes[[3]] [1] "frikadelle" $dishes[[4]] [1] "sauerkraut" Functions that have other functions as input are called higher order functions↩︎ "],["package-development.html", "Chapter 9 Package development 9.1 Why you need to write your own package 9.2 Starting easy: creating a package to share data 9.3 Including data inside the package 9.4 Adding functions to your package 9.5 Documenting your package 9.6 Unit testing your package", " Chapter 9 Package development 9.1 Why you need to write your own package One of the reasons you might have tried R in the first place is the abundance of packages. As I’m writing these lines (in November 2020) 16523 packages are available on CRAN (in August 2019, there were 14762, and in August 2016, when I first wrote the number of packages down for my first ebook, it was 8922 packages). This is a staggering amount of packages and to help you look for the right ones, you can check out CRAN Task Views. You might wonder why the heck should you write your own packages? After all, with so many packages you’re sure to find something that suits your needs, right? Well, it depends. Of course, you will not need to write you own function to perform non-linear regression, or to train a neural network. But as time will go, you will start writing your own functions, functions that fit your needs, and that you use daily. It may be functions that prepare and shape data that you use at work for analysis. Or maybe you want to deliver an analysis to a client, with data and source code, so you decide to deliver a package that contains everything (something I’ve already done in the past). Maybe you want to develop a Shiny applications using the {golem} framework, which allows you to build apps as packages. Ok, but is it necessary to write a package? Why not just write functions inside some scripts and then simply run or share these scripts (and in the case of Shiny, you don’t have to use {golem})? This seems like a valid solution at first. However, it quickly becomes tedious, especially if you have multiple scripts scattered around your computer or inside different subfolders. You’ll also have to write the documentation on separate files and these can easily get lost or become outdated. Relying on scripts does not scale well; even if you are not sharing your code outside of your computer (maybe you’re working on super secret projects at NASA), you always have to think about future you. And in general, future you thinks that past you is an asshole, exactly because you put 0 effort in documenting, testing and making your code easy to use. Having everything inside a package takes care of these headaches for you, and will make future you proud of past you. And if you have to share your code, or deliver to a client, believe me, it will make things a thousand times easier. Code that is inside packages is very easy to document and test, especially if you’re using Rstudio. It also makes it possible to use the wonderful {covr} package, which tells you which lines in which functions are called by your tests. If some lines are missing, write tests that invoke them and increase the coverage of your tests! Documenting and testing your code is very important; it gives you assurance that the code your writing works, but most importantly, it gives others assurance that what you wrote works. And I include future you in these others too. In order to share this package with these others we are going to use Git. If you’re familiar with Git, great, you’ll be able to skip some sections. If not, then buckle up, you’re in for a wild ride. As I mentioned in the introduction, if you want to learn much more than I’ll show about packages read Wickham (2015). I will only show you the basics, but it should be enough to get you productive. 9.2 Starting easy: creating a package to share data We will start a package from scratch, in order to share data with the world. For this, we are first going to scrape a table off Wikipedia, prepare the data and then include it in a package. To make distributing this package easy, we’re going to put it up on Github, so you’ll need a Github account. Let’s start by creating a Github account. 9.2.1 Setting up a Github account Setting up a Github account is very easy; just go over to https://github.com/ and simply sign up! Then you will need to generate a ssh key on your computer. This is a way for you to securely interact with your Github account, and push your code to the repository without having to always type your password. I will assume you never created any ssh keys before, because if you already did, you could skip these steps. I will also assume that you are on a GNU+Linux or macOS system; if you’re using windows, the instructions are very similar, but you’ll first need to install Git available here. Git is available by default on any GNU+Linux system, and as far as I know also on macOS, but I might be wrong and you might also need to install git on macOS (but then the instructions are the same whether you’re using GNU+Linux or macOS). If you have trouble installing git, read the following section from the Pro Git book. Then, open a terminal (or the git command line on Windows) and type the following: ssh-keygen This command will generate several files in the .ssh directory inside your HOME directory. Look for the file that ends with the .pub extension, and copy its contents. You will need to paste these contents on Github. So now sign in to Github; once you are signed in, go to settings and then SSH and GPG keys: In the screenshot above, you see my ssh key associated with my account; this will be empty for you. Click on the top right, New SSH key: Give your key a name, and paste the key you generated before. You’re done! You can now configure git a bit more by telling it who you are. Open a terminal, adapt and type the following commands: git config --global user.name "Harold Zurcher" git config --global user.email harold.zurcher@madisonbus.com You’re ready to go!8 You can now push code to github to share it with the world. Or if you do not want to share you package (for confidentiality reasons for instance), you can still benefit from using git, as it possible to have an internal git server that could be managed by your company’s IT team. There is also the possibility to set up corporate, and thus private git servers by buying the service from github, or other providers such as gitlab. 9.2.2 Starting your package To start writing a package, the easiest way is to load up Rstudio and start a new project, under the File menu. If you’re starting from scratch, just choose the first option, New Directory and then R package. Give a new to your package, for example arcade (you’ll see why in a bit) and you can also choose to use git for version control. Now if you check the folder where you chose to save your package, you will see a folder with the same name as your package, and inside this folder a lot of new files and other folders. The most important folder for now is the R folder. This is the folder that will hold your .R source code files. You can also see these files and folders inside the Files panel from within Rstudio. Rstudio will also have hello.R opened, which is a single demo source file inside the R folder. You can get rid of this file, or keep it and edit it. I would advise you keep it and even distribute it inside your package. You can save this file in a special directory called data-raw. You don’t need to manually create this folder now, we will do so in a bit. For now, just follow along. Now, to start working on your package, the best is to use a package called {usethis}. {usethis} is a package that makes writing packages very easy; it includes functions that create the required subfolders and necessary template files so that you do not need to constantly check how file so-and-so should be placed or named. Let’s start by adding a readme file. This is easily achieved by using the following function from {usethis}: usethis::use_readme_md() This creates a template README.md file in the root directory of your package. You can now edit this file accordingly, and that’s it. The next step could be setting up your package to work with {roxygen2}, which will help write the documentation of your package: usethis::use_roxygen_md() The output tells you to run devtools::document(), we will do this later. Since you have learned about the tidyverse by reading this book, I am willing to bet that you will want to use the %>% operator inside the functions contained in your package. To do this without issues, which wil become apparent later, use the following command: usethis::use_pipe() This will make the %>% operator available internally to your package’s functions, but also to the user that will load the package. We are almost done setting up the package. If you plan on distributing data with your package, you might want to also share the code that prepared the data. For instance, if you receive the data from your finance department, but this data needs some cleaning before being useful, you could write a script to do so and then distribute this script also with the package, for reproducibility purposes. These scripts, while not central to the package, could still be of interest to the users. The directory to place them is called data-raw: usethis::use_data_raw() One final folder is inst. You can add files to this folder, and they will be available to the users that install the package. Users can find the files in the folder where packages get installed. On GNU+Linux systems, that would be somewhere like: /home/user/R/amd64-linux-gnu-library/3.6. There, you will find the installation folders of all the packages. If the package you make is called {spam}, you will find the files you put inside the inst folder on the root of the installation folder of spam. You can simply create the inst folder yourself, or use the following command: usethis::use_directory("inst") Finally, the last step is to give your package a license; this again is only useful if you plan on distributing it to the world. If you are writing your own package for yourself, or for purposes internal to your company, this is probably superfluous. I won’t discuss the particularities of licenses, so let’s just say that for the sake of this example package we are writing, we are going to use the MIT license: usethis::use_mit_license() This again creates the right file at the right spot. There are other interesting functions inside the {usethis} package, and we will come back to it later. 9.3 Including data inside the package Many packages include data and we are going to learn how to do it. I’ll assume that we already have a dataset on hand that we have to share. This is quite simple to do, first let’s simply load the data: arcade <- readr::read_csv("~/path/to/data/arcade.csv") and then use, once again, {usethis} comes to our rescue: usethis::use_data(arcade, compress = "xz") that’s it! Well almost. We still need to write a little script that will allow users of your package to load the data. This script is simply called data.R and contains the following lines: #' List of highest-grossing games #' #' Source: https://en.wikipedia.org/wiki/Arcade_game#List_of_highest-grossing_games #' #' @format A data frame with 6 variables: \\code{game}, \\code{release_year}, #' \\code{hardware_units_sold}, \\code{comment_hardware}, \\code{estimated_gross_revenue}, #' \\code{comment_revenue} #' \\describe{ #' \\item{game}{The name of the game} #' \\item{release_year}{The year the game was released} #' \\item{hardware_units_sold}{The amount of hardware units sold} #' \\item{comment_hardware}{Comment accompanying the amount of hardware units sold} #' \\item{estimated_gross_revenue}{Estimated gross revenue in US$ with 2019 inflation} #' \\item{comment_revenue}{Comment accompanying the amount of hardware units sold} #' } "arcade" Basically this is a description of the data, and the name with which the user will invoke the data. To conclude this part, remember the data-raw folder? If you used a script to scrape/get the data from somewhere, or if you had to write code to prepare the data to make it fit for sharing, this is where you can put that script. I have written such a script, I will discuss it in the next chapter, where I’ll show you how to scrape data from the internet. You can also save the file where you wrote all your calls to {usethis} functions if you want. 9.4 Adding functions to your package Functions will be added inside the R package. In there, you will find the hello.R file. You can edit this file if you kept it or you can create a new script. This script can hold one function, or several functions. Let’s start with the simplest case; one function inside one script. 9.4.1 One function inside one script Create a new R script, or edit the hello.R file, and add in the following code: #' Compute descriptive statistics for the numeric columns of a data frame. #' @param df The data frame to summarise. #' @param ... Optional. Columns in the data frame #' @return A data frame with descriptive statistics. If you are only interested in certain columns #' you can add these columns. #' @import dplyr #' @importFrom tidyr gather #' @export #' @examples #' \\dontrun{ #' describe(dataset) #' describe(dataset, col1, col2) #' } describe_numeric <- function(df, ...){ if (nargs() > 1) df <- select(df, ...) df %>% select_if(is.numeric) %>% gather(variable, value) %>% group_by(variable) %>% summarise_all(list(mean = ~mean(., na.rm = TRUE), sd = ~sd(., na.rm = TRUE), nobs = ~length(.), min = ~min(., na.rm = TRUE), max = ~max(., na.rm = TRUE), q05 = ~quantile(., 0.05, na.rm = TRUE), q25 = ~quantile(., 0.25, na.rm = TRUE), mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE), median = ~quantile(., 0.5, na.rm = TRUE), q75 = ~quantile(., 0.75, na.rm = TRUE), q95 = ~quantile(., 0.95, na.rm = TRUE), n_missing = ~sum(is.na(.)))) %>% mutate(type = "Numeric") } Save the script under the name describe.R. This function shows you pretty much you need to know when writing functions for packages. First, there’s the comment lines, that start with #' and not with #. These lines will be converted into the function’s documentation which you and your package’s users will be able to read in Rstudio’s Help pane. Notice the keywords that start with @. These are quite important: @param: used to define the function’s parameters; @return: used to define the object returned by the function; @import: if the function needs functions from another package, in the present case {dplyr}; then this is where you would define these. Separate several package with a space; @importFrom: if the function only needs one function from a package, define it here. Read it as from tidyr import gather, very similar to how it is done in Python; @export: makes the function available to the users. If you omit this, this function will not be available to the users and only available internally to the other functions of the package. Not making functions available to users can be useful if you need to write functions that are used by other functions but never be used by anyone directly. It is still possible to access these internal, private, functions by using :::, as in, package:::private_function(); @examples: lists examples in the documentation. The \\dontrun{} tag is used for when you do not want these examples to run when building the package. As explained before, if the function depends on function from other packages, then @import or @importFrom must be used. But it is also possible to use the package::function() syntax like I did on the following line: mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE), This function uses the sample_mode() function from my {brotools} package. Since it is the only function that I am using, I don’t import the whole package with @import. I could have done the same for gather() from {tidyr} instead of using @importFrom, but I wanted to showcase @importFrom, which can also be use to import several functions: @importFrom package function_1 function_2 function_3 The way I’m doing this however is not optimal. If your package depends on many functions from other packages that are not available on CRAN, but rather on Github, you might want to do that in a cleaner way. The cleaner way is to add a “Remotes” field in the package’s NAMESPACE (this is a very important file that gets generated automatically by devtools::document()) I won’t cover this here, but you can read more about it here. What I will cover is how to declare dependencies to other CRAN packages. These dependencies also get declared inside the ‘Description’ file, which we will cover in the next section. Because I’m doing that in this hacky way, my {brotools} package should be installed: devtools::install_github("b-rodrigues/brotools") Again, I want to emphasize that this is not the best way of doing it. However, using the “REMOTES” field as described in the document I linked above is not complicated. Now comes the function itself. The function is written in pretty much the same way as usual, but there are some particularities. First of all, the second argument of the function is the ..., which were already covered in Chapter 7. I want to give the option to my users to specify any columns to summarise only these columns, instead of all of them, which is the default behaviour. But because I cannot know how many columns the user wants to summarize beforehand, and also because I do not want to limit the user to 2 or 3 columns, I use the .... But what if the user wants to summarize all the columns? This is taken care of in this line: if (nargs() > 1) df <- select(df, ...) nargs() counts the number of arguments of the function. If the user calls the function like so: describe_numeric(mtcars) nargs() will return 1. If, instead, the user calls the function with one or more columns: describe_numeric(mtcars, hp, mpg) then nargs() will return 2 (in this case). And does, this piece of code will be executed: df <- select(df, ...) which selects the columns hp and mpg from the mtcars dataset. This reduced data set is then the one that is being summarized. 9.4.2 Many functions inside a script If you need to add more functions, you can add more in the same script, or create one script per function. The advantage of writing more than one function per script is that you can keep functions that are conceptually similar in the same place. For instance, if you want to add a function called describe_character() to your package, adding it to the same script where describe_numeric() is might be a good idea, so let’s do just that: #' Compute descriptive statistics for the numeric columns of a data frame. #' @param df The data frame to summarise. #' @param ... Optional. Columns in the data frame #' @return A data frame with descriptive statistics. If you are only interested in certain columns #' you can add these columns. #' @import dplyr #' @importFrom tidyr gather #' @export #' @examples #' \\dontrun{ #' describe(dataset) #' describe(dataset, col1, col2) #' } describe_numeric <- function(df, ...){ if (nargs() > 1) df <- select(df, ...) df %>% select(is.numeric) %>% pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>% group_by(variable) %>% summarise(across(everything(), tibble::lst(mean = ~mean(., na.rm = TRUE), sd = ~sd(., na.rm = TRUE), nobs = ~length(.), min = ~min(., na.rm = TRUE), max = ~max(., na.rm = TRUE), q05 = ~quantile(., 0.05, na.rm = TRUE), q25 = ~quantile(., 0.25, na.rm = TRUE), mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE), median = ~quantile(., 0.5, na.rm = TRUE), q75 = ~quantile(., 0.75, na.rm = TRUE), q95 = ~quantile(., 0.95, na.rm = TRUE), n_missing = ~sum(is.na(.))))) %>% mutate(type = "Numeric") } #' Compute descriptive statistics for the character or factor columns of a data frame. #' @param df The data frame to summarise. #' @return A data frame with a description of the character or factor columns. #' @import dplyr #' @importFrom tidyr gather describe_character_or_factors <- function(df, type){ df %>% pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>% group_by(variable) %>% summarise(across(everything(), funs(mode = brotools::sample_mode(value, na.rm = TRUE), nobs = length(value), n_missing = sum(is.na(value)), n_unique = length(unique(value))))) %>% mutate(type = type) } #' Compute descriptive statistics for the character columns of a data frame. #' @param df The data frame to summarise. #' @return A data frame with a description of the character columns. #' @import dplyr #' @export #' @examples #' \\dontrun{ #' describe(dataset) #' } describe_character <- function(df){ df %>% select(where(is.character)) %>% describe_character_or_factors(type = "Character") } Let’s now continue on to the next section, where we will learn to document the package. 9.5 Documenting your package There are several files that you must edit to fully document the package; for now, only the functions are documented. The first of these files is the DESCRIPTION file. 9.5.1 Description By default, the DESCRIPTION file, which you can find in the root of your package project, contains the following lines: Package: arcade Type: Package Title: What the Package Does (Title Case) Version: 0.1.0 Author: Who wrote it Maintainer: The package maintainer <yourself@somewhere.net> Description: More about what it does (maybe more than one line) Use four spaces when indenting paragraphs within the Description. License: What license is it under? Encoding: UTF-8 LazyData: true RoxygenNote: 7.0.2 Each section is quite self-explanatory. This is how it could look like once you’re done editing it: Package: arcade Type: Package Title: List of highest-grossing Arcade Games Version: 0.1.0 Author: person("Harold", "Zurcher", email = "harold.zurcher@madisonbus.com", role = c("aut", "cre")) Description: This package contains data about the highest-grossing arcade games from the 70's until 2010's. Also contains some functions to summarize data. License: CC0 Encoding: UTF-8 LazyData: true RoxygenNote: 7.0.2 The Author and Maintainer need some further explanations; I have added Harold Zurcher as the athor and creator, with the role = c(\"aut\", \"cre\") bit. \"cre\" can also be used for maintainer, so I removed the Maintainer line. 9.6 Unit testing your package References "],["further-topics.html", "Chapter 10 Further topics 10.1 Using Python from R with {reticulate} 10.2 Generating Pdf or Word reports with R 10.3 Scraping the internet 10.4 Regular expressions 10.5 Setting up a blog with {blogdown}", " Chapter 10 Further topics This chapter is a collection of short section that show some of the very nice things you can use R for. These sections are based on past blog posts. 10.1 Using Python from R with {reticulate} There is a lot of discussion online about the benefits of Python over and vice-versa. When it comes to data science, they are for the most part interchangeable. I would say that R has an advantage over Python when it comes to offering specialized packages for certain topics such as econometrics, bioinformatics, actuarial sciences, etc… while Python seems to offer more possibilities when it comes to integrating a machine learning model into an app. However, if most of your work is data analysis/machine learning, both languages are practically interchangeable. But it can happen that you need access to a very specific library with no R equivalent. Well, in that case, no need to completely switch to Python, as you can call Python code from R using the {reticulate} package. {reticulate} allows you to seamlessly call Python functions from an R session. An easy way to use {reticulate} is to start a a new notebook, but you can also use {reticulate} and the included functions interactively. However, I find that Rstudio notebooks work very well for this particular use-case, because you can write R and Python chunks, and thus differentiate the different specific lines of code really well. Let’s see how this works. First of all, you might need to specify the path to your Python executable, in my case, because I’ve installed Python using Anaconda, I need to specify it: # This is an R chunk use_python("~/miniconda3/bin/python") 10.2 Generating Pdf or Word reports with R 10.3 Scraping the internet 10.4 Regular expressions 10.5 Setting up a blog with {blogdown} "],["references.html", "References", " References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] +[["index.html", "Modern R with the tidyverse Preface Note to the reader What is R? Who is this book for? Why this book? Why modern R? What is RStudio? What to expect from this book? Prerequisites What are packages? The author", " Modern R with the tidyverse Bruno Rodrigues 2022-10-16 Preface Note to the reader I have been working on this on and off for the past 4 years or so. In 2022, I have updated the contents of the book to reflect updates introduced with R 4.1 and in several packages (especially those from the {tidyverse}). I have also cut some content that I think is not that useful, especially in later chapters. This book is still being written. Chapters 1 to 8 are almost ready, but more content is being added (especially to chapter 8). 9 and 10 are empty for now. Some exercises might be at the wrong place too and more are coming. You can purchase an ebook version of this book on leanpub. The version on leanpub is quite out of date, so if you buy it, it’s really just to send some money my money, so many thanks for that! You can also support me by buying me a coffee or paypal.me. What is R? Read R’s official answer to this question here. To make it short: R is a multi-paradigm (procedural, imperative, object-oriented and functional)1 programming language that focuses on applications in statistics. By statistics I mean any field that uses statistics such as official statistics, economics, finance, data science, machine learning, etc. For the sake of simplicity, I will use the word “statistics” as a general term that encompasses all these fields and disciplines for the remainder of this book. Who is this book for? This book can be useful to different audiences. If you have never used R in your life, and want to start, start with Chapter 1 of this book. Chapter 1 to 3 are the very basics, and should be easy to follow up to Chapter 7. Starting with Chapter 7, it gets more technical, and will be harder to follow. But I suggest you keep on going, and do not hesitate to contact me for help if you struggle! Chapter 7 is also where you can start if you are already familiar with R and the {tidyverse}, but not functional programming. If you are familiar with R but not the {tidyverse} (or have no clue what the {tidyverse} is), then you can start with Chapter 4. If you are familiar with R, the {tidyverse} and functional programming, you might still be interested in this book, especially Chapter 9 and 10, which deal with package development and further advanced topics respectively. Why this book? This book is first and foremost for myself. This book is the result of years of using and teaching R at university and then at my jobs. During my university time, I wrote some notes to help me teach R and which I shared with my students. These are still the basis of Chapter 2. Then, once I had left university, and continued using R at my first “real” job, I wrote another book that dealt mostly with package development and functional programming. This book is now merged to this one and is the basis of Chapters 9 and 10. During these years at my first job, I was also tasked with teaching R. By that time, I was already quite familiar with the {tidyverse} so I wrote a lot of notes that were internal and adapted for the audience of my first job. These are now the basis of Chapters 3 to 8. Then, during all these years, I kept blogging about R, and reading blogs and further books. All this knowledge is condensed here, so if you are familiar with my blog, you’ll definitely recognize a lot of my blog posts in here. So this book is first and foremost for me, because I need to write all of this down in a central place. So because my target audience is myself, this book is free. If you find it useful, and are in the mood of buying me a coffee, you can, but if this book is not useful to you, no harm done (unless you paid for it before reading it, in which case, I am sorry to have wasted your time). But I am quite sure you’ll find some of the things written here useful, regardless of your current experience level with R. Why modern R? Modern R instead of “just” R because we are going to learn how to use modern packages (mostly those from the tidyverse) and concepts, such as functional programming (which is quite an old concept actually, but one that came into fashion recently). R is derived from S, which is a programming language that has roots in FORTRAN and other languages too. If you learned R at university, you’ve probably learned to use it as you would have used FORTRAN; very long scripts where data are represented as matrices and where row-wise (or column-wise) operations are implemented with for loops. There’s nothing wrong with that, mind you, but R was also influenced by Scheme and Common Lisp, which are functional programming languages. In my opinion, functional programming is a programming paradigm that works really well when dealing with statistical problems. This is because programming in a functional style is just like writing math. For instance, suppose you want to sum all the elements of a vector. In mathematical notation, you would write something like: \\[ \\sum_{i = 1}^{100} x_{i} \\] where \\(x\\) is a vector of length 100. Solving this using a loop would look something like this: res <- 0 for(i in 1:length(x)){ res <- x[i] + res } This does not look like the math notation at all! You have to define a variable that will hold the result outside of the loop, and then you have to define res as something plus res inside the body of the loop. This is really unnatural. The functional programming approach is much easier: Reduce(`+`, x) We will learn about Reduce() later (to be more precise, we will learn about purrr::reduce(), the “tidy” version of Reduce()), but already you see that the notation looks a lot more like the mathematical notation. At its core, functional programming uses functions, and functions are so-called first class objects in R, which means that there is nothing special about them… you can pass them to other functions, create functions that return functions and do any kind of operation on them just as with any other object. This means that functions in R are extremely powerful and flexible tools. In the first part of the book, we are going to use functions that are already available in R, and then use those available in packages, mostly those from the tidyverse. The tidyverse is a collection of packages developed by Hadley Wickham, and several of his colleagues at RStudio, Inc. By using the packages from the tidyverse and R’s built-in functional programming capabilities, we can write code that is faster and easier to explain to colleagues, and also easier to maintain. This also means that you might have to change your expectations and what you know already from R, if you learned it at University but haven’t touched it in a long time. For example for and while loops, are relegated to chapter 8. This does not mean that you will have to wait for 8 chapter to know how to repeat instructions N times, but that for and while loops are tools that are very useful for very specific situations that will be discussed at that point. In the second part of the book, we are going to move from using R to solve statistical problems to developing with R. We are going to learn about creating your own package. If you do not know what packages are, don’t worry, this will be discussed just below. What is RStudio? RStudio is a modern IDE that makes writing R code easier. The first thing we are going to learn is how to use it. R and RStudio are both open source: this means that the source code is freely available on the internet and contributions by anyone are welcome and integrated; provided they are meaningful and useful. What to expect from this book? The idea of Chapters 1 to 7 is to make you efficient with R as quickly as possible, especially if you already have prior programming knowledge. Starting with Chapter 8 you will learn more advanced topics, especially programming with R. R is a programming language, and you can’t write “programming language” without “language”. And just as you wouldn’t expect to learn French, Portuguese or Icelandic by reading a single book, you shouldn’t expect to become fluent in R by reading a single book, not even by reading 10 books. Programming is an art which requires a lot of practice. Teach yourself programming in 10 years is a blog post written by Peter Norvig which explains that just as with any craft, mastering programming takes time. And even if you don’t need or want to become an expert in R, if you wish to use R effectively and in a way that ultimately saves you time, you need to have some fluency in it, and this only comes by continuing to learn about the language, and most importantly practicing. If you keep using R every day, you’ll definitely become very fluent. To stay informed about developments of the language, and the latest news, I advise you read blogs, especially R-bloggers which aggregates blog posts by more than 750 blogs discussing R. So what you can expect from this book is that this book is not the only one you should read. Prerequisites R and RStudio are the two main pieces of software that we are going to use. R is the programming language and RStudio is a modern IDE for it. You can use R without RStudio; but you cannot use RStudio without R. If you wish to install R and RStudio at home to follow the examples in this book you can do it as both pieces of software are available free of charge (paid options for RStudio exist, for companies that need technical support). Installation is simple, but operating system dependent. To download and install R for Windows, follow this link. For macOS, follow this one. If you run a GNU+Linux distribution, you can install R using the system’s package manager. If you’re running Ubuntu, you might want to take a look at r2u, which provides very fast installation of packages, full integration with apt (so dependencies get solved automatically) and covers the entirety of CRAN. For RStudio, look for your operating system here. What are packages? There is one more step; we are going to install some packages. Packages are additional pieces of code that can be installed from within R with the following function: install.packages(). These packages extend R’s capabilities significantly, and are probably one of the main reasons R is so popular. As of November 2018, R has over 13000 packages. To install the packages we need, first open RStudio and then copy and paste this line in the console: install.packages(c("tidyverse", "rsample", "recipes", "blogdown" ,"yardstick", "parsnip", "plm", "pwt9", "checkpoint", "Ecdat", "ggthemes", "ggfortify", "margins", "janitor", "rio", "stopwords", "colourpicker", "glmnet", "lhs", "mlrMBO", "mlbench", "ranger")) or go to the Packages pane and then click on Install: The author My name is Bruno Rodrigues and I program almost exclusively in R and have been teaching some R courses for a few years now. I first started teaching for students at the University of Strasbourg while working on my PhD. I hold a PhD in economics, with a focus on quantitative methods. I’m currently head of the statistics department of the Ministry of Higher education and Research in Luxembourg, and before that worked as a manager in the data science team of PWC Luxembourg. This book is an adaptation of notes I’ve used in the past during my time as a teacher, but also a lot of things I’ve learned about R since I left academia. In my free time I like cooking, working out and blogging, while listening to Fip or Chillsky Radio. I also like to get my butt handed to me by playing roguelikes such as NetHack, for which I wrote a package that contains functions to analyze the data that is saved on your computer after you win or lose (it will be lose 99% of the time) the game. You can follow me on twitter, I tweet mostly about R or what’s happening in Luxembourg. In this book we are going to focus on R’s functional programming capabilities↩︎ "],["getting-to-know-rstudio.html", "Chapter 1 Getting to know RStudio 1.1 Panes 1.2 Console 1.3 Scripts 1.4 Options 1.5 Keyboard shortcuts 1.6 Projects 1.7 History 1.8 Plots 1.9 Addins 1.10 Packages 1.11 Exercises", " Chapter 1 Getting to know RStudio RStudio is a company that develops and maintains several products. Their best-known product is an IDE (Integrated development environment) for the R programming language, also called RStudio. You can install RStudio by visiting this link. There is also a server version that can be used to have a centralized version of R within, say, a company. RStudio, the company, also develops Shiny, a package to create full-fledged web-apps. I am not going to cover Shiny in this book, since there’s already a lot of material that you can learn from. Once you have installed RStudio, launch it and let’s go through the interface together. 1.1 Panes RStudio is divided into different panes. Each pane has a specific function. The gif below shows some of these panes: Take some time to look around what each pane shows you. Some panes are empty; for example the Plots pane or the Viewer pane. Plots shows you the plots you make. You can browse the plots and save them. We will see this in more detail in a later chapter. Viewer shows you previews of documents that you generate with R. More on this later. 1.2 Console The Console pane is where you can execute R code. Write the following in the console: 2 + 3 and you’ll get the answer, 5. However, do not write a lot of lines in the console. It is better write your code inside a script. Output is also shown inside the console. 1.3 Scripts Look at the gif below: In this gif, we see the user creating a new R script. R scripts are simple text files that hold R code. Think of .do files in STATA or .c files for C. R scripts have the extension .r or .R. It is possible to create a lot of other files. We’ll take a look at R Markdown files in Chapter 11. 1.3.1 The help pane The Help pane allows you to consult documentation for functions or packages. The gif below shows how it works: you can also access help using the following syntax: ?lm. This will bring up the documentation for the function lm(). You can also type ??lm which will look for the string lm in every package. 1.3.2 The Environment pane The Environment pane shows every object created in the current section. It is especially useful if you have defined lists or have loaded data into R as it makes it easy to explore these more complex objects. 1.4 Options It is also possible to customize RStudio’s look and feel: Take some time to go through the options. 1.5 Keyboard shortcuts It is a good idea to familiarize yourself with at least some keyboard shortcuts. This is more convenient than having to move the mouse around: If there is only one keyboard shortcut you need to know, it’s Ctrl-Enter that executes a line of code from your script. However, these other shortcuts are also worth knowing: CTRL-ALT-R: run entire script CTRL-ALT-UP or DOWN: make cursor taller or shorter, allowing you to edit multiple lines at the same time CTRL-F: Search and replace ALT-UP or DOWN: Move line up or down CTRL-SHIFT-C: Comment/uncomment line ALT-SHIFT-K: Bring up the list of keyboard shortcuts CTRL-SHIFT-M: Insert the pipe operator (%>%, more on this later) CTRL-S: Save script This is just a few keyboard shortcuts that I personally find useful. However, I strongly advise you to learn and use whatever shortcuts are useful and feel natural to you! 1.6 Projects One of the best features of RStudio are projects. Creating a project is simple; the gif below shows how you can create a project and how you can switch between projects. Projects make a lot of things easier, such as managing paths. More on this in the chapter about reading data. Another useful feature of projects is that the scripts you open in project A will stay open even if you switch to another project B, and then switch back to the project A again. You can also use version control (with git) inside a project. Version control is very useful, but I won’t discuss it here. You can find a lot of resources online to get you started with git. 1.7 History The History pane saves all the previous lines you executed. You can then select these lines and send them back to the console or the script. 1.8 Plots All the plots you make during a session are visible in the Plots pane. From there, you can export them in different formats. The plots shown in the gif are made using basic R functions. Later, we will learn how to make nicer looking plots using the package ggplot2. 1.9 Addins Some packages install addins, which are accessible through the addins button: These addins make it easier to use some functions and you can read more about them here. My favorite addins are the ones you get when installing the {datapasta} package. Read more about it here. There are other panes that I will not discuss here, but you will naturally discover their use as you go. For example, we will discuss the Build pane in Chapter 11. 1.10 Packages You can think of packages as addons that extend R’s core functionality. You can browse all available packages on CRAN. To make it easier to find what you might be interested in, you can also browse the CRAN Task Views. Each package has a landing page that summarises its dependencies, version number etc. For example, for the dplyr package: https://cran.r-project.org/web/packages/dplyr/index.html. Take a look at the Downloads section, and especially at the Reference Manual and Vignettes: Vignettes are valuable documents; inside vignettes, the purpose of the package is explained in plain English, usually with accompanying examples. The reference manuals list the available functions inside the packages. You can also find vignettes from within Rstudio: Go to the Packages pane and click on the package you’re interested in. Then you can consult the help for the functions that come with the package as well as the package’s vignettes. Once you installed a package, you have to load it before you can use it. To load packages you use the library() function: library(dplyr) library(janitor) # and so on... If you only need to use one single function once, you don’t need to load an entire package. You can write the following: dplyr::full_join(A, B) using the :: operator, you can access functions from packages without having to load the whole package beforehand. It is possible and easy to create your own packages. This is useful if you have to write a lot of functions that you use daily. We will lean about that, in Chapter 10. 1.11 Exercises Exercise 1 Change the look and feel of RStudio to suit your tastes! I personally like to move the console to the right and use a dark theme. Take some 5 minutes to customize it and browse through all the options. "],["objects-their-classes-and-types-and-useful-r-functions-to-get-you-started.html", "Chapter 2 Objects, their classes and types, and useful R functions to get you started 2.1 The numeric class 2.2 The character class 2.3 The factor class 2.4 The Date class 2.5 The logical class 2.6 Vectors and matrices 2.7 The list class 2.8 The data.frame and tibble classes 2.9 Formulas 2.10 Models 2.11 NULL, NA and NaN 2.12 Useful functions to get you started 2.13 Exercises", " Chapter 2 Objects, their classes and types, and useful R functions to get you started All objects in R have a given type. You already know most of them, as these types are also used in mathematics. Integers, floating point numbers (floats), matrices, etc, are all objects you are already familiar with. But R has other, maybe lesser known data types (that you can find in a lot of other programming languages) that you need to become familiar with. But first, we need to learn how to assign a value to a variable. This can be done in two ways: a <- 3 or a = 3 in very practical terms, there is no difference between the two. I prefer using <- for assigning values to variables and reserve = for passing arguments to functions, for example: spam <- mean(x = c(1,2,3)) I think this is less confusing than: spam = mean(x = c(1,2,3)) but as I explained above you can use whatever you feel most comfortable with. 2.1 The numeric class To define single numbers, you can do the following: a <- 3 The class() function allows you to check the class of an object: class(a) ## [1] "numeric" Decimals are defined with the character .: a <- 3.14 R also supports integers. If you find yourself in a situation where you explicitly need an integer and not a floating point number, you can use the following: a <- as.integer(3) class(a) ## [1] "integer" The as.integer() function is very useful, because it converts its argument into an integer. There is a whole family of as.*() functions. To convert a into a floating point number again: class(as.numeric(a)) ## [1] "numeric" There is also is.numeric() which tests whether a number is of the numeric class: is.numeric(a) ## [1] TRUE It is also possible to create an integer using L: a <- 5L class(a) ## [1] "integer" Another way to convert this integer back to a floating point number is to use as.double() instead of as numeric: class(as.double(a)) ## [1] "numeric" The functions prefixed with is.* and as.* are quite useful, there is one for any of the supported types in R, such as as/is.character(), as/is.factor(), etc… 2.2 The character class Use \" \" to define characters (called strings in other programming languages): a <- "this is a string" class(a) ## [1] "character" To convert something to a character you can use the as.character() function: a <- 4.392 class(a) ## [1] "numeric" Now let’s convert it: class(as.character(a)) ## [1] "character" It is also possible to convert a character to a numeric: a <- "4.392" class(a) ## [1] "character" class(as.numeric(a)) ## [1] "numeric" But this only works if it makes sense: a <- "this won't work, chief" class(a) ## [1] "character" as.numeric(a) ## Warning: NAs introduced by coercion ## [1] NA A very nice package to work with characters is {stringr}, which is also part of the {tidyverse}. 2.3 The factor class Factors look like characters, but are very different. They are the representation of categorical variables. A {tidyverse} package to work with factors is {forcats}. You would rarely use factor variables outside of datasets, so for now, it is enough to know that this class exists. We are going to learn more about factor variables in Chapter 4, by using the {forcats} package. 2.4 The Date class Dates also look like characters, but are very different too: as.Date("2019/03/19") ## [1] "2019-03-19" class(as.Date("2019/03/19")) ## [1] "Date" Manipulating dates and time can be tricky, but thankfully there’s a {tidyverse} package for that, called {lubridate}. We are going to go over this package in Chapter 4. 2.5 The logical class This is the class of predicates, expressions that evaluate to true or false. For example, if you type: 4 > 3 ## [1] TRUE R returns TRUE, which is an object of class logical: k <- 4 > 3 class(k) ## [1] "logical" In other programming languages, logicals are often called bools. A logical variable can only have two values, either TRUE or FALSE. You can test the truthiness of a variable with isTRUE(): k <- 4 > 3 isTRUE(k) ## [1] TRUE How can you test if a variable is false? There is not a isFALSE() function (at least not without having to load a package containing this function), but there is way to do it: k <- 4 > 3 !isTRUE(k) ## [1] FALSE The ! operator indicates negation, so the above expression could be translated as is k not TRUE?. There are other operators for boolean algebra, namely &, &&, |, ||. & means and and | stands for or. You might be wondering what the difference between & and && is? Or between | and ||? & and | work on vectors, doing pairwise comparisons: one <- c(TRUE, FALSE, TRUE, FALSE) two <- c(FALSE, TRUE, TRUE, TRUE) one & two ## [1] FALSE FALSE TRUE FALSE Compare this to the && operator: one <- c(TRUE, FALSE, TRUE, FALSE) two <- c(FALSE, TRUE, TRUE, TRUE) one && two ## Warning in one && two: 'length(x) = 4 > 1' in coercion to 'logical(1)' ## Warning in one && two: 'length(x) = 4 > 1' in coercion to 'logical(1)' ## [1] FALSE The && and || operators only compare the first element of the vectors and stop as soon as a the return value can be safely determined. This is called short-circuiting. Consider the following: one <- c(TRUE, FALSE, TRUE, FALSE) two <- c(FALSE, TRUE, TRUE, TRUE) three <- c(TRUE, TRUE, FALSE, FALSE) one && two && three ## Warning in one && two: 'length(x) = 4 > 1' in coercion to 'logical(1)' ## Warning in one && two: 'length(x) = 4 > 1' in coercion to 'logical(1)' ## [1] FALSE one || two || three ## Warning in one || two: 'length(x) = 4 > 1' in coercion to 'logical(1)' ## [1] TRUE The || operator stops as soon it evaluates to TRUE whereas the && stops as soon as it evaluates to FALSE. Personally, I rarely use || or && because I get confused. I find using | or & in combination with the all() or any() functions much more useful: one <- c(TRUE, FALSE, TRUE, FALSE) two <- c(FALSE, TRUE, TRUE, TRUE) any(one & two) ## [1] TRUE all(one & two) ## [1] FALSE any() checks whether any of the vector’s elements are TRUE and all() checks if all elements of the vector are TRUE. As a final note, you should know that is possible to use T for TRUE and F for FALSE but I would advise against doing this, because it is not very explicit. 2.6 Vectors and matrices You can create a vector in different ways. But first of all, it is important to understand that a vector in most programming languages is nothing more than a list of things. These things can be numbers (either integers or floats), strings, or even other vectors. A vector in R can only contain elements of one single type. This is not the case for a list, which is much more flexible. We will talk about lists shortly, but let’s first focus on vectors and matrices. 2.6.1 The c() function A very important function that allows you to build a vector is c(): a <- c(1,2,3,4,5) This creates a vector with elements 1, 2, 3, 4, 5. If you check its class: class(a) ## [1] "numeric" This can be confusing: you where probably expecting a to be of class vector or something similar. This is not the case if you use c() to create the vector, because c() doesn’t build a vector in the mathematical sense, but a so-called atomic vector. Checking its dimension: dim(a) ## NULL returns NULL because an atomic vector doesn’t have a dimension. If you want to create a true vector, you need to use cbind() or rbind(). But before continuing, be aware that atomic vectors can only contain elements of the same type: c(1, 2, "3") ## [1] "1" "2" "3" because “3” is a character, all the other values get implicitly converted to characters. You have to be very careful about this, and if you use atomic vectors in your programming, you have to make absolutely sure that no characters or logicals or whatever else are going to convert your atomic vector to something you were not expecting. 2.6.2 cbind() and rbind() You can create a true vector with cbind(): a <- cbind(1, 2, 3, 4, 5) Check its class now: class(a) ## [1] "matrix" "array" This is exactly what we expected. Let’s check its dimension: dim(a) ## [1] 1 5 This returns the dimension of a using the LICO notation (number of LInes first, the number of COlumns). It is also possible to bind vectors together to create a matrix. b <- cbind(6,7,8,9,10) Now let’s put vector a and b into a matrix called matrix_c using rbind(). rbind() functions the same way as cbind() but glues the vectors together by rows and not by columns. matrix_c <- rbind(a,b) print(matrix_c) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 2 3 4 5 ## [2,] 6 7 8 9 10 2.6.3 The matrix class R also has support for matrices. For example, you can create a matrix of dimension (5,5) filled with 0’s with the matrix() function: matrix_a <- matrix(0, nrow = 5, ncol = 5) If you want to create the following matrix: \\[ B = \\left( \\begin{array}{ccc} 2 & 4 & 3 \\\\ 1 & 5 & 7 \\end{array} \\right) \\] you would do it like this: B <- matrix(c(2, 4, 3, 1, 5, 7), nrow = 2, byrow = TRUE) The option byrow = TRUE means that the rows of the matrix will be filled first. You can access individual elements of matrix_a like so: matrix_a[2, 3] ## [1] 0 and R returns its value, 0. We can assign a new value to this element if we want. Try: matrix_a[2, 3] <- 7 and now take a look at matrix_a again. print(matrix_a) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0 0 0 0 0 ## [2,] 0 0 7 0 0 ## [3,] 0 0 0 0 0 ## [4,] 0 0 0 0 0 ## [5,] 0 0 0 0 0 Recall our vector b: b <- cbind(6,7,8,9,10) To access its third element, you can simply write: b[3] ## [1] 8 I have heard many people praising R for being a matrix based language. Matrices are indeed useful, and statisticians are used to working with them. However, I very rarely use matrices in my day to day work, and prefer an approach based on data frames (which will be discussed below). This is because working with data frames makes it easier to use R’s advanced functional programming language capabilities, and this is where R really shines in my opinion. Working with matrices almost automatically implies using loops and all the iterative programming techniques, à la Fortran, which I personally believe are ill-suited for interactive statistical programming (as discussed in the introduction). 2.7 The list class The list class is a very flexible class, and thus, very useful. You can put anything inside a list, such as numbers: list1 <- list(3, 2) or other lists constructed with c(): list2 <- list(c(1, 2), c(3, 4)) you can also put objects of different classes in the same list: list3 <- list(3, c(1, 2), "lists are amazing!") and of course create list of lists: my_lists <- list(list1, list2, list3) To check the contents of a list, you can use the structure function str(): str(my_lists) ## List of 3 ## $ :List of 2 ## ..$ : num 3 ## ..$ : num 2 ## $ :List of 2 ## ..$ : num [1:2] 1 2 ## ..$ : num [1:2] 3 4 ## $ :List of 3 ## ..$ : num 3 ## ..$ : num [1:2] 1 2 ## ..$ : chr "lists are amazing!" or you can use RStudio’s Environment pane: You can also create named lists: list4 <- list("name_1" = 2, "name_2" = 8, "name_3" = "this is a named list") and you can access the elements in two ways: list4[[1]] ## [1] 2 or, for named lists: list4$name_3 ## [1] "this is a named list" Take note of the $ operator, because it is going to be quite useful for data.frames as well, which we are going to get to know in the next section. Lists are used extensively because they are so flexible. You can build lists of datasets and apply functions to all the datasets at once, build lists of models, lists of plots, etc… In the later chapters we are going to learn all about them. Lists are central objects in a functional programming workflow for interactive statistical analysis. 2.8 The data.frame and tibble classes In the next chapter we are going to learn how to import datasets into R. Once you import data, the resulting object is either a data.frame or a tibble depending on which package you used to import the data. tibbles extend data.frames so if you know about data.frame objects already, working with tibbles will be very easy. tibbles have a better print() method, and some other niceties. However, I want to stress that these objects are central to R and are thus very important; they are actually special cases of lists, discussed above. There are different ways to print a data.frame or a tibble if you wish to inspect it. You can use View(my_data) to show the my_data data.frame in the View pane of RStudio: You can also use the str() function: str(my_data) And if you need to access an individual column, you can use the $ sign, same as for a list: my_data$col1 2.9 Formulas We will learn more about formulas later, but because it is an important object, it is useful if you already know about them early on. A formula is defined in the following way: my_formula <- ~x class(my_formula) ## [1] "formula" Formula objects are defined using the ~ symbol. Formulas are useful to define statistical models, for example for a linear regression: lm(y ~ x) or also to define anonymous functions, but more on this later. 2.10 Models A statistical model is an object like any other in R: Here, I have already a model that I ran on some test data: class(my_model) ## [1] "lm" my_model is an object of class lm, for linear model. You can apply different functions to a model object: summary(my_model) ## ## Call: ## lm(formula = mpg ~ hp, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.7121 -2.1122 -0.8854 1.5819 8.2360 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 30.09886 1.63392 18.421 < 2e-16 *** ## hp -0.06823 0.01012 -6.742 1.79e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.863 on 30 degrees of freedom ## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892 ## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07 This class will be explored in later chapters. 2.11 NULL, NA and NaN The NULL, NA and NaN classes are pretty special. NULL is returned when the result of function is undetermined. For example, consider list4: list4 ## $name_1 ## [1] 2 ## ## $name_2 ## [1] 8 ## ## $name_3 ## [1] "this is a named list" if you try to access an element that does not exist, such as d, you will get NULL back: list4$d ## NULL NaN means “Not a Number” and is returned when a function return something that is not a number: sqrt(-1) ## Warning in sqrt(-1): NaNs produced ## [1] NaN or: 0/0 ## [1] NaN Basically, numbers that cannot be represented as floating point numbers are NaN. Finally, there’s NA which is closely related to NaN but is used for missing values. NA stands for Not Available. There are several types of NAs: NA_integer_ NA_real_ NA_complex_ NA_character_ but these are in principle only used when you need to program your own functions and need to explicitly test for the missingness of, say, a character value. To test whether a value is NA, use the is.na() function. 2.12 Useful functions to get you started This section will list several basic R functions that are very useful and should be part of your toolbox. 2.12.1 Sequences There are several functions that create sequences, seq(), seq_along() and rep(). rep() is easy enough: rep(1, 10) ## [1] 1 1 1 1 1 1 1 1 1 1 This simply repeats 1 10 times. You can repeat other objects too: rep("HAHA", 10) ## [1] "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" "HAHA" To create a sequence, things are not as straightforward. There is seq(): seq(1, 10) ## [1] 1 2 3 4 5 6 7 8 9 10 seq(70, 80) ## [1] 70 71 72 73 74 75 76 77 78 79 80 It is also possible to provide a by argument: seq(1, 10, by = 2) ## [1] 1 3 5 7 9 seq_along() behaves similarly, but returns the length of the object passed to it. So if you pass list4 to seq_along(), it will return a sequence from 1 to 3: seq_along(list4) ## [1] 1 2 3 which is also true for seq() actually: seq(list4) ## [1] 1 2 3 but these two functions behave differently for arguments of length equal to 1: seq(10) ## [1] 1 2 3 4 5 6 7 8 9 10 seq_along(10) ## [1] 1 So be quite careful about that. I would advise you do not use seq(), but only seq_along() and seq_len(). seq_len() only takes arguments of length 1: seq_len(10) ## [1] 1 2 3 4 5 6 7 8 9 10 seq_along(10) ## [1] 1 The problem with seq() is that it is unpredictable; depending on its input, the output will either be an integer or a sequence. When programming, it is better to have function that are stricter and fail when confronted to special cases, instead of returning some result. This is a bit of a recurrent issue with R, and the functions from the {tidyverse} mitigate this issue by being stricter than their base R counterparts. For example, consider the ifelse() function from base R: ifelse(3 > 5, 1, "this is false") ## [1] "this is false" and compare it to {dplyr}’s implementation, if_else(): if_else(3 > 5, 1, "this is false") Error: `false` must be type double, not character Call `rlang::last_error()` to see a backtrace if_else() fails because the return value when FALSE is not a double (a real number) but a character. This might seem unnecessarily strict, but at least it is predictable. This makes debugging easier when used inside functions. In Chapter 8 we are going to learn how to write our own functions, and being strict makes programming easier. 2.12.2 Basic string manipulation For now, we have not closely studied character objects, we only learned how to define them. Later, in Chapter 5 we will learn about the {stringr} package which provides useful function to work with strings. However, there are several base R functions that are very useful that you might want to know nonetheless, such as paste() and paste0(): paste("Hello", "amigo") ## [1] "Hello amigo" but you can also change the separator if needed: paste("Hello", "amigo", sep = "--") ## [1] "Hello--amigo" paste0() is the same as paste() but does not have any sep argument: paste0("Hello", "amigo") ## [1] "Helloamigo" If you provide a vector of characters, you can also use the collapse argument, which places whatever you provide for collapse between the characters of the vector: paste0(c("Joseph", "Mary", "Jesus"), collapse = ", and ") ## [1] "Joseph, and Mary, and Jesus" To change the case of characters, you can use toupper() and tolower(): tolower("HAHAHAHAH") ## [1] "hahahahah" toupper("hueuehuehuheuhe") ## [1] "HUEUEHUEHUHEUHE" Finally, there are the classical mathematical functions that you know and love: sqrt() exp() log() abs() sin(), cos(), tan(), and others sum(), cumsum(), prod(), cumprod() max(), min() and many others… 2.13 Exercises Exercise 1 Try to create the following vector: \\[a = (6,3,8,9)\\] and add it this other vector: \\[b = (9,1,3,5)\\] and save the result to a new variable called result. Exercise 2 Using a and b from before, try to get their dot product. Try with a * b in the R console. What happened? Try to find the right function to get the dot product. Don’t hesitate to google the answer! Exercise 3 How can you create a matrix of dimension (30,30) filled with 2’s by only using the function matrix()? Exercise 4 Save your first name in a variable a and your surname in a variable b. What does the function: paste(a, b) do? Look at the help for paste() with ?paste or using the Help pane in RStudio. What does the optional argument sep do? Exercise 5 Define the following variables: a <- 8, b <- 3, c <- 19. What do the following lines check? What do they return? a > b a == b a != b a < b (a > b) && (a < c) (a > b) && (a > c) (a > b) || (a < b) Exercise 6 Define the following matrix: \\[ \\text{matrix_a} = \\left( \\begin{array}{ccc} 9 & 4 & 12 \\\\ 5 & 0 & 7 \\\\ 2 & 6 & 8 \\\\ 9 & 2 & 9 \\end{array} \\right) \\] What does matrix_a >= 5 do? What does matrix_a[ , 2] do? Can you find which function gives you the transpose of this matrix? Exercise 7 Solve the following system of equations using the solve() function: \\[ \\left( \\begin{array}{cccc} 9 & 4 & 12 & 2 \\\\ 5 & 0 & 7 & 9\\\\ 2 & 6 & 8 & 0\\\\ 9 & 2 & 9 & 11 \\end{array} \\right) \\times \\left( \\begin{array}{ccc} x \\\\ y \\\\ z \\\\ t \\\\ \\end{array}\\right) = \\left( \\begin{array}{ccc} 7\\\\ 18\\\\ 1\\\\ 0 \\end{array} \\right) \\] Exercise 8 Load the mtcars data (mtcars is include in R, so you only need to use the data() function to load the data): data(mtcars) if you run class(mtcars), you get “data.frame”. Try now with typeof(mtcars). The answer is now “list”! This is because the class of an object is an attribute of that object, which can even be assigned by the user: class(mtcars) <- "don't do this" class(mtcars) ## [1] "don't do this" The type of an object is R’s internal type of that object, which cannot be manipulated by the user. It is always useful to know the type of an object (not just its class). For example, in the particular case of data frames, because the type of a data frame is a list, you can use all that you learned about lists to manipulate data frames! Recall that $ allowed you to select the element of a list for instance: my_list <- list("one" = 1, "two" = 2, "three" = 3) my_list$one ## [1] 1 Because data frames are nothing but fancy lists, this is why you can access columns the same way: mtcars$mpg ## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 ## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 ## [31] 15.0 21.4 "],["reading-and-writing-data.html", "Chapter 3 Reading and writing data 3.1 The swiss army knife of data import and export: {rio} 3.2 Writing any object to disk 3.3 Using RStudio projects to manage paths", " Chapter 3 Reading and writing data In this chapter, we are going to import example datasets that are available in R, mtcars and iris. I have converted these datasets into several formats. Download those datasets here if you want to follow the examples below. R can import some formats without the need of external packages, such as the .csv format. However, for other formats, you will need to use different packages. Because there are a lot of different formats available I suggest you use the {rio} package. {rio} is a wrapper around different packages that import/export data in different formats. This package is nice because you don’t need to remember which package to use to import, say, STATA datasets and then you need to remember which one for SAS datasets, and so on. Read {rio}’s vignette for more details. Below I show some of {rio}’s functions presented in the vignette. It is also possible to import data from other, less “traditional” sources, such as your clipboard. Also note that it is possible to import more than one dataset at once. There are two ways of doing that, either by importing all the datasets, binding their rows together and add a new variable with the name of the data, or import all the datasets into a list, where each element of that list is a data frame. We are going to explore this second option later. 3.1 The swiss army knife of data import and export: {rio} To import data with {rio}, import() is all you need: library(rio) mtcars <- import("datasets/mtcars.csv") head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 import() needs the path to the data, and you can specify additional options if needed. On a Windows computer, you have to pay attention to the path; you cannot simply copy and paste it, because paths in Windows use the \\ symbol whereas R uses / (just like on Linux or macOS). Importing a STATA or a SAS file is done just the same: mtcars_stata <- import("datasets/mtcars.dta") head(mtcars_stata) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 mtcars_sas <- import("datasets/mtcars.sas7bdat") head(mtcars_sas) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 It is also possible to import Excel files where each sheet is a single table, but you will need import_list() for that. The file multi.xlsx has two sheets, each with a table in it: multi <- import_list("datasets/multi.xlsx") str(multi) ## List of 2 ## $ mtcars:'data.frame': 32 obs. of 11 variables: ## ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ... ## ..$ disp: num [1:32] 160 160 108 258 360 ... ## ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ... ## ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ... ## ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ... ## ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ... ## ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ... ## ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ... ## ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ... ## $ iris :'data.frame': 150 obs. of 5 variables: ## ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## ..$ Species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ... As you can see multi is a list of datasets. Told you lists were very flexible! It is also possible to import all the datasets in a single directory at once. For this, you first need a vector of paths: paths <- Sys.glob("datasets/unemployment/*.csv") Sys.glob() allows you to find files using a regular expression. “datasets/unemployment/*.csv” matches all the .csv files inside the “datasets/unemployment/” folder. all_data <- import_list(paths) str(all_data) ## List of 4 ## $ unemp_2013:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ... ## ..$ of which: Wage-earners : int [1:118] 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ... ## ..$ of which: Non-wage-earners: int [1:118] 19872 1809 168 94 116 294 272 113 189 338 ... ## ..$ Unemployed : int [1:118] 19287 1071 114 25 74 261 98 45 66 207 ... ## ..$ Active population : int [1:118] 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.95 5.67 6.27 2.88 4.92 ... ## ..$ Year : int [1:118] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2013.csv" ## $ unemp_2014:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 228423 18166 1767 845 1505 4129 2172 1007 1268 3124 ... ## ..$ of which: Wage-earners : int [1:118] 208238 16366 1606 757 1390 3840 1897 887 1082 2782 ... ## ..$ of which: Non-wage-earners: int [1:118] 20185 1800 161 88 115 289 275 120 186 342 ... ## ..$ Unemployed : int [1:118] 19362 1066 122 19 66 287 91 38 61 202 ... ## ..$ Active population : int [1:118] 247785 19232 1889 864 1571 4416 2263 1045 1329 3326 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.81 5.54 6.46 2.2 4.2 ... ## ..$ Year : int [1:118] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2014.csv" ## $ unemp_2015:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 233130 18310 1780 870 1470 4130 2170 1050 1300 3140 ... ## ..$ of which: Wage-earners : int [1:118] 212530 16430 1620 780 1350 3820 1910 920 1100 2770 ... ## ..$ of which: Non-wage-earners: int [1:118] 20600 1880 160 90 120 310 260 130 200 370 ... ## ..$ Unemployed : int [1:118] 18806 988 106 29 73 260 80 41 72 169 ... ## ..$ Active population : int [1:118] 251936 19298 1886 899 1543 4390 2250 1091 1372 3309 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.46 5.12 5.62 3.23 4.73 ... ## ..$ Year : int [1:118] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2015.csv" ## $ unemp_2016:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 236100 18380 1790 870 1470 4160 2160 1030 1330 3150 ... ## ..$ of which: Wage-earners : int [1:118] 215430 16500 1640 780 1350 3840 1900 900 1130 2780 ... ## ..$ of which: Non-wage-earners: int [1:118] 20670 1880 150 90 120 320 260 130 200 370 ... ## ..$ Unemployed : int [1:118] 18185 975 91 27 66 246 76 35 70 206 ... ## ..$ Active population : int [1:118] 254285 19355 1881 897 1536 4406 2236 1065 1400 3356 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.15 5.04 4.84 3.01 4.3 ... ## ..$ Year : int [1:118] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2016.csv" in a subsequent chapter we will learn how to actually use these lists of datasets. If you know that each dataset in each file has the same columns, you can also import them directly into a single dataset by binding each dataset together using rbind = TRUE: bind_data <- import_list(paths, rbind = TRUE) str(bind_data) ## 'data.frame': 472 obs. of 9 variables: ## $ Commune : chr "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## $ Total employed population : int 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ... ## $ of which: Wage-earners : int 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ... ## $ of which: Non-wage-earners: int 19872 1809 168 94 116 294 272 113 189 338 ... ## $ Unemployed : int 19287 1071 114 25 74 261 98 45 66 207 ... ## $ Active population : int 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ... ## $ Unemployment rate (in %) : num 7.95 5.67 6.27 2.88 4.92 ... ## $ Year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## $ _file : chr "datasets/unemployment/unemp_2013.csv" "datasets/unemployment/unemp_2013.csv" "datasets/unemployment/unemp_2013.csv" "datasets/unemployment/unemp_2013.csv" ... ## - attr(*, ".internal.selfref")=<externalptr> This also adds a further column called _file indicating the name of the file that contained the original data. If something goes wrong, you might need to take a look at the underlying function {rio} is actually using to import the file. Let’s look at the following example: testdata <- import("datasets/problems/mtcars.csv") head(testdata) ## mpg&cyl&disp&hp&drat&wt&qsec&vs&am&gear&carb ## 1 21&6&160&110&3.9&2.62&16.46&0&1&4&4 ## 2 21&6&160&110&3.9&2.875&17.02&0&1&4&4 ## 3 22.8&4&108&93&3.85&2.32&18.61&1&1&4&1 ## 4 21.4&6&258&110&3.08&3.215&19.44&1&0&3&1 ## 5 18.7&8&360&175&3.15&3.44&17.02&0&0&3&2 ## 6 18.1&6&225&105&2.76&3.46&20.22&1&0&3&1 as you can see, the import didn’t work quite well! This is because the separator is the & for some reason. Because we are trying to read a .csv file, rio::import() is using data.table::fread() under the hood (you can read this in import()’s help). If you then read data.table::fread()’s help, you see that the fread() function has an optional sep = argument that you can use to specify the separator. You can use this argument in import() too, and it will be passed down to data.table::fread(): testdata <- import("datasets/problems/mtcars.csv", sep = "&") head(testdata) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 21 6 160 110 3.9 2.62 16.46 0 1 4 4 ## 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4 ## 3 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 ## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## 5 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 ## 6 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 export() allows you to write data to disk, by simply providing the path and name of the file you wish to save. export(testdata, "path/where/to/save/testdata.csv") If you end the name with .csv the file is exported to the csv format, if instead you write .dta the data will be exported to the STATA format, and so on. If you wish to export to Excel, this is possible, but it may require that you change a file on your computer (you only have to do this once). Try running: export(testdata, "path/where/to/save/testdata.xlsx") if this results in an error, try the following: Run the following lines in Rstudio: if(!file.exists("~/.Rprofile")) # only create if not already there file.create("~/.Rprofile") # (don't overwrite it) file.edit("~/.Rprofile") These lines, taken shamelessly from Efficient R programming (go read it, it’s a very great resource) look for and open the .Rprofile file which is a file that is run every time you open Rstudio. This means that you can put any line of code there that will always be executed whenever you launch Rstudio. Add this line to the file: Sys.setenv("R_ZIPCMD" = "C:/Program Files (x86)/Rtools/zip.exe") This tells Rstudio to use zip.exe as the default zip tool, which is needed to export files to the Excel format. Try it out by restarting Rstudio, and then running the following lines: library(rio) data(mtcars) export(mtcars, "mtcars.xlsx") You should find the mtcars.xlsx inside your working directory. You can check what is your working directory with getwd(). {rio} should cover all your needs, but if not, there is very likely a package out there that will import the data you need. 3.2 Writing any object to disk {rio} is an amazing package, but is only able to write tabular representations of data. What if you would like to save, say, a list containing any arbitrary object? This is possible with the saveRDS() function. Literally anything can be saved with saveRDS(): my_list <- list("this is a list", list("which contains a list", 12), c(1, 2, 3, 4), matrix(c(2, 4, 3, 1, 5, 7), nrow = 2)) str(my_list) ## List of 4 ## $ : chr "this is a list" ## $ :List of 2 ## ..$ : chr "which contains a list" ## ..$ : num 12 ## $ : num [1:4] 1 2 3 4 ## $ : num [1:2, 1:3] 2 4 3 1 5 7 my_list is a list containing a string, a list which contains a string and a number, a vector and a matrix… Now suppose that computing this list takes a very long time. For example, imagine that each element of the list is the result of estimating a very complex model on a simulated dataset, which takes hours to run. Because this takes so long to compute, you’d want to save it to disk. This is possible with saveRDS(): saveRDS(my_list, "my_list.RDS") The next day, after having freshly started your computer and launched RStudio, it is possible to retrieve the object exactly like it was using readRDS(): my_list <- readRDS("my_list.RDS") str(my_list) ## List of 4 ## $ : chr "this is a list" ## $ :List of 2 ## ..$ : chr "which contains a list" ## ..$ : num 12 ## $ : num [1:4] 1 2 3 4 ## $ : num [1:2, 1:3] 2 4 3 1 5 7 Even if you want to save a regular dataset, using saveRDS() might be a good idea because the data gets compressed if you add the option compress = TRUE to saveRDS(). However keep in mind that this will only be readable by R, so if you need to share this data with colleagues that use another tool, save it in another format. 3.3 Using RStudio projects to manage paths Managing paths can be painful, especially if you’re collaborating with a colleague and both of you saved the data in paths that are different. Whenever one of you wants to work on the script, the path will need to be adapted first. The best way to avoid that is to use projects with RStudio. Imagine that you are working on a project entitled “housing”. You will create a folder called “housing” somewhere on your computer and inside this folder have another folder called “data”, then a bunch of other folders containing different files or the outputs of your analysis. What matters here is that you have a folder called “data” which contains the datasets you will ananlyze. When you are inside an RStudio project, granted that you chose your “housing” folder as the folder to host the project, you can read the data by simply specifying the path like so: my_data <- import("/data/data.csv") Constrast this to what you would need to write if you were not using a project: my_data <- import("C:/My Documents/Castor/Work/Projects/Housing/data/data.csv") Not only is that longer, but if Castor is working on this project with Pollux, Pollux would need to change the above line to this: my_data <- import("C:/My Documents/Pollux/Work/Projects/Housing/data/data.csv") whenever Pollux needs to work on it. Another, similar issue, is that if you need to write something to disk, such as a dataset or a plot, you would also need to specify the whole path: export(my_data, "C:/My Documents/Pollux/Work/Projects/Housing/data/data.csv") If you forget to write the whole path, then the dataset will be saved in the standard working directory, which is your “My Documents” folder on Windows, and “Home” on GNU+Linux or macOS. You can check what is the working directory with the getwd() function: getwd() On a fresh session on my computer this returns: "/home/bruno" or, on Windows: "C:/Users/Bruno/Documents" but if you call this function inside a project, it will return the path to your project. It is also possible to set the working directory with setwd(), so you don’t need to always write the full path, meaning that you can this: setwd("the/path/I/want/") import("data/my_data.csv") export(processed_data, "processed_data.xlsx") instead of: import("the/path/I/want/data/my_data.csv") export(processed_data, "the/path/I/want/processed_data.xlsx") However, I really, really, really urge you never to use setwd(). Use projects instead! Using projects saves a lot of pain in the long run. "],["descriptive-statistics-and-data-manipulation.html", "Chapter 4 Descriptive statistics and data manipulation 4.1 A data exploration exercice using base R 4.2 Smoking is bad for you, but pipes are your friend 4.3 The {tidyverse}’s enfant prodige: {dplyr} 4.4 Reshaping and sprucing up data with {tidyr} 4.5 Working on many columns with if_any(), if_all() and across() 4.6 Other useful {tidyverse} functions 4.7 Special packages for special kinds of data: {forcats}, {lubridate}, and {stringr} 4.8 List-columns 4.9 Going beyond descriptive statistics and data manipulation 4.10 Exercises", " Chapter 4 Descriptive statistics and data manipulation Now that we are familiar with some R objects and know how to import data, it is time to write some code. In this chapter, we are going to compute descriptive statistics for a single dataset, but also for a list of datasets later in the chapter. However, I will not give a list of functions to compute descriptive statistics; if you need a specific function you can find easily in the Help pane in Rstudio or using any modern internet search engine. What I will do is show you a workflow that allows you to compute the descripitive statisics you need fast. R has a lot of built-in functions for descriptive statistics; however, if you want to compute statistics for different sub-groups, some more complex manipulations are needed. At least this was true in the past. Nowadays, thanks to the packages from the {tidyverse}, it is very easy and fast to compute descriptive statistics by any stratifying variable(s). The package we are going to use for this is called {dplyr}. {dplyr} contains a lot of functions that make manipulating data and computing descriptive statistics very easy. To make things easier for now, we are going to use example data included with {dplyr}. So no need to import an external dataset; this does not change anything to the example that we are going to study here; the source of the data does not matter for this. Using {dplyr} is possible only if the data you are working with is already in a useful shape. When data is more messy, you will need to first manipulate it to bring it a tidy format. For this, we will use {tidyr}, which is very useful package to reshape data and to do advanced cleaning of your data. All these tidyverse functions are also called verbs. However, before getting to know these verbs, let’s do an analysis using standard, or base R functions. This will be the benchmark against which we are going to measure a {tidyverse} workflow. 4.1 A data exploration exercice using base R Let’s first load the starwars data set, included in the {dplyr} package: library(dplyr) data(starwars) Let’s first take a look at the data: head(starwars) ## # A tibble: 6 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooi… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… ## 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo ## 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… ## 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera… ## 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi… ## # … with 4 more variables: species <chr>, films <list>, vehicles <list>, ## # starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color, ## # ³​eye_color, ⁴​birth_year, ⁵​homeworld This data contains information on Star Wars characters. The first question you have to answer is to find the average height of the characters: mean(starwars$height) ## [1] NA As discussed in Chapter 2, $ allows you to access columns of a data.frame objects. Because there are NA values in the data, the result is also NA. To get the result, you need to add an option to mean(): mean(starwars$height, na.rm = TRUE) ## [1] 174.358 Let’s also take a look at the standard deviation: sd(starwars$height, na.rm = TRUE) ## [1] 34.77043 It might be more informative to compute these two statistics by sex, so for this, we are going to use aggregate(): aggregate(starwars$height, by = list(sex = starwars$sex), mean) ## sex x ## 1 female NA ## 2 hermaphroditic 175 ## 3 male NA ## 4 none NA Oh, shoot! Most groups have missing values in them, so we get NA back. We need to use na.rm = TRUE just like before. Thankfully, it is possible to pass this option to mean() inside aggregate() as well: aggregate(starwars$height, by = list(sex = starwars$sex), mean, na.rm = TRUE) ## sex x ## 1 female 169.2667 ## 2 hermaphroditic 175.0000 ## 3 male 179.1053 ## 4 none 131.2000 Later in the book, we are also going to see how to define our own functions (with the default options that are useful to us), and this will also help in this sort of situation. Even though we can use na.rm = TRUE, let’s also use subset() to filter out the NA values beforehand: starwars_no_nas <- subset(starwars, !is.na(height)) aggregate(starwars_no_nas$height, by = list(sex = starwars_no_nas$sex), mean) ## sex x ## 1 female 169.2667 ## 2 hermaphroditic 175.0000 ## 3 male 179.1053 ## 4 none 131.2000 (aggregate() also has a subset = option, but I prefer to explicitely subset the data set with subset()). Even if you are not familiar with aggregate(), I believe the above lines are quite self-explanatory. You need to provide aggregate() with 3 things; the variable you want to summarize (or only the data frame, if you want to summarize all variables), a list of grouping variables and then the function that will be applied to each subgroup. And by the way, to test for NA, one uses the function is.na() not something like species == \"NA\" or anything like that. !is.na() does the opposite (! reverses booleans, so !TRUE becomes FALSE and vice-versa). You can easily add another grouping variable: aggregate(starwars_no_nas$height, by = list(Sex = starwars_no_nas$sex, `Hair color` = starwars_no_nas$hair_color), mean) ## Sex Hair color x ## 1 female auburn 150.0000 ## 2 male auburn, grey 180.0000 ## 3 male auburn, white 182.0000 ## 4 female black 166.3333 ## 5 male black 176.2500 ## 6 male blond 176.6667 ## 7 female blonde 168.0000 ## 8 female brown 160.4000 ## 9 male brown 182.6667 ## 10 male brown, grey 178.0000 ## 11 male grey 170.0000 ## 12 female none 188.2500 ## 13 male none 182.2414 ## 14 none none 148.0000 ## 15 female white 167.0000 ## 16 male white 152.3333 or use another function: aggregate(starwars_no_nas$height, by = list(Sex = starwars_no_nas$sex), sd) ## Sex x ## 1 female 15.32256 ## 2 hermaphroditic NA ## 3 male 36.01075 ## 4 none 49.14977 (let’s ignore the NAs). It is important to note that aggregate() returns a data.frame object. You can only give one function to aggregate(), so if you need the mean and the standard deviation of height, you must do it in two steps. Since R 4.1, a new infix operator |> has been introduced, which is really handy for writing the kind of code we’ve been looking at in this chapter. |> is also called a pipe, or the base pipe to distinguish it from another pipe that we’ll discuss in the next section. For now, let’s learn about |>. Consider the following: 10 |> sqrt() ## [1] 3.162278 This computes sqrt(10); so what |> does, is pass the left hand side (10, in the example above) to the right hand side (sqrt()). Using |> might seem more complicated and verbose than not using it, but you will see in a bit why it can be useful. The next function I would like to introduce at this point is with(). with() makes it possible to apply functions on data.frame columns without having to write $ all the time. For example, consider this: mean(starwars$height, na.rm = TRUE) ## [1] 174.358 with(starwars, mean(height, na.rm = TRUE)) ## [1] 174.358 The advantage of using with() is that we can directly reference height without using $. Here again, this is more verbose than simply using $… so why bother with it? It turns out that by combining |> and with(), we can write very clean and concise code. Let’s go back to a previous example to illustrate this idea: starwars_no_nas <- subset(starwars, !is.na(height)) aggregate(starwars_no_nas$height, by = list(sex = starwars_no_nas$sex), mean) ## sex x ## 1 female 169.2667 ## 2 hermaphroditic 175.0000 ## 3 male 179.1053 ## 4 none 131.2000 First, we created a new dataset where we filtered out rows where height is NA. This dataset is useless otherwise, but we need it for the next part, where we actually do what we want (computing the average height by sex). Using |> and with(), we can write this in one go: starwars |> subset(!is.na(sex)) |> with(aggregate(height, by = list(Species = species, Sex = sex), mean)) ## Species Sex x ## 1 Clawdite female 168.0000 ## 2 Human female NA ## 3 Kaminoan female 213.0000 ## 4 Mirialan female 168.0000 ## 5 Tholothian female 184.0000 ## 6 Togruta female 178.0000 ## 7 Twi'lek female 178.0000 ## 8 Hutt hermaphroditic 175.0000 ## 9 Aleena male 79.0000 ## 10 Besalisk male 198.0000 ## 11 Cerean male 198.0000 ## 12 Chagrian male 196.0000 ## 13 Dug male 112.0000 ## 14 Ewok male 88.0000 ## 15 Geonosian male 183.0000 ## 16 Gungan male 208.6667 ## 17 Human male NA ## 18 Iktotchi male 188.0000 ## 19 Kaleesh male 216.0000 ## 20 Kaminoan male 229.0000 ## 21 Kel Dor male 188.0000 ## 22 Mon Calamari male 180.0000 ## 23 Muun male 191.0000 ## 24 Nautolan male 196.0000 ## 25 Neimodian male 191.0000 ## 26 Pau'an male 206.0000 ## 27 Quermian male 264.0000 ## 28 Rodian male 173.0000 ## 29 Skakoan male 193.0000 ## 30 Sullustan male 160.0000 ## 31 Toong male 163.0000 ## 32 Toydarian male 137.0000 ## 33 Trandoshan male 190.0000 ## 34 Twi'lek male 180.0000 ## 35 Vulptereen male 94.0000 ## 36 Wookiee male 231.0000 ## 37 Xexto male 122.0000 ## 38 Yoda's species male 66.0000 ## 39 Zabrak male 173.0000 ## 40 Droid none NA So let’s unpack this. In the first two rows, using |>, we pass the starwars data.frame to subset(): starwars |> subset(!is.na(sex)) ## # A tibble: 83 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… ## 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo ## 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… ## 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera… ## 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi… ## 7 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi… ## 8 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi… ## 9 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi… ## 10 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon ## # … with 73 more rows, 4 more variables: species <chr>, films <list>, ## # vehicles <list>, starships <list>, and abbreviated variable names ## # ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld as I explained before, this is exactly the same as subset(starwars, !is.na(sex)). Then, we pass the result of subset() to the next function, with(). The first argument of with() must be a data.frame, and this is exactly what subset() returns! So now the output of subset() is passed down to with(), which makes it now possible to reference the columns of the data.frame in aggregate() directly. If you have a hard time understanding what is going on, you can use quote() to see what’s going on. quote() returns an expression with evaluating it: quote(log(10)) ## log(10) Why am I bring this up? Well, since a |> f() is exactly equal to f(a), quoting the code above will return an expression with |>. For instance: quote(10 |> log()) ## log(10) So let’s quote the big block of code from above: quote( starwars |> subset(!is.na(sex)) |> with(aggregate(height, by = list(Species = species, Sex = sex), mean)) ) ## with(subset(starwars, !is.na(sex)), aggregate(height, by = list(Species = species, ## Sex = sex), mean)) I think now you see why using |> makes code much clearer; the nested expression you would need to write otherwise is much less readable, unless you define intermediate objects. And without with(), this is what you would need to write: b <- subset(starwars, !is.na(height)) aggregate(b$height, by = list(Species = b$species, Sex = b$sex), mean) To finish this section, let’s say that you wanted to have the average height and mass by sex. In this case you need to specify the columns in aggregate() with cbind() (let’s use na.rm = TRUE again instead of subset()ing the data beforehand): starwars |> with(aggregate(cbind(height, mass), by = list(Sex = sex), FUN = mean, na.rm = TRUE)) ## Sex height mass ## 1 female 169.2667 54.68889 ## 2 hermaphroditic 175.0000 1358.00000 ## 3 male 179.1053 81.00455 ## 4 none 131.2000 69.75000 Let’s now continue with some more advanced operations using this fake dataset: survey_data_base <- as.data.frame( tibble::tribble( ~id, ~var1, ~var2, ~var3, 1, 1, 0.2, 0.3, 2, 1.4, 1.9, 4.1, 3, 0.1, 2.8, 8.9, 4, 1.7, 1.9, 7.6 ) ) survey_data_base ## id var1 var2 var3 ## 1 1 1.0 0.2 0.3 ## 2 2 1.4 1.9 4.1 ## 3 3 0.1 2.8 8.9 ## 4 4 1.7 1.9 7.6 Depending on what you want to do with this data, it is not in the right shape. For example, it would not be possible to simply compute the average of var1, var2 and var3 for each id. This is because this would require running mean() by row, but this is not very easy. This is because R is not suited to row-based workflows. Well I’m lying a little bit here, it turns here that R comes with a rowMeans() function. So this would work: survey_data_base |> transform(mean_id = rowMeans(cbind(var1, var2, var3))) #transform adds a column to a data.frame ## id var1 var2 var3 mean_id ## 1 1 1.0 0.2 0.3 0.500000 ## 2 2 1.4 1.9 4.1 2.466667 ## 3 3 0.1 2.8 8.9 3.933333 ## 4 4 1.7 1.9 7.6 3.733333 But there is no rowSD() or rowMax(), etc… so it is much better to reshape the data and put it in a format that gives us maximum flexibility. To reshape the data, we’ll be using the aptly-called reshape() command: survey_data_long <- reshape(survey_data_base, varying = list(2:4), v.names = "variable", direction = "long") We can now easily compute the average of variable for each id: aggregate(survey_data_long$variable, by = list(Id = survey_data_long$id), mean) ## Id x ## 1 1 0.500000 ## 2 2 2.466667 ## 3 3 3.933333 ## 4 4 3.733333 or any other variable: aggregate(survey_data_long$variable, by = list(Id = survey_data_long$id), max) ## Id x ## 1 1 1.0 ## 2 2 4.1 ## 3 3 8.9 ## 4 4 7.6 As you can see, R comes with very powerful functions right out of the box, ready to use. When I was studying, unfortunately, my professors had been brought up on FORTRAN loops, so we had to do to all this using loops (not reshaping, thankfully), which was not so easy. Now that we have seen how base R works, let’s redo the analysis using {tidyverse} verbs. The {tidyverse} provides many more functions, each of them doing only one single thing. You will shortly see why this is quite important; by focusing on just one task, and by focusing on the data frame as the central object, it becomes possible to build really complex workflows, piece by piece, very easily. But before deep diving into the {tidyverse}, let’s take a moment to discuss about another infix operator, %>%. 4.2 Smoking is bad for you, but pipes are your friend The title of this section might sound weird at first, but by the end of it, you’ll get this (terrible) pun. You probably know the following painting by René Magritte, La trahison des images: It turns out there’s an R package from the tidyverse that is called magrittr. What does this package do? This package introduced pipes to R, way before |> in R 4.1. Pipes are a concept from the Unix operating system; if you’re using a GNU+Linux distribution or macOS, you’re basically using a modern unix (that’s an oversimplification, but I’m an economist by training, and outrageously oversimplifying things is what we do, deal with it). The magrittr pipe is written as %>%. Just like |>, %>% takes the left hand side to feed it as the first argument of the function in the right hand side. Try the following: library(magrittr) 16 %>% sqrt ## [1] 4 You can chain multiple functions, as you can with |>: 16 %>% sqrt %>% log ## [1] 1.386294 But unlike with |>, you can omit (). %>% also has other features. For example, you can pipe things to other infix operators. For example, +. You can use + as usual: 2 + 12 ## [1] 14 Or as a prefix operator: `+`(2, 12) ## [1] 14 You can use this notation with %>%: 16 %>% sqrt %>% `+`(18) ## [1] 22 This also works using |> since R version 4.2, but only if you use the _ pipe placeholder: 16 |> sqrt() |> `+`(x = _, 18) ## [1] 22 The output of 16 (16) got fed to sqrt(), and the output of sqrt(16) (4) got fed to +(18) (so we got +(4, 18) = 22). Without %>% you’d write the line just above like this: sqrt(16) + 18 ## [1] 22 Just like before, with |>, this might seem overly complicated, but using these pipes will make our code much more readable. I’m sure you’ll be convinced by the end of this chapter. %>% is not the only pipe operator in magrittr. There’s %T%, %<>% and %$%. All have their uses, but are basically shortcuts to some common tasks with %>% plus another function. Which means that you can live without them, and because of this, I will not discuss them. 4.3 The {tidyverse}’s enfant prodige: {dplyr} The best way to get started with the tidyverse packages is to get to know {dplyr}. {dplyr} provides a lot of very useful functions that makes it very easy to get discriptive statistics or add new columns to your data. 4.3.1 A first taste of data manipulation with {dplyr} This section will walk you through a typical analysis using {dplyr} funcitons. Just go with it; I will give more details in the next sections. First, let’s load {dplyr} and the included starwars dataset. Let’s also take a look at the first 5 lines of the dataset: library(dplyr) data(starwars) head(starwars) ## # A tibble: 6 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooi… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… ## 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo ## 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… ## 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera… ## 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi… ## # … with 4 more variables: species <chr>, films <list>, vehicles <list>, ## # starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color, ## # ³​eye_color, ⁴​birth_year, ⁵​homeworld data(starwars) loads the example dataset called starwars that is included in the package {dplyr}. As I said earlier, this is just an example; you could have loaded an external dataset, from a .csv file for instance. This does not matter for what comes next. Like we saw earlier, R includes a lot of functions for descriptive statistics, such as mean(), sd(), cov(), and many more. What {dplyr} brings to the table is a grammar of data manipulation that makes it very easy to apply descriptive statistics functions, or any other, very easily. Just like before, we are going to compute the average height by sex: starwars %>% group_by(sex) %>% summarise(mean_height = mean(height, na.rm = TRUE)) ## # A tibble: 5 × 2 ## sex mean_height ## <chr> <dbl> ## 1 female 169. ## 2 hermaphroditic 175 ## 3 male 179. ## 4 none 131. ## 5 <NA> 181. The very nice thing about using %>% and {dplyr} verbs/functions, is that this is really readable. The above three lines can be translated like so in English: Take the starwars dataset, then group by sex, then compute the mean height (for each subgroup) by omitting missing values. %>% can be translated by “then”. Without %>% you would need to change the code to: summarise(group_by(starwars, sex), mean(height, na.rm = TRUE)) ## # A tibble: 5 × 2 ## sex `mean(height, na.rm = TRUE)` ## <chr> <dbl> ## 1 female 169. ## 2 hermaphroditic 175 ## 3 male 179. ## 4 none 131. ## 5 <NA> 181. Unlike with the base approach, each function does only one thing. With the base function aggregate() was used to also define the subgroups. This is not the case with {dplyr}; one function to create the groups (group_by()) and then one function to compute the summaries (summarise()). Also, group_by() creates a specific subgroup for individuals where sex is missing. This is the last line in the data frame, where sex is NA. Another nice thing is that you can specify the column containing the average height. I chose to name it mean_height. Now, let’s suppose that we want to filter some data first: starwars %>% filter(gender == "masculine") %>% group_by(sex) %>% summarise(mean_height = mean(height, na.rm = TRUE)) ## # A tibble: 3 × 2 ## sex mean_height ## <chr> <dbl> ## 1 hermaphroditic 175 ## 2 male 179. ## 3 none 140 Again, the %>% makes the above lines of code very easy to read. Without it, one would need to write: summarise(group_by(filter(starwars, gender == "masculine"), sex), mean(height, na.rm = TRUE)) ## # A tibble: 3 × 2 ## sex `mean(height, na.rm = TRUE)` ## <chr> <dbl> ## 1 hermaphroditic 175 ## 2 male 179. ## 3 none 140 I think you agree with me that this is not very readable. One way to make it more readable would be to save intermediary variables: filtered_data <- filter(starwars, gender == "masculine") grouped_data <- group_by(filter(starwars, gender == "masculine"), sex) summarise(grouped_data, mean(height)) ## # A tibble: 3 × 2 ## sex `mean(height)` ## <chr> <dbl> ## 1 hermaphroditic 175 ## 2 male NA ## 3 none NA But this can get very tedious. Once you’re used to %>%, you won’t go back to not use it. Before continuing and to make things clearer; filter(), group_by() and summarise() are functions that are included in {dplyr}. %>% is actually a function from {magrittr}, but this package gets loaded on the fly when you load {dplyr}, so you do not need to worry about it. The result of all these operations that use {dplyr} functions are actually other datasets, or tibbles. This means that you can save them in variable, or write them to disk, and then work with these as any other datasets. mean_height <- starwars %>% group_by(sex) %>% summarise(mean(height)) class(mean_height) ## [1] "tbl_df" "tbl" "data.frame" head(mean_height) ## # A tibble: 5 × 2 ## sex `mean(height)` ## <chr> <dbl> ## 1 female NA ## 2 hermaphroditic 175 ## 3 male NA ## 4 none NA ## 5 <NA> NA You could then write this data to disk using rio::export() for instance. If you need more than the mean of the height, you can keep adding as many functions as needed (another advantage over aggregate(): summary_table <- starwars %>% group_by(sex) %>% summarise(mean_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) summary_table ## # A tibble: 5 × 4 ## sex mean_height var_height n_obs ## <chr> <dbl> <dbl> <int> ## 1 female 169. 235. 16 ## 2 hermaphroditic 175 NA 1 ## 3 male 179. 1297. 60 ## 4 none 131. 2416. 6 ## 5 <NA> 181. 8.33 4 I’ve added more functions, namely var(), to get the variance of height, and n(), which is a function from {dplyr}, not base R, to get the number of observations. This is quite useful, because we see that there is a group with only one individual. Let’s focus on the sexes for which we have more than 1 individual. Since we save all the previous operations (which produce a tibble) in a variable, we can keep going from there: summary_table2 <- summary_table %>% filter(n_obs > 1) summary_table2 ## # A tibble: 4 × 4 ## sex mean_height var_height n_obs ## <chr> <dbl> <dbl> <int> ## 1 female 169. 235. 16 ## 2 male 179. 1297. 60 ## 3 none 131. 2416. 6 ## 4 <NA> 181. 8.33 4 As mentioned before, there’s a lot of NAs; this is because by default, mean() and var() return NA if even one single observation is NA. This is good, because it forces you to look at the data to see what is going on. If you would get a number, even if there were NAs you could very easily miss these missing values. It is better for functions to fail early and often than the opposite. This is way we keep using na.rm = TRUE for mean() and var(). Now let’s actually take a look at the rows where sex is NA: starwars %>% filter(is.na(sex)) ## # A tibble: 4 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Ric Olié 183 NA brown fair blue NA <NA> <NA> Naboo ## 2 Quarsh Pana… 183 NA black dark brown 62 <NA> <NA> Naboo ## 3 Sly Moore 178 48 none pale white NA <NA> <NA> Umbara ## 4 Captain Pha… NA NA unknown unknown unknown NA <NA> <NA> <NA> ## # … with 4 more variables: species <chr>, films <list>, vehicles <list>, ## # starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color, ## # ³​eye_color, ⁴​birth_year, ⁵​homeworld There’s only 4 rows where sex is NA. Let’s ignore them: starwars %>% filter(!is.na(sex)) %>% group_by(sex) %>% summarise(ave_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) %>% filter(n_obs > 1) ## # A tibble: 3 × 4 ## sex ave_height var_height n_obs ## <chr> <dbl> <dbl> <int> ## 1 female 169. 235. 16 ## 2 male 179. 1297. 60 ## 3 none 131. 2416. 6 And why not compute the same table, but first add another stratifying variable? starwars %>% filter(!is.na(sex)) %>% group_by(sex, eye_color) %>% summarise(ave_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) %>% filter(n_obs > 1) ## `summarise()` has grouped output by 'sex'. You can override using the `.groups` ## argument. ## # A tibble: 12 × 5 ## # Groups: sex [3] ## sex eye_color ave_height var_height n_obs ## <chr> <chr> <dbl> <dbl> <int> ## 1 female black 196. 612. 2 ## 2 female blue 167 118. 6 ## 3 female brown 160 42 5 ## 4 female hazel 178 NA 2 ## 5 male black 182 1197 7 ## 6 male blue 190. 434. 12 ## 7 male brown 167. 1663. 15 ## 8 male orange 181. 1306. 7 ## 9 male red 190. 0.5 2 ## 10 male unknown 136 6498 2 ## 11 male yellow 180. 2196. 9 ## 12 none red 131 3571 3 Ok, that’s it for a first taste. We have already discovered some very useful {dplyr} functions, filter(), group_by() and summarise summarise(). Now, we are going to learn more about these functions in more detail. 4.3.2 Filter the rows of a dataset with filter() We’re going to use the Gasoline dataset from the plm package, so install that first: install.packages("plm") Then load the required data: data(Gasoline, package = "plm") and load dplyr: library(dplyr) This dataset gives the consumption of gasoline for 18 countries from 1960 to 1978. When you load the data like this, it is a standard data.frame. {dplyr} functions can be used on standard data.frame objects, but also on tibbles. tibbles are just like data frame, but with a better print method (and other niceties). I’ll discuss the {tibble} package later, but for now, let’s convert the data to a tibble and change its name, and also transform the country column to lower case: gasoline <- as_tibble(Gasoline) gasoline <- gasoline %>% mutate(country = tolower(country)) filter() is pretty straightforward. What if you would like to subset the data to focus on the year 1969? Simple: filter(gasoline, year == 1969) ## # A tibble: 18 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1969 4.05 -6.15 -0.559 -8.79 ## 2 belgium 1969 3.85 -5.86 -0.355 -8.52 ## 3 canada 1969 4.86 -5.56 -1.04 -8.10 ## 4 denmark 1969 4.17 -5.72 -0.407 -8.47 ## 5 france 1969 3.77 -5.84 -0.315 -8.37 ## 6 germany 1969 3.90 -5.83 -0.589 -8.44 ## 7 greece 1969 4.89 -6.59 -0.180 -10.7 ## 8 ireland 1969 4.21 -6.38 -0.272 -8.95 ## 9 italy 1969 3.74 -6.28 -0.248 -8.67 ## 10 japan 1969 4.52 -6.16 -0.417 -9.61 ## 11 netherla 1969 3.99 -5.88 -0.417 -8.63 ## 12 norway 1969 4.09 -5.74 -0.338 -8.69 ## 13 spain 1969 3.99 -5.60 0.669 -9.72 ## 14 sweden 1969 3.99 -7.77 -2.73 -8.20 ## 15 switzerl 1969 4.21 -5.91 -0.918 -8.47 ## 16 turkey 1969 5.72 -7.39 -0.298 -12.5 ## 17 u.k. 1969 3.95 -6.03 -0.383 -8.47 ## 18 u.s.a. 1969 4.84 -5.41 -1.22 -7.79 Let’s use %>%, since we’re familiar with it now: gasoline %>% filter(year == 1969) ## # A tibble: 18 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1969 4.05 -6.15 -0.559 -8.79 ## 2 belgium 1969 3.85 -5.86 -0.355 -8.52 ## 3 canada 1969 4.86 -5.56 -1.04 -8.10 ## 4 denmark 1969 4.17 -5.72 -0.407 -8.47 ## 5 france 1969 3.77 -5.84 -0.315 -8.37 ## 6 germany 1969 3.90 -5.83 -0.589 -8.44 ## 7 greece 1969 4.89 -6.59 -0.180 -10.7 ## 8 ireland 1969 4.21 -6.38 -0.272 -8.95 ## 9 italy 1969 3.74 -6.28 -0.248 -8.67 ## 10 japan 1969 4.52 -6.16 -0.417 -9.61 ## 11 netherla 1969 3.99 -5.88 -0.417 -8.63 ## 12 norway 1969 4.09 -5.74 -0.338 -8.69 ## 13 spain 1969 3.99 -5.60 0.669 -9.72 ## 14 sweden 1969 3.99 -7.77 -2.73 -8.20 ## 15 switzerl 1969 4.21 -5.91 -0.918 -8.47 ## 16 turkey 1969 5.72 -7.39 -0.298 -12.5 ## 17 u.k. 1969 3.95 -6.03 -0.383 -8.47 ## 18 u.s.a. 1969 4.84 -5.41 -1.22 -7.79 You can also filter more than just one year, by using the %in% operator: gasoline %>% filter(year %in% seq(1969, 1973)) ## # A tibble: 90 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1969 4.05 -6.15 -0.559 -8.79 ## 2 austria 1970 4.08 -6.08 -0.597 -8.73 ## 3 austria 1971 4.11 -6.04 -0.654 -8.64 ## 4 austria 1972 4.13 -5.98 -0.596 -8.54 ## 5 austria 1973 4.20 -5.90 -0.594 -8.49 ## 6 belgium 1969 3.85 -5.86 -0.355 -8.52 ## 7 belgium 1970 3.87 -5.80 -0.378 -8.45 ## 8 belgium 1971 3.87 -5.76 -0.399 -8.41 ## 9 belgium 1972 3.91 -5.71 -0.311 -8.36 ## 10 belgium 1973 3.90 -5.64 -0.373 -8.31 ## # … with 80 more rows It is also possible use between(), a helper function: gasoline %>% filter(between(year, 1969, 1973)) ## # A tibble: 90 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1969 4.05 -6.15 -0.559 -8.79 ## 2 austria 1970 4.08 -6.08 -0.597 -8.73 ## 3 austria 1971 4.11 -6.04 -0.654 -8.64 ## 4 austria 1972 4.13 -5.98 -0.596 -8.54 ## 5 austria 1973 4.20 -5.90 -0.594 -8.49 ## 6 belgium 1969 3.85 -5.86 -0.355 -8.52 ## 7 belgium 1970 3.87 -5.80 -0.378 -8.45 ## 8 belgium 1971 3.87 -5.76 -0.399 -8.41 ## 9 belgium 1972 3.91 -5.71 -0.311 -8.36 ## 10 belgium 1973 3.90 -5.64 -0.373 -8.31 ## # … with 80 more rows To select non-consecutive years: gasoline %>% filter(year %in% c(1969, 1973, 1977)) ## # A tibble: 54 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1969 4.05 -6.15 -0.559 -8.79 ## 2 austria 1973 4.20 -5.90 -0.594 -8.49 ## 3 austria 1977 3.93 -5.83 -0.422 -8.25 ## 4 belgium 1969 3.85 -5.86 -0.355 -8.52 ## 5 belgium 1973 3.90 -5.64 -0.373 -8.31 ## 6 belgium 1977 3.85 -5.56 -0.432 -8.14 ## 7 canada 1969 4.86 -5.56 -1.04 -8.10 ## 8 canada 1973 4.90 -5.41 -1.13 -7.94 ## 9 canada 1977 4.81 -5.34 -1.07 -7.77 ## 10 denmark 1969 4.17 -5.72 -0.407 -8.47 ## # … with 44 more rows %in% tests if an object is part of a set. 4.3.3 Select columns with select() While filter() allows you to keep or discard rows of data, select() allows you to keep or discard entire columns. To keep columns: gasoline %>% select(country, year, lrpmg) ## # A tibble: 342 × 3 ## country year lrpmg ## <chr> <int> <dbl> ## 1 austria 1960 -0.335 ## 2 austria 1961 -0.351 ## 3 austria 1962 -0.380 ## 4 austria 1963 -0.414 ## 5 austria 1964 -0.445 ## 6 austria 1965 -0.497 ## 7 austria 1966 -0.467 ## 8 austria 1967 -0.506 ## 9 austria 1968 -0.522 ## 10 austria 1969 -0.559 ## # … with 332 more rows To discard them: gasoline %>% select(-country, -year, -lrpmg) ## # A tibble: 342 × 3 ## lgaspcar lincomep lcarpcap ## <dbl> <dbl> <dbl> ## 1 4.17 -6.47 -9.77 ## 2 4.10 -6.43 -9.61 ## 3 4.07 -6.41 -9.46 ## 4 4.06 -6.37 -9.34 ## 5 4.04 -6.32 -9.24 ## 6 4.03 -6.29 -9.12 ## 7 4.05 -6.25 -9.02 ## 8 4.05 -6.23 -8.93 ## 9 4.05 -6.21 -8.85 ## 10 4.05 -6.15 -8.79 ## # … with 332 more rows To rename them: gasoline %>% select(country, date = year, lrpmg) ## # A tibble: 342 × 3 ## country date lrpmg ## <chr> <int> <dbl> ## 1 austria 1960 -0.335 ## 2 austria 1961 -0.351 ## 3 austria 1962 -0.380 ## 4 austria 1963 -0.414 ## 5 austria 1964 -0.445 ## 6 austria 1965 -0.497 ## 7 austria 1966 -0.467 ## 8 austria 1967 -0.506 ## 9 austria 1968 -0.522 ## 10 austria 1969 -0.559 ## # … with 332 more rows There’s also rename(): gasoline %>% rename(date = year) ## # A tibble: 342 × 6 ## country date lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows rename() does not do any kind of selection, but just renames. You can also use select() to re-order columns: gasoline %>% select(year, country, lrpmg, everything()) ## # A tibble: 342 × 6 ## year country lrpmg lgaspcar lincomep lcarpcap ## <int> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 1960 austria -0.335 4.17 -6.47 -9.77 ## 2 1961 austria -0.351 4.10 -6.43 -9.61 ## 3 1962 austria -0.380 4.07 -6.41 -9.46 ## 4 1963 austria -0.414 4.06 -6.37 -9.34 ## 5 1964 austria -0.445 4.04 -6.32 -9.24 ## 6 1965 austria -0.497 4.03 -6.29 -9.12 ## 7 1966 austria -0.467 4.05 -6.25 -9.02 ## 8 1967 austria -0.506 4.05 -6.23 -8.93 ## 9 1968 austria -0.522 4.05 -6.21 -8.85 ## 10 1969 austria -0.559 4.05 -6.15 -8.79 ## # … with 332 more rows everything() is a helper function, and there’s also starts_with(), and ends_with(). For example, what if we are only interested in columns whose name start with “l”? gasoline %>% select(starts_with("l")) ## # A tibble: 342 × 4 ## lgaspcar lincomep lrpmg lcarpcap ## <dbl> <dbl> <dbl> <dbl> ## 1 4.17 -6.47 -0.335 -9.77 ## 2 4.10 -6.43 -0.351 -9.61 ## 3 4.07 -6.41 -0.380 -9.46 ## 4 4.06 -6.37 -0.414 -9.34 ## 5 4.04 -6.32 -0.445 -9.24 ## 6 4.03 -6.29 -0.497 -9.12 ## 7 4.05 -6.25 -0.467 -9.02 ## 8 4.05 -6.23 -0.506 -8.93 ## 9 4.05 -6.21 -0.522 -8.85 ## 10 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows ends_with() works in a similar fashion. There is also contains(): gasoline %>% select(country, year, contains("car")) ## # A tibble: 342 × 4 ## country year lgaspcar lcarpcap ## <chr> <int> <dbl> <dbl> ## 1 austria 1960 4.17 -9.77 ## 2 austria 1961 4.10 -9.61 ## 3 austria 1962 4.07 -9.46 ## 4 austria 1963 4.06 -9.34 ## 5 austria 1964 4.04 -9.24 ## 6 austria 1965 4.03 -9.12 ## 7 austria 1966 4.05 -9.02 ## 8 austria 1967 4.05 -8.93 ## 9 austria 1968 4.05 -8.85 ## 10 austria 1969 4.05 -8.79 ## # … with 332 more rows You can read more about these helper functions here, but we’re going to look more into them in a coming section. Another verb, similar to select(), is pull(). Let’s compare the two: gasoline %>% select(lrpmg) ## # A tibble: 342 × 1 ## lrpmg ## <dbl> ## 1 -0.335 ## 2 -0.351 ## 3 -0.380 ## 4 -0.414 ## 5 -0.445 ## 6 -0.497 ## 7 -0.467 ## 8 -0.506 ## 9 -0.522 ## 10 -0.559 ## # … with 332 more rows gasoline %>% pull(lrpmg) %>% head() # using head() because there's 337 elements in total ## [1] -0.3345476 -0.3513276 -0.3795177 -0.4142514 -0.4453354 -0.4970607 pull(), unlike select(), does not return a tibble, but only the column you want, as a vector. 4.3.4 Group the observations of your dataset with group_by() group_by() is a very useful verb; as the name implies, it allows you to create groups and then, for example, compute descriptive statistics by groups. For example, let’s group our data by country: gasoline %>% group_by(country) ## # A tibble: 342 × 6 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows It looks like nothing much happened, but if you look at the second line of the output you can read the following: ## # Groups: country [18] this means that the data is grouped, and every computation you will do now will take these groups into account. It is also possible to group by more than one variable: gasoline %>% group_by(country, year) ## # A tibble: 342 × 6 ## # Groups: country, year [342] ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows and so on. You can then also ungroup: gasoline %>% group_by(country, year) %>% ungroup() ## # A tibble: 342 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows Once your data is grouped, the operations that will follow will be executed inside each group. 4.3.5 Get summary statistics with summarise() Ok, now that we have learned the basic verbs, we can start to do more interesting stuff. For example, one might want to compute the average gasoline consumption in each country, for the whole period: gasoline %>% group_by(country) %>% summarise(mean(lgaspcar)) ## # A tibble: 18 × 2 ## country `mean(lgaspcar)` ## <chr> <dbl> ## 1 austria 4.06 ## 2 belgium 3.92 ## 3 canada 4.86 ## 4 denmark 4.19 ## 5 france 3.82 ## 6 germany 3.89 ## 7 greece 4.88 ## 8 ireland 4.23 ## 9 italy 3.73 ## 10 japan 4.70 ## 11 netherla 4.08 ## 12 norway 4.11 ## 13 spain 4.06 ## 14 sweden 4.01 ## 15 switzerl 4.24 ## 16 turkey 5.77 ## 17 u.k. 3.98 ## 18 u.s.a. 4.82 mean() was given as an argument to summarise(), which is a {dplyr} verb. What we get is another tibble, that contains the variable we used to group, as well as the average per country. We can also rename this column: gasoline %>% group_by(country) %>% summarise(mean_gaspcar = mean(lgaspcar)) ## # A tibble: 18 × 2 ## country mean_gaspcar ## <chr> <dbl> ## 1 austria 4.06 ## 2 belgium 3.92 ## 3 canada 4.86 ## 4 denmark 4.19 ## 5 france 3.82 ## 6 germany 3.89 ## 7 greece 4.88 ## 8 ireland 4.23 ## 9 italy 3.73 ## 10 japan 4.70 ## 11 netherla 4.08 ## 12 norway 4.11 ## 13 spain 4.06 ## 14 sweden 4.01 ## 15 switzerl 4.24 ## 16 turkey 5.77 ## 17 u.k. 3.98 ## 18 u.s.a. 4.82 and because the output is a tibble, we can continue to use {dplyr} verbs on it: gasoline %>% group_by(country) %>% summarise(mean_gaspcar = mean(lgaspcar)) %>% filter(country == "france") ## # A tibble: 1 × 2 ## country mean_gaspcar ## <chr> <dbl> ## 1 france 3.82 summarise() is a very useful verb. For example, we can compute several descriptive statistics at once: gasoline %>% group_by(country) %>% summarise(mean_gaspcar = mean(lgaspcar), sd_gaspcar = sd(lgaspcar), max_gaspcar = max(lgaspcar), min_gaspcar = min(lgaspcar)) ## # A tibble: 18 × 5 ## country mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 austria 4.06 0.0693 4.20 3.92 ## 2 belgium 3.92 0.103 4.16 3.82 ## 3 canada 4.86 0.0262 4.90 4.81 ## 4 denmark 4.19 0.158 4.50 4.00 ## 5 france 3.82 0.0499 3.91 3.75 ## 6 germany 3.89 0.0239 3.93 3.85 ## 7 greece 4.88 0.255 5.38 4.48 ## 8 ireland 4.23 0.0437 4.33 4.16 ## 9 italy 3.73 0.220 4.05 3.38 ## 10 japan 4.70 0.684 6.00 3.95 ## 11 netherla 4.08 0.286 4.65 3.71 ## 12 norway 4.11 0.123 4.44 3.96 ## 13 spain 4.06 0.317 4.75 3.62 ## 14 sweden 4.01 0.0364 4.07 3.91 ## 15 switzerl 4.24 0.102 4.44 4.05 ## 16 turkey 5.77 0.329 6.16 5.14 ## 17 u.k. 3.98 0.0479 4.10 3.91 ## 18 u.s.a. 4.82 0.0219 4.86 4.79 Because the output is a tibble, you can save it in a variable of course: desc_gasoline <- gasoline %>% group_by(country) %>% summarise(mean_gaspcar = mean(lgaspcar), sd_gaspcar = sd(lgaspcar), max_gaspcar = max(lgaspcar), min_gaspcar = min(lgaspcar)) And then you can answer questions such as, which country has the maximum average gasoline consumption?: desc_gasoline %>% filter(max(mean_gaspcar) == mean_gaspcar) ## # A tibble: 1 × 5 ## country mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 turkey 5.77 0.329 6.16 5.14 Turns out it’s Turkey. What about the minimum consumption? desc_gasoline %>% filter(min(mean_gaspcar) == mean_gaspcar) ## # A tibble: 1 × 5 ## country mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 italy 3.73 0.220 4.05 3.38 Because the output of {dplyr} verbs is a tibble, it is possible to continue working with it. This is one shortcoming of using the base summary() function. The object returned by that function is not very easy to manipulate. 4.3.6 Adding columns with mutate() and transmute() mutate() adds a column to the tibble, which can contain any transformation of any other variable: gasoline %>% group_by(country) %>% mutate(n()) ## # A tibble: 342 × 7 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap `n()` ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <int> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 19 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 19 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 19 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 19 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 19 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 19 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 19 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 19 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 19 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 19 ## # … with 332 more rows Using mutate() I’ve added a column that counts how many times the country appears in the tibble, using n(), another {dplyr} function. There’s also count() and tally(), which we are going to see further down. It is also possible to rename the column on the fly: gasoline %>% group_by(country) %>% mutate(count = n()) ## # A tibble: 342 × 7 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap count ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <int> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 19 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 19 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 19 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 19 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 19 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 19 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 19 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 19 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 19 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 19 ## # … with 332 more rows It is possible to do any arbitrary operation: gasoline %>% group_by(country) %>% mutate(spam = exp(lgaspcar + lincomep)) ## # A tibble: 342 × 7 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap spam ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 0.100 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 0.0978 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 0.0969 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 0.0991 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 0.102 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 0.104 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 0.110 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 0.113 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 0.115 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 0.122 ## # … with 332 more rows transmute() is the same as mutate(), but only returns the created variable: gasoline %>% group_by(country) %>% transmute(spam = exp(lgaspcar + lincomep)) ## # A tibble: 342 × 2 ## # Groups: country [18] ## country spam ## <chr> <dbl> ## 1 austria 0.100 ## 2 austria 0.0978 ## 3 austria 0.0969 ## 4 austria 0.0991 ## 5 austria 0.102 ## 6 austria 0.104 ## 7 austria 0.110 ## 8 austria 0.113 ## 9 austria 0.115 ## 10 austria 0.122 ## # … with 332 more rows 4.3.7 Joining tibbles with full_join(), left_join(), right_join() and all the others I will end this section on {dplyr} with the very useful verbs: the *_join() verbs. Let’s first start by loading another dataset from the plm package. SumHes and let’s convert it to tibble and rename it: data(SumHes, package = "plm") pwt <- SumHes %>% as_tibble() %>% mutate(country = tolower(country)) Let’s take a quick look at the data: glimpse(pwt) ## Rows: 3,250 ## Columns: 7 ## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 19… ## $ country <chr> "algeria", "algeria", "algeria", "algeria", "algeria", "algeri… ## $ opec <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no… ## $ com <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no… ## $ pop <int> 10800, 11016, 11236, 11460, 11690, 11923, 12267, 12622, 12986,… ## $ gdp <int> 1723, 1599, 1275, 1517, 1589, 1584, 1548, 1600, 1758, 1835, 18… ## $ sr <dbl> 19.9, 21.1, 15.0, 13.9, 10.6, 11.0, 8.3, 11.3, 15.1, 18.2, 19.… We can merge both gasoline and pwt by country and year, as these two variables are common to both datasets. There are more countries and years in the pwt dataset, so when merging both, and depending on which function you use, you will either have NA’s for the variables where there is no match, or rows that will be dropped. Let’s start with full_join: gas_pwt_full <- gasoline %>% full_join(pwt, by = c("country", "year")) Let’s see which countries and years are included: gas_pwt_full %>% count(country, year) ## # A tibble: 3,307 × 3 ## country year n ## <chr> <int> <int> ## 1 algeria 1960 1 ## 2 algeria 1961 1 ## 3 algeria 1962 1 ## 4 algeria 1963 1 ## 5 algeria 1964 1 ## 6 algeria 1965 1 ## 7 algeria 1966 1 ## 8 algeria 1967 1 ## 9 algeria 1968 1 ## 10 algeria 1969 1 ## # … with 3,297 more rows As you see, every country and year was included, but what happened for, say, the U.S.S.R? This country is in pwt but not in gasoline at all: gas_pwt_full %>% filter(country == "u.s.s.r.") ## # A tibble: 26 × 11 ## country year lgaspcar lincomep lrpmg lcarp…¹ opec com pop gdp sr ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <int> <int> <dbl> ## 1 u.s.s.r. 1960 NA NA NA NA no yes 214400 2397 37.9 ## 2 u.s.s.r. 1961 NA NA NA NA no yes 217896 2542 39.4 ## 3 u.s.s.r. 1962 NA NA NA NA no yes 221449 2656 38.4 ## 4 u.s.s.r. 1963 NA NA NA NA no yes 225060 2681 38.4 ## 5 u.s.s.r. 1964 NA NA NA NA no yes 227571 2854 39.5 ## 6 u.s.s.r. 1965 NA NA NA NA no yes 230109 3049 39.9 ## 7 u.s.s.r. 1966 NA NA NA NA no yes 232676 3247 39.9 ## 8 u.s.s.r. 1967 NA NA NA NA no yes 235272 3454 40.2 ## 9 u.s.s.r. 1968 NA NA NA NA no yes 237896 3730 40.6 ## 10 u.s.s.r. 1969 NA NA NA NA no yes 240550 3808 37.9 ## # … with 16 more rows, and abbreviated variable name ¹​lcarpcap As you probably guessed, the variables from gasoline that are not included in pwt are filled with NAs. One could remove all these lines and only keep countries for which these variables are not NA everywhere with filter(), but there is a simpler solution: gas_pwt_inner <- gasoline %>% inner_join(pwt, by = c("country", "year")) Let’s use the tabyl() from the janitor packages which is a very nice alternative to the table() function from base R: library(janitor) gas_pwt_inner %>% tabyl(country) ## country n percent ## austria 19 0.06666667 ## belgium 19 0.06666667 ## canada 19 0.06666667 ## denmark 19 0.06666667 ## france 19 0.06666667 ## greece 19 0.06666667 ## ireland 19 0.06666667 ## italy 19 0.06666667 ## japan 19 0.06666667 ## norway 19 0.06666667 ## spain 19 0.06666667 ## sweden 19 0.06666667 ## turkey 19 0.06666667 ## u.k. 19 0.06666667 ## u.s.a. 19 0.06666667 Only countries with values in both datasets were returned. It’s almost every country from gasoline, apart from Germany (called “germany west” in pwt and “germany” in gasoline. I left it as is to provide an example of a country not in pwt). Let’s also look at the variables: glimpse(gas_pwt_inner) ## Rows: 285 ## Columns: 11 ## $ country <chr> "austria", "austria", "austria", "austria", "austria", "austr… ## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1… ## $ lgaspcar <dbl> 4.173244, 4.100989, 4.073177, 4.059509, 4.037689, 4.033983, 4… ## $ lincomep <dbl> -6.474277, -6.426006, -6.407308, -6.370679, -6.322247, -6.294… ## $ lrpmg <dbl> -0.3345476, -0.3513276, -0.3795177, -0.4142514, -0.4453354, -… ## $ lcarpcap <dbl> -9.766840, -9.608622, -9.457257, -9.343155, -9.237739, -9.123… ## $ opec <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, n… ## $ com <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, n… ## $ pop <int> 7048, 7087, 7130, 7172, 7215, 7255, 7308, 7338, 7362, 7384, 7… ## $ gdp <int> 5143, 5388, 5481, 5688, 5978, 6144, 6437, 6596, 6847, 7162, 7… ## $ sr <dbl> 24.3, 24.5, 23.3, 22.9, 25.2, 25.2, 26.7, 25.6, 25.7, 26.1, 2… The variables from both datasets are in the joined data. Contrast this to semi_join(): gas_pwt_semi <- gasoline %>% semi_join(pwt, by = c("country", "year")) glimpse(gas_pwt_semi) ## Rows: 285 ## Columns: 6 ## $ country <chr> "austria", "austria", "austria", "austria", "austria", "austr… ## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1… ## $ lgaspcar <dbl> 4.173244, 4.100989, 4.073177, 4.059509, 4.037689, 4.033983, 4… ## $ lincomep <dbl> -6.474277, -6.426006, -6.407308, -6.370679, -6.322247, -6.294… ## $ lrpmg <dbl> -0.3345476, -0.3513276, -0.3795177, -0.4142514, -0.4453354, -… ## $ lcarpcap <dbl> -9.766840, -9.608622, -9.457257, -9.343155, -9.237739, -9.123… gas_pwt_semi %>% tabyl(country) ## country n percent ## austria 19 0.06666667 ## belgium 19 0.06666667 ## canada 19 0.06666667 ## denmark 19 0.06666667 ## france 19 0.06666667 ## greece 19 0.06666667 ## ireland 19 0.06666667 ## italy 19 0.06666667 ## japan 19 0.06666667 ## norway 19 0.06666667 ## spain 19 0.06666667 ## sweden 19 0.06666667 ## turkey 19 0.06666667 ## u.k. 19 0.06666667 ## u.s.a. 19 0.06666667 Only columns of gasoline are returned, and only rows of gasoline that were matched with rows from pwt. semi_join() is not a commutative operation: pwt_gas_semi <- pwt %>% semi_join(gasoline, by = c("country", "year")) glimpse(pwt_gas_semi) ## Rows: 285 ## Columns: 7 ## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 19… ## $ country <chr> "canada", "canada", "canada", "canada", "canada", "canada", "c… ## $ opec <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no… ## $ com <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no… ## $ pop <int> 17910, 18270, 18614, 18963, 19326, 19678, 20049, 20411, 20744,… ## $ gdp <int> 7258, 7261, 7605, 7876, 8244, 8664, 9093, 9231, 9582, 9975, 10… ## $ sr <dbl> 22.7, 21.5, 22.1, 21.9, 22.9, 24.8, 25.4, 23.1, 22.6, 23.4, 21… gas_pwt_semi %>% tabyl(country) ## country n percent ## austria 19 0.06666667 ## belgium 19 0.06666667 ## canada 19 0.06666667 ## denmark 19 0.06666667 ## france 19 0.06666667 ## greece 19 0.06666667 ## ireland 19 0.06666667 ## italy 19 0.06666667 ## japan 19 0.06666667 ## norway 19 0.06666667 ## spain 19 0.06666667 ## sweden 19 0.06666667 ## turkey 19 0.06666667 ## u.k. 19 0.06666667 ## u.s.a. 19 0.06666667 The rows are the same, but not the columns. left_join() and right_join() return all the rows from either the dataset that is on the “left” (the first argument of the fonction) or on the “right” (the second argument of the function) but all columns from both datasets. So depending on which countries you’re interested in, you’re going to use either one of these functions: gas_pwt_left <- gasoline %>% left_join(pwt, by = c("country", "year")) gas_pwt_left %>% tabyl(country) ## country n percent ## austria 19 0.05555556 ## belgium 19 0.05555556 ## canada 19 0.05555556 ## denmark 19 0.05555556 ## france 19 0.05555556 ## germany 19 0.05555556 ## greece 19 0.05555556 ## ireland 19 0.05555556 ## italy 19 0.05555556 ## japan 19 0.05555556 ## netherla 19 0.05555556 ## norway 19 0.05555556 ## spain 19 0.05555556 ## sweden 19 0.05555556 ## switzerl 19 0.05555556 ## turkey 19 0.05555556 ## u.k. 19 0.05555556 ## u.s.a. 19 0.05555556 gas_pwt_right <- gasoline %>% right_join(pwt, by = c("country", "year")) gas_pwt_right %>% tabyl(country) %>% head() ## country n percent ## algeria 26 0.008 ## angola 26 0.008 ## argentina 26 0.008 ## australia 26 0.008 ## austria 26 0.008 ## bangladesh 26 0.008 The last merge function is anti_join(): gas_pwt_anti <- gasoline %>% anti_join(pwt, by = c("country", "year")) glimpse(gas_pwt_anti) ## Rows: 57 ## Columns: 6 ## $ country <chr> "germany", "germany", "germany", "germany", "germany", "germa… ## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1… ## $ lgaspcar <dbl> 3.916953, 3.885345, 3.871484, 3.848782, 3.868993, 3.861049, 3… ## $ lincomep <dbl> -6.159837, -6.120923, -6.094258, -6.068361, -6.013442, -5.966… ## $ lrpmg <dbl> -0.1859108, -0.2309538, -0.3438417, -0.3746467, -0.3996526, -… ## $ lcarpcap <dbl> -9.342481, -9.183841, -9.037280, -8.913630, -8.811013, -8.711… gas_pwt_anti %>% tabyl(country) ## country n percent ## germany 19 0.3333333 ## netherla 19 0.3333333 ## switzerl 19 0.3333333 gas_pwt_anti has the columns the gasoline dataset as well as the only country from gasoline that is not in pwt: “germany”. That was it for the basic {dplyr} verbs. Next, we’re going to learn about {tidyr}. 4.4 Reshaping and sprucing up data with {tidyr} Note: this section is going to be a lot harder than anything you’ve seen until now. Reshaping data is tricky, and to really grok it, you need time, and you need to run each line, and see what happens. Take your time, and don’t be discouraged. Another important package from the {tidyverse} that goes hand in hand with {dplyr} is {tidyr}. {tidyr} is the package you need when it’s time to reshape data. I will start by presenting pivot_wider() and pivot_longer(). 4.4.1 pivot_wider() and pivot_longer() Let’s first create a fake dataset: library(tidyr) survey_data <- tribble( ~id, ~variable, ~value, 1, "var1", 1, 1, "var2", 0.2, NA, "var3", 0.3, 2, "var1", 1.4, 2, "var2", 1.9, 2, "var3", 4.1, 3, "var1", 0.1, 3, "var2", 2.8, 3, "var3", 8.9, 4, "var1", 1.7, NA, "var2", 1.9, 4, "var3", 7.6 ) head(survey_data) ## # A tibble: 6 × 3 ## id variable value ## <dbl> <chr> <dbl> ## 1 1 var1 1 ## 2 1 var2 0.2 ## 3 NA var3 0.3 ## 4 2 var1 1.4 ## 5 2 var2 1.9 ## 6 2 var3 4.1 I used the tribble() function from the {tibble} package to create this fake dataset. I’ll discuss this package later, for now, let’s focus on {tidyr}. Let’s suppose that we need the data to be in the wide format which means var1, var2 and var3 need to be their own columns. To do this, we need to use the pivot_wider() function. Why wide? Because the data set will be wide, meaning, having more columns than rows. survey_data %>% pivot_wider(id_cols = id, names_from = variable, values_from = value) ## # A tibble: 5 × 4 ## id var1 var2 var3 ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1 0.2 NA ## 2 NA NA 1.9 0.3 ## 3 2 1.4 1.9 4.1 ## 4 3 0.1 2.8 8.9 ## 5 4 1.7 NA 7.6 Let’s go through pivot_wider()’s arguments: the first is id_cols = which requires the variable that uniquely identifies the rows to be supplied. names_from = is where you input the variable that will generate the names of the new columns. In our case, the variable colmuns has three values; var1, var2 and var3, and these are now the names of the new columns. Finally, values_from = is where you can specify the column containing the values that will fill the data frame. I find the argument names names_from = and values_from = quite explicit. As you can see, there are some missing values. Let’s suppose that we know that these missing values are true 0’s. pivot_wider() has an argument called values_fill = that makes it easy to replace the missing values: survey_data %>% pivot_wider(id_cols = id, names_from = variable, values_from = value, values_fill = list(value = 0)) ## # A tibble: 5 × 4 ## id var1 var2 var3 ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1 0.2 0 ## 2 NA 0 1.9 0.3 ## 3 2 1.4 1.9 4.1 ## 4 3 0.1 2.8 8.9 ## 5 4 1.7 0 7.6 A list of variables and their respective values to replace NA’s with must be supplied to values_fill. Let’s now use another dataset, which you can get from here (downloaded from: http://www.statistiques.public.lu/stat/TableViewer/tableView.aspx?ReportId=12950&IF_Language=eng&MainTheme=2&FldrName=3&RFPath=91). This data set gives the unemployment rate for each Luxembourguish canton from 2001 to 2015. We will come back to this data later on to learn how to plot it. For now, let’s use it to learn more about {tidyr}. unemp_lux_data <- rio::import( "https://raw.githubusercontent.com/b-rodrigues/modern_R/master/datasets/unemployment/all/unemployment_lux_all.csv" ) head(unemp_lux_data) ## division year active_population of_which_non_wage_earners ## 1 Beaufort 2001 688 85 ## 2 Beaufort 2002 742 85 ## 3 Beaufort 2003 773 85 ## 4 Beaufort 2004 828 80 ## 5 Beaufort 2005 866 96 ## 6 Beaufort 2006 893 87 ## of_which_wage_earners total_employed_population unemployed ## 1 568 653 35 ## 2 631 716 26 ## 3 648 733 40 ## 4 706 786 42 ## 5 719 815 51 ## 6 746 833 60 ## unemployment_rate_in_percent ## 1 5.09 ## 2 3.50 ## 3 5.17 ## 4 5.07 ## 5 5.89 ## 6 6.72 Now, let’s suppose that for our purposes, it would make more sense to have the data in a wide format, where columns are “divison times year” and the value is the unemployment rate. This can be easily done with providing more columns to names_from =. unemp_lux_data2 <- unemp_lux_data %>% filter(year %in% seq(2013, 2017), str_detect(division, ".*ange$"), !str_detect(division, ".*Canton.*")) %>% select(division, year, unemployment_rate_in_percent) %>% rowid_to_column() unemp_lux_data2 %>% pivot_wider(names_from = c(division, year), values_from = unemployment_rate_in_percent) ## # A tibble: 48 × 49 ## rowid Bertr…¹ Bertr…² Bertr…³ Diffe…⁴ Diffe…⁵ Diffe…⁶ Dudel…⁷ Dudel…⁸ Dudel…⁹ ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 5.69 NA NA NA NA NA NA NA NA ## 2 2 NA 5.65 NA NA NA NA NA NA NA ## 3 3 NA NA 5.35 NA NA NA NA NA NA ## 4 4 NA NA NA 13.2 NA NA NA NA NA ## 5 5 NA NA NA NA 12.6 NA NA NA NA ## 6 6 NA NA NA NA NA 11.4 NA NA NA ## 7 7 NA NA NA NA NA NA 9.35 NA NA ## 8 8 NA NA NA NA NA NA NA 9.37 NA ## 9 9 NA NA NA NA NA NA NA NA 8.53 ## 10 10 NA NA NA NA NA NA NA NA NA ## # … with 38 more rows, 39 more variables: Frisange_2013 <dbl>, ## # Frisange_2014 <dbl>, Frisange_2015 <dbl>, Hesperange_2013 <dbl>, ## # Hesperange_2014 <dbl>, Hesperange_2015 <dbl>, Leudelange_2013 <dbl>, ## # Leudelange_2014 <dbl>, Leudelange_2015 <dbl>, Mondercange_2013 <dbl>, ## # Mondercange_2014 <dbl>, Mondercange_2015 <dbl>, Pétange_2013 <dbl>, ## # Pétange_2014 <dbl>, Pétange_2015 <dbl>, Rumelange_2013 <dbl>, ## # Rumelange_2014 <dbl>, Rumelange_2015 <dbl>, Schifflange_2013 <dbl>, … In the filter() statement, I only kept data from 2013 to 2017, “division”s ending with the string “ange” (“division” can be a canton or a commune, for example “Canton Redange”, a canton, or “Hesperange” a commune), and removed the cantons as I’m only interested in communes. If you don’t understand this filter() statement, don’t fret; this is not important for what follows. I then only kept the columns I’m interested in and pivoted the data to a wide format. Also, I needed to add a unique identifier to the data frame. For this, I used rowid_to_column() function, from the {tibble} package, which adds a new column to the data frame with an id, going from 1 to the number of rows in the data frame. If I did not add this identifier, the statement would work still: unemp_lux_data3 <- unemp_lux_data %>% filter(year %in% seq(2013, 2017), str_detect(division, ".*ange$"), !str_detect(division, ".*Canton.*")) %>% select(division, year, unemployment_rate_in_percent) unemp_lux_data3 %>% pivot_wider(names_from = c(division, year), values_from = unemployment_rate_in_percent) ## # A tibble: 1 × 48 ## Bertrange_2013 Bertr…¹ Bertr…² Diffe…³ Diffe…⁴ Diffe…⁵ Dudel…⁶ Dudel…⁷ Dudel…⁸ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 5.69 5.65 5.35 13.2 12.6 11.4 9.35 9.37 8.53 ## # … with 39 more variables: Frisange_2013 <dbl>, Frisange_2014 <dbl>, ## # Frisange_2015 <dbl>, Hesperange_2013 <dbl>, Hesperange_2014 <dbl>, ## # Hesperange_2015 <dbl>, Leudelange_2013 <dbl>, Leudelange_2014 <dbl>, ## # Leudelange_2015 <dbl>, Mondercange_2013 <dbl>, Mondercange_2014 <dbl>, ## # Mondercange_2015 <dbl>, Pétange_2013 <dbl>, Pétange_2014 <dbl>, ## # Pétange_2015 <dbl>, Rumelange_2013 <dbl>, Rumelange_2014 <dbl>, ## # Rumelange_2015 <dbl>, Schifflange_2013 <dbl>, Schifflange_2014 <dbl>, … and actually look even better, but only because there are no repeated values; there is only one unemployment rate for each “commune times year”. I will come back to this later on, with another example that might be clearer. These last two code blocks are intense; make sure you go through each lien step by step and understand what is going on. You might have noticed that because there is no data for the years 2016 and 2017, these columns do not appear in the data. But suppose that we need to have these columns, so that a colleague from another department can fill in the values. This is possible by providing a data frame with the detailed specifications of the result data frame. This optional data frame must have at least two columns, .name, which are the column names you want, and .value which contains the values. Also, the function that uses this spec is a pivot_wider_spec(), and not pivot_wider(). unemp_spec <- unemp_lux_data %>% tidyr::expand(division, year = c(year, 2016, 2017), .value = "unemployment_rate_in_percent") %>% unite(".name", division, year, remove = FALSE) unemp_spec Here, I use another function, tidyr::expand(), which returns every combinations (cartesian product) of every variable from a dataset. To make it work, we still need to create a column that uniquely identifies each row in the data: unemp_lux_data4 <- unemp_lux_data %>% select(division, year, unemployment_rate_in_percent) %>% rowid_to_column() %>% pivot_wider_spec(spec = unemp_spec) unemp_lux_data4 ## # A tibble: 1,770 × 2,007 ## rowid Beauf…¹ Beauf…² Beauf…³ Beauf…⁴ Beauf…⁵ Beauf…⁶ Beauf…⁷ Beauf…⁸ Beauf…⁹ ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 5.09 NA NA NA NA NA NA NA NA ## 2 2 NA 3.5 NA NA NA NA NA NA NA ## 3 3 NA NA 5.17 NA NA NA NA NA NA ## 4 4 NA NA NA 5.07 NA NA NA NA NA ## 5 5 NA NA NA NA 5.89 NA NA NA NA ## 6 6 NA NA NA NA NA 6.72 NA NA NA ## 7 7 NA NA NA NA NA NA 4.3 NA NA ## 8 8 NA NA NA NA NA NA NA 7.08 NA ## 9 9 NA NA NA NA NA NA NA NA 8.52 ## 10 10 NA NA NA NA NA NA NA NA NA ## # … with 1,760 more rows, 1,997 more variables: Beaufort_2010 <dbl>, ## # Beaufort_2011 <dbl>, Beaufort_2012 <dbl>, Beaufort_2013 <dbl>, ## # Beaufort_2014 <dbl>, Beaufort_2015 <dbl>, Beaufort_2016 <dbl>, ## # Beaufort_2017 <dbl>, Bech_2001 <dbl>, Bech_2002 <dbl>, Bech_2003 <dbl>, ## # Bech_2004 <dbl>, Bech_2005 <dbl>, Bech_2006 <dbl>, Bech_2007 <dbl>, ## # Bech_2008 <dbl>, Bech_2009 <dbl>, Bech_2010 <dbl>, Bech_2011 <dbl>, ## # Bech_2012 <dbl>, Bech_2013 <dbl>, Bech_2014 <dbl>, Bech_2015 <dbl>, … You can notice that now we have columns for 2016 and 2017 too. Let’s clean the data a little bit more: unemp_lux_data4 %>% select(-rowid) %>% fill(matches(".*"), .direction = "down") %>% slice(n()) ## # A tibble: 1 × 2,006 ## Beaufort_2001 Beaufo…¹ Beauf…² Beauf…³ Beauf…⁴ Beauf…⁵ Beauf…⁶ Beauf…⁷ Beauf…⁸ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 5.09 3.5 5.17 5.07 5.89 6.72 4.3 7.08 8.52 ## # … with 1,997 more variables: Beaufort_2010 <dbl>, Beaufort_2011 <dbl>, ## # Beaufort_2012 <dbl>, Beaufort_2013 <dbl>, Beaufort_2014 <dbl>, ## # Beaufort_2015 <dbl>, Beaufort_2016 <dbl>, Beaufort_2017 <dbl>, ## # Bech_2001 <dbl>, Bech_2002 <dbl>, Bech_2003 <dbl>, Bech_2004 <dbl>, ## # Bech_2005 <dbl>, Bech_2006 <dbl>, Bech_2007 <dbl>, Bech_2008 <dbl>, ## # Bech_2009 <dbl>, Bech_2010 <dbl>, Bech_2011 <dbl>, Bech_2012 <dbl>, ## # Bech_2013 <dbl>, Bech_2014 <dbl>, Bech_2015 <dbl>, Bech_2016 <dbl>, … We will learn about fill(), anoher {tidyr} function a bit later in this chapter, but its basic purpose is to fill rows with whatever value comes before or after the missing values. slice(n()) then only keeps the last row of the data frame, which is the row that contains all the values (expect for 2016 and 2017, which has missing values, as we wanted). Here is another example of the importance of having an identifier column when using a spec: data(mtcars) mtcars_spec <- mtcars %>% tidyr::expand(am, cyl, .value = "mpg") %>% unite(".name", am, cyl, remove = FALSE) mtcars_spec We can now transform the data: mtcars %>% pivot_wider_spec(spec = mtcars_spec) ## # A tibble: 32 × 14 ## disp hp drat wt qsec vs gear carb `0_4` `0_6` `0_8` `1_4` `1_6` ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 160 110 3.9 2.62 16.5 0 4 4 NA NA NA NA 21 ## 2 160 110 3.9 2.88 17.0 0 4 4 NA NA NA NA 21 ## 3 108 93 3.85 2.32 18.6 1 4 1 NA NA NA 22.8 NA ## 4 258 110 3.08 3.22 19.4 1 3 1 NA 21.4 NA NA NA ## 5 360 175 3.15 3.44 17.0 0 3 2 NA NA 18.7 NA NA ## 6 225 105 2.76 3.46 20.2 1 3 1 NA 18.1 NA NA NA ## 7 360 245 3.21 3.57 15.8 0 3 4 NA NA 14.3 NA NA ## 8 147. 62 3.69 3.19 20 1 4 2 24.4 NA NA NA NA ## 9 141. 95 3.92 3.15 22.9 1 4 2 22.8 NA NA NA NA ## 10 168. 123 3.92 3.44 18.3 1 4 4 NA 19.2 NA NA NA ## # … with 22 more rows, and 1 more variable: `1_8` <dbl> As you can see, there are several values of “mpg” for some combinations of “am” times “cyl”. If we remove the other columns, each row will not be uniquely identified anymore. This results in a warning message, and a tibble that contains list-columns: mtcars %>% select(am, cyl, mpg) %>% pivot_wider_spec(spec = mtcars_spec) ## Warning: Values from `mpg` are not uniquely identified; output will contain list-cols. ## * Use `values_fn = list` to suppress this warning. ## * Use `values_fn = {summary_fun}` to summarise duplicates. ## * Use the following dplyr code to identify duplicates. ## {data} %>% ## dplyr::group_by(am, cyl) %>% ## dplyr::summarise(n = dplyr::n(), .groups = "drop") %>% ## dplyr::filter(n > 1L) ## # A tibble: 1 × 6 ## `0_4` `0_6` `0_8` `1_4` `1_6` `1_8` ## <list> <list> <list> <list> <list> <list> ## 1 <dbl [3]> <dbl [4]> <dbl [12]> <dbl [8]> <dbl [3]> <dbl [2]> We are going to learn about list-columns in the next section. List-columns are very powerful, and mastering them will be important. But generally speaking, when reshaping data, if you get list-columns back it often means that something went wrong. So you have to be careful with this. pivot_longer() is used when you need to go from a wide to a long dataset, meaning, a dataset where there are some columns that should not be columns, but rather, the levels of a factor variable. Let’s suppose that the “am” column is split into two columns, 1 for automatic and 0 for manual transmissions, and that the values filling these colums are miles per gallon, “mpg”: mtcars_wide_am <- mtcars %>% pivot_wider(names_from = am, values_from = mpg) mtcars_wide_am %>% select(`0`, `1`, everything()) ## # A tibble: 32 × 11 ## `0` `1` cyl disp hp drat wt qsec vs gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 NA 21 6 160 110 3.9 2.62 16.5 0 4 4 ## 2 NA 21 6 160 110 3.9 2.88 17.0 0 4 4 ## 3 NA 22.8 4 108 93 3.85 2.32 18.6 1 4 1 ## 4 21.4 NA 6 258 110 3.08 3.22 19.4 1 3 1 ## 5 18.7 NA 8 360 175 3.15 3.44 17.0 0 3 2 ## 6 18.1 NA 6 225 105 2.76 3.46 20.2 1 3 1 ## 7 14.3 NA 8 360 245 3.21 3.57 15.8 0 3 4 ## 8 24.4 NA 4 147. 62 3.69 3.19 20 1 4 2 ## 9 22.8 NA 4 141. 95 3.92 3.15 22.9 1 4 2 ## 10 19.2 NA 6 168. 123 3.92 3.44 18.3 1 4 4 ## # … with 22 more rows As you can see, the “0” and “1” columns should not be their own columns, unless there is a very specific and good reason they should… but rather, they should be the levels of another column (in our case, “am”). We can go back to a long dataset like so: mtcars_wide_am %>% pivot_longer(cols = c(`1`, `0`), names_to = "am", values_to = "mpg") %>% select(am, mpg, everything()) ## # A tibble: 64 × 11 ## am mpg cyl disp hp drat wt qsec vs gear carb ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 21 6 160 110 3.9 2.62 16.5 0 4 4 ## 2 0 NA 6 160 110 3.9 2.62 16.5 0 4 4 ## 3 1 21 6 160 110 3.9 2.88 17.0 0 4 4 ## 4 0 NA 6 160 110 3.9 2.88 17.0 0 4 4 ## 5 1 22.8 4 108 93 3.85 2.32 18.6 1 4 1 ## 6 0 NA 4 108 93 3.85 2.32 18.6 1 4 1 ## 7 1 NA 6 258 110 3.08 3.22 19.4 1 3 1 ## 8 0 21.4 6 258 110 3.08 3.22 19.4 1 3 1 ## 9 1 NA 8 360 175 3.15 3.44 17.0 0 3 2 ## 10 0 18.7 8 360 175 3.15 3.44 17.0 0 3 2 ## # … with 54 more rows In the cols argument, you need to list all the variables that need to be transformed. Only 1 and 0 must be pivoted, so I list them. Just for illustration purposes, imagine that we would need to pivot 50 columns. It would be faster to list the columns that do not need to be pivoted. This can be achieved by listing the columns that must be excluded with - in front, and maybe using match() with a regular expression: mtcars_wide_am %>% pivot_longer(cols = -matches("^[[:alpha:]]"), names_to = "am", values_to = "mpg") %>% select(am, mpg, everything()) ## # A tibble: 64 × 11 ## am mpg cyl disp hp drat wt qsec vs gear carb ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 21 6 160 110 3.9 2.62 16.5 0 4 4 ## 2 0 NA 6 160 110 3.9 2.62 16.5 0 4 4 ## 3 1 21 6 160 110 3.9 2.88 17.0 0 4 4 ## 4 0 NA 6 160 110 3.9 2.88 17.0 0 4 4 ## 5 1 22.8 4 108 93 3.85 2.32 18.6 1 4 1 ## 6 0 NA 4 108 93 3.85 2.32 18.6 1 4 1 ## 7 1 NA 6 258 110 3.08 3.22 19.4 1 3 1 ## 8 0 21.4 6 258 110 3.08 3.22 19.4 1 3 1 ## 9 1 NA 8 360 175 3.15 3.44 17.0 0 3 2 ## 10 0 18.7 8 360 175 3.15 3.44 17.0 0 3 2 ## # … with 54 more rows Every column that starts with a letter is ok, so there is no need to pivot them. I use the match() function with a regular expression so that I don’t have to type the names of all the columns. select() is used to re-order the columns, only for viewing purposes names_to = takes a string as argument, which will be the name of the name column containing the levels 0 and 1, and values_to = also takes a string as argument, which will be the name of the column containing the values. Finally, you can see that there are a lot of NAs in the output. These can be removed easily: mtcars_wide_am %>% pivot_longer(cols = c(`1`, `0`), names_to = "am", values_to = "mpg", values_drop_na = TRUE) %>% select(am, mpg, everything()) ## # A tibble: 32 × 11 ## am mpg cyl disp hp drat wt qsec vs gear carb ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 21 6 160 110 3.9 2.62 16.5 0 4 4 ## 2 1 21 6 160 110 3.9 2.88 17.0 0 4 4 ## 3 1 22.8 4 108 93 3.85 2.32 18.6 1 4 1 ## 4 0 21.4 6 258 110 3.08 3.22 19.4 1 3 1 ## 5 0 18.7 8 360 175 3.15 3.44 17.0 0 3 2 ## 6 0 18.1 6 225 105 2.76 3.46 20.2 1 3 1 ## 7 0 14.3 8 360 245 3.21 3.57 15.8 0 3 4 ## 8 0 24.4 4 147. 62 3.69 3.19 20 1 4 2 ## 9 0 22.8 4 141. 95 3.92 3.15 22.9 1 4 2 ## 10 0 19.2 6 168. 123 3.92 3.44 18.3 1 4 4 ## # … with 22 more rows Now for a more advanced example, let’s suppose that we are dealing with the following wide dataset: mtcars_wide <- mtcars %>% pivot_wider_spec(spec = mtcars_spec) mtcars_wide ## # A tibble: 32 × 14 ## disp hp drat wt qsec vs gear carb `0_4` `0_6` `0_8` `1_4` `1_6` ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 160 110 3.9 2.62 16.5 0 4 4 NA NA NA NA 21 ## 2 160 110 3.9 2.88 17.0 0 4 4 NA NA NA NA 21 ## 3 108 93 3.85 2.32 18.6 1 4 1 NA NA NA 22.8 NA ## 4 258 110 3.08 3.22 19.4 1 3 1 NA 21.4 NA NA NA ## 5 360 175 3.15 3.44 17.0 0 3 2 NA NA 18.7 NA NA ## 6 225 105 2.76 3.46 20.2 1 3 1 NA 18.1 NA NA NA ## 7 360 245 3.21 3.57 15.8 0 3 4 NA NA 14.3 NA NA ## 8 147. 62 3.69 3.19 20 1 4 2 24.4 NA NA NA NA ## 9 141. 95 3.92 3.15 22.9 1 4 2 22.8 NA NA NA NA ## 10 168. 123 3.92 3.44 18.3 1 4 4 NA 19.2 NA NA NA ## # … with 22 more rows, and 1 more variable: `1_8` <dbl> The difficulty here is that we have columns with two levels of information. For instance, the column “0_4” contains the miles per gallon values for manual cars (0) with 4 cylinders. The first step is to first pivot the columns: mtcars_wide %>% pivot_longer(cols = matches("0|1"), names_to = "am_cyl", values_to = "mpg", values_drop_na = TRUE) %>% select(am_cyl, mpg, everything()) ## # A tibble: 32 × 10 ## am_cyl mpg disp hp drat wt qsec vs gear carb ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1_6 21 160 110 3.9 2.62 16.5 0 4 4 ## 2 1_6 21 160 110 3.9 2.88 17.0 0 4 4 ## 3 1_4 22.8 108 93 3.85 2.32 18.6 1 4 1 ## 4 0_6 21.4 258 110 3.08 3.22 19.4 1 3 1 ## 5 0_8 18.7 360 175 3.15 3.44 17.0 0 3 2 ## 6 0_6 18.1 225 105 2.76 3.46 20.2 1 3 1 ## 7 0_8 14.3 360 245 3.21 3.57 15.8 0 3 4 ## 8 0_4 24.4 147. 62 3.69 3.19 20 1 4 2 ## 9 0_4 22.8 141. 95 3.92 3.15 22.9 1 4 2 ## 10 0_6 19.2 168. 123 3.92 3.44 18.3 1 4 4 ## # … with 22 more rows Now we only need to separate the “am_cyl” column into two new columns, “am” and “cyl”: mtcars_wide %>% pivot_longer(cols = matches("0|1"), names_to = "am_cyl", values_to = "mpg", values_drop_na = TRUE) %>% separate(am_cyl, into = c("am", "cyl"), sep = "_") %>% select(am, cyl, mpg, everything()) ## # A tibble: 32 × 11 ## am cyl mpg disp hp drat wt qsec vs gear carb ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 6 21 160 110 3.9 2.62 16.5 0 4 4 ## 2 1 6 21 160 110 3.9 2.88 17.0 0 4 4 ## 3 1 4 22.8 108 93 3.85 2.32 18.6 1 4 1 ## 4 0 6 21.4 258 110 3.08 3.22 19.4 1 3 1 ## 5 0 8 18.7 360 175 3.15 3.44 17.0 0 3 2 ## 6 0 6 18.1 225 105 2.76 3.46 20.2 1 3 1 ## 7 0 8 14.3 360 245 3.21 3.57 15.8 0 3 4 ## 8 0 4 24.4 147. 62 3.69 3.19 20 1 4 2 ## 9 0 4 22.8 141. 95 3.92 3.15 22.9 1 4 2 ## 10 0 6 19.2 168. 123 3.92 3.44 18.3 1 4 4 ## # … with 22 more rows It is also possible to construct a specification data frame, just like for pivot_wider_spec(). This time, I’m using the build_longer_spec() function that makes it easy to build specifications: mtcars_spec_long <- mtcars_wide %>% build_longer_spec(matches("0|1"), values_to = "mpg") %>% separate(name, c("am", "cyl"), sep = "_") mtcars_spec_long ## # A tibble: 6 × 4 ## .name .value am cyl ## <chr> <chr> <chr> <chr> ## 1 0_4 mpg 0 4 ## 2 0_6 mpg 0 6 ## 3 0_8 mpg 0 8 ## 4 1_4 mpg 1 4 ## 5 1_6 mpg 1 6 ## 6 1_8 mpg 1 8 This spec can now be specified to pivot_longer(): mtcars_wide %>% pivot_longer_spec(spec = mtcars_spec_long, values_drop_na = TRUE) %>% select(am, cyl, mpg, everything()) ## # A tibble: 32 × 11 ## am cyl mpg disp hp drat wt qsec vs gear carb ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 6 21 160 110 3.9 2.62 16.5 0 4 4 ## 2 1 6 21 160 110 3.9 2.88 17.0 0 4 4 ## 3 1 4 22.8 108 93 3.85 2.32 18.6 1 4 1 ## 4 0 6 21.4 258 110 3.08 3.22 19.4 1 3 1 ## 5 0 8 18.7 360 175 3.15 3.44 17.0 0 3 2 ## 6 0 6 18.1 225 105 2.76 3.46 20.2 1 3 1 ## 7 0 8 14.3 360 245 3.21 3.57 15.8 0 3 4 ## 8 0 4 24.4 147. 62 3.69 3.19 20 1 4 2 ## 9 0 4 22.8 141. 95 3.92 3.15 22.9 1 4 2 ## 10 0 6 19.2 168. 123 3.92 3.44 18.3 1 4 4 ## # … with 22 more rows Defining specifications give a lot of flexibility and in some complicated cases are the way to go. 4.4.2 fill() and full_seq() fill() is pretty useful to… fill in missing values. For instance, in survey_data, some “id”s are missing: survey_data ## # A tibble: 12 × 3 ## id variable value ## <dbl> <chr> <dbl> ## 1 1 var1 1 ## 2 1 var2 0.2 ## 3 NA var3 0.3 ## 4 2 var1 1.4 ## 5 2 var2 1.9 ## 6 2 var3 4.1 ## 7 3 var1 0.1 ## 8 3 var2 2.8 ## 9 3 var3 8.9 ## 10 4 var1 1.7 ## 11 NA var2 1.9 ## 12 4 var3 7.6 It seems pretty obvious that the first NA is supposed to be 1 and the second missing is supposed to be 4. With fill(), this is pretty easy to achieve: survey_data %>% fill(.direction = "down", id) full_seq() is similar: full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1) ## [1] "2018-08-01" "2018-08-02" "2018-08-03" We can add this as the date column to our survey data: survey_data %>% mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) ## # A tibble: 12 × 4 ## id variable value date ## <dbl> <chr> <dbl> <date> ## 1 1 var1 1 2018-08-01 ## 2 1 var2 0.2 2018-08-02 ## 3 NA var3 0.3 2018-08-03 ## 4 2 var1 1.4 2018-08-01 ## 5 2 var2 1.9 2018-08-02 ## 6 2 var3 4.1 2018-08-03 ## 7 3 var1 0.1 2018-08-01 ## 8 3 var2 2.8 2018-08-02 ## 9 3 var3 8.9 2018-08-03 ## 10 4 var1 1.7 2018-08-01 ## 11 NA var2 1.9 2018-08-02 ## 12 4 var3 7.6 2018-08-03 I use the base rep() function to repeat the date 4 times and then using mutate() I have added it the data frame. Putting all these operations together: survey_data %>% fill(.direction = "down", id) %>% mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) ## # A tibble: 12 × 4 ## id variable value date ## <dbl> <chr> <dbl> <date> ## 1 1 var1 1 2018-08-01 ## 2 1 var2 0.2 2018-08-02 ## 3 1 var3 0.3 2018-08-03 ## 4 2 var1 1.4 2018-08-01 ## 5 2 var2 1.9 2018-08-02 ## 6 2 var3 4.1 2018-08-03 ## 7 3 var1 0.1 2018-08-01 ## 8 3 var2 2.8 2018-08-02 ## 9 3 var3 8.9 2018-08-03 ## 10 4 var1 1.7 2018-08-01 ## 11 4 var2 1.9 2018-08-02 ## 12 4 var3 7.6 2018-08-03 You should be careful when imputing missing values though. The method described above is called Last Observation Carried Forward, and sometimes it makes sense, like here, but sometimes it doesn’t and doing this will introduce bias in your analysis. Discussing how to handle missing values in your analysis is outside of the scope of this book, but there are many resources available. You may want to check out the vignettes of the {mice} package, which lists many resources to get you started. 4.4.3 Put order in your columns with separate(), unite(), and in your rows with separate_rows() Sometimes, data can be in a format that makes working with it needlessly painful. For example, you get this: survey_data_not_tidy ## # A tibble: 12 × 3 ## id variable_date value ## <dbl> <chr> <dbl> ## 1 1 var1/2018-08-01 1 ## 2 1 var2/2018-08-02 0.2 ## 3 1 var3/2018-08-03 0.3 ## 4 2 var1/2018-08-01 1.4 ## 5 2 var2/2018-08-02 1.9 ## 6 2 var3/2018-08-03 4.1 ## 7 3 var1/2018-08-01 0.1 ## 8 3 var2/2018-08-02 2.8 ## 9 3 var3/2018-08-03 8.9 ## 10 4 var1/2018-08-01 1.7 ## 11 4 var2/2018-08-02 1.9 ## 12 4 var3/2018-08-03 7.6 Dealing with this is simple, thanks to separate(): survey_data_not_tidy %>% separate(variable_date, into = c("variable", "date"), sep = "/") ## # A tibble: 12 × 4 ## id variable date value ## <dbl> <chr> <chr> <dbl> ## 1 1 var1 2018-08-01 1 ## 2 1 var2 2018-08-02 0.2 ## 3 1 var3 2018-08-03 0.3 ## 4 2 var1 2018-08-01 1.4 ## 5 2 var2 2018-08-02 1.9 ## 6 2 var3 2018-08-03 4.1 ## 7 3 var1 2018-08-01 0.1 ## 8 3 var2 2018-08-02 2.8 ## 9 3 var3 2018-08-03 8.9 ## 10 4 var1 2018-08-01 1.7 ## 11 4 var2 2018-08-02 1.9 ## 12 4 var3 2018-08-03 7.6 The variable_date column gets separated into two columns, variable and date. One also needs to specify the separator, in this case “/”. unite() is the reverse operation, which can be useful when you are confronted to this situation: survey_data2 ## # A tibble: 12 × 6 ## id variable year month day value ## <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 1 var1 2018 08 01 1 ## 2 1 var2 2018 08 02 0.2 ## 3 1 var3 2018 08 03 0.3 ## 4 2 var1 2018 08 01 1.4 ## 5 2 var2 2018 08 02 1.9 ## 6 2 var3 2018 08 03 4.1 ## 7 3 var1 2018 08 01 0.1 ## 8 3 var2 2018 08 02 2.8 ## 9 3 var3 2018 08 03 8.9 ## 10 4 var1 2018 08 01 1.7 ## 11 4 var2 2018 08 02 1.9 ## 12 4 var3 2018 08 03 7.6 In some situation, it is better to have the date as a single column: survey_data2 %>% unite(date, year, month, day, sep = "-") ## # A tibble: 12 × 4 ## id variable date value ## <dbl> <chr> <chr> <dbl> ## 1 1 var1 2018-08-01 1 ## 2 1 var2 2018-08-02 0.2 ## 3 1 var3 2018-08-03 0.3 ## 4 2 var1 2018-08-01 1.4 ## 5 2 var2 2018-08-02 1.9 ## 6 2 var3 2018-08-03 4.1 ## 7 3 var1 2018-08-01 0.1 ## 8 3 var2 2018-08-02 2.8 ## 9 3 var3 2018-08-03 8.9 ## 10 4 var1 2018-08-01 1.7 ## 11 4 var2 2018-08-02 1.9 ## 12 4 var3 2018-08-03 7.6 Another awful situation is the following: survey_data_from_hell ## id variable value ## 1 1 var1 1 ## 2 1 var2 0.2 ## 3 NA var3 0.3 ## 4 2 var1, var2, var3 1.4, 1.9, 4.1 ## 5 3 var1, var2 0.1, 2.8 ## 6 3 var3 8.9 ## 7 4 var1 1.7 ## 8 NA var2 1.9 ## 9 4 var3 7.6 separate_rows() saves the day: survey_data_from_hell %>% separate_rows(variable, value) ## # A tibble: 12 × 3 ## id variable value ## <dbl> <chr> <chr> ## 1 1 var1 1 ## 2 1 var2 0.2 ## 3 NA var3 0.3 ## 4 2 var1 1.4 ## 5 2 var2 1.9 ## 6 2 var3 4.1 ## 7 3 var1 0.1 ## 8 3 var2 2.8 ## 9 3 var3 8.9 ## 10 4 var1 1.7 ## 11 NA var2 1.9 ## 12 4 var3 7.6 So to summarise… you can go from this: survey_data_from_hell ## id variable value ## 1 1 var1 1 ## 2 1 var2 0.2 ## 3 NA var3 0.3 ## 4 2 var1, var2, var3 1.4, 1.9, 4.1 ## 5 3 var1, var2 0.1, 2.8 ## 6 3 var3 8.9 ## 7 4 var1 1.7 ## 8 NA var2 1.9 ## 9 4 var3 7.6 to this: survey_data_clean ## # A tibble: 12 × 4 ## id variable date value ## <dbl> <chr> <chr> <dbl> ## 1 1 var1 2018-08-01 1 ## 2 1 var2 2018-08-02 0.2 ## 3 1 var3 2018-08-03 0.3 ## 4 2 var1 2018-08-01 1.4 ## 5 2 var2 2018-08-02 1.9 ## 6 2 var3 2018-08-03 4.1 ## 7 3 var1 2018-08-01 0.1 ## 8 3 var2 2018-08-02 2.8 ## 9 3 var3 2018-08-03 8.9 ## 10 4 var1 2018-08-01 1.7 ## 11 4 var2 2018-08-02 1.9 ## 12 4 var3 2018-08-03 7.6 quite easily: survey_data_from_hell %>% separate_rows(variable, value, convert = TRUE) %>% fill(.direction = "down", id) %>% mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) 4.5 Working on many columns with if_any(), if_all() and across() 4.5.1 Filtering rows where several columns verify a condition Let’s go back to the gasoline data from the {Ecdat} package. When using filter(), it is only possible to filter one column at a time. For example, you can only filter rows where a column equals “France” for instance. But suppose that we have a condition that we want to use to filter out a lot of columns at once. For example, for every column that is of type numeric, keep only the lines where the condition value > -8 is satisfied. The next line does that: gasoline %>% filter(if_any(where(is.numeric), \\(x)(`>`(x, -8)))) ## # A tibble: 342 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows The above code is using the if_any() function, included in {dplyr}. It also uses where(), which must be used for predicate functions like is.numeric(), or is.character(), etc. You can think of if_any() as a function that helps you select the columns to which to apply the function. You can read the code above like this: Start with the gasoline data, then filter rows that are greater than -8 across the columns which are numeric or similar. if_any(), if_all() and across() makes operations like these very easy to achieve. Sometimes, you’d want to filter rows from columns that end their labels with a letter, for instance \"p\". This can again be achieved using another helper, ends_with(), instead of where(): gasoline %>% filter(if_any(ends_with("p"), \\(x)(`>`(x, -8)))) ## # A tibble: 340 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 330 more rows We already know about ends_with() and starts_with(). So the above line means “for the columns whose name end with a ‘p’ only keep the lines where, for all the selected columns, the values are strictly superior to -8”. if_all() works exactly the same way, but think of the if in if_all() as having the conditions separated by and while the if of if_any() being separated by or. So for example, the code above, where if_any() is replaced by if_all(), results in a much smaller data frame: gasoline %>% filter(if_all(ends_with("p"), \\(x)(`>`(x, -8)))) ## # A tibble: 30 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 canada 1972 4.89 -5.44 -1.10 -7.99 ## 2 canada 1973 4.90 -5.41 -1.13 -7.94 ## 3 canada 1974 4.89 -5.42 -1.12 -7.90 ## 4 canada 1975 4.89 -5.38 -1.19 -7.87 ## 5 canada 1976 4.84 -5.36 -1.06 -7.81 ## 6 canada 1977 4.81 -5.34 -1.07 -7.77 ## 7 canada 1978 4.86 -5.31 -1.07 -7.79 ## 8 germany 1978 3.88 -5.56 -0.628 -7.95 ## 9 sweden 1975 3.97 -7.68 -2.77 -7.99 ## 10 sweden 1976 3.98 -7.67 -2.82 -7.96 ## # … with 20 more rows because here, we only keep rows for columns that end with “p” where ALL of them are simultaneously greater than 8. 4.5.2 Selecting several columns at once In a previous section we already played around a little bit with select() and some helpers, everything(), starts_with() and ends_with(). But there are many ways that you can use helper functions to select several columns easily: gasoline %>% select(where(is.numeric)) ## # A tibble: 342 × 5 ## year lgaspcar lincomep lrpmg lcarpcap ## <int> <dbl> <dbl> <dbl> <dbl> ## 1 1960 4.17 -6.47 -0.335 -9.77 ## 2 1961 4.10 -6.43 -0.351 -9.61 ## 3 1962 4.07 -6.41 -0.380 -9.46 ## 4 1963 4.06 -6.37 -0.414 -9.34 ## 5 1964 4.04 -6.32 -0.445 -9.24 ## 6 1965 4.03 -6.29 -0.497 -9.12 ## 7 1966 4.05 -6.25 -0.467 -9.02 ## 8 1967 4.05 -6.23 -0.506 -8.93 ## 9 1968 4.05 -6.21 -0.522 -8.85 ## 10 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows Selecting by column position is also possible: gasoline %>% select(c(1, 2, 5)) ## # A tibble: 342 × 3 ## country year lrpmg ## <chr> <int> <dbl> ## 1 austria 1960 -0.335 ## 2 austria 1961 -0.351 ## 3 austria 1962 -0.380 ## 4 austria 1963 -0.414 ## 5 austria 1964 -0.445 ## 6 austria 1965 -0.497 ## 7 austria 1966 -0.467 ## 8 austria 1967 -0.506 ## 9 austria 1968 -0.522 ## 10 austria 1969 -0.559 ## # … with 332 more rows As is selecting columns starting or ending with a certain string of characters, as discussed previously: gasoline %>% select(starts_with("l")) ## # A tibble: 342 × 4 ## lgaspcar lincomep lrpmg lcarpcap ## <dbl> <dbl> <dbl> <dbl> ## 1 4.17 -6.47 -0.335 -9.77 ## 2 4.10 -6.43 -0.351 -9.61 ## 3 4.07 -6.41 -0.380 -9.46 ## 4 4.06 -6.37 -0.414 -9.34 ## 5 4.04 -6.32 -0.445 -9.24 ## 6 4.03 -6.29 -0.497 -9.12 ## 7 4.05 -6.25 -0.467 -9.02 ## 8 4.05 -6.23 -0.506 -8.93 ## 9 4.05 -6.21 -0.522 -8.85 ## 10 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows Another very neat trick is selecting columns that may or may not exist in your data frame. For this quick examples let’s use the mtcars dataset: sort(colnames(mtcars)) ## [1] "am" "carb" "cyl" "disp" "drat" "gear" "hp" "mpg" "qsec" "vs" ## [11] "wt" Let’s create a vector with some column names: cols_to_select <- c("mpg", "cyl", "am", "nonsense") The following selects the columns that exist in the data frame but shows a warning for the column that does not exist: mtcars %>% select(any_of(cols_to_select)) ## mpg cyl am ## Mazda RX4 21.0 6 1 ## Mazda RX4 Wag 21.0 6 1 ## Datsun 710 22.8 4 1 ## Hornet 4 Drive 21.4 6 0 ## Hornet Sportabout 18.7 8 0 ## Valiant 18.1 6 0 ## Duster 360 14.3 8 0 ## Merc 240D 24.4 4 0 ## Merc 230 22.8 4 0 ## Merc 280 19.2 6 0 ## Merc 280C 17.8 6 0 ## Merc 450SE 16.4 8 0 ## Merc 450SL 17.3 8 0 ## Merc 450SLC 15.2 8 0 ## Cadillac Fleetwood 10.4 8 0 ## Lincoln Continental 10.4 8 0 ## Chrysler Imperial 14.7 8 0 ## Fiat 128 32.4 4 1 ## Honda Civic 30.4 4 1 ## Toyota Corolla 33.9 4 1 ## Toyota Corona 21.5 4 0 ## Dodge Challenger 15.5 8 0 ## AMC Javelin 15.2 8 0 ## Camaro Z28 13.3 8 0 ## Pontiac Firebird 19.2 8 0 ## Fiat X1-9 27.3 4 1 ## Porsche 914-2 26.0 4 1 ## Lotus Europa 30.4 4 1 ## Ford Pantera L 15.8 8 1 ## Ferrari Dino 19.7 6 1 ## Maserati Bora 15.0 8 1 ## Volvo 142E 21.4 4 1 and finally, if you want it to fail, don’t use any helper: mtcars %>% select(cols_to_select) Error: Can't subset columns that don't exist. The column `nonsense` doesn't exist. or use all_of(): mtcars %>% select(all_of(cols_to_select)) ✖ Column `nonsense` doesn't exist. Bulk-renaming can be achieved using rename_with() gasoline %>% rename_with(toupper, is.numeric) ## Warning: Predicate functions must be wrapped in `where()`. ## ## # Bad ## data %>% select(is.numeric) ## ## # Good ## data %>% select(where(is.numeric)) ## ## ℹ Please update your code. ## This message is displayed once per session. ## # A tibble: 342 × 6 ## country YEAR LGASPCAR LINCOMEP LRPMG LCARPCAP ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows you can also pass functions to rename_with(): gasoline %>% rename_with(\\(x)(paste0("new_", x))) ## # A tibble: 342 × 6 ## new_country new_year new_lgaspcar new_lincomep new_lrpmg new_lcarpcap ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 ## # … with 332 more rows The reason I’m talking about renaming in a section about selecting is because you can also rename with select: gasoline %>% select(YEAR = year) ## # A tibble: 342 × 1 ## YEAR ## <int> ## 1 1960 ## 2 1961 ## 3 1962 ## 4 1963 ## 5 1964 ## 6 1965 ## 7 1966 ## 8 1967 ## 9 1968 ## 10 1969 ## # … with 332 more rows but of course here, you only keep that one column, and you can’t rename with a function. 4.5.3 Summarising with across() across() is used for summarising data. It allows to aggregations… across several columns. It is especially useful with group_by(). To illustrate how group_by() works with across() I have to first modify the gasoline data a little bit. As you can see below, the year column is of type double: gasoline %>% lapply(typeof) ## $country ## [1] "character" ## ## $year ## [1] "integer" ## ## $lgaspcar ## [1] "double" ## ## $lincomep ## [1] "double" ## ## $lrpmg ## [1] "double" ## ## $lcarpcap ## [1] "double" (we’ll discuss lapply() in a later chapter, but just to give you a little taste, lapply() applies a function to each element of a list or of a data frame, in this case, lapply() applied the typeof() function to each column of the gasoline data set, returning the type of each column) Let’s change that to character: gasoline <- gasoline %>% mutate(year = as.character(year), country = as.character(country)) This now allows me to group by type of columns for instance: gasoline %>% group_by(across(where(is.character))) %>% summarise(mean_lincomep = mean(lincomep)) ## `summarise()` has grouped output by 'country'. You can override using the ## `.groups` argument. ## # A tibble: 342 × 3 ## # Groups: country [18] ## country year mean_lincomep ## <chr> <chr> <dbl> ## 1 austria 1960 -6.47 ## 2 austria 1961 -6.43 ## 3 austria 1962 -6.41 ## 4 austria 1963 -6.37 ## 5 austria 1964 -6.32 ## 6 austria 1965 -6.29 ## 7 austria 1966 -6.25 ## 8 austria 1967 -6.23 ## 9 austria 1968 -6.21 ## 10 austria 1969 -6.15 ## # … with 332 more rows This is faster than having to write: gasoline %>% group_by(country, year) %>% summarise(mean_lincomep = mean(lincomep)) ## `summarise()` has grouped output by 'country'. You can override using the ## `.groups` argument. ## # A tibble: 342 × 3 ## # Groups: country [18] ## country year mean_lincomep ## <chr> <chr> <dbl> ## 1 austria 1960 -6.47 ## 2 austria 1961 -6.43 ## 3 austria 1962 -6.41 ## 4 austria 1963 -6.37 ## 5 austria 1964 -6.32 ## 6 austria 1965 -6.29 ## 7 austria 1966 -6.25 ## 8 austria 1967 -6.23 ## 9 austria 1968 -6.21 ## 10 austria 1969 -6.15 ## # … with 332 more rows You may think that having two write the name of two variables is not a huge deal, which is true. But imagine that you have dozens of character columns that you want to group by. With across() and the helper functions, it doesn’t matter if the data frame has 2 columns you need to group by or 2000. All that matters is that you can find some commonalities between all these columns that make it easy to select them. It can be their type, as we have seen before, or their label: gasoline %>% group_by(across(contains("y"))) %>% summarise(mean_licomep = mean(lincomep)) ## `summarise()` has grouped output by 'country'. You can override using the ## `.groups` argument. ## # A tibble: 342 × 3 ## # Groups: country [18] ## country year mean_licomep ## <chr> <chr> <dbl> ## 1 austria 1960 -6.47 ## 2 austria 1961 -6.43 ## 3 austria 1962 -6.41 ## 4 austria 1963 -6.37 ## 5 austria 1964 -6.32 ## 6 austria 1965 -6.29 ## 7 austria 1966 -6.25 ## 8 austria 1967 -6.23 ## 9 austria 1968 -6.21 ## 10 austria 1969 -6.15 ## # … with 332 more rows but it’s also possible to group_by() position: gasoline %>% group_by(across(c(1, 2))) %>% summarise(mean_licomep = mean(lincomep)) ## `summarise()` has grouped output by 'country'. You can override using the ## `.groups` argument. ## # A tibble: 342 × 3 ## # Groups: country [18] ## country year mean_licomep ## <chr> <chr> <dbl> ## 1 austria 1960 -6.47 ## 2 austria 1961 -6.43 ## 3 austria 1962 -6.41 ## 4 austria 1963 -6.37 ## 5 austria 1964 -6.32 ## 6 austria 1965 -6.29 ## 7 austria 1966 -6.25 ## 8 austria 1967 -6.23 ## 9 austria 1968 -6.21 ## 10 austria 1969 -6.15 ## # … with 332 more rows Using a sequence is also possible: gasoline %>% group_by(across(seq(1:2))) %>% summarise(mean_lincomep = mean(lincomep)) ## `summarise()` has grouped output by 'country'. You can override using the ## `.groups` argument. ## # A tibble: 342 × 3 ## # Groups: country [18] ## country year mean_lincomep ## <chr> <chr> <dbl> ## 1 austria 1960 -6.47 ## 2 austria 1961 -6.43 ## 3 austria 1962 -6.41 ## 4 austria 1963 -6.37 ## 5 austria 1964 -6.32 ## 6 austria 1965 -6.29 ## 7 austria 1966 -6.25 ## 8 austria 1967 -6.23 ## 9 austria 1968 -6.21 ## 10 austria 1969 -6.15 ## # … with 332 more rows but be careful, selecting by position is dangerous. If the position of columns changes, your code will fail. Selecting by type or label is much more robust, especially by label, since types can change as well (for example a date column can easily be exported as character column, etc). 4.5.4 summarise() across many columns Summarising across many columns is really incredibly useful and in my opinion one of the best arguments in favour of switching to a {tidyverse} only workflow: gasoline %>% group_by(country) %>% summarise(across(starts_with("l"), mean)) ## # A tibble: 18 × 5 ## country lgaspcar lincomep lrpmg lcarpcap ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 austria 4.06 -6.12 -0.486 -8.85 ## 2 belgium 3.92 -5.85 -0.326 -8.63 ## 3 canada 4.86 -5.58 -1.05 -8.08 ## 4 denmark 4.19 -5.76 -0.358 -8.58 ## 5 france 3.82 -5.87 -0.253 -8.45 ## 6 germany 3.89 -5.85 -0.517 -8.51 ## 7 greece 4.88 -6.61 -0.0339 -10.8 ## 8 ireland 4.23 -6.44 -0.348 -9.04 ## 9 italy 3.73 -6.35 -0.152 -8.83 ## 10 japan 4.70 -6.25 -0.287 -9.95 ## 11 netherla 4.08 -5.92 -0.370 -8.82 ## 12 norway 4.11 -5.75 -0.278 -8.77 ## 13 spain 4.06 -5.63 0.739 -9.90 ## 14 sweden 4.01 -7.82 -2.71 -8.25 ## 15 switzerl 4.24 -5.93 -0.902 -8.54 ## 16 turkey 5.77 -7.34 -0.422 -12.5 ## 17 u.k. 3.98 -6.02 -0.459 -8.55 ## 18 u.s.a. 4.82 -5.45 -1.21 -7.78 But where summarise() and across() really shine is when you want to apply several functions to many columns at once: gasoline %>% group_by(country) %>% summarise(across(starts_with("l"), tibble::lst(mean, sd, max, min), .names = "{fn}_{col}")) ## # A tibble: 18 × 17 ## country mean_lgasp…¹ sd_lg…² max_l…³ min_l…⁴ mean_…⁵ sd_li…⁶ max_l…⁷ min_l…⁸ ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 4.06 0.0693 4.20 3.92 -6.12 0.235 -5.76 -6.47 ## 2 belgium 3.92 0.103 4.16 3.82 -5.85 0.227 -5.53 -6.22 ## 3 canada 4.86 0.0262 4.90 4.81 -5.58 0.193 -5.31 -5.89 ## 4 denmark 4.19 0.158 4.50 4.00 -5.76 0.176 -5.48 -6.06 ## 5 france 3.82 0.0499 3.91 3.75 -5.87 0.241 -5.53 -6.26 ## 6 germany 3.89 0.0239 3.93 3.85 -5.85 0.193 -5.56 -6.16 ## 7 greece 4.88 0.255 5.38 4.48 -6.61 0.331 -6.15 -7.16 ## 8 ireland 4.23 0.0437 4.33 4.16 -6.44 0.162 -6.19 -6.72 ## 9 italy 3.73 0.220 4.05 3.38 -6.35 0.217 -6.08 -6.73 ## 10 japan 4.70 0.684 6.00 3.95 -6.25 0.425 -5.71 -6.99 ## 11 netherla 4.08 0.286 4.65 3.71 -5.92 0.193 -5.66 -6.22 ## 12 norway 4.11 0.123 4.44 3.96 -5.75 0.201 -5.42 -6.09 ## 13 spain 4.06 0.317 4.75 3.62 -5.63 0.278 -5.29 -6.17 ## 14 sweden 4.01 0.0364 4.07 3.91 -7.82 0.126 -7.67 -8.07 ## 15 switzerl 4.24 0.102 4.44 4.05 -5.93 0.124 -5.75 -6.16 ## 16 turkey 5.77 0.329 6.16 5.14 -7.34 0.331 -6.89 -7.84 ## 17 u.k. 3.98 0.0479 4.10 3.91 -6.02 0.107 -5.84 -6.19 ## 18 u.s.a. 4.82 0.0219 4.86 4.79 -5.45 0.148 -5.22 -5.70 ## # … with 8 more variables: mean_lrpmg <dbl>, sd_lrpmg <dbl>, max_lrpmg <dbl>, ## # min_lrpmg <dbl>, mean_lcarpcap <dbl>, sd_lcarpcap <dbl>, ## # max_lcarpcap <dbl>, min_lcarpcap <dbl>, and abbreviated variable names ## # ¹​mean_lgaspcar, ²​sd_lgaspcar, ³​max_lgaspcar, ⁴​min_lgaspcar, ⁵​mean_lincomep, ## # ⁶​sd_lincomep, ⁷​max_lincomep, ⁸​min_lincomep Here, I first started by grouping by country, then I applied the mean(), sd(), max() and min() functions to every column starting with the character \"l\". tibble::lst() allows you to create a list just like with list() but names its arguments automatically. So the mean() function gets name \"mean\", and so on. Finally, I use the .names = argument to create the template for the new column names. {fn}_{col} creates new column names of the form function name _ column name. As mentioned before, across() works with other helper functions: gasoline %>% group_by(country) %>% summarise(across(contains("car"), tibble::lst(mean, sd, max, min), .names = "{fn}_{col}")) ## # A tibble: 18 × 9 ## country mean_lgasp…¹ sd_lg…² max_l…³ min_l…⁴ mean_…⁵ sd_lc…⁶ max_l…⁷ min_l…⁸ ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 4.06 0.0693 4.20 3.92 -8.85 0.473 -8.21 -9.77 ## 2 belgium 3.92 0.103 4.16 3.82 -8.63 0.417 -8.10 -9.41 ## 3 canada 4.86 0.0262 4.90 4.81 -8.08 0.195 -7.77 -8.38 ## 4 denmark 4.19 0.158 4.50 4.00 -8.58 0.349 -8.20 -9.33 ## 5 france 3.82 0.0499 3.91 3.75 -8.45 0.344 -8.01 -9.15 ## 6 germany 3.89 0.0239 3.93 3.85 -8.51 0.406 -7.95 -9.34 ## 7 greece 4.88 0.255 5.38 4.48 -10.8 0.839 -9.57 -12.2 ## 8 ireland 4.23 0.0437 4.33 4.16 -9.04 0.345 -8.55 -9.70 ## 9 italy 3.73 0.220 4.05 3.38 -8.83 0.639 -8.11 -10.1 ## 10 japan 4.70 0.684 6.00 3.95 -9.95 1.20 -8.59 -12.2 ## 11 netherla 4.08 0.286 4.65 3.71 -8.82 0.617 -8.16 -10.0 ## 12 norway 4.11 0.123 4.44 3.96 -8.77 0.438 -8.17 -9.68 ## 13 spain 4.06 0.317 4.75 3.62 -9.90 0.960 -8.63 -11.6 ## 14 sweden 4.01 0.0364 4.07 3.91 -8.25 0.242 -7.96 -8.74 ## 15 switzerl 4.24 0.102 4.44 4.05 -8.54 0.378 -8.03 -9.26 ## 16 turkey 5.77 0.329 6.16 5.14 -12.5 0.751 -11.2 -13.5 ## 17 u.k. 3.98 0.0479 4.10 3.91 -8.55 0.281 -8.26 -9.12 ## 18 u.s.a. 4.82 0.0219 4.86 4.79 -7.78 0.162 -7.54 -8.02 ## # … with abbreviated variable names ¹​mean_lgaspcar, ²​sd_lgaspcar, ## # ³​max_lgaspcar, ⁴​min_lgaspcar, ⁵​mean_lcarpcap, ⁶​sd_lcarpcap, ⁷​max_lcarpcap, ## # ⁸​min_lcarpcap This is very likely the quickest, most elegant way to summarise that many columns. There’s also a way to summarise where: gasoline %>% group_by(country) %>% summarise(across(where(is.numeric), tibble::lst(mean, sd, min, max), .names = "{fn}_{col}")) ## # A tibble: 18 × 17 ## country mean_lgasp…¹ sd_lg…² min_l…³ max_l…⁴ mean_…⁵ sd_li…⁶ min_l…⁷ max_l…⁸ ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 4.06 0.0693 3.92 4.20 -6.12 0.235 -6.47 -5.76 ## 2 belgium 3.92 0.103 3.82 4.16 -5.85 0.227 -6.22 -5.53 ## 3 canada 4.86 0.0262 4.81 4.90 -5.58 0.193 -5.89 -5.31 ## 4 denmark 4.19 0.158 4.00 4.50 -5.76 0.176 -6.06 -5.48 ## 5 france 3.82 0.0499 3.75 3.91 -5.87 0.241 -6.26 -5.53 ## 6 germany 3.89 0.0239 3.85 3.93 -5.85 0.193 -6.16 -5.56 ## 7 greece 4.88 0.255 4.48 5.38 -6.61 0.331 -7.16 -6.15 ## 8 ireland 4.23 0.0437 4.16 4.33 -6.44 0.162 -6.72 -6.19 ## 9 italy 3.73 0.220 3.38 4.05 -6.35 0.217 -6.73 -6.08 ## 10 japan 4.70 0.684 3.95 6.00 -6.25 0.425 -6.99 -5.71 ## 11 netherla 4.08 0.286 3.71 4.65 -5.92 0.193 -6.22 -5.66 ## 12 norway 4.11 0.123 3.96 4.44 -5.75 0.201 -6.09 -5.42 ## 13 spain 4.06 0.317 3.62 4.75 -5.63 0.278 -6.17 -5.29 ## 14 sweden 4.01 0.0364 3.91 4.07 -7.82 0.126 -8.07 -7.67 ## 15 switzerl 4.24 0.102 4.05 4.44 -5.93 0.124 -6.16 -5.75 ## 16 turkey 5.77 0.329 5.14 6.16 -7.34 0.331 -7.84 -6.89 ## 17 u.k. 3.98 0.0479 3.91 4.10 -6.02 0.107 -6.19 -5.84 ## 18 u.s.a. 4.82 0.0219 4.79 4.86 -5.45 0.148 -5.70 -5.22 ## # … with 8 more variables: mean_lrpmg <dbl>, sd_lrpmg <dbl>, min_lrpmg <dbl>, ## # max_lrpmg <dbl>, mean_lcarpcap <dbl>, sd_lcarpcap <dbl>, ## # min_lcarpcap <dbl>, max_lcarpcap <dbl>, and abbreviated variable names ## # ¹​mean_lgaspcar, ²​sd_lgaspcar, ³​min_lgaspcar, ⁴​max_lgaspcar, ⁵​mean_lincomep, ## # ⁶​sd_lincomep, ⁷​min_lincomep, ⁸​max_lincomep This allows you to summarise every column that contains real numbers. The difference between is.double() and is.numeric() is that is.numeric() returns TRUE for integers too, whereas is.double() returns TRUE for real numbers only (integers are real numbers too, but you know what I mean). It is also possible to summarise every column at once: gasoline %>% select(-year) %>% group_by(country) %>% summarise(across(everything(), tibble::lst(mean, sd, min, max), .names = "{fn}_{col}")) ## # A tibble: 18 × 17 ## country mean_lgasp…¹ sd_lg…² min_l…³ max_l…⁴ mean_…⁵ sd_li…⁶ min_l…⁷ max_l…⁸ ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 4.06 0.0693 3.92 4.20 -6.12 0.235 -6.47 -5.76 ## 2 belgium 3.92 0.103 3.82 4.16 -5.85 0.227 -6.22 -5.53 ## 3 canada 4.86 0.0262 4.81 4.90 -5.58 0.193 -5.89 -5.31 ## 4 denmark 4.19 0.158 4.00 4.50 -5.76 0.176 -6.06 -5.48 ## 5 france 3.82 0.0499 3.75 3.91 -5.87 0.241 -6.26 -5.53 ## 6 germany 3.89 0.0239 3.85 3.93 -5.85 0.193 -6.16 -5.56 ## 7 greece 4.88 0.255 4.48 5.38 -6.61 0.331 -7.16 -6.15 ## 8 ireland 4.23 0.0437 4.16 4.33 -6.44 0.162 -6.72 -6.19 ## 9 italy 3.73 0.220 3.38 4.05 -6.35 0.217 -6.73 -6.08 ## 10 japan 4.70 0.684 3.95 6.00 -6.25 0.425 -6.99 -5.71 ## 11 netherla 4.08 0.286 3.71 4.65 -5.92 0.193 -6.22 -5.66 ## 12 norway 4.11 0.123 3.96 4.44 -5.75 0.201 -6.09 -5.42 ## 13 spain 4.06 0.317 3.62 4.75 -5.63 0.278 -6.17 -5.29 ## 14 sweden 4.01 0.0364 3.91 4.07 -7.82 0.126 -8.07 -7.67 ## 15 switzerl 4.24 0.102 4.05 4.44 -5.93 0.124 -6.16 -5.75 ## 16 turkey 5.77 0.329 5.14 6.16 -7.34 0.331 -7.84 -6.89 ## 17 u.k. 3.98 0.0479 3.91 4.10 -6.02 0.107 -6.19 -5.84 ## 18 u.s.a. 4.82 0.0219 4.79 4.86 -5.45 0.148 -5.70 -5.22 ## # … with 8 more variables: mean_lrpmg <dbl>, sd_lrpmg <dbl>, min_lrpmg <dbl>, ## # max_lrpmg <dbl>, mean_lcarpcap <dbl>, sd_lcarpcap <dbl>, ## # min_lcarpcap <dbl>, max_lcarpcap <dbl>, and abbreviated variable names ## # ¹​mean_lgaspcar, ²​sd_lgaspcar, ³​min_lgaspcar, ⁴​max_lgaspcar, ⁵​mean_lincomep, ## # ⁶​sd_lincomep, ⁷​min_lincomep, ⁸​max_lincomep I removed the year variable because it’s not a variable for which we want to have descriptive statistics. 4.6 Other useful {tidyverse} functions 4.6.1 if_else(), case_when() and recode() Some other very useful {tidyverse} functions are if_else() and case_when. These two functions, combined with mutate() make it easy to create a new variable whose values must respect certain conditions. For instance, we might want to have a dummy that equals 1 if a country in the European Union (to simplify, say as of 2017) and 0 if not. First let’s create a list of countries that are in the EU: eu_countries <- c("austria", "belgium", "bulgaria", "croatia", "republic of cyprus", "czech republic", "denmark", "estonia", "finland", "france", "germany", "greece", "hungary", "ireland", "italy", "latvia", "lithuania", "luxembourg", "malta", "netherla", "poland", "portugal", "romania", "slovakia", "slovenia", "spain", "sweden", "u.k.") I’ve had to change “netherlands” to “netherla” because that’s how the country is called in the gasoline data. Now let’s create a dummy variable that equals 1 for EU countries, and 0 for the others: gasoline %>% mutate(country = tolower(country)) %>% mutate(in_eu = if_else(country %in% eu_countries, 1, 0)) ## # A tibble: 342 × 7 ## country year lgaspcar lincomep lrpmg lcarpcap in_eu ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 1 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 1 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 1 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 1 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 1 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 1 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 1 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 1 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 1 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 1 ## # … with 332 more rows Instead of 1 and 0, we can of course use strings (I add filter(year == 1960) at the end to have a better view of what happened): gasoline %>% mutate(country = tolower(country)) %>% mutate(in_eu = if_else(country %in% eu_countries, "yes", "no")) %>% filter(year == 1960) ## # A tibble: 18 × 7 ## country year lgaspcar lincomep lrpmg lcarpcap in_eu ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 yes ## 2 belgium 1960 4.16 -6.22 -0.166 -9.41 yes ## 3 canada 1960 4.86 -5.89 -0.972 -8.38 no ## 4 denmark 1960 4.50 -6.06 -0.196 -9.33 yes ## 5 france 1960 3.91 -6.26 -0.0196 -9.15 yes ## 6 germany 1960 3.92 -6.16 -0.186 -9.34 yes ## 7 greece 1960 5.04 -7.16 -0.0835 -12.2 yes ## 8 ireland 1960 4.27 -6.72 -0.0765 -9.70 yes ## 9 italy 1960 4.05 -6.73 0.165 -10.1 yes ## 10 japan 1960 6.00 -6.99 -0.145 -12.2 no ## 11 netherla 1960 4.65 -6.22 -0.201 -10.0 yes ## 12 norway 1960 4.44 -6.09 -0.140 -9.68 no ## 13 spain 1960 4.75 -6.17 1.13 -11.6 yes ## 14 sweden 1960 4.06 -8.07 -2.52 -8.74 yes ## 15 switzerl 1960 4.40 -6.16 -0.823 -9.26 no ## 16 turkey 1960 6.13 -7.80 -0.253 -13.5 no ## 17 u.k. 1960 4.10 -6.19 -0.391 -9.12 yes ## 18 u.s.a. 1960 4.82 -5.70 -1.12 -8.02 no I think that if_else() is fairly straightforward, especially if you know ifelse() already. You might be wondering what is the difference between these two. if_else() is stricter than ifelse() and does not do type conversion. Compare the two next lines: ifelse(1 == 1, "0", 1) ## [1] "0" if_else(1 == 1, "0", 1) Error: `false` must be type string, not double Type conversion, especially without a warning is very dangerous. if_else()’s behaviour which consists in failing as soon as possble avoids a lot of pain and suffering, especially when programming non-interactively. if_else() also accepts an optional argument, that allows you to specify what should be returned in case of NA: if_else(1 <= NA, 0, 1, 999) ## [1] 999 # Or if_else(1 <= NA, 0, 1, NA_real_) ## [1] NA case_when() can be seen as a generalization of if_else(). Whenever you want to use multiple if_else()s, that’s when you know you should use case_when() (I’m adding the filter at the end for the same reason as before, to see the output better): gasoline %>% mutate(country = tolower(country)) %>% mutate(region = case_when( country %in% c("france", "italy", "turkey", "greece", "spain") ~ "mediterranean", country %in% c("germany", "austria", "switzerl", "belgium", "netherla") ~ "central europe", country %in% c("canada", "u.s.a.", "u.k.", "ireland") ~ "anglosphere", country %in% c("denmark", "norway", "sweden") ~ "nordic", country %in% c("japan") ~ "asia")) %>% filter(year == 1960) ## # A tibble: 18 × 7 ## country year lgaspcar lincomep lrpmg lcarpcap region ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 central europe ## 2 belgium 1960 4.16 -6.22 -0.166 -9.41 central europe ## 3 canada 1960 4.86 -5.89 -0.972 -8.38 anglosphere ## 4 denmark 1960 4.50 -6.06 -0.196 -9.33 nordic ## 5 france 1960 3.91 -6.26 -0.0196 -9.15 mediterranean ## 6 germany 1960 3.92 -6.16 -0.186 -9.34 central europe ## 7 greece 1960 5.04 -7.16 -0.0835 -12.2 mediterranean ## 8 ireland 1960 4.27 -6.72 -0.0765 -9.70 anglosphere ## 9 italy 1960 4.05 -6.73 0.165 -10.1 mediterranean ## 10 japan 1960 6.00 -6.99 -0.145 -12.2 asia ## 11 netherla 1960 4.65 -6.22 -0.201 -10.0 central europe ## 12 norway 1960 4.44 -6.09 -0.140 -9.68 nordic ## 13 spain 1960 4.75 -6.17 1.13 -11.6 mediterranean ## 14 sweden 1960 4.06 -8.07 -2.52 -8.74 nordic ## 15 switzerl 1960 4.40 -6.16 -0.823 -9.26 central europe ## 16 turkey 1960 6.13 -7.80 -0.253 -13.5 mediterranean ## 17 u.k. 1960 4.10 -6.19 -0.391 -9.12 anglosphere ## 18 u.s.a. 1960 4.82 -5.70 -1.12 -8.02 anglosphere If all you want is to recode values, you can use recode(). For example, the Netherlands is written as “NETHERLA” in the gasoline data, which is quite ugly. Same for Switzerland: gasoline <- gasoline %>% mutate(country = tolower(country)) %>% mutate(country = recode(country, "netherla" = "netherlands", "switzerl" = "switzerland")) I saved the data with these changes as they will become useful in the future. Let’s take a look at the data: gasoline %>% filter(country %in% c("netherlands", "switzerland"), year == 1960) ## # A tibble: 2 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 netherlands 1960 4.65 -6.22 -0.201 -10.0 ## 2 switzerland 1960 4.40 -6.16 -0.823 -9.26 4.6.2 lead() and lag() lead() and lag() are especially useful in econometrics. When I was doing my masters, in 4 B.d. (Before dplyr) lagging variables in panel data was quite tricky. Now, with {dplyr} it’s really very easy: gasoline %>% group_by(country) %>% mutate(lag_lgaspcar = lag(lgaspcar)) %>% mutate(lead_lgaspcar = lead(lgaspcar)) %>% filter(year %in% seq(1960, 1963)) ## # A tibble: 72 × 8 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap lag_lgaspcar lead_lgaspcar ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 NA 4.10 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 4.17 4.07 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 4.10 4.06 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 4.07 4.04 ## 5 belgium 1960 4.16 -6.22 -0.166 -9.41 NA 4.12 ## 6 belgium 1961 4.12 -6.18 -0.172 -9.30 4.16 4.08 ## 7 belgium 1962 4.08 -6.13 -0.222 -9.22 4.12 4.00 ## 8 belgium 1963 4.00 -6.09 -0.250 -9.11 4.08 3.99 ## 9 canada 1960 4.86 -5.89 -0.972 -8.38 NA 4.83 ## 10 canada 1961 4.83 -5.88 -0.972 -8.35 4.86 4.85 ## # … with 62 more rows To lag every variable, remember that you can use mutate_if(): gasoline %>% group_by(country) %>% mutate_if(is.double, lag) %>% filter(year %in% seq(1960, 1963)) ## `mutate_if()` ignored the following grouping variables: ## • Column `country` ## # A tibble: 72 × 6 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 belgium 1960 4.16 -6.22 -0.166 -9.41 ## 6 belgium 1961 4.12 -6.18 -0.172 -9.30 ## 7 belgium 1962 4.08 -6.13 -0.222 -9.22 ## 8 belgium 1963 4.00 -6.09 -0.250 -9.11 ## 9 canada 1960 4.86 -5.89 -0.972 -8.38 ## 10 canada 1961 4.83 -5.88 -0.972 -8.35 ## # … with 62 more rows you can replace lag() with lead(), but just keep in mind that the columns get transformed in place. 4.6.3 ntile() The last helper function I will discuss is ntile(). There are some other, so do read mutate()’s documentation with help(mutate)! If you need quantiles, you need ntile(). Let’s see how it works: gasoline %>% mutate(quintile = ntile(lgaspcar, 5)) %>% mutate(decile = ntile(lgaspcar, 10)) %>% select(country, year, lgaspcar, quintile, decile) ## # A tibble: 342 × 5 ## country year lgaspcar quintile decile ## <chr> <dbl> <dbl> <int> <int> ## 1 austria 1960 4.17 3 6 ## 2 austria 1961 4.10 3 6 ## 3 austria 1962 4.07 3 5 ## 4 austria 1963 4.06 3 5 ## 5 austria 1964 4.04 3 5 ## 6 austria 1965 4.03 3 5 ## 7 austria 1966 4.05 3 5 ## 8 austria 1967 4.05 3 5 ## 9 austria 1968 4.05 3 5 ## 10 austria 1969 4.05 3 5 ## # … with 332 more rows quintile and decile do not hold the values but the quantile the value lies in. If you want to have a column that contains the median for instance, you can use good ol’ quantile(): gasoline %>% group_by(country) %>% mutate(median = quantile(lgaspcar, 0.5)) %>% # quantile(x, 0.5) is equivalent to median(x) filter(year == 1960) %>% select(country, year, median) ## # A tibble: 18 × 3 ## # Groups: country [18] ## country year median ## <chr> <dbl> <dbl> ## 1 austria 1960 4.05 ## 2 belgium 1960 3.88 ## 3 canada 1960 4.86 ## 4 denmark 1960 4.16 ## 5 france 1960 3.81 ## 6 germany 1960 3.89 ## 7 greece 1960 4.89 ## 8 ireland 1960 4.22 ## 9 italy 1960 3.74 ## 10 japan 1960 4.52 ## 11 netherlands 1960 3.99 ## 12 norway 1960 4.08 ## 13 spain 1960 3.99 ## 14 sweden 1960 4.00 ## 15 switzerland 1960 4.26 ## 16 turkey 1960 5.72 ## 17 u.k. 1960 3.98 ## 18 u.s.a. 1960 4.81 4.6.4 arrange() arrange() re-orders the whole tibble according to values of the supplied variable: gasoline %>% arrange(lgaspcar) ## # A tibble: 342 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 italy 1977 3.38 -6.10 0.164 -8.15 ## 2 italy 1978 3.39 -6.08 0.0348 -8.11 ## 3 italy 1976 3.43 -6.12 0.103 -8.17 ## 4 italy 1974 3.50 -6.13 -0.223 -8.26 ## 5 italy 1975 3.52 -6.17 -0.0327 -8.22 ## 6 spain 1978 3.62 -5.29 0.621 -8.63 ## 7 italy 1972 3.63 -6.21 -0.215 -8.38 ## 8 italy 1971 3.65 -6.22 -0.148 -8.47 ## 9 spain 1977 3.65 -5.30 0.526 -8.73 ## 10 italy 1973 3.65 -6.16 -0.325 -8.32 ## # … with 332 more rows If you want to re-order the tibble in descending order of the variable: gasoline %>% arrange(desc(lgaspcar)) ## # A tibble: 342 × 6 ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 turkey 1966 6.16 -7.51 -0.356 -13.0 ## 2 turkey 1960 6.13 -7.80 -0.253 -13.5 ## 3 turkey 1961 6.11 -7.79 -0.343 -13.4 ## 4 turkey 1962 6.08 -7.84 -0.408 -13.2 ## 5 turkey 1968 6.08 -7.42 -0.365 -12.8 ## 6 turkey 1963 6.08 -7.63 -0.225 -13.3 ## 7 turkey 1964 6.06 -7.63 -0.252 -13.2 ## 8 turkey 1967 6.04 -7.46 -0.335 -12.8 ## 9 japan 1960 6.00 -6.99 -0.145 -12.2 ## 10 turkey 1965 5.82 -7.62 -0.293 -12.9 ## # … with 332 more rows arrange’s documentation alerts the user that re-ording by group is only possible by explicitely specifying an option: gasoline %>% filter(year %in% seq(1960, 1963)) %>% group_by(country) %>% arrange(desc(lgaspcar), .by_group = TRUE) ## # A tibble: 72 × 6 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 ## 5 belgium 1960 4.16 -6.22 -0.166 -9.41 ## 6 belgium 1961 4.12 -6.18 -0.172 -9.30 ## 7 belgium 1962 4.08 -6.13 -0.222 -9.22 ## 8 belgium 1963 4.00 -6.09 -0.250 -9.11 ## 9 canada 1960 4.86 -5.89 -0.972 -8.38 ## 10 canada 1962 4.85 -5.84 -0.979 -8.32 ## # … with 62 more rows This is especially useful for plotting. We’ll see this in Chapter 6. 4.6.5 tally() and count() tally() and count() count the number of observations in your data. I believe count() is the more useful of the two, as it counts the number of observations within a group that you can provide: gasoline %>% count(country) ## # A tibble: 18 × 2 ## country n ## <chr> <int> ## 1 austria 19 ## 2 belgium 19 ## 3 canada 19 ## 4 denmark 19 ## 5 france 19 ## 6 germany 19 ## 7 greece 19 ## 8 ireland 19 ## 9 italy 19 ## 10 japan 19 ## 11 netherlands 19 ## 12 norway 19 ## 13 spain 19 ## 14 sweden 19 ## 15 switzerland 19 ## 16 turkey 19 ## 17 u.k. 19 ## 18 u.s.a. 19 There’s also add_count() which adds the column to the data: gasoline %>% add_count(country) ## # A tibble: 342 × 7 ## country year lgaspcar lincomep lrpmg lcarpcap n ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 19 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 19 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 19 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 19 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 19 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 19 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 19 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 19 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 19 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 19 ## # … with 332 more rows add_count() is a shortcut for the following code: gasoline %>% group_by(country) %>% mutate(n = n()) ## # A tibble: 342 × 7 ## # Groups: country [18] ## country year lgaspcar lincomep lrpmg lcarpcap n ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> ## 1 austria 1960 4.17 -6.47 -0.335 -9.77 19 ## 2 austria 1961 4.10 -6.43 -0.351 -9.61 19 ## 3 austria 1962 4.07 -6.41 -0.380 -9.46 19 ## 4 austria 1963 4.06 -6.37 -0.414 -9.34 19 ## 5 austria 1964 4.04 -6.32 -0.445 -9.24 19 ## 6 austria 1965 4.03 -6.29 -0.497 -9.12 19 ## 7 austria 1966 4.05 -6.25 -0.467 -9.02 19 ## 8 austria 1967 4.05 -6.23 -0.506 -8.93 19 ## 9 austria 1968 4.05 -6.21 -0.522 -8.85 19 ## 10 austria 1969 4.05 -6.15 -0.559 -8.79 19 ## # … with 332 more rows where n() is a {dplyr} function that can only be used within summarise(), mutate() and filter(). 4.7 Special packages for special kinds of data: {forcats}, {lubridate}, and {stringr} 4.7.1 🐱🐱🐱🐱 Factor variables are very useful but not very easy to manipulate. forcats contains very useful functions that make working on factor variables painless. In my opinion, the four following functions, fct_recode(), fct_relevel(), fct_reorder() and fct_relabel(), are the ones you must know, so that’s what I’ll be showing. Remember in chapter 3 when I very quickly explained what were factor variables? In this section, we are going to work a little bit with these type of variable. factors are very useful, and the forcats package includes some handy functions to work with them. First, let’s load the forcats package: library(forcats) as an example, we are going to work with the gss_cat dataset that is included in forcats. Let’s load the data: data(gss_cat) head(gss_cat) ## # A tibble: 6 × 9 ## year marital age race rincome partyid relig denom tvhours ## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int> ## 1 2000 Never married 26 White $8000 to 9999 Ind,near r… Prot… Sout… 12 ## 2 2000 Divorced 48 White $8000 to 9999 Not str re… Prot… Bapt… NA ## 3 2000 Widowed 67 White Not applicable Independent Prot… No d… 2 ## 4 2000 Never married 39 White Not applicable Ind,near r… Orth… Not … 4 ## 5 2000 Divorced 25 White Not applicable Not str de… None Not … 1 ## 6 2000 Married 25 White $20000 - 24999 Strong dem… Prot… Sout… NA as you can see, marital, race, rincome and partyid are all factor variables. Let’s take a closer look at marital: str(gss_cat$marital) ## Factor w/ 6 levels "No answer","Never married",..: 2 4 5 2 4 6 2 4 6 6 ... and let’s see rincome: str(gss_cat$rincome) ## Factor w/ 16 levels "No answer","Don't know",..: 8 8 16 16 16 5 4 9 4 4 ... factor variables have different levels and the forcats package includes functions that allow you to recode, collapse and do all sorts of things on these levels. For example , using forcats::fct_recode() you can recode levels: gss_cat <- gss_cat %>% mutate(marital = fct_recode(marital, refuse = "No answer", never_married = "Never married", divorced = "Separated", divorced = "Divorced", widowed = "Widowed", married = "Married")) gss_cat %>% tabyl(marital) ## marital n percent ## refuse 17 0.0007913234 ## never_married 5416 0.2521063166 ## divorced 4126 0.1920588372 ## widowed 1807 0.0841130196 ## married 10117 0.4709305032 Using fct_recode(), I was able to recode the levels and collapse Separated and Divorced to a single category called divorced. As you can see, refuse and widowed are less than 10%, so maybe you’d want to lump these categories together: gss_cat <- gss_cat %>% mutate(marital = fct_lump(marital, prop = 0.10, other_level = "other")) gss_cat %>% tabyl(marital) ## marital n percent ## never_married 5416 0.25210632 ## divorced 4126 0.19205884 ## married 10117 0.47093050 ## other 1824 0.08490434 fct_reorder() is especially useful for plotting. We will explore plotting in the next chapter, but to show you why fct_reorder() is so useful, I will create a barplot, first without using fct_reorder() to re-order the factors, then with reordering. Do not worry if you don’t understand all the code for now: gss_cat %>% tabyl(marital) %>% ggplot() + geom_col(aes(y = n, x = marital)) + coord_flip() It would be much better if the categories were ordered by frequency. This is easy to do with fct_reorder(): gss_cat %>% tabyl(marital) %>% mutate(marital = fct_reorder(marital, n, .desc = FALSE)) %>% ggplot() + geom_col(aes(y = n, x = marital)) + coord_flip() Much better! In Chapter 6, we are going to learn about {ggplot2}. The last family of function I’d like to mention are the fct_lump*() functions. These make it possible to lump several levels of a factor into a new other level: gss_cat %>% mutate( # Description of the different functions taken from help(fct_lump) denom_lowfreq = fct_lump_lowfreq(denom), # lumps together the least frequent levels, ensuring that "other" is still the smallest level. denom_min = fct_lump_min(denom, min = 10), # lumps levels that appear fewer than min times. denom_n = fct_lump_n(denom, n = 3), # lumps all levels except for the n most frequent (or least frequent if n < 0) denom_prop = fct_lump_prop(denom, prop = 0.10) # lumps levels that appear in fewer prop * n times. ) ## # A tibble: 21,483 × 13 ## year marital age race rincome partyid relig denom tvhours denom…¹ denom…² ## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int> <fct> <fct> ## 1 2000 never_… 26 White $8000 … Ind,ne… Prot… Sout… 12 Southe… Southe… ## 2 2000 divorc… 48 White $8000 … Not st… Prot… Bapt… NA Baptis… Baptis… ## 3 2000 other 67 White Not ap… Indepe… Prot… No d… 2 No den… No den… ## 4 2000 never_… 39 White Not ap… Ind,ne… Orth… Not … 4 Not ap… Not ap… ## 5 2000 divorc… 25 White Not ap… Not st… None Not … 1 Not ap… Not ap… ## 6 2000 married 25 White $20000… Strong… Prot… Sout… NA Southe… Southe… ## 7 2000 never_… 36 White $25000… Not st… Chri… Not … 3 Not ap… Not ap… ## 8 2000 divorc… 44 White $7000 … Ind,ne… Prot… Luth… NA Luther… Luther… ## 9 2000 married 44 White $25000… Not st… Prot… Other 0 Other Other ## 10 2000 married 47 White $25000… Strong… Prot… Sout… 3 Southe… Southe… ## # … with 21,473 more rows, 2 more variables: denom_n <fct>, denom_prop <fct>, ## # and abbreviated variable names ¹​denom_lowfreq, ²​denom_min There’s many other, so I’d advise you go through the package’s function reference. 4.7.2 Get your dates right with {lubridate} {lubridate} is yet another tidyverse package, that makes dealing with dates or durations (and intervals) as painless as possible. I do not use every function contained in the package daily, and as such will only focus on some of the functions. However, if you have to deal with dates often, you might want to explore the package thouroughly. 4.7.2.1 Defining dates, the tidy way Let’s load new dataset, called independence from the datasets folder: independence <- readRDS("datasets/independence.rds") This dataset was scraped from the following Wikipedia page. It shows when African countries gained independence and from which colonial powers. In Chapter 10, I will show you how to scrape Wikipedia pages using R. For now, let’s take a look at the contents of the dataset: independence ## # A tibble: 54 × 6 ## country colonial_name colon…¹ indep…² first…³ indep…⁴ ## <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Liberia Liberia United… 26 Jul… Joseph… Liberi… ## 2 South Africa Cape Colony Colony of Natal O… United… 31 May… Louis … South … ## 3 Egypt Sultanate of Egypt United… 28 Feb… Fuad I Egypti… ## 4 Eritrea Italian Eritrea Italy 10 Feb… Haile … - ## 5 Libya British Military Administration… United… 24 Dec… Idris - ## 6 Sudan Anglo-Egyptian Sudan United… 1 Janu… Ismail… - ## 7 Tunisia French Protectorate of Tunisia France 20 Mar… Muhamm… - ## 8 Morocco French Protectorate in Morocco … France… 2 Marc… Mohamm… Ifni W… ## 9 Ghana Gold Coast United… 6 Marc… Kwame … Gold C… ## 10 Guinea French West Africa France 2 Octo… Ahmed … Guinea… ## # … with 44 more rows, and abbreviated variable names ¹​colonial_power, ## # ²​independence_date, ³​first_head_of_state, ⁴​independence_won_through as you can see, the date of independence is in a format that might make it difficult to answer questions such as Which African countries gained independence before 1960 ? for two reasons. First of all, the date uses the name of the month instead of the number of the month, and second of all the type of the independence day column is character and not “date”. So our first task is to correctly define the column as being of type date, while making sure that R understands that January is supposed to be “01”, and so on. There are several helpful functions included in {lubridate} to convert columns to dates. For instance if the column you want to convert is of the form “2012-11-21”, then you would use the function ymd(), for “year-month-day”. If, however the column is “2012-21-11”, then you would use ydm(). There’s a few of these helper functions, and they can handle a lot of different formats for dates. In our case, having the name of the month instead of the number might seem quite problematic, but it turns out that this is a case that {lubridate} handles painfully: library(lubridate) ## ## Attaching package: 'lubridate' ## The following objects are masked from 'package:base': ## ## date, intersect, setdiff, union independence <- independence %>% mutate(independence_date = dmy(independence_date)) ## Warning: 5 failed to parse. Some dates failed to parse, for instance for Morocco. This is because these countries have several independence dates; this means that the string to convert looks like: "2 March 1956 7 April 1956 10 April 1958 4 January 1969" which obviously cannot be converted by {lubridate} without further manipulation. I ignore these cases for simplicity’s sake. 4.7.2.2 Data manipulation with dates Let’s take a look at the data now: independence ## # A tibble: 54 × 6 ## country colonial_name colon…¹ independ…² first…³ indep…⁴ ## <chr> <chr> <chr> <date> <chr> <chr> ## 1 Liberia Liberia United… 1847-07-26 Joseph… Liberi… ## 2 South Africa Cape Colony Colony of Natal… United… 1910-05-31 Louis … South … ## 3 Egypt Sultanate of Egypt United… 1922-02-28 Fuad I Egypti… ## 4 Eritrea Italian Eritrea Italy 1947-02-10 Haile … - ## 5 Libya British Military Administrat… United… 1951-12-24 Idris - ## 6 Sudan Anglo-Egyptian Sudan United… 1956-01-01 Ismail… - ## 7 Tunisia French Protectorate of Tunis… France 1956-03-20 Muhamm… - ## 8 Morocco French Protectorate in Moroc… France… NA Mohamm… Ifni W… ## 9 Ghana Gold Coast United… 1957-03-06 Kwame … Gold C… ## 10 Guinea French West Africa France 1958-10-02 Ahmed … Guinea… ## # … with 44 more rows, and abbreviated variable names ¹​colonial_power, ## # ²​independence_date, ³​first_head_of_state, ⁴​independence_won_through As you can see, we now have a date column in the right format. We can now answer questions such as Which countries gained independence before 1960? quite easily, by using the functions year(), month() and day(). Let’s see which countries gained independence before 1960: independence %>% filter(year(independence_date) <= 1960) %>% pull(country) ## [1] "Liberia" "South Africa" ## [3] "Egypt" "Eritrea" ## [5] "Libya" "Sudan" ## [7] "Tunisia" "Ghana" ## [9] "Guinea" "Cameroon" ## [11] "Togo" "Mali" ## [13] "Madagascar" "Democratic Republic of the Congo" ## [15] "Benin" "Niger" ## [17] "Burkina Faso" "Ivory Coast" ## [19] "Chad" "Central African Republic" ## [21] "Republic of the Congo" "Gabon" ## [23] "Mauritania" You guessed it, year() extracts the year of the date column and converts it as a numeric so that we can work on it. This is the same for month() or day(). Let’s try to see if countries gained their independence on Christmas Eve: independence %>% filter(month(independence_date) == 12, day(independence_date) == 24) %>% pull(country) ## [1] "Libya" Seems like Libya was the only one! You can also operate on dates. For instance, let’s compute the difference between two dates, using the interval() column: independence %>% mutate(today = lubridate::today()) %>% mutate(independent_since = interval(independence_date, today)) %>% select(country, independent_since) ## # A tibble: 54 × 2 ## country independent_since ## <chr> <Interval> ## 1 Liberia 1847-07-26 UTC--2022-10-16 UTC ## 2 South Africa 1910-05-31 UTC--2022-10-16 UTC ## 3 Egypt 1922-02-28 UTC--2022-10-16 UTC ## 4 Eritrea 1947-02-10 UTC--2022-10-16 UTC ## 5 Libya 1951-12-24 UTC--2022-10-16 UTC ## 6 Sudan 1956-01-01 UTC--2022-10-16 UTC ## 7 Tunisia 1956-03-20 UTC--2022-10-16 UTC ## 8 Morocco NA--NA ## 9 Ghana 1957-03-06 UTC--2022-10-16 UTC ## 10 Guinea 1958-10-02 UTC--2022-10-16 UTC ## # … with 44 more rows The independent_since column now contains an interval object that we can convert to years: independence %>% mutate(today = lubridate::today()) %>% mutate(independent_since = interval(independence_date, today)) %>% select(country, independent_since) %>% mutate(years_independent = as.numeric(independent_since, "years")) ## # A tibble: 54 × 3 ## country independent_since years_independent ## <chr> <Interval> <dbl> ## 1 Liberia 1847-07-26 UTC--2022-10-16 UTC 175. ## 2 South Africa 1910-05-31 UTC--2022-10-16 UTC 112. ## 3 Egypt 1922-02-28 UTC--2022-10-16 UTC 101. ## 4 Eritrea 1947-02-10 UTC--2022-10-16 UTC 75.7 ## 5 Libya 1951-12-24 UTC--2022-10-16 UTC 70.8 ## 6 Sudan 1956-01-01 UTC--2022-10-16 UTC 66.8 ## 7 Tunisia 1956-03-20 UTC--2022-10-16 UTC 66.6 ## 8 Morocco NA--NA NA ## 9 Ghana 1957-03-06 UTC--2022-10-16 UTC 65.6 ## 10 Guinea 1958-10-02 UTC--2022-10-16 UTC 64.0 ## # … with 44 more rows We can now see for how long the last country to gain independence has been independent. Because the data is not tidy (in some cases, an African country was colonized by two powers, see Libya), I will only focus on 4 European colonial powers: Belgium, France, Portugal and the United Kingdom: independence %>% filter(colonial_power %in% c("Belgium", "France", "Portugal", "United Kingdom")) %>% mutate(today = lubridate::today()) %>% mutate(independent_since = interval(independence_date, today)) %>% mutate(years_independent = as.numeric(independent_since, "years")) %>% group_by(colonial_power) %>% summarise(last_colony_independent_for = min(years_independent, na.rm = TRUE)) ## # A tibble: 4 × 2 ## colonial_power last_colony_independent_for ## <chr> <dbl> ## 1 Belgium 60.3 ## 2 France 45.3 ## 3 Portugal 46.9 ## 4 United Kingdom 46.3 4.7.2.3 Arithmetic with dates Adding or substracting days to dates is quite easy: ymd("2018-12-31") + 16 ## [1] "2019-01-16" It is also possible to be more explicit and use days(): ymd("2018-12-31") + days(16) ## [1] "2019-01-16" To add years, you can use years(): ymd("2018-12-31") + years(1) ## [1] "2019-12-31" But you have to be careful with leap years: ymd("2016-02-29") + years(1) ## [1] NA Because 2017 is not a leap year, the above computation returns NA. The same goes for months with a different number of days: ymd("2018-12-31") + months(2) ## [1] NA The way to solve these issues is to use the special %m+% infix operator: ymd("2016-02-29") %m+% years(1) ## [1] "2017-02-28" and for months: ymd("2018-12-31") %m+% months(2) ## [1] "2019-02-28" {lubridate} contains many more functions. If you often work with dates, duration or interval data, {lubridate} is a package that you have to add to your toolbox. 4.7.3 Manipulate strings with {stringr} {stringr} contains functions to manipulate strings. In Chapter 10, I will teach you about regular expressions, but the functions contained in {stringr} allow you to already do a lot of work on strings, without needing to be a regular expression expert. I will discuss the most common string operations: detecting, locating, matching, searching and replacing, and exctracting/removing strings. To introduce these operations, let us use an ALTO file of an issue of The Winchester News from October 31, 1910, which you can find on this link (to see how the newspaper looked like, click here). I re-hosted the file on a public gist for archiving purposes. While working on the book, the original site went down several times… ALTO is an XML schema for the description of text OCR and layout information of pages for digitzed material, such as newspapers (source: ALTO Wikipedia page). For more details, you can read my blogpost on the matter, but for our current purposes, it is enough to know that the file contains the text of newspaper articles. The file looks like this: <TextLine HEIGHT="138.0" WIDTH="2434.0" HPOS="4056.0" VPOS="5814.0"> <String STYLEREFS="ID7" HEIGHT="108.0" WIDTH="393.0" HPOS="4056.0" VPOS="5838.0" CONTENT="timore" WC="0.82539684"> <ALTERNATIVE>timole</ALTERNATIVE> <ALTERNATIVE>tlnldre</ALTERNATIVE> <ALTERNATIVE>timor</ALTERNATIVE> <ALTERNATIVE>insole</ALTERNATIVE> <ALTERNATIVE>landed</ALTERNATIVE> </String> <SP WIDTH="74.0" HPOS="4449.0" VPOS="5838.0"/> <String STYLEREFS="ID7" HEIGHT="105.0" WIDTH="432.0" HPOS="4524.0" VPOS="5847.0" CONTENT="market" WC="0.95238096"/> <SP WIDTH="116.0" HPOS="4956.0" VPOS="5847.0"/> <String STYLEREFS="ID7" HEIGHT="69.0" WIDTH="138.0" HPOS="5073.0" VPOS="5883.0" CONTENT="as" WC="0.96825397"/> <SP WIDTH="74.0" HPOS="5211.0" VPOS="5883.0"/> <String STYLEREFS="ID7" HEIGHT="69.0" WIDTH="285.0" HPOS="5286.0" VPOS="5877.0" CONTENT="were" WC="1.0"> <ALTERNATIVE>verc</ALTERNATIVE> <ALTERNATIVE>veer</ALTERNATIVE> </String> <SP WIDTH="68.0" HPOS="5571.0" VPOS="5877.0"/> <String STYLEREFS="ID7" HEIGHT="111.0" WIDTH="147.0" HPOS="5640.0" VPOS="5838.0" CONTENT="all" WC="1.0"/> <SP WIDTH="83.0" HPOS="5787.0" VPOS="5838.0"/> <String STYLEREFS="ID7" HEIGHT="111.0" WIDTH="183.0" HPOS="5871.0" VPOS="5835.0" CONTENT="the" WC="0.95238096"> <ALTERNATIVE>tll</ALTERNATIVE> <ALTERNATIVE>Cu</ALTERNATIVE> <ALTERNATIVE>tall</ALTERNATIVE> </String> <SP WIDTH="75.0" HPOS="6054.0" VPOS="5835.0"/> <String STYLEREFS="ID3" HEIGHT="132.0" WIDTH="351.0" HPOS="6129.0" VPOS="5814.0" CONTENT="cattle" WC="0.95238096"/> </TextLine> We are interested in the strings after CONTENT=. We are going to use functions from the {stringr} package to get the strings after CONTENT=. In Chapter 10, we are going to explore this file again, but using complex regular expressions to get all the content in one go. 4.7.3.1 Getting text data into Rstudio First of all, let us read in the file: winchester <- read_lines("https://gist.githubusercontent.com/b-rodrigues/5139560e7d0f2ecebe5da1df3629e015/raw/e3031d894ffb97217ddbad1ade1b307c9937d2c8/gistfile1.txt") Even though the file is an XML file, I still read it in using read_lines() and not read_xml() from the {xml2} package. This is for the purposes of the current exercise, and also because I always have trouble with XML files, and prefer to treat them as simple text files, and use regular expressions to get what I need. Now that the ALTO file is read in and saved in the winchester variable, you might want to print the whole thing in the console. Before that, take a look at the structure: str(winchester) ## chr [1:43] "" ... So the winchester variable is a character atomic vector with 43 elements. So first, we need to understand what these elements are. Let’s start with the first one: winchester[1] ## [1] "" Ok, so it seems like the first element is part of the header of the file. What about the second one? winchester[2] ## [1] "<meta http-equiv=\\"Content-Type\\" content=\\"text/html; charset=UTF-8\\"><base href=\\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\\"><style>body{margin-left:0;margin-right:0;margin-top:0}#bN015htcoyT__google-cache-hdr{background:#f5f5f5;font:13px arial,sans-serif;text-align:left;color:#202020;border:0;margin:0;border-bottom:1px solid #cecece;line-height:16px;padding:16px 28px 24px 28px}#bN015htcoyT__google-cache-hdr *{display:inline;font:inherit;text-align:inherit;color:inherit;line-height:inherit;background:none;border:0;margin:0;padding:0;letter-spacing:0}#bN015htcoyT__google-cache-hdr a{text-decoration:none;color:#1a0dab}#bN015htcoyT__google-cache-hdr a:hover{text-decoration:underline}#bN015htcoyT__google-cache-hdr a:visited{color:#609}#bN015htcoyT__google-cache-hdr div{display:block;margin-top:4px}#bN015htcoyT__google-cache-hdr b{font-weight:bold;display:inline-block;direction:ltr}</style><div id=\\"bN015htcoyT__google-cache-hdr\\"><div><span>This is Google's cache of <a href=\\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\\">https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml</a>.</span>&nbsp;<span>It is a snapshot of the page as it appeared on 21 Jan 2019 05:18:18 GMT.</span>&nbsp;<span>The <a href=\\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\\">current page</a> could have changed in the meantime.</span>&nbsp;<a href=\\"http://support.google.com/websearch/bin/answer.py?hl=en&amp;p=cached&amp;answer=1687222\\"><span>Learn more</span>.</a></div><div><span style=\\"display:inline-block;margin-top:8px;margin-right:104px;white-space:nowrap\\"><span style=\\"margin-right:28px\\"><span style=\\"font-weight:bold\\">Full version</span></span><span style=\\"margin-right:28px\\"><a href=\\"http://webcache.googleusercontent.com/search?q=cache:2BVPV8QGj3oJ:https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml&amp;hl=en&amp;gl=lu&amp;strip=1&amp;vwsrc=0\\"><span>Text-only version</span></a></span><span style=\\"margin-right:28px\\"><a href=\\"http://webcache.googleusercontent.com/search?q=cache:2BVPV8QGj3oJ:https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml&amp;hl=en&amp;gl=lu&amp;strip=0&amp;vwsrc=1\\"><span>View source</span></a></span></span></div><span style=\\"display:inline-block;margin-top:8px;color:#717171\\"><span>Tip: To quickly find your search term on this page, press <b>Ctrl+F</b> or <b>⌘-F</b> (Mac) and use the find bar.</span></span></div><div style=\\"position:relative;\\"><?xml version=\\"1.0\\" encoding=\\"UTF-8\\"?>" Same. So where is the content? The file is very large, so if you print it in the console, it will take quite some time to print, and you will not really be able to make out anything. The best way would be to try to detect the string CONTENT and work from there. 4.7.3.2 Detecting, getting the position and locating strings When confronted to an atomic vector of strings, you might want to know inside which elements you can find certain strings. For example, to know which elements of winchester contain the string CONTENT, use str_detect(): winchester %>% str_detect("CONTENT") ## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [37] FALSE FALSE FALSE FALSE FALSE FALSE TRUE This returns a boolean atomic vector of the same length as winchester. If the string CONTENT is nowhere to be found, the result will equal FALSE, if not it will equal TRUE. Here it is easy to see that the last element contains the string CONTENT. But what if instead of having 43 elements, the vector had 24192 elements? And hundreds would contain the string CONTENT? It would be easier to instead have the indices of the vector where one can find the word CONTENT. This is possible with str_which(): winchester %>% str_which("CONTENT") ## [1] 43 Here, the result is 43, meaning that the 43rd element of winchester contains the string CONTENT somewhere. If we need more precision, we can use str_locate() and str_locate_all(). To explain how both these functions work, let’s create a very small example: ancient_philosophers <- c("aristotle", "plato", "epictetus", "seneca the younger", "epicurus", "marcus aurelius") Now suppose I am interested in philosophers whose name ends in us. Let us use str_locate() first: ancient_philosophers %>% str_locate("us") ## start end ## [1,] NA NA ## [2,] NA NA ## [3,] 8 9 ## [4,] NA NA ## [5,] 7 8 ## [6,] 5 6 You can interpret the result as follows: in the rows, the index of the vector where the string us is found. So the 3rd, 5th and 6th philosopher have us somewhere in their name. The result also has two columns: start and end. These give the position of the string. So the string us can be found starting at position 8 of the 3rd element of the vector, and ends at position 9. Same goes for the other philisophers. However, consider Marcus Aurelius. He has two names, both ending with us. However, str_locate() only shows the position of the us in Marcus. To get both us strings, you need to use str_locate_all(): ancient_philosophers %>% str_locate_all("us") ## [[1]] ## start end ## ## [[2]] ## start end ## ## [[3]] ## start end ## [1,] 8 9 ## ## [[4]] ## start end ## ## [[5]] ## start end ## [1,] 7 8 ## ## [[6]] ## start end ## [1,] 5 6 ## [2,] 14 15 Now we get the position of the two us in Marcus Aurelius. Doing this on the winchester vector will give use the position of the CONTENT string, but this is not really important right now. What matters is that you know how str_locate() and str_locate_all() work. So now that we know what interests us in the 43nd element of winchester, let’s take a closer look at it: winchester[43] As you can see, it’s a mess: <TextLine HEIGHT=\\"126.0\\" WIDTH=\\"1731.0\\" HPOS=\\"17160.0\\" VPOS=\\"21252.0\\"><String HEIGHT=\\"114.0\\" WIDTH=\\"354.0\\" HPOS=\\"17160.0\\" VPOS=\\"21264.0\\" CONTENT=\\"0tV\\" WC=\\"0.8095238\\"/><SP WIDTH=\\"131.0\\" HPOS=\\"17514.0\\" VPOS=\\"21264.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"111.0\\" WIDTH=\\"474.0\\" HPOS=\\"17646.0\\" VPOS=\\"21258.0\\" CONTENT=\\"BATES\\" WC=\\"1.0\\"/><SP WIDTH=\\"140.0\\" HPOS=\\"18120.0\\" VPOS=\\"21258.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"114.0\\" WIDTH=\\"630.0\\" HPOS=\\"18261.0\\" VPOS=\\"21252.0\\" CONTENT=\\"President\\" WC=\\"1.0\\"><ALTERNATIVE>Prcideht</ALTERNATIVE><ALTERNATIVE>Pride</ALTERNATIVE></String></TextLine><TextLine HEIGHT=\\"153.0\\" WIDTH=\\"1689.0\\" HPOS=\\"17145.0\\" VPOS=\\"21417.0\\"><String STYLEREFS=\\"ID7\\" HEIGHT=\\"105.0\\" WIDTH=\\"258.0\\" HPOS=\\"17145.0\\" VPOS=\\"21439.0\\" CONTENT=\\"WM\\" WC=\\"0.82539684\\"><TextLine HEIGHT=\\"120.0\\" WIDTH=\\"2211.0\\" HPOS=\\"16788.0\\" VPOS=\\"21870.0\\"><String STYLEREFS=\\"ID7\\" HEIGHT=\\"96.0\\" WIDTH=\\"102.0\\" HPOS=\\"16788.0\\" VPOS=\\"21894.0\\" CONTENT=\\"It\\" WC=\\"1.0\\"/><SP WIDTH=\\"72.0\\" HPOS=\\"16890.0\\" VPOS=\\"21894.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"96.0\\" WIDTH=\\"93.0\\" HPOS=\\"16962.0\\" VPOS=\\"21885.0\\" CONTENT=\\"is\\" WC=\\"1.0\\"/><SP WIDTH=\\"80.0\\" HPOS=\\"17055.0\\" VPOS=\\"21885.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"102.0\\" WIDTH=\\"417.0\\" HPOS=\\"17136.0\\" VPOS=\\"21879.0\\" CONTENT=\\"seldom\\" WC=\\"1.0\\"/><SP WIDTH=\\"80.0\\" HPOS=\\"17553.0\\" VPOS=\\"21879.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"96.0\\" WIDTH=\\"267.0\\" HPOS=\\"17634.0\\" VPOS=\\"21873.0\\" CONTENT=\\"hard\\" WC=\\"1.0\\"/><SP WIDTH=\\"81.0\\" HPOS=\\"17901.0\\" VPOS=\\"21873.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"87.0\\" WIDTH=\\"111.0\\" HPOS=\\"17982.0\\" VPOS=\\"21879.0\\" CONTENT=\\"to\\" WC=\\"1.0\\"/><SP WIDTH=\\"81.0\\" HPOS=\\"18093.0\\" VPOS=\\"21879.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"96.0\\" WIDTH=\\"219.0\\" HPOS=\\"18174.0\\" VPOS=\\"21870.0\\" CONTENT=\\"find\\" WC=\\"1.0\\"/><SP WIDTH=\\"77.0\\" HPOS=\\"18393.0\\" VPOS=\\"21870.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"69.0\\" WIDTH=\\"66.0\\" HPOS=\\"18471.0\\" VPOS=\\"21894.0\\" CONTENT=\\"a\\" WC=\\"1.0\\"/><SP WIDTH=\\"77.0\\" HPOS=\\"18537.0\\" VPOS=\\"21894.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"78.0\\" WIDTH=\\"384.0\\" HPOS=\\"18615.0\\" VPOS=\\"21888.0\\" CONTENT=\\"succes\\" WC=\\"0.82539684\\"><ALTERNATIVE>success</ALTERNATIVE></String></TextLine><TextLine HEIGHT=\\"126.0\\" WIDTH=\\"2316.0\\" HPOS=\\"16662.0\\" VPOS=\\"22008.0\\"><String STYLEREFS=\\"ID7\\" HEIGHT=\\"75.0\\" WIDTH=\\"183.0\\" HPOS=\\"16662.0\\" VPOS=\\"22059.0\\" CONTENT=\\"sor\\" WC=\\"1.0\\"><ALTERNATIVE>soar</ALTERNATIVE></String><SP WIDTH=\\"72.0\\" HPOS=\\"16845.0\\" VPOS=\\"22059.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"90.0\\" WIDTH=\\"168.0\\" HPOS=\\"16917.0\\" VPOS=\\"22035.0\\" CONTENT=\\"for\\" WC=\\"1.0\\"/><SP WIDTH=\\"72.0\\" HPOS=\\"17085.0\\" VPOS=\\"22035.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"69.0\\" WIDTH=\\"267.0\\" HPOS=\\"17157.0\\" VPOS=\\"22050.0\\" CONTENT=\\"even\\" WC=\\"1.0\\"><ALTERNATIVE>cen</ALTERNATIVE><ALTERNATIVE>cent</ALTERNATIVE></String><SP WIDTH=\\"77.0\\" HPOS=\\"17434.0\\" VPOS=\\"22050.0\\"/><String STYLEREFS=\\"ID7\\" HEIGHT=\\"66.0\\" WIDTH=\\"63.0\\" HPOS=\\"17502.0\\" VPOS=\\"22044.0\\" The file was imported without any newlines. So we need to insert them ourselves, by splitting the string in a clever way. 4.7.3.3 Splitting strings There are two functions included in {stringr} to split strings, str_split() and str_split_fixed(). Let’s go back to our ancient philosophers. Two of them, Seneca the Younger and Marcus Aurelius have something else in common than both being Roman Stoic philosophers. Their names are composed of several words. If we want to split their names at the space character, we can use str_split() like this: ancient_philosophers %>% str_split(" ") ## [[1]] ## [1] "aristotle" ## ## [[2]] ## [1] "plato" ## ## [[3]] ## [1] "epictetus" ## ## [[4]] ## [1] "seneca" "the" "younger" ## ## [[5]] ## [1] "epicurus" ## ## [[6]] ## [1] "marcus" "aurelius" str_split() also has a simplify = TRUE option: ancient_philosophers %>% str_split(" ", simplify = TRUE) ## [,1] [,2] [,3] ## [1,] "aristotle" "" "" ## [2,] "plato" "" "" ## [3,] "epictetus" "" "" ## [4,] "seneca" "the" "younger" ## [5,] "epicurus" "" "" ## [6,] "marcus" "aurelius" "" This time, the returned object is a matrix. What about str_split_fixed()? The difference is that here you can specify the number of pieces to return. For example, you could consider the name “Aurelius” to be the middle name of Marcus Aurelius, and the “the younger” to be the middle name of Seneca the younger. This means that you would want to split the name only at the first space character, and not at all of them. This is easily achieved with str_split_fixed(): ancient_philosophers %>% str_split_fixed(" ", 2) ## [,1] [,2] ## [1,] "aristotle" "" ## [2,] "plato" "" ## [3,] "epictetus" "" ## [4,] "seneca" "the younger" ## [5,] "epicurus" "" ## [6,] "marcus" "aurelius" This gives the expected result. So how does this help in our case? Well, if you look at how the ALTO file looks like, at the beginning of this section, you will notice that every line ends with the “>” character. So let’s split at that character! winchester_text <- winchester[43] %>% str_split(">") Let’s take a closer look at winchester_text: str(winchester_text) ## List of 1 ## $ : chr [1:19706] "</processingStepSettings" "<processingSoftware" "<softwareCreator" "iArchives</softwareCreator" ... So this is a list of length one, and the first, and only, element of that list is an atomic vector with 19706 elements. Since this is a list of only one element, we can simplify it by saving the atomic vector in a variable: winchester_text <- winchester_text[[1]] Let’s now look at some lines: winchester_text[1232:1245] ## [1] "<SP WIDTH=\\"66.0\\" HPOS=\\"5763.0\\" VPOS=\\"9696.0\\"/" ## [2] "<String STYLEREFS=\\"ID7\\" HEIGHT=\\"108.0\\" WIDTH=\\"612.0\\" HPOS=\\"5829.0\\" VPOS=\\"9693.0\\" CONTENT=\\"Louisville\\" WC=\\"1.0\\"" ## [3] "<ALTERNATIVE" ## [4] "Loniile</ALTERNATIVE" ## [5] "<ALTERNATIVE" ## [6] "Lenities</ALTERNATIVE" ## [7] "</String" ## [8] "</TextLine" ## [9] "<TextLine HEIGHT=\\"150.0\\" WIDTH=\\"2520.0\\" HPOS=\\"4032.0\\" VPOS=\\"9849.0\\"" ## [10] "<String STYLEREFS=\\"ID7\\" HEIGHT=\\"108.0\\" WIDTH=\\"510.0\\" HPOS=\\"4032.0\\" VPOS=\\"9861.0\\" CONTENT=\\"Tobacco\\" WC=\\"1.0\\"/" ## [11] "<SP WIDTH=\\"113.0\\" HPOS=\\"4542.0\\" VPOS=\\"9861.0\\"/" ## [12] "<String STYLEREFS=\\"ID7\\" HEIGHT=\\"105.0\\" WIDTH=\\"696.0\\" HPOS=\\"4656.0\\" VPOS=\\"9861.0\\" CONTENT=\\"Warehouse\\" WC=\\"1.0\\"" ## [13] "<ALTERNATIVE" ## [14] "WHrchons</ALTERNATIVE" This now looks easier to handle. We can narrow it down to the lines that only contain the string we are interested in, “CONTENT”. First, let’s get the indices: content_winchester_index <- winchester_text %>% str_which("CONTENT") How many lines contain the string “CONTENT”? length(content_winchester_index) ## [1] 4462 As you can see, this reduces the amount of data we have to work with. Let us save this is a new variable: content_winchester <- winchester_text[content_winchester_index] 4.7.3.4 Matching strings Matching strings is useful, but only in combination with regular expressions. As stated at the beginning of this section, we are going to learn about regular expressions in Chapter 10, but in order to make this section useful, we are going to learn the easiest, but perhaps the most useful regular expression: .*. Let’s go back to our ancient philosophers, and use str_match() and see what happens. Let’s match the “us” string: ancient_philosophers %>% str_match("us") ## [,1] ## [1,] NA ## [2,] NA ## [3,] "us" ## [4,] NA ## [5,] "us" ## [6,] "us" Not very useful, but what about the regular expression .*? How could it help? ancient_philosophers %>% str_match(".*us") ## [,1] ## [1,] NA ## [2,] NA ## [3,] "epictetus" ## [4,] NA ## [5,] "epicurus" ## [6,] "marcus aurelius" That’s already very interesting! So how does .* work? To understand, let’s first start by using . alone: ancient_philosophers %>% str_match(".us") ## [,1] ## [1,] NA ## [2,] NA ## [3,] "tus" ## [4,] NA ## [5,] "rus" ## [6,] "cus" This also matched whatever symbol comes just before the “u” from “us”. What if we use two . instead? ancient_philosophers %>% str_match("..us") ## [,1] ## [1,] NA ## [2,] NA ## [3,] "etus" ## [4,] NA ## [5,] "urus" ## [6,] "rcus" This time, we get the two symbols that immediately precede “us”. Instead of continuing like this we now use the *, which matches zero or more of .. So by combining * and ., we can match any symbol repeatedly, until there is nothing more to match. Note that there is also +, which works similarly to *, but it matches one or more symbols. There is also a str_match_all(): ancient_philosophers %>% str_match_all(".*us") ## [[1]] ## [,1] ## ## [[2]] ## [,1] ## ## [[3]] ## [,1] ## [1,] "epictetus" ## ## [[4]] ## [,1] ## ## [[5]] ## [,1] ## [1,] "epicurus" ## ## [[6]] ## [,1] ## [1,] "marcus aurelius" In this particular case it does not change the end result, but keep it in mind for cases like this one: c("haha", "huhu") %>% str_match("ha") ## [,1] ## [1,] "ha" ## [2,] NA and: c("haha", "huhu") %>% str_match_all("ha") ## [[1]] ## [,1] ## [1,] "ha" ## [2,] "ha" ## ## [[2]] ## [,1] What if we want to match names containing the letter “t”? Easy: ancient_philosophers %>% str_match(".*t.*") ## [,1] ## [1,] "aristotle" ## [2,] "plato" ## [3,] "epictetus" ## [4,] "seneca the younger" ## [5,] NA ## [6,] NA So how does this help us with our historical newspaper? Let’s try to get the strings that come after “CONTENT”: winchester_content <- winchester_text %>% str_match("CONTENT.*") Let’s use our faithful str() function to take a look: winchester_content %>% str ## chr [1:19706, 1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ... Hum, there’s a lot of NA values! This is because a lot of the lines from the file did not have the string “CONTENT”, so there is no match possible. Let’s us remove all these NAs. Because the result is a matrix, we cannot use the filter() function from {dplyr}. So we need to convert it to a tibble first: winchester_content <- winchester_content %>% as.tibble() %>% filter(!is.na(V1)) ## Warning: `as.tibble()` was deprecated in tibble 2.0.0. ## Please use `as_tibble()` instead. ## The signature and semantics have changed, see `?as_tibble`. ## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0. ## Using compatibility `.name_repair`. Because matrix columns do not have names, when a matrix gets converted into a tibble, the firt column gets automatically called V1. This is why I filter on this column. Let’s take a look at the data: head(winchester_content) ## # A tibble: 6 × 1 ## V1 ## <chr> ## 1 "CONTENT=\\"J\\" WC=\\"0.8095238\\"/" ## 2 "CONTENT=\\"a\\" WC=\\"0.8095238\\"/" ## 3 "CONTENT=\\"Ira\\" WC=\\"0.95238096\\"/" ## 4 "CONTENT=\\"mj\\" WC=\\"0.8095238\\"/" ## 5 "CONTENT=\\"iI\\" WC=\\"0.8095238\\"/" ## 6 "CONTENT=\\"tE1r\\" WC=\\"0.8095238\\"/" 4.7.3.5 Searching and replacing strings We are getting close to the final result. We still need to do some cleaning however. Since our data is inside a nice tibble, we might as well stick with it. So let’s first rename the column and change all the strings to lowercase: winchester_content <- winchester_content %>% mutate(content = tolower(V1)) %>% select(-V1) Let’s take a look at the result: head(winchester_content) ## # A tibble: 6 × 1 ## content ## <chr> ## 1 "content=\\"j\\" wc=\\"0.8095238\\"/" ## 2 "content=\\"a\\" wc=\\"0.8095238\\"/" ## 3 "content=\\"ira\\" wc=\\"0.95238096\\"/" ## 4 "content=\\"mj\\" wc=\\"0.8095238\\"/" ## 5 "content=\\"ii\\" wc=\\"0.8095238\\"/" ## 6 "content=\\"te1r\\" wc=\\"0.8095238\\"/" The second part of the string, “wc=….” is not really interesting. Let’s search and replace this with an empty string, using str_replace(): winchester_content <- winchester_content %>% mutate(content = str_replace(content, "wc.*", "")) head(winchester_content) ## # A tibble: 6 × 1 ## content ## <chr> ## 1 "content=\\"j\\" " ## 2 "content=\\"a\\" " ## 3 "content=\\"ira\\" " ## 4 "content=\\"mj\\" " ## 5 "content=\\"ii\\" " ## 6 "content=\\"te1r\\" " We need to use the regular expression from before to replace “wc” and every character that follows. The same can be use to remove “content=”: winchester_content <- winchester_content %>% mutate(content = str_replace(content, "content=", "")) head(winchester_content) ## # A tibble: 6 × 1 ## content ## <chr> ## 1 "\\"j\\" " ## 2 "\\"a\\" " ## 3 "\\"ira\\" " ## 4 "\\"mj\\" " ## 5 "\\"ii\\" " ## 6 "\\"te1r\\" " We are almost done, but some cleaning is still necessary: 4.7.3.6 Exctracting or removing strings Now, because I now the ALTO spec, I know how to find words that are split between two sentences: winchester_content %>% filter(str_detect(content, "hyppart")) ## # A tibble: 64 × 1 ## content ## <chr> ## 1 "\\"aver\\" subs_type=\\"hyppart1\\" subs_content=\\"average\\" " ## 2 "\\"age\\" subs_type=\\"hyppart2\\" subs_content=\\"average\\" " ## 3 "\\"considera\\" subs_type=\\"hyppart1\\" subs_content=\\"consideration\\" " ## 4 "\\"tion\\" subs_type=\\"hyppart2\\" subs_content=\\"consideration\\" " ## 5 "\\"re\\" subs_type=\\"hyppart1\\" subs_content=\\"resigned\\" " ## 6 "\\"signed\\" subs_type=\\"hyppart2\\" subs_content=\\"resigned\\" " ## 7 "\\"install\\" subs_type=\\"hyppart1\\" subs_content=\\"installed\\" " ## 8 "\\"ed\\" subs_type=\\"hyppart2\\" subs_content=\\"installed\\" " ## 9 "\\"be\\" subs_type=\\"hyppart1\\" subs_content=\\"before\\" " ## 10 "\\"fore\\" subs_type=\\"hyppart2\\" subs_content=\\"before\\" " ## # … with 54 more rows For instance, the word “average” was split over two lines, the first part of the word, “aver” on the first line, and the second part of the word, “age”, on the second line. We want to keep what comes after “subs_content”. Let’s extract the word “average” using str_extract(). However, because only some words were split between two lines, we first need to detect where the string “hyppart1” is located, and only then can we extract what comes after “subs_content”. Thus, we need to combine str_detect() to first detect the string, and then str_extract() to extract what comes after “subs_content”: winchester_content <- winchester_content %>% mutate(content = if_else(str_detect(content, "hyppart1"), str_extract_all(content, "content=.*", simplify = TRUE), content)) Let’s take a look at the result: winchester_content %>% filter(str_detect(content, "content")) ## # A tibble: 64 × 1 ## content ## <chr> ## 1 "content=\\"average\\" " ## 2 "\\"age\\" subs_type=\\"hyppart2\\" subs_content=\\"average\\" " ## 3 "content=\\"consideration\\" " ## 4 "\\"tion\\" subs_type=\\"hyppart2\\" subs_content=\\"consideration\\" " ## 5 "content=\\"resigned\\" " ## 6 "\\"signed\\" subs_type=\\"hyppart2\\" subs_content=\\"resigned\\" " ## 7 "content=\\"installed\\" " ## 8 "\\"ed\\" subs_type=\\"hyppart2\\" subs_content=\\"installed\\" " ## 9 "content=\\"before\\" " ## 10 "\\"fore\\" subs_type=\\"hyppart2\\" subs_content=\\"before\\" " ## # … with 54 more rows We still need to get rid of the string “content=” and then of all the strings that contain “hyppart2”, which are not needed now: winchester_content <- winchester_content %>% mutate(content = str_replace(content, "content=", "")) %>% mutate(content = if_else(str_detect(content, "hyppart2"), NA_character_, content)) head(winchester_content) ## # A tibble: 6 × 1 ## content ## <chr> ## 1 "\\"j\\" " ## 2 "\\"a\\" " ## 3 "\\"ira\\" " ## 4 "\\"mj\\" " ## 5 "\\"ii\\" " ## 6 "\\"te1r\\" " Almost done! We only need to remove the \" characters: winchester_content <- winchester_content %>% mutate(content = str_replace_all(content, "\\"", "")) head(winchester_content) ## # A tibble: 6 × 1 ## content ## <chr> ## 1 "j " ## 2 "a " ## 3 "ira " ## 4 "mj " ## 5 "ii " ## 6 "te1r " Let’s remove space characters with str_trim(): winchester_content <- winchester_content %>% mutate(content = str_trim(content)) head(winchester_content) ## # A tibble: 6 × 1 ## content ## <chr> ## 1 j ## 2 a ## 3 ira ## 4 mj ## 5 ii ## 6 te1r To finish off this section, let’s remove stop words (words that do not add any meaning to a sentence, such as “as”, “and”…) and words that are composed of less than 3 characters. You can find a dataset with stopwords inside the {stopwords} package: library(stopwords) data(data_stopwords_stopwordsiso) eng_stopwords <- tibble("content" = data_stopwords_stopwordsiso$en) winchester_content <- winchester_content %>% anti_join(eng_stopwords) %>% filter(nchar(content) > 3) ## Joining, by = "content" head(winchester_content) ## # A tibble: 6 × 1 ## content ## <chr> ## 1 te1r ## 2 jilas ## 3 edition ## 4 winchester ## 5 news ## 6 injuries That’s it for this section! You now know how to work with strings, but in Chapter 10 we are going one step further by learning about regular expressions, which offer much more power. 4.7.4 Tidy data frames with {tibble} We have already seen and used several functions from the {tibble} package. Let’s now go through some more useful functions. 4.7.4.1 Creating tibbles tribble() makes it easy to create tibble row by row, manually: It is also possible to create a tibble from a named list: as_tibble(list("combustion" = c("oil", "diesel", "oil", "electric"), "doors" = c(3, 5, 5, 5))) ## # A tibble: 4 × 2 ## combustion doors ## <chr> <dbl> ## 1 oil 3 ## 2 diesel 5 ## 3 oil 5 ## 4 electric 5 enframe(list("combustion" = c(1,2), "doors" = c(1,2,4), "cylinders" = c(1,8,9,10))) ## # A tibble: 3 × 2 ## name value ## <chr> <list> ## 1 combustion <dbl [2]> ## 2 doors <dbl [3]> ## 3 cylinders <dbl [4]> 4.8 List-columns To learn about list-columns, let’s first focus on a single character of the starwars dataset: data(starwars) starwars %>% filter(name == "Luke Skywalker") %>% glimpse() ## Rows: 1 ## Columns: 14 ## $ name <chr> "Luke Skywalker" ## $ height <int> 172 ## $ mass <dbl> 77 ## $ hair_color <chr> "blond" ## $ skin_color <chr> "fair" ## $ eye_color <chr> "blue" ## $ birth_year <dbl> 19 ## $ sex <chr> "male" ## $ gender <chr> "masculine" ## $ homeworld <chr> "Tatooine" ## $ species <chr> "Human" ## $ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return … ## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike"> ## $ starships <list> <"X-wing", "Imperial shuttle"> We see that the columns films, vehicles and starships (at the bottom) are all lists, and in the case of films, it lists all the films where Luke Skywalker has appeared. What if you want to take a closer look at films where Luke Skywalker appeared? starwars %>% filter(name == "Luke Skywalker") %>% pull(films) ## [[1]] ## [1] "The Empire Strikes Back" "Revenge of the Sith" ## [3] "Return of the Jedi" "A New Hope" ## [5] "The Force Awakens" pull() is a {dplyr} function that extract (pulls) the column you’re interested in. It is quite useful when you want to inspect a column. Instead of just looking at Luke Skywalker’s films, let’s pull the complete films column instead: starwars %>% head() %>% # let's just look at the first six rows pull(films) ## [[1]] ## [1] "The Empire Strikes Back" "Revenge of the Sith" ## [3] "Return of the Jedi" "A New Hope" ## [5] "The Force Awakens" ## ## [[2]] ## [1] "The Empire Strikes Back" "Attack of the Clones" ## [3] "The Phantom Menace" "Revenge of the Sith" ## [5] "Return of the Jedi" "A New Hope" ## ## [[3]] ## [1] "The Empire Strikes Back" "Attack of the Clones" ## [3] "The Phantom Menace" "Revenge of the Sith" ## [5] "Return of the Jedi" "A New Hope" ## [7] "The Force Awakens" ## ## [[4]] ## [1] "The Empire Strikes Back" "Revenge of the Sith" ## [3] "Return of the Jedi" "A New Hope" ## ## [[5]] ## [1] "The Empire Strikes Back" "Revenge of the Sith" ## [3] "Return of the Jedi" "A New Hope" ## [5] "The Force Awakens" ## ## [[6]] ## [1] "Attack of the Clones" "Revenge of the Sith" "A New Hope" Let’s stop here a moment. As you see, the films column contains several items in it. How is it possible that a single cell contains more than one film? This is because what is actually contained in the cell is not the seven films, as seven separate characters, but an atomic vector that happens to have seven elements. But it is still only one vector. Zooming in into the data frame helps understand: In the picture above we see three columns. The first two, name and sex are what you’re used to see, just one element defining the characters name and sex respectively. The last one also contains only one element for each character, it just so happens to be a complete vector of characters. Because what is inside the cells of a list-column can be very different things (as list can contain anything), you have to think a bit about it in order to extract insights from such columns. List-columns may seem arcane, but they are extremely powerful once you master them. As an example, suppose we want to create a numerical variable which counts the number of movies in which the characters have appeared. For this we need to compute the length of the list, or count the number of elements this list has. Let’s try with length() a base R function: starwars %>% filter(name == "Luke Skywalker") %>% pull(films) %>% length() ## [1] 1 This might be surprising, but remember that a list with only one element, has a length of 1: length( list(words) # this creates a list which one element. This element is a list of 980 words. ) ## [1] 1 Even though words contain a vector of 980 words, if we put this very long vector inside the first element of list, length(list(words)) will this compute the length of the list. Let’s see what happens if we create a more complex list: numbers <- seq(1, 5) length( list(words, # this creates a list which one element. This element is a list of 980 words. numbers) # numbers contains numbers 1 through 5 ) ## [1] 2 list(words, numbers) is now a list of two elements, words and numbers. If we want to compute the length of words and numbers, we need to learn about another powerful concept called higher-order functions. We are going to learn about this in greater detail in Chapter 8. For now, let’s use the fact that our list films is contained inside of a data frame, and use a convenience function included in {dplyr} to handle situations like this: starwars <- starwars %>% rowwise() %>% # <- Apply the next steps for each row individually mutate(n_films = length(films)) dplyr::rowwise() is useful when working with list-columns because whatever instructions follow get run on the single element contained in the list. The picture below illustrates this: Let’s take a look at the characters and the number of films they have appeared in: starwars %>% select(name, films, n_films) ## # A tibble: 87 × 3 ## # Rowwise: ## name films n_films ## <chr> <list> <int> ## 1 Luke Skywalker <chr [5]> 5 ## 2 C-3PO <chr [6]> 6 ## 3 R2-D2 <chr [7]> 7 ## 4 Darth Vader <chr [4]> 4 ## 5 Leia Organa <chr [5]> 5 ## 6 Owen Lars <chr [3]> 3 ## 7 Beru Whitesun lars <chr [3]> 3 ## 8 R5-D4 <chr [1]> 1 ## 9 Biggs Darklighter <chr [1]> 1 ## 10 Obi-Wan Kenobi <chr [6]> 6 ## # … with 77 more rows Now we can, for example, create a factor variable that groups characters by asking whether they appeared only in 1 movie, or more: starwars <- starwars %>% mutate(more_1 = case_when(n_films == 1 ~ "Exactly one movie", n_films >= 1 ~ "More than 1 movie")) You can also create list-columns with your own datasets, by using tidyr::nest(). Remember the fake survey_data I created to illustrate pivot_longer() and pivot_wider()? Let’s go back to that dataset again: survey_data <- tribble( ~id, ~variable, ~value, 1, "var1", 1, 1, "var2", 0.2, NA, "var3", 0.3, 2, "var1", 1.4, 2, "var2", 1.9, 2, "var3", 4.1, 3, "var1", 0.1, 3, "var2", 2.8, 3, "var3", 8.9, 4, "var1", 1.7, NA, "var2", 1.9, 4, "var3", 7.6 ) print(survey_data) ## # A tibble: 12 × 3 ## id variable value ## <dbl> <chr> <dbl> ## 1 1 var1 1 ## 2 1 var2 0.2 ## 3 NA var3 0.3 ## 4 2 var1 1.4 ## 5 2 var2 1.9 ## 6 2 var3 4.1 ## 7 3 var1 0.1 ## 8 3 var2 2.8 ## 9 3 var3 8.9 ## 10 4 var1 1.7 ## 11 NA var2 1.9 ## 12 4 var3 7.6 nested_data <- survey_data %>% group_by(id) %>% nest() glimpse(nested_data) ## Rows: 5 ## Columns: 2 ## Groups: id [5] ## $ id <dbl> 1, NA, 2, 3, 4 ## $ data <list> [<tbl_df[2 x 2]>], [<tbl_df[2 x 2]>], [<tbl_df[3 x 2]>], [<tbl_df… This creates a new tibble, with columns id and data. data is a list-column that contains tibbles; each tibble is the variable and value for each individual: nested_data %>% filter(id == "1") %>% pull(data) ## [[1]] ## # A tibble: 2 × 2 ## variable value ## <chr> <dbl> ## 1 var1 1 ## 2 var2 0.2 As you can see, for individual 1, the column data contains a 2x2 tibble with columns variable and value. Because group_by() followed by nest() is so useful, there is a wrapper around these two functions called group_nest(): survey_data %>% group_nest(id) ## # A tibble: 5 × 2 ## id data ## <dbl> <list<tibble[,2]>> ## 1 1 [2 × 2] ## 2 2 [3 × 2] ## 3 3 [3 × 2] ## 4 4 [2 × 2] ## 5 NA [2 × 2] You might be wondering why this is useful, because this seems to introduce an unnecessary layer of complexity. The usefulness of list-columns will become apparent in the next chapters, where we are going to learn how to repeat actions over, say, individuals. So if you’ve reached the end of this section and still didn’t really grok list-columns, go take some fresh air and come back to this section again later on. 4.9 Going beyond descriptive statistics and data manipulation The {tidyverse} collection of packages can do much more than simply data manipulation and descriptive statisics. You can use the principles we have covered and the functions you now know to do much more. For instance, you can use a few {tidyverse} functions to do Monte Carlo simulations, for example to estimate \\(\\pi\\). Draw the unit circle inside the unit square, the ratio of the area of the circle to the area of the square will be \\(\\pi/4\\). Then shot K arrows at the square; roughly \\(K*\\pi/4\\) should have fallen inside the circle. So if now you shoot N arrows at the square, and M fall inside the circle, you have the following relationship \\(M = N*\\pi/4\\). You can thus compute \\(\\pi\\) like so: \\(\\pi = 4*M/N\\). The more arrows N you throw at the square, the better approximation of \\(\\pi\\) you’ll have. Let’s try to do this with a tidy Monte Carlo simulation. First, let’s randomly pick some points inside the unit square: library(tidyverse) n <- 5000 set.seed(2019) points <- tibble("x" = runif(n), "y" = runif(n)) Now, to know if a point is inside the unit circle, we need to check wether \\(x^2 + y^2 < 1\\). Let’s add a new column to the points tibble, called inside equal to 1 if the point is inside the unit circle and 0 if not: points <- points %>% mutate(inside = map2_dbl(.x = x, .y = y, ~ifelse(.x**2 + .y**2 < 1, 1, 0))) %>% rowid_to_column("N") Let’s take a look at points: points ## # A tibble: 5,000 × 4 ## N x y inside ## <int> <dbl> <dbl> <dbl> ## 1 1 0.770 0.984 0 ## 2 2 0.713 0.0107 1 ## 3 3 0.303 0.133 1 ## 4 4 0.618 0.0378 1 ## 5 5 0.0505 0.677 1 ## 6 6 0.0432 0.0846 1 ## 7 7 0.820 0.727 0 ## 8 8 0.00961 0.0758 1 ## 9 9 0.102 0.373 1 ## 10 10 0.609 0.676 1 ## # … with 4,990 more rows Now, I can compute the estimation of \\(\\pi\\) at each row, by computing the cumulative sum of the 1’s in the inside column and dividing that by the current value of N column: points <- points %>% mutate(estimate = 4*cumsum(inside)/N) cumsum(inside) is the M from the formula. Now, we can finish by plotting the result: ggplot(points) + geom_line(aes(y = estimate, x = N)) + geom_hline(yintercept = pi) In the next chapter, we are going to learn all about {ggplot2}, the package I used in the lines above to create this plot. As the number of tries grows, the estimation of \\(\\pi\\) gets better. Using a data frame as a structure to hold our simulated points and the results makes it very easy to avoid loops, and thus write code that is more concise and easier to follow. If you studied a quantitative field in university, you might have done a similar exercise at the time, very likely by defining a matrix to hold your points, and an empty vector to hold whether a particular point was inside the unit circle. Then you wrote a loop to compute whether a point was inside the unit circle, save this result in the before-defined empty vector and then compute the estimation of \\(\\pi\\). Again, I take this opportunity here to stress that there is nothing wrong with this approach per se, but R is better suited for a workflow where lists or data frames are the central objects and where the analyst operates over them with functional programming techniques. 4.10 Exercises Exercise 1 Combine mutate() with across() to exponentiate every column of type double of the gasoline dataset. To obtain the gasoline dataset, run the following lines: data(Gasoline, package = "plm") gasoline <- as_tibble(Gasoline) gasoline <- gasoline %>% mutate(country = tolower(country)) Exponeniate columns starting with the character \"l\" of the gasoline dataset. Convert all columns’ classes into the character class. Exercise 2 Load the LaborSupply dataset from the {Ecdat} package and answer the following questions: Compute the average annual hours worked by year (plus standard deviation) What age group worked the most hours in the year 1982? Create a variable, n_years that equals the number of years an individual stays in the panel. Is the panel balanced? Which are the individuals that do not have any kids during the whole period? Create a variable, no_kids, that flags these individuals (1 = no kids, 0 = kids) Using the no_kids variable from before compute the average wage, standard deviation and number of observations in each group for the year 1980 (no kids group vs kids group). Create the lagged logarithm of hours worked and wages. Remember that this is a panel. Exercise 3 What does the following code do? Copy and paste it in an R interpreter to find out! LaborSupply %>% group_by(id) %>% mutate(across(starts_with("l"), tibble::lst(lag, lead))) Using summarise() and across(), compute the mean, standard deviation and number of individuals of lnhr and lnwg for each individual. "],["graphs.html", "Chapter 5 Graphs 5.1 Resources 5.2 Examples 5.3 Customization 5.4 Saving plots to disk 5.5 Exercises", " Chapter 5 Graphs By default, it is possible to make a lot of graphs with R without the need of any external packages. However, in this chapter, we are going to learn how to make graphs using {ggplot2} which is a very powerful package that produces amazing graphs. There is an entry cost to {ggplot2} as it works in a very different way than what you would expect, especially if you know how to make plots with the basic R functions already. But the resulting graphs are well worth the effort and once you will know more about {ggplot2} you will see that in a lot of situations it is actually faster and easier. Another advantage is that making plots with {ggplot2} is consistent, so you do not need to learn anything specific to make, say, density plots. There are a lot of extensions to {ggplot2}, such as {ggridges} to create so-called ridge plots and {gganimate} to create animated plots. By the end of this chapter you will know how to do basic plots with {ggplot2} and also how to use these two extensions. 5.1 Resources Before showing some examples and the general functionality of {ggplot2}, I list here some online resources that I keep coming back to: Data Visualization for Social Science R Graphics Cookbook R graph gallery Tufte in R ggplot2 extensions ggthemes function reference ggplot2 cheatsheet When I first started using {ggplot2}, I had a cookbook approach to it; I tried findinge examples online that looked like what I needed, copy and paste the code and then adapted it to my case. The above resources are the ones I consulted and keep consulting in these situations (I also go back to past code I’ve written, of course). Don’t hesitate to skim these resources for inspiration and to learn more about some extensions to {ggplot2}. In the next subsections I am going to show you how to draw the most common plots, as well as show you how to customize your plots with {ggthemes}, a package that contains pre-defined themes for {ggplot2}. 5.2 Examples I think that the best way to learn how to use {ggplot2} is to jump right into it. Let’s first start with barplots. 5.2.1 Barplots To follow the examples below, load the following libraries: library(ggplot2) library(ggthemes) {ggplot2} is an implementation of the Grammar of Graphics by Wilkinson (2006), but you don’t need to read the books to start using it. If we go back to the Star Wars data (contained in dplyr), and wish to draw a barplot of the gender, the following lines are enough: ggplot(starwars, aes(gender)) + geom_bar() The first argument of the function is the data (called starwars in this example), and then the function aes(). This function is where you list the variables that you want to map to the aesthetics of the geoms functions. On the second line, you see that we use the geom_bar() function. This function creates a barplot of gender variable. You can get different kind of plots by using different geom_ functions. You can also provide the aes() argument to the geom_*() function: ggplot(starwars) + geom_bar(aes(gender)) The difference between these two approaches is that when you specify the aesthetics in the ggplot() function, all the geom_*() functions that follow will inherited these aesthetics. This is useful if you want to avoid writing the same code over and over again, but can be problematic if you need to specify different aesthetics to different geom_*() functions. This will become clear in a later example. You can add options to your plots, for instance, you can change the coordinate system in your barplot: ggplot(starwars, aes(gender)) + geom_bar() + coord_flip() This is the basic recipe to create plots using {ggplot2}: start with a call to ggplot() where you specify the data you want to plot, and optionally the aesthetics. Then, use the geom_*() function you need; if you did not specify the aesthetics in the call to the ggplot() function, do it here. Then, you can add different options, such as changing the coordinate system, changing the theme, the colour palette used, changing the position of the legend and much, much more. This chapter will only give you an overview of the capabilities of {ggplot2}. 5.2.2 Scatter plots Scatter plots are very useful, especially if you are trying to figure out the relationship between two variables. For instance, let’s make a scatter plot of height vs weight of Star Wars characters: ggplot(starwars) + geom_point(aes(height, mass)) As you can see there is an outlier; a very heavy character! Star Wars fans already guessed it, it’s Jabba the Hut. To make the plot easier to read, let’s remove this outlier: starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot() + geom_point(aes(height, mass)) There is a positive correlation between height and mass, by adding geom_smooth() with the option method = \"lm\": starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot(aes(height, mass)) + geom_point(aes(height, mass)) + geom_smooth(method = "lm") ## `geom_smooth()` using formula 'y ~ x' I’ve moved the aes(height, mass) up to the ggplot() function because both geom_point() and geom_smooth() need them, and as explained in the begging of this section, the aesthetics listed in ggplot() get passed down to the other geoms. If you omit method = \"lm, you get a non-parametric curve: starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot(aes(height, mass)) + geom_point(aes(height, mass)) + geom_smooth() ## `geom_smooth()` using method = 'loess' and formula 'y ~ x' 5.2.3 Density Use geom_density() to get density plots: ggplot(starwars, aes(height)) + geom_density() ## Warning: Removed 6 rows containing non-finite values (stat_density). Let’s go into more detail now; what if you would like to plot the densities for feminines and masculines only (removing the droids from the data first)? This can be done by first filtering the data using dplyr and then separating the dataset by gender: starwars %>% filter(gender %in% c("feminine", "masculine")) The above lines do the filtering; only keep gender if gender is in the vector \"feminine\", \"masculine\". This is much easier than having to write gender == \"feminine\" | gender == \"masculine\". Then, we pipe this dataset to ggplot: starwars %>% filter(gender %in% c("feminine", "masculine")) %>% ggplot(aes(height, fill = gender)) + geom_density() ## Warning: Removed 5 rows containing non-finite values (stat_density). Let’s take a closer look to the aes() function: I’ve added fill = gender. This means that there will be one density plot for each gender in the data, and each will be coloured accordingly. This is where {ggplot2} might be confusing; there is no need to write explicitly (even if it is possible) that you want the feminine density to be red and the masculine density to be blue. You just map the variable gender to this particular aesthetic. You conclude the plot by adding geom_density() which is this case is the plot you want. We will see later how to change the colours of your plot. An alternative way to write this code is first to save the filtered data in a variable, and define the aesthetics inside the geom_density() function: filtered_data <- starwars %>% filter(gender %in% c("feminine", "masculine")) ggplot(filtered_data) + geom_density(aes(height, fill = gender)) ## Warning: Removed 5 rows containing non-finite values (stat_density). 5.2.4 Line plots For the line plots, we are going to use official unemployment data (the same as in the previous chapter, but with all the available years). Get it from here (downloaded from the website of the Luxembourguish national statistical institute. Let’s plot the unemployment for the canton of Luxembourg only: unemp_lux_data <- import("datasets/unemployment/all/unemployment_lux_all.csv") unemp_lux_data %>% filter(division == "Luxembourg") %>% ggplot(aes(x = year, y = unemployment_rate_in_percent, group = 1)) + geom_line() Because line plots are 2D, you need to specify the y and x axes. There is also another option you need to add, group = 1. This is to tell aes() that the dots have to be connected with a single line. What if you want to plot more than one commune? unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette")) %>% ggplot(aes(x = year, y = unemployment_rate_in_percent, group = division, colour = division)) + geom_line() This time, I’ve specified group = division which means that there has to be one line per as many communes as in the variable division. I do the same for colours. I think the next example illustrates how {ggplot2} is actually brilliant; if you need to add a third commune, there is no need to specify anything else; no need to add anything to the legend, no need to specify a third colour etc: unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(x = year, y = unemployment_rate_in_percent, group = division, colour = division)) + geom_line() The three communes get mapped to the colour aesthetic so whatever the number of communes, as long as there are enough colours, the communes will each get mapped to one of these colours. 5.2.5 Facets In some case you have a factor variable that separates the data you wish to plot into different categories. If you want to have a plot per category you can use the facet_grid() function. Careful though, this function does not take a variable as an argument, but a formula, hence the ~ symbol in the code below: starwars %>% mutate(human = case_when(species == "Human" ~ "Human", species != "Human" ~ "Not Human")) %>% filter(gender %in% c("feminine", "masculine"), !is.na(human)) %>% ggplot(aes(height, fill = gender)) + facet_grid(. ~ human) + #<--- this is a formula geom_density() ## Warning: Removed 5 rows containing non-finite values (stat_density). I first created a factor variable that specifies if a Star Wars character is human or not, and then use it for facetting. By changing the formula, you change how the facetting is done: starwars %>% mutate(human = case_when(species == "Human" ~ "Human", species != "Human" ~ "Not Human")) %>% filter(gender %in% c("feminine", "masculine"), !is.na(human)) %>% ggplot(aes(height, fill = gender)) + facet_grid(human ~ .) + geom_density() ## Warning: Removed 5 rows containing non-finite values (stat_density). Recall the categorical variable more_1 that we computed in the previous chapter? Let’s use it as a faceting variable: starwars %>% rowwise() %>% mutate(n_films = length(films)) %>% mutate(more_1 = case_when(n_films == 1 ~ "Exactly one movie", n_films != 1 ~ "More than 1 movie")) %>% mutate(human = case_when(species == "Human" ~ "Human", species != "Human" ~ "Not Human")) %>% filter(gender %in% c("feminine", "masculine"), !is.na(human)) %>% ggplot(aes(height, fill = gender)) + facet_grid(human ~ more_1) + geom_density() ## Warning: Removed 5 rows containing non-finite values (stat_density). 5.2.6 Pie Charts I am not a huge fan of pie charts, but sometimes this is what you have to do. So let’s see how you can create pie charts. First, let’s create a mock dataset with the function tibble::tribble() which allows you to create a dataset line by line: test_data <- tribble( ~id, ~var1, ~var2, ~var3, ~var4, ~var5, "a", 26.5, 38, 30, 32, 34, "b", 30, 30, 28, 32, 30, "c", 34, 32, 30, 28, 26.5 ) This data is not in the right format though, which is wide. We need to have it in the long format for it to work with {ggplot2}. For this, let’s use tidyr::gather() as seen in the previous chapter: test_data_long = test_data %>% gather(variable, value, starts_with("var")) Now, let’s plot this data, first by creating 3 bar plots: ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(variable, value, fill = variable), stat = "identity") In the code above, I introduce a new option, called stat = \"identity\". By default, geom_bar() counts the number of observations of each category that is plotted, which is a statistical transformation. By adding stat = \"identity\", I force the statistical transformation to be the identity function, and thus plot the data as is. To create the pie chart, first we need to compute the share of each id to var1, var2, etc… To do this, we first group by id, then compute the total. Then we use a new function ungroup(). After using ungroup() all the computations are done on the whole dataset instead of by group, which is what we need to compute the share: test_data_long <- test_data_long %>% group_by(id) %>% mutate(total = sum(value)) %>% ungroup() %>% mutate(share = value/total) Let’s take a look to see if this is what we wanted: print(test_data_long) ## # A tibble: 15 × 5 ## id variable value total share ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 a var1 26.5 160. 0.165 ## 2 b var1 30 150 0.2 ## 3 c var1 34 150. 0.226 ## 4 a var2 38 160. 0.237 ## 5 b var2 30 150 0.2 ## 6 c var2 32 150. 0.213 ## 7 a var3 30 160. 0.187 ## 8 b var3 28 150 0.187 ## 9 c var3 30 150. 0.199 ## 10 a var4 32 160. 0.199 ## 11 b var4 32 150 0.213 ## 12 c var4 28 150. 0.186 ## 13 a var5 34 160. 0.212 ## 14 b var5 30 150 0.2 ## 15 c var5 26.5 150. 0.176 If you didn’t understand what ungroup() did, rerun the last few lines with it and inspect the output. To plot the pie chart, we create a barplot again, but specify polar coordinates: ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(y = share, x = "", fill = variable), stat = "identity") + theme() + coord_polar("y", start = 0) As you can see, this typical pie chart is not very easy to read; compared to the barplots above it is not easy to distinguish if a has a higher share than b or c. You can change the look of the pie chart, for example by specifying variable as the x: ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(y = share, x = variable, fill = variable), stat = "identity") + theme() + coord_polar("x", start = 0) But as a general rule, avoid pie charts if possible. I find that pie charts are only interesting if you need to show proportions that are hugely unequal, to really emphasize the difference between said proportions. 5.2.7 Adding text to plots Sometimes you might want to add some text to your plots. This is possible with geom_text(): ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(variable, value, fill = variable), stat = "identity") + geom_text(aes(variable, value + 1.5, label = value)) You can put anything after label = but in general what you want are the values, so that’s what I put there. But you can also refine it, imagine the values are actually in euros: ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(variable, value, fill = variable), stat = "identity") + geom_text(aes(variable, value + 1.5, label = paste(value, "€"))) You can also achieve something similar with geom_label(): ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(variable, value, fill = variable), stat = "identity") + geom_label(aes(variable, value + 1.5, label = paste(value, "€"))) 5.3 Customization Every plot you’ve seen until now was made with the default look of {ggplot2}. If you want to change the look, you can apply a theme, and a colour scheme. Let’s take a look at themes first by using the ones found in the package ggthemes. But first, let’s learn how to change the names of the axes and how to title a plot. 5.3.1 Changing titles, axes labels, options, mixing geoms and changing themes The name of this subsection is quite long, but this is because everything is kind of linked. Let’s start by learning what the labs() function does. To change the title of the plot, and of the axes, you need to pass the names to the labs() function: unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() What if you want to make the lines thicker? unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line(size = 2) Each geom_*() function has its own options. Notice that the size=2 argument is not inside an aes() function. This is because I do not want to map a variable of the data to the size of the line, in other words, I do not want to make the size of the line proportional to a certain variable in the data. Recall the scatter plot we did earlier, where we showed that height and mass of star wars characters increased together? Let’s take this plot again, but make the size of the dots proportional to the birth year of the character: starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot() + geom_point(aes(height, mass, size = birth_year)) Making the size proportional to the birth year (the age would have been more informative) allows us to see a third dimension. It is also possible to “see” a fourth dimension, the gender for instance, by changing the colour of the dots: starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot() + geom_point(aes(height, mass, size = birth_year, colour = gender)) As I promised above, we are now going to learn how to add a regression line to this scatter plot: starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot() + geom_point(aes(height, mass, size = birth_year, colour = gender)) + geom_smooth(aes(height, mass), method = "lm") ## `geom_smooth()` using formula 'y ~ x' geom_smooth() adds a regression line, but only if you specify method = \"lm\" (“lm” stands for “linear model”). What happens if you remove this option? starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot() + geom_point(aes(height, mass, size = birth_year, colour = gender)) + geom_smooth(aes(height, mass)) ## `geom_smooth()` using method = 'loess' and formula 'y ~ x' By default, geom_smooth() does a non-parametric regression called LOESS (locally estimated scatterplot smoothing), which is more flexible. It is also possible to have one regression line by gender: starwars %>% filter(!str_detect(name, "Jabba")) %>% ggplot() + geom_point(aes(height, mass, size = birth_year, colour = gender)) + geom_smooth(aes(height, mass, colour = gender)) ## `geom_smooth()` using method = 'loess' and formula 'y ~ x' Because there are only a few observations for feminines and NAs the regression lines are not very informative, but this was only an example to show you some options of geom_smooth(). Let’s go back to the unemployment line plots. For now, let’s keep the base {ggplot2} theme, but modify it a bit. For example, the legend placement is actually a feature of the theme. This means that if you want to change where the legend is placed you need to modify this feature from the theme. This is done with the function theme(): unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme(legend.position = "bottom") + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() What I also like to do is remove the title of the legend, because it is often superfluous: unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme(legend.position = "bottom", legend.title = element_blank()) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() The legend title has to be an element_text object.element_text objects are used with theme to specify how text should be displayed. element_blank() draws nothing and assigns no space (not even blank space). If you want to keep the legend title but change it, you need to use element_text(): unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme(legend.position = "bottom", legend.title = element_text(colour = "red")) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() If you want to change the word “division” to something else, you can do so by providing the colour argument to the labs() function: unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme(legend.position = "bottom") + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate", colour = "Administrative division") + geom_line() You could modify every feature of the theme like that, but there are built-in themes that you can use: unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme_minimal() + theme(legend.position = "bottom", legend.title = element_blank()) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() For example in the code above, I have used theme_minimal() which I like quite a lot. You can also use themes from the ggthemes package, which even contains a STATA theme, if you like it: unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme_stata() + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() As you can see, theme_stata() has the legend on the bottom by default, because this is how the legend position is defined within the theme. However the legend title is still there. Let’s remove it: unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme_stata() + theme(legend.title = element_blank()) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() ggthemes even features an Excel 2003 theme (don’t use it though): unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme_excel() + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() You can create your own theme by using a simple theme, such as theme_minimal() as a base and then add your options. We are going to create one theme after we learn how to create our own functions, in Chapter 7. Then, we are going to create a package to share this theme with the world, and we are going to learn how to make packages in Chapter 9. 5.3.2 Colour schemes You can also change colour schemes, by specifying either scale_colour_*() or scale_fill_*() functions. scale_colour_*() functions are used for continuous variables, while scale_fill_*() functions for discrete variables (so for barplots for example). A colour scheme I like is the Highcharts colour scheme. unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme_minimal() + scale_colour_hc() + theme(legend.position = "bottom", legend.title = element_blank()) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() An example with a barplot: ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(variable, value, fill = variable), stat = "identity") + geom_text(aes(variable, value + 1.5, label = value)) + theme_minimal() + scale_fill_hc() It is also possible to define and use your own palette. To use your own colours you can use scale_colour_manual() and scale_fill_manual() and specify the html codes of the colours you want to use. unemp_lux_data %>% filter(division %in% c("Luxembourg", "Esch-sur-Alzette", "Wiltz")) %>% ggplot(aes(year, unemployment_rate_in_percent, group = division, colour = division)) + theme_minimal() + scale_colour_manual(values = c("#FF336C", "#334BFF", "#2CAE00")) + theme(legend.position = "bottom", legend.title = element_blank()) + labs(title = "Unemployment in Luxembourg, Esch/Alzette and Wiltz", x = "Year", y = "Rate") + geom_line() To get html codes of colours you can use this online tool. There is also a very nice package, called colourpicker that allows you to pick colours from with RStudio. Also, you do not even need to load it to use it, since it comes with an Addin: For a barplot you would do the same: ggplot(test_data_long) + facet_wrap(~id) + geom_bar(aes(variable, value, fill = variable), stat = "identity") + geom_text(aes(variable, value + 1.5, label = value)) + theme_minimal() + theme(legend.position = "bottom", legend.title = element_blank()) + scale_fill_manual(values = c("#FF336C", "#334BFF", "#2CAE00", "#B3C9C6", "#765234")) For countinuous variables, things are a bit different. Let’s first create a plot where we map a continuous variable to the colour argument of aes(): ggplot(diamonds) + geom_point(aes(carat, price, colour = depth)) To change the colour, we need to use scale_color_gradient() and specify a value for low values of the variable, and a value for high values of the variable. For example, using the colours of the theme I made for my blog: ggplot(diamonds) + geom_point(aes(carat, price, colour = depth)) + scale_color_gradient(low = "#bec3b8", high = "#ad2c6c") 5.4 Saving plots to disk There are two ways to save plots on disk; one through the Plots plane in RStudio and another using the ggsave() function. Using RStudio, navigate to the Plots pane and click on Export. You can then choose where to save the plot and other various options: This is fine if you only generate one or two plots but if you generate a large number of them, it is less tedious to use the ggsave() function: my_plot1 <- ggplot(my_data) + geom_bar(aes(variable)) ggsave("path/you/want/to/save/the/plot/to/my_plot1.pdf", my_plot1) There are other options that you can specify such as the width and height, resolution, units, etc… 5.5 Exercises Exercise 1 Load the Bwages dataset from the Ecdat package. Your first task is to create a new variable, educ_level, which is a factor variable that equals: “Primary school” if educ == 1 “High school” if educ == 2 “Some university” if educ == 3 “Master’s degree” if educ == 4 “Doctoral degree” if educ == 5 Use case_when() for this. Then, plot a scatter plot of wages on experience, by education level. Add a theme that you like, and remove the title of the legend. The scatter plot is not very useful, because you cannot make anything out. Instead, use another geom that shows you a non-parametric fit with confidence bands. References "],["statistical-models.html", "Chapter 6 Statistical models 6.1 Terminology 6.2 Fitting a model to data 6.3 Diagnostics 6.4 Interpreting models 6.5 Comparing models 6.6 Using a model for prediction 6.7 Beyond linear regression 6.8 Hyper-parameters 6.9 Training, validating, and testing models", " Chapter 6 Statistical models In this chapter, we will not learn about all the models out there that you may or may not need. Instead, I will show you how can use what you have learned until now and how you can apply these concepts to modeling. Also, as you read in the beginning of the book, R has many many packages. So the model you need is most probably already implemented in some package and you will very likely not need to write your own from scratch. In the first section, I will discuss the terminology used in this book. Then I will discuss linear regression; showing how linear regression works illsutrates very well how other models work too, without loss of generality. Then I will introduce the concepte of hyper-parameters with ridge regression. This chapter will then finish with an introduction to cross-validation as a way to tune the hyper-parameters of models that features them. 6.1 Terminology Before continuing discussing about statistical models and model fitting it is worthwhile to discuss terminology a little bit. Depending on your background, you might call an explanatory variable a feature or the dependent variable the target. These are the same objects. The matrix of features is usually called a design matrix, and what statisticians call the intercept is what machine learning engineers call the bias. Referring to the intercept by bias is unfortunate, as bias also has a very different meaning; bias is also what we call the error in a model that may cause biased estimates. To finish up, the estimated parameters of the model may be called coefficients or weights. Here again, I don’t like the using weight as weight as a very different meaning in statistics. So, in the remainder of this chapter, and book, I will use the terminology from the statistical litterature, using dependent and explanatory variables (y and x), and calling the estimated parameters coefficients and the intercept… well the intercept (the \\(\\beta\\)s of the model). However, I will talk of training a model, instead of estimating a model. 6.2 Fitting a model to data Suppose you have a variable y that you wish to explain using a set of other variables x1, x2, x3, etc. Let’s take a look at the Housing dataset from the Ecdat package: library(Ecdat) data(Housing) You can read a description of the dataset by running: ?Housing Housing package:Ecdat R Documentation Sales Prices of Houses in the City of Windsor Description: a cross-section from 1987 _number of observations_ : 546 _observation_ : goods _country_ : Canada Usage: data(Housing) Format: A dataframe containing : price: sale price of a house lotsize: the lot size of a property in square feet bedrooms: number of bedrooms bathrms: number of full bathrooms stories: number of stories excluding basement driveway: does the house has a driveway ? recroom: does the house has a recreational room ? fullbase: does the house has a full finished basement ? gashw: does the house uses gas for hot water heating ? airco: does the house has central air conditioning ? garagepl: number of garage places prefarea: is the house located in the preferred neighbourhood of the city ? Source: Anglin, P.M. and R. Gencay (1996) “Semiparametric estimation of a hedonic price function”, _Journal of Applied Econometrics_, *11(6)*, 633-648. References: Verbeek, Marno (2004) _A Guide to Modern Econometrics_, John Wiley and Sons, chapter 3. Journal of Applied Econometrics data archive : <URL: http://qed.econ.queensu.ca/jae/>. See Also: ‘Index.Source’, ‘Index.Economics’, ‘Index.Econometrics’, ‘Index.Observations’ or by looking for Housing in the help pane of RStudio. Usually, you would take a look a the data before doing any modeling: glimpse(Housing) ## Rows: 546 ## Columns: 12 ## $ price <dbl> 42000, 38500, 49500, 60500, 61000, 66000, 66000, 69000, 83800… ## $ lotsize <dbl> 5850, 4000, 3060, 6650, 6360, 4160, 3880, 4160, 4800, 5500, 7… ## $ bedrooms <dbl> 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 3, 4, 1, 2, 3… ## $ bathrms <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1… ## $ stories <dbl> 2, 1, 1, 2, 1, 1, 2, 3, 1, 4, 1, 1, 2, 1, 1, 1, 2, 3, 1, 1, 2… ## $ driveway <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, no, ye… ## $ recroom <fct> no, no, no, yes, no, yes, no, no, yes, yes, no, no, no, no, n… ## $ fullbase <fct> yes, no, no, no, no, yes, yes, no, yes, no, yes, no, no, no, … ## $ gashw <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, n… ## $ airco <fct> no, no, no, no, no, yes, no, no, no, yes, yes, no, no, no, no… ## $ garagepl <dbl> 1, 0, 0, 0, 0, 0, 2, 0, 0, 1, 3, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1… ## $ prefarea <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, n… Housing prices depend on a set of variables such as the number of bedrooms, the area it is located and so on. If you believe that housing prices depend linearly on a set of explanatory variables, you will want to estimate a linear model. To estimate a linear model, you will need to use the built-in lm() function: model1 <- lm(price ~ lotsize + bedrooms, data = Housing) lm() takes a formula as an argument, which defines the model you want to estimate. In this case, I ran the following regression: \\[ \\text{price} = \\beta_0 + \\beta_1 * \\text{lotsize} + \\beta_2 * \\text{bedrooms} + \\varepsilon \\] where \\(\\beta_0, \\beta_1\\) and \\(\\beta_2\\) are three parameters to estimate. To take a look at the results, you can use the summary() method (not to be confused with dplyr::summarise()): summary(model1) ## ## Call: ## lm(formula = price ~ lotsize + bedrooms, data = Housing) ## ## Residuals: ## Min 1Q Median 3Q Max ## -65665 -12498 -2075 8970 97205 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.613e+03 4.103e+03 1.368 0.172 ## lotsize 6.053e+00 4.243e-01 14.265 < 2e-16 *** ## bedrooms 1.057e+04 1.248e+03 8.470 2.31e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 21230 on 543 degrees of freedom ## Multiple R-squared: 0.3703, Adjusted R-squared: 0.3679 ## F-statistic: 159.6 on 2 and 543 DF, p-value: < 2.2e-16 if you wish to remove the intercept (\\(\\beta_0\\) in the above equation) from your model, you can do so with -1: model2 <- lm(price ~ -1 + lotsize + bedrooms, data = Housing) summary(model2) ## ## Call: ## lm(formula = price ~ -1 + lotsize + bedrooms, data = Housing) ## ## Residuals: ## Min 1Q Median 3Q Max ## -67229 -12342 -1333 9627 95509 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## lotsize 6.283 0.390 16.11 <2e-16 *** ## bedrooms 11968.362 713.194 16.78 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 21250 on 544 degrees of freedom ## Multiple R-squared: 0.916, Adjusted R-squared: 0.9157 ## F-statistic: 2965 on 2 and 544 DF, p-value: < 2.2e-16 or if you want to use all the columns inside Housing, replacing the column names by .: model3 <- lm(price ~ ., data = Housing) summary(model3) ## ## Call: ## lm(formula = price ~ ., data = Housing) ## ## Residuals: ## Min 1Q Median 3Q Max ## -41389 -9307 -591 7353 74875 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -4038.3504 3409.4713 -1.184 0.236762 ## lotsize 3.5463 0.3503 10.124 < 2e-16 *** ## bedrooms 1832.0035 1047.0002 1.750 0.080733 . ## bathrms 14335.5585 1489.9209 9.622 < 2e-16 *** ## stories 6556.9457 925.2899 7.086 4.37e-12 *** ## drivewayyes 6687.7789 2045.2458 3.270 0.001145 ** ## recroomyes 4511.2838 1899.9577 2.374 0.017929 * ## fullbaseyes 5452.3855 1588.0239 3.433 0.000642 *** ## gashwyes 12831.4063 3217.5971 3.988 7.60e-05 *** ## aircoyes 12632.8904 1555.0211 8.124 3.15e-15 *** ## garagepl 4244.8290 840.5442 5.050 6.07e-07 *** ## prefareayes 9369.5132 1669.0907 5.614 3.19e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 15420 on 534 degrees of freedom ## Multiple R-squared: 0.6731, Adjusted R-squared: 0.6664 ## F-statistic: 99.97 on 11 and 534 DF, p-value: < 2.2e-16 You can access different elements of model3 with $, because the result of lm() is a list (you can check this claim with typeof(model3): print(model3$coefficients) ## (Intercept) lotsize bedrooms bathrms stories drivewayyes ## -4038.350425 3.546303 1832.003466 14335.558468 6556.945711 6687.778890 ## recroomyes fullbaseyes gashwyes aircoyes garagepl prefareayes ## 4511.283826 5452.385539 12831.406266 12632.890405 4244.829004 9369.513239 but I prefer to use the {broom} package, and more specifically the tidy() function, which converts model3 into a neat data.frame: results3 <- broom::tidy(model3) glimpse(results3) ## Rows: 12 ## Columns: 5 ## $ term <chr> "(Intercept)", "lotsize", "bedrooms", "bathrms", "stories", … ## $ estimate <dbl> -4038.350425, 3.546303, 1832.003466, 14335.558468, 6556.9457… ## $ std.error <dbl> 3409.4713, 0.3503, 1047.0002, 1489.9209, 925.2899, 2045.2458… ## $ statistic <dbl> -1.184451, 10.123618, 1.749764, 9.621691, 7.086369, 3.269914… ## $ p.value <dbl> 2.367616e-01, 3.732442e-22, 8.073341e-02, 2.570369e-20, 4.37… I explicitely write broom::tidy() because tidy() is a popular function name. For instance, it is also a function from the {yardstick} package, which does not do the same thing at all. Since I will also be using {yardstick} I prefer to explicitely write broom::tidy() to avoid conflicts. Using broom::tidy() is useful, because you can then work on the results easily, for example if you wish to only keep results that are significant at the 5% level: results3 %>% filter(p.value < 0.05) ## # A tibble: 10 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 lotsize 3.55 0.350 10.1 3.73e-22 ## 2 bathrms 14336. 1490. 9.62 2.57e-20 ## 3 stories 6557. 925. 7.09 4.37e-12 ## 4 drivewayyes 6688. 2045. 3.27 1.15e- 3 ## 5 recroomyes 4511. 1900. 2.37 1.79e- 2 ## 6 fullbaseyes 5452. 1588. 3.43 6.42e- 4 ## 7 gashwyes 12831. 3218. 3.99 7.60e- 5 ## 8 aircoyes 12633. 1555. 8.12 3.15e-15 ## 9 garagepl 4245. 841. 5.05 6.07e- 7 ## 10 prefareayes 9370. 1669. 5.61 3.19e- 8 You can even add new columns, such as the confidence intervals: results3 <- broom::tidy(model3, conf.int = TRUE, conf.level = 0.95) print(results3) ## # A tibble: 12 × 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -4038. 3409. -1.18 2.37e- 1 -10736. 2659. ## 2 lotsize 3.55 0.350 10.1 3.73e-22 2.86 4.23 ## 3 bedrooms 1832. 1047. 1.75 8.07e- 2 -225. 3889. ## 4 bathrms 14336. 1490. 9.62 2.57e-20 11409. 17262. ## 5 stories 6557. 925. 7.09 4.37e-12 4739. 8375. ## 6 drivewayyes 6688. 2045. 3.27 1.15e- 3 2670. 10705. ## 7 recroomyes 4511. 1900. 2.37 1.79e- 2 779. 8244. ## 8 fullbaseyes 5452. 1588. 3.43 6.42e- 4 2333. 8572. ## 9 gashwyes 12831. 3218. 3.99 7.60e- 5 6511. 19152. ## 10 aircoyes 12633. 1555. 8.12 3.15e-15 9578. 15688. ## 11 garagepl 4245. 841. 5.05 6.07e- 7 2594. 5896. ## 12 prefareayes 9370. 1669. 5.61 3.19e- 8 6091. 12648. Going back to model estimation, you can of course use lm() in a pipe workflow: Housing %>% select(-driveway, -stories) %>% lm(price ~ ., data = .) %>% broom::tidy() ## # A tibble: 10 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3025. 3263. 0.927 3.54e- 1 ## 2 lotsize 3.67 0.363 10.1 4.52e-22 ## 3 bedrooms 4140. 1036. 3.99 7.38e- 5 ## 4 bathrms 16443. 1546. 10.6 4.29e-24 ## 5 recroomyes 5660. 2010. 2.82 5.05e- 3 ## 6 fullbaseyes 2241. 1618. 1.38 1.67e- 1 ## 7 gashwyes 13568. 3411. 3.98 7.93e- 5 ## 8 aircoyes 15578. 1597. 9.75 8.53e-21 ## 9 garagepl 4232. 883. 4.79 2.12e- 6 ## 10 prefareayes 10729. 1753. 6.12 1.81e- 9 The first . in the lm() function is used to indicate that we wish to use all the data from Housing (minus driveway and stories which I removed using select() and the - sign), and the second . is used to place the result from the two dplyr instructions that preceded is to be placed there. The picture below should help you understand: You have to specify this, because by default, when using %>% the left hand side argument gets passed as the first argument of the function on the right hand side. Since version 4.2, R now also natively includes a placeholder, _: Housing |> select(-driveway, -stories) |> lm(price ~ ., data = _) |> broom::tidy() ## # A tibble: 10 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3025. 3263. 0.927 3.54e- 1 ## 2 lotsize 3.67 0.363 10.1 4.52e-22 ## 3 bedrooms 4140. 1036. 3.99 7.38e- 5 ## 4 bathrms 16443. 1546. 10.6 4.29e-24 ## 5 recroomyes 5660. 2010. 2.82 5.05e- 3 ## 6 fullbaseyes 2241. 1618. 1.38 1.67e- 1 ## 7 gashwyes 13568. 3411. 3.98 7.93e- 5 ## 8 aircoyes 15578. 1597. 9.75 8.53e-21 ## 9 garagepl 4232. 883. 4.79 2.12e- 6 ## 10 prefareayes 10729. 1753. 6.12 1.81e- 9 For the example above, I’ve also switched from %>% to |>, or else I can’t use the _ placeholder. The advantage of the _ placeholder is that it disambiguates .. So here, the . is a placeholder for all the variables in the dataset, and _ is a placeholder for the dataset. 6.3 Diagnostics Diagnostics are useful metrics to assess model fit. You can read some of these diagnostics, such as the \\(R^2\\) at the bottom of the summary (when running summary(my_model)), but if you want to do more than simply reading these diagnostics from RStudio, you can put those in a data.frame too, using broom::glance(): glance(model3) ## # A tibble: 1 × 12 ## r.squared adj.r.…¹ sigma stati…² p.value df logLik AIC BIC devia…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.673 0.666 15423. 100. 6.18e-122 11 -6034. 12094. 12150. 1.27e11 ## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names ¹​adj.r.squared, ²​statistic, ³​deviance You can also plot the usual diagnostics plots using ggfortify::autoplot() which uses the {ggplot2} package under the hood: library(ggfortify) autoplot(model3, which = 1:6) + theme_minimal() which=1:6 is an additional option that shows you all the diagnostics plot. If you omit this option, you will only get 4 of them. You can also get the residuals of the regression in two ways; either you grab them directly from the model fit: resi3 <- residuals(model3) or you can augment the original data with a residuals column, using broom::augment(): housing_aug <- augment(model3) Let’s take a look at housing_aug: glimpse(housing_aug) ## Rows: 546 ## Columns: 18 ## $ price <dbl> 42000, 38500, 49500, 60500, 61000, 66000, 66000, 69000, 838… ## $ lotsize <dbl> 5850, 4000, 3060, 6650, 6360, 4160, 3880, 4160, 4800, 5500,… ## $ bedrooms <dbl> 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 3, 4, 1, 2,… ## $ bathrms <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2,… ## $ stories <dbl> 2, 1, 1, 2, 1, 1, 2, 3, 1, 4, 1, 1, 2, 1, 1, 1, 2, 3, 1, 1,… ## $ driveway <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, no, … ## $ recroom <fct> no, no, no, yes, no, yes, no, no, yes, yes, no, no, no, no,… ## $ fullbase <fct> yes, no, no, no, no, yes, yes, no, yes, no, yes, no, no, no… ## $ gashw <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,… ## $ airco <fct> no, no, no, no, no, yes, no, no, no, yes, yes, no, no, no, … ## $ garagepl <dbl> 1, 0, 0, 0, 0, 0, 2, 0, 0, 1, 3, 0, 0, 0, 0, 0, 1, 0, 0, 1,… ## $ prefarea <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,… ## $ .fitted <dbl> 66037.98, 41391.15, 39889.63, 63689.09, 49760.43, 66387.12,… ## $ .resid <dbl> -24037.9757, -2891.1515, 9610.3699, -3189.0873, 11239.5735,… ## $ .hat <dbl> 0.013477335, 0.008316321, 0.009893730, 0.021510891, 0.01033… ## $ .sigma <dbl> 15402.01, 15437.14, 15431.98, 15437.02, 15429.89, 15437.64,… ## $ .cooksd <dbl> 2.803214e-03, 2.476265e-05, 3.265481e-04, 8.004787e-05, 4.6… ## $ .std.resid <dbl> -1.56917096, -0.18823924, 0.62621736, -0.20903274, 0.732539… A few columns have been added to the original data, among them .resid which contains the residuals. Let’s plot them: ggplot(housing_aug) + geom_density(aes(.resid)) Fitted values are also added to the original data, under the variable .fitted. It would also have been possible to get the fitted values with: fit3 <- fitted(model3) but I prefer using augment(), because the columns get merged to the original data, which then makes it easier to find specific individuals, for example, you might want to know for which housing units the model underestimates the price: total_pos <- housing_aug %>% filter(.resid > 0) %>% summarise(total = n()) %>% pull(total) we find 261 individuals where the residuals are positive. It is also easier to extract outliers: housing_aug %>% mutate(prank = cume_dist(.cooksd)) %>% filter(prank > 0.99) %>% glimpse() ## Rows: 6 ## Columns: 19 ## $ price <dbl> 163000, 125000, 132000, 175000, 190000, 174500 ## $ lotsize <dbl> 7420, 4320, 3500, 9960, 7420, 7500 ## $ bedrooms <dbl> 4, 3, 4, 3, 4, 4 ## $ bathrms <dbl> 1, 1, 2, 2, 2, 2 ## $ stories <dbl> 2, 2, 2, 2, 3, 2 ## $ driveway <fct> yes, yes, yes, yes, yes, yes ## $ recroom <fct> yes, no, no, no, no, no ## $ fullbase <fct> yes, yes, no, yes, no, yes ## $ gashw <fct> no, yes, yes, no, no, no ## $ airco <fct> yes, no, no, no, yes, yes ## $ garagepl <dbl> 2, 2, 2, 2, 2, 3 ## $ prefarea <fct> no, no, no, yes, yes, yes ## $ .fitted <dbl> 94826.68, 77688.37, 85495.58, 108563.18, 115125.03, 118549.… ## $ .resid <dbl> 68173.32, 47311.63, 46504.42, 66436.82, 74874.97, 55951.00 ## $ .hat <dbl> 0.02671105, 0.05303793, 0.05282929, 0.02819317, 0.02008141,… ## $ .sigma <dbl> 15144.70, 15293.34, 15298.27, 15159.14, 15085.99, 15240.66 ## $ .cooksd <dbl> 0.04590995, 0.04637969, 0.04461464, 0.04616068, 0.04107317,… ## $ .std.resid <dbl> 4.480428, 3.152300, 3.098176, 4.369631, 4.904193, 3.679815 ## $ prank <dbl> 0.9963370, 1.0000000, 0.9945055, 0.9981685, 0.9926740, 0.99… prank is a variable I created with cume_dist() which is a dplyr function that returns the proportion of all values less than or equal to the current rank. For example: example <- c(5, 4.6, 2, 1, 0.8, 0, -1) cume_dist(example) ## [1] 1.0000000 0.8571429 0.7142857 0.5714286 0.4285714 0.2857143 0.1428571 by filtering prank > 0.99 we get the top 1% of outliers according to Cook’s distance. 6.4 Interpreting models Model interpretation is essential in the social sciences, but it is also getting very important in machine learning. As usual, the terminology is different; in machine learning, we speak about explainability. There is a very important aspect that one has to understand when it comes to interpretability/explainability: classical, parametric models, and black-box models. This is very well explained in Breiman (2001), an absolute must read (link to paper, in PDF format: click here). The gist of the paper is that there are two cultures of statistical modeling; one culture relies on modeling the data generating process, for instance, by considering that a variable y (independent variable, or target) is a linear combination of input variables x (dependent variables, or features) plus some noise. The other culture uses complex algorithms (random forests, neural networks) to model the relationship between y and x. The author argues that most statisticians have relied for too long on modeling data generating processes and do not use all the potential offered by these complex algorithms. I think that a lot of things have changed since then, and that nowadays any practitioner that uses data is open to use any type of model or algorithm, as long as it does the job. However, the paper is very interesting, and the discussion on trade-off between simplicity of the model and interpretability/explainability is still relevant today. In this section, I will explain how one can go about interpreting or explaining models from these two cultures. Also, it is important to note here that the discussion that will follow will be heavily influenced by my econometrics background. I will focus on marginal effects as way to interpret parametric models (models from the first culture described above), but depending on the field, practitioners might use something else (for instance by computing odds ratios in a logistic regression). I will start by interpretability of classical statistical models. 6.4.1 Marginal effects If one wants to know the effect of variable x on the dependent variable y, so-called marginal effects have to be computed. This is easily done in R with the {marginaleffects} package. Formally, marginal effects are the partial derivative of the regression equation with respect to the variable we want to look at. library(marginaleffects) effects_model3 <- marginaleffects(model3) summary(effects_model3) ## Term Contrast Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 % ## 1 lotsize dY/dX 3.546 0.3503 10.124 < 2.22e-16 2.86 4.233 ## 2 bedrooms dY/dX 1832.003 1047.0056 1.750 0.08016056 -220.09 3884.097 ## 3 bathrms dY/dX 14335.558 1489.9557 9.621 < 2.22e-16 11415.30 17255.818 ## 4 stories dY/dX 6556.946 925.2943 7.086 1.3771e-12 4743.40 8370.489 ## 5 driveway yes - no 6687.779 2045.2459 3.270 0.00107580 2679.17 10696.387 ## 6 recroom yes - no 4511.284 1899.9577 2.374 0.01757689 787.44 8235.132 ## 7 fullbase yes - no 5452.386 1588.0239 3.433 0.00059597 2339.92 8564.855 ## 8 gashw yes - no 12831.406 3217.5970 3.988 6.6665e-05 6525.03 19137.781 ## 9 airco yes - no 12632.890 1555.0211 8.124 4.5131e-16 9585.11 15680.676 ## 10 garagepl dY/dX 4244.829 840.5965 5.050 4.4231e-07 2597.29 5892.368 ## 11 prefarea yes - no 9369.513 1669.0906 5.614 1.9822e-08 6098.16 12640.871 ## ## Model type: lm ## Prediction type: response Let’s go through this: summary(effects_model3) shows the average marginal effects for each of the dependent variables that were used in model3. The way to interpret them is as follows: everything else held constant (often you’ll read the Latin ceteris paribus for this), a unit increase in lotize increases the price by 3.546 units, on average. The everything held constant part is crucial; with marginal effects, you’re looking at just the effect of one variable at a time. For discrete variables, like driveway, this is simpler: imagine you have two equal houses, exactly the same house, one has a driveway and the other doesn’t. The one with the driveway is 6687 units more expensive, on average. Now it turns out that in the case of a linear model, the average marginal effects are exactly equal to the coefficients. Just compare summary(model3) to effects_model3 to see (and remember, I told you that marginal effects were the partial derivative of the regression equation with respect to the variable of interest. So the derivative of \\(\\alpha*X_1 + ....\\) with respect to \\(X_1\\) will be \\(\\alpha\\)). But in the case of a more complex, non-linear model, this is not so obvious. This is where {marginaleffects} will make your life much easier. It is also possible to plot the results: plot(effects_model3) effects_model3 is a data frame containing the effects for each house in the data set. For example, let’s take a look at the first house: effects_model3 %>% filter(rowid == 1) ## rowid type term contrast dydx std.error statistic ## 1 1 response lotsize dY/dX 3.546303 0.3502195 10.125944 ## 2 1 response bedrooms dY/dX 1832.003466 1046.1608842 1.751168 ## 3 1 response bathrms dY/dX 14335.558468 1490.4827945 9.618064 ## 4 1 response stories dY/dX 6556.945711 925.4764870 7.084940 ## 5 1 response driveway yes - no 6687.778890 2045.2460319 3.269914 ## 6 1 response recroom yes - no 4511.283826 1899.9577182 2.374413 ## 7 1 response fullbase yes - no 5452.385539 1588.0237538 3.433441 ## 8 1 response gashw yes - no 12831.406266 3217.5971931 3.987885 ## 9 1 response airco yes - no 12632.890405 1555.0207045 8.123937 ## 10 1 response garagepl dY/dX 4244.829004 840.8930857 5.048001 ## 11 1 response prefarea yes - no 9369.513239 1669.0904968 5.613544 ## p.value conf.low conf.high predicted predicted_hi predicted_lo ## 1 4.238689e-24 2.859885 4.232721 66037.98 66043.14 66037.98 ## 2 7.991698e-02 -218.434189 3882.441121 66037.98 66038.89 66037.98 ## 3 6.708200e-22 11414.265872 17256.851065 66037.98 66042.28 66037.98 ## 4 1.391042e-12 4743.045128 8370.846295 66037.98 66039.94 66037.98 ## 5 1.075801e-03 2679.170328 10696.387452 66037.98 66037.98 59350.20 ## 6 1.757689e-02 787.435126 8235.132526 66037.98 70549.26 66037.98 ## 7 5.959723e-04 2339.916175 8564.854903 66037.98 66037.98 60585.59 ## 8 6.666508e-05 6525.031651 19137.780882 66037.98 78869.38 66037.98 ## 9 4.512997e-16 9585.105829 15680.674981 66037.98 78670.87 66037.98 ## 10 4.464572e-07 2596.708842 5892.949167 66037.98 66039.25 66037.98 ## 11 1.982240e-08 6098.155978 12640.870499 66037.98 75407.49 66037.98 ## price lotsize bedrooms bathrms stories driveway recroom fullbase gashw airco ## 1 42000 5850 3 1 2 yes no yes no no ## 2 42000 5850 3 1 2 yes no yes no no ## 3 42000 5850 3 1 2 yes no yes no no ## 4 42000 5850 3 1 2 yes no yes no no ## 5 42000 5850 3 1 2 yes no yes no no ## 6 42000 5850 3 1 2 yes no yes no no ## 7 42000 5850 3 1 2 yes no yes no no ## 8 42000 5850 3 1 2 yes no yes no no ## 9 42000 5850 3 1 2 yes no yes no no ## 10 42000 5850 3 1 2 yes no yes no no ## 11 42000 5850 3 1 2 yes no yes no no ## garagepl prefarea eps ## 1 1 no 1.4550 ## 2 1 no 0.0005 ## 3 1 no 0.0003 ## 4 1 no 0.0003 ## 5 1 no NA ## 6 1 no NA ## 7 1 no NA ## 8 1 no NA ## 9 1 no NA ## 10 1 no 0.0003 ## 11 1 no NA rowid is column that identifies the houses in the original data set, so rowid == 1 filters out the first house. This shows you the marginal effects (column dydx computed for this house; but remember, since we’re dealing with a linear model, the values of the marginal effects are constant. If you don’t see the point of this discussion, don’t fret, the next example should make things clearer. Let’s estimate a logit model and compute the marginal effects. You might know logit models as logistic regression. Logit models can be estimated using the glm() function, which stands for generalized linear models. As an example, we are going to use the Participation data, also from the {Ecdat} package: data(Participation) ?Particpation Participation package:Ecdat R Documentation Labor Force Participation Description: a cross-section _number of observations_ : 872 _observation_ : individuals _country_ : Switzerland Usage: data(Participation) Format: A dataframe containing : lfp labour force participation ? lnnlinc the log of nonlabour income age age in years divided by 10 educ years of formal education nyc the number of young children (younger than 7) noc number of older children foreign foreigner ? Source: Gerfin, Michael (1996) “Parametric and semiparametric estimation of the binary response”, _Journal of Applied Econometrics_, *11(3)*, 321-340. References: Davidson, R. and James G. MacKinnon (2004) _Econometric Theory and Methods_, New York, Oxford University Press, <URL: http://www.econ.queensu.ca/ETM/>, chapter 11. Journal of Applied Econometrics data archive : <URL: http://qed.econ.queensu.ca/jae/>. See Also: ‘Index.Source’, ‘Index.Economics’, ‘Index.Econometrics’, ‘Index.Observations’ The variable of interest is lfp: whether the individual participates in the labour force or not. To know which variables are relevant in the decision to participate in the labour force, one could train a logit model, using glm(): logit_participation <- glm(lfp ~ ., data = Participation, family = "binomial") broom::tidy(logit_participation) ## # A tibble: 7 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 10.4 2.17 4.79 1.69e- 6 ## 2 lnnlinc -0.815 0.206 -3.97 7.31e- 5 ## 3 age -0.510 0.0905 -5.64 1.72e- 8 ## 4 educ 0.0317 0.0290 1.09 2.75e- 1 ## 5 nyc -1.33 0.180 -7.39 1.51e-13 ## 6 noc -0.0220 0.0738 -0.298 7.66e- 1 ## 7 foreignyes 1.31 0.200 6.56 5.38e-11 From the results above, one can only interpret the sign of the coefficients. To know how much a variable influences the labour force participation, one has to use marginaleffects(): effects_logit_participation <- marginaleffects(logit_participation) summary(effects_logit_participation) ## Term Contrast Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 % ## 1 lnnlinc dY/dX -0.169940 0.04151 -4.0939 4.2416e-05 -0.251300 -0.08858 ## 2 age dY/dX -0.106407 0.01759 -6.0492 1.4560e-09 -0.140884 -0.07193 ## 3 educ dY/dX 0.006616 0.00604 1.0954 0.27335 -0.005222 0.01845 ## 4 nyc dY/dX -0.277463 0.03325 -8.3436 < 2.22e-16 -0.342642 -0.21229 ## 5 noc dY/dX -0.004584 0.01538 -0.2981 0.76563 -0.034725 0.02556 ## 6 foreign yes - no 0.283377 0.03984 7.1129 1.1361e-12 0.205292 0.36146 ## ## Model type: glm ## Prediction type: response As you can see, the average marginal effects here are not equal to the estimated coefficients of the model. Let’s take a look at the first row of the data: Participation[1, ] ## lfp lnnlinc age educ nyc noc foreign ## 1 no 10.7875 3 8 1 1 no and let’s now look at rowid == 1 in the marginal effects data frame: effects_logit_participation %>% filter(rowid == 1) ## rowid type term contrast dydx std.error statistic ## 1 1 response lnnlinc dY/dX -0.156661756 0.038522800 -4.0667282 ## 2 1 response age dY/dX -0.098097148 0.020123709 -4.8747052 ## 3 1 response educ dY/dX 0.006099266 0.005367036 1.1364310 ## 4 1 response nyc dY/dX -0.255784406 0.029367783 -8.7096942 ## 5 1 response noc dY/dX -0.004226368 0.014167283 -0.2983189 ## 6 1 response foreign yes - no 0.305630005 0.045174828 6.7654935 ## p.value conf.low conf.high predicted predicted_hi predicted_lo lfp ## 1 4.767780e-05 -0.232165056 -0.08115846 0.2596523 0.2595710 0.2596523 no ## 2 1.089711e-06 -0.137538892 -0.05865540 0.2596523 0.2596111 0.2596523 no ## 3 2.557762e-01 -0.004419931 0.01661846 0.2596523 0.2596645 0.2596523 no ## 4 3.046958e-18 -0.313344203 -0.19822461 0.2596523 0.2595755 0.2596523 no ## 5 7.654598e-01 -0.031993732 0.02354100 0.2596523 0.2596497 0.2596523 no ## 6 1.328556e-11 0.217088969 0.39417104 0.2596523 0.5652823 0.2596523 no ## lnnlinc age educ nyc noc foreign eps ## 1 10.7875 3 8 1 1 no 0.0005188749 ## 2 10.7875 3 8 1 1 no 0.0004200000 ## 3 10.7875 3 8 1 1 no 0.0020000000 ## 4 10.7875 3 8 1 1 no 0.0003000000 ## 5 10.7875 3 8 1 1 no 0.0006000000 ## 6 10.7875 3 8 1 1 no NA Let’s focus on the first row, where term is lnnlinc. What we see here is the effect of an infinitesimal increase in the variable lnnlinc on the participation, for an individual who has the following characteristics: lnnlinc = 10.7875, age = 3, educ = 8, nyc = 1, noc = 1 and foreign = no, which are the characteristics of this first individual in our data. So let’s look at the value of dydx for every individual: dydx_lnnlinc <- effects_logit_participation %>% filter(term == "lnnlinc") head(dydx_lnnlinc) ## rowid type term contrast dydx std.error statistic p.value ## 1 1 response lnnlinc dY/dX -0.15666176 0.03852280 -4.066728 4.767780e-05 ## 2 2 response lnnlinc dY/dX -0.20013939 0.05124543 -3.905507 9.402813e-05 ## 3 3 response lnnlinc dY/dX -0.18493932 0.04319729 -4.281271 1.858287e-05 ## 4 4 response lnnlinc dY/dX -0.05376281 0.01586468 -3.388837 7.018964e-04 ## 5 5 response lnnlinc dY/dX -0.18709356 0.04502973 -4.154890 3.254439e-05 ## 6 6 response lnnlinc dY/dX -0.19586185 0.04782143 -4.095692 4.209096e-05 ## conf.low conf.high predicted predicted_hi predicted_lo lfp lnnlinc age ## 1 -0.23216506 -0.08115846 0.25965227 0.25957098 0.25965227 no 10.78750 3.0 ## 2 -0.30057859 -0.09970018 0.43340025 0.43329640 0.43340025 yes 10.52425 4.5 ## 3 -0.26960445 -0.10027418 0.34808777 0.34799181 0.34808777 no 10.96858 4.6 ## 4 -0.08485701 -0.02266862 0.07101902 0.07099113 0.07101902 no 11.10500 3.1 ## 5 -0.27535020 -0.09883692 0.35704926 0.35695218 0.35704926 no 11.10847 4.4 ## 6 -0.28959014 -0.10213356 0.40160949 0.40150786 0.40160949 yes 11.02825 4.2 ## educ nyc noc foreign eps ## 1 8 1 1 no 0.0005188749 ## 2 8 0 1 no 0.0005188749 ## 3 9 0 0 no 0.0005188749 ## 4 11 2 0 no 0.0005188749 ## 5 12 0 2 no 0.0005188749 ## 6 12 0 1 no 0.0005188749 dydx_lnnlinc is a data frame with all individual marginal effect for the variable lnnlinc. What if we compute the mean of this column? dydx_lnnlinc %>% summarise(mean(dydx)) ## mean(dydx) ## 1 -0.1699405 Let’s compare this to the average marginal effects: summary(effects_logit_participation) ## Term Contrast Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 % ## 1 lnnlinc dY/dX -0.169940 0.04151 -4.0939 4.2416e-05 -0.251300 -0.08858 ## 2 age dY/dX -0.106407 0.01759 -6.0492 1.4560e-09 -0.140884 -0.07193 ## 3 educ dY/dX 0.006616 0.00604 1.0954 0.27335 -0.005222 0.01845 ## 4 nyc dY/dX -0.277463 0.03325 -8.3436 < 2.22e-16 -0.342642 -0.21229 ## 5 noc dY/dX -0.004584 0.01538 -0.2981 0.76563 -0.034725 0.02556 ## 6 foreign yes - no 0.283377 0.03984 7.1129 1.1361e-12 0.205292 0.36146 ## ## Model type: glm ## Prediction type: response Yep, it’s the same! This is why we speak of average marginal effects. Now that we know why these are called average marginal effects, let’s go back to interpreting them. This time, let’s plot them, because why not: plot(effects_logit_participation) So an infinitesimal increase, in say, non-labour income (lnnlinc) of 0.001 is associated with a decrease of the probability of labour force participation by 0.001*17 percentage points. This is just scratching the surface of interpreting these kinds of models. There are many more types of effects that you can compute and look at. I highly recommend you read the documentation of {marginaleffects} which you can find here. The author of the package, Vincent Arel-Bundock writes a lot of very helpful documentation for his packages, so if model interpretation is important for your job, definitely take a look. 6.4.2 Explainability of black-box models Just read Christoph Molnar’s Interpretable Machine Learning. Seriously, I cannot add anything meaningful to it. His book is brilliant. 6.5 Comparing models Consider this section more as an illustration of what is possible with the knowledge you acquired at this point. Imagine that the task at hand is to compare two models. We would like to select the one which has the best fit to the data. Let’s first estimate another model on the same data; prices are only positive, so a linear regression might not be the best model, because the model could predict negative prices. Let’s look at the distribution of prices: ggplot(Housing) + geom_density(aes(price)) it looks like modeling the log of price might provide a better fit: model_log <- lm(log(price) ~ ., data = Housing) result_log <- broom::tidy(model_log) print(result_log) ## # A tibble: 12 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 10.0 0.0472 212. 0 ## 2 lotsize 0.0000506 0.00000485 10.4 2.91e-23 ## 3 bedrooms 0.0340 0.0145 2.34 1.94e- 2 ## 4 bathrms 0.168 0.0206 8.13 3.10e-15 ## 5 stories 0.0923 0.0128 7.20 2.10e-12 ## 6 drivewayyes 0.131 0.0283 4.61 5.04e- 6 ## 7 recroomyes 0.0735 0.0263 2.79 5.42e- 3 ## 8 fullbaseyes 0.0994 0.0220 4.52 7.72e- 6 ## 9 gashwyes 0.178 0.0446 4.00 7.22e- 5 ## 10 aircoyes 0.178 0.0215 8.26 1.14e-15 ## 11 garagepl 0.0508 0.0116 4.36 1.58e- 5 ## 12 prefareayes 0.127 0.0231 5.50 6.02e- 8 Let’s take a look at the diagnostics: glance(model_log) ## # A tibble: 1 × 12 ## r.squared adj.r.squ…¹ sigma stati…² p.value df logLik AIC BIC devia…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.677 0.670 0.214 102. 3.67e-123 11 73.9 -122. -65.8 24.4 ## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names ¹​adj.r.squared, ²​statistic, ³​deviance Let’s compare these to the ones from the previous model: diag_lm <- glance(model3) diag_lm <- diag_lm %>% mutate(model = "lin-lin model") diag_log <- glance(model_log) diag_log <- diag_log %>% mutate(model = "log-lin model") diagnostics_models <- full_join(diag_lm, diag_log) %>% select(model, everything()) # put the `model` column first ## Joining, by = c("r.squared", "adj.r.squared", "sigma", "statistic", "p.value", ## "df", "logLik", "AIC", "BIC", "deviance", "df.residual", "nobs", "model") print(diagnostics_models) ## # A tibble: 2 × 13 ## model r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 lin-li… 0.673 0.666 1.54e+4 100. 6.18e-122 11 -6034. 12094. 12150. ## 2 log-li… 0.677 0.670 2.14e-1 102. 3.67e-123 11 73.9 -122. -65.8 ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>, and ## # abbreviated variable names ¹​r.squared, ²​adj.r.squared, ³​statistic I saved the diagnostics in two different data.frame objects using the glance() function and added a model column to indicate which model the diagnostics come from. Then I merged both datasets using full_join(), a {dplyr} function. Using this approach, we can easily build a data frame with the diagnostics of several models and compare them. The model using the logarithm of prices has lower AIC and BIC (and this higher likelihood), so if you’re worried about selecting the model with the better fit to the data, you’d go for this model. 6.6 Using a model for prediction Once you estimated a model, you might want to use it for prediction. This is easily done using the predict() function that works with most models. Prediction is also useful as a way to test the accuracy of your model: split your data into a training set (used for training) and a testing set (used for the pseudo-prediction) and see if your model overfits the data. We are going to see how to do that in a later section; for now, let’s just get acquainted with predict() and other functions. I insist, keep in mind that this section is only to get acquainted with these functions. We are going to explore prediction, overfitting and tuning of models in a later section. Let’s go back to the models we trained in the previous section, model3 and model_log. Let’s also take a subsample of data, which we will be using for prediction: set.seed(1234) pred_set <- Housing %>% sample_n(20) In order to always get the same pred_set, I set the random seed first. Let’s take a look at the data: print(pred_set) ## price lotsize bedrooms bathrms stories driveway recroom fullbase gashw ## 284 45000 6750 2 1 1 yes no no no ## 101 57000 4500 3 2 2 no no yes no ## 400 85000 7231 3 1 2 yes yes yes no ## 98 59900 8250 3 1 1 yes no yes no ## 103 125000 4320 3 1 2 yes no yes yes ## 326 99000 8880 3 2 2 yes no yes no ## 79 55000 3180 2 2 1 yes no yes no ## 270 59000 4632 4 1 2 yes no no no ## 382 112500 6550 3 1 2 yes no yes no ## 184 63900 3510 3 1 2 yes no no no ## 4 60500 6650 3 1 2 yes yes no no ## 212 42000 2700 2 1 1 no no no no ## 195 33000 3180 2 1 1 yes no no no ## 511 70000 4646 3 1 2 yes yes yes no ## 479 88000 5450 4 2 1 yes no yes no ## 510 64000 4040 3 1 2 yes no no no ## 424 62900 2880 3 1 2 yes no no no ## 379 84000 7160 3 1 1 yes no yes no ## 108 58500 3680 3 2 2 yes no no no ## 131 35000 4840 2 1 2 yes no no no ## airco garagepl prefarea ## 284 no 0 no ## 101 yes 0 no ## 400 yes 0 yes ## 98 no 3 no ## 103 no 2 no ## 326 yes 1 no ## 79 no 2 no ## 270 yes 0 no ## 382 yes 0 yes ## 184 no 0 no ## 4 no 0 no ## 212 no 0 no ## 195 no 0 no ## 511 no 2 no ## 479 yes 0 yes ## 510 no 1 no ## 424 no 0 yes ## 379 no 2 yes ## 108 no 0 no ## 131 no 0 no If we wish to use it for prediction, this is easily done with predict(): predict(model3, pred_set) ## 284 101 400 98 103 326 79 270 ## 51143.48 77286.31 93204.28 76481.82 77688.37 103751.72 66760.79 66486.26 ## 382 184 4 212 195 511 479 510 ## 86277.96 48042.41 63689.09 30093.18 38483.18 70524.34 91987.65 54166.78 ## 424 379 108 131 ## 55177.75 77741.03 62980.84 50926.99 This returns a vector of predicted prices. This can then be used to compute the Root Mean Squared Error for instance. Let’s do it within a tidyverse pipeline: rmse <- pred_set %>% mutate(predictions = predict(model3, .)) %>% summarise(sqrt(sum(predictions - price)**2/n())) The root mean square error of model3 is 3646.0817347. I also used the n() function which returns the number of observations in a group (or all the observations, if the data is not grouped). Let’s compare model3 ’s RMSE with the one from model_log: rmse2 <- pred_set %>% mutate(predictions = exp(predict(model_log, .))) %>% summarise(sqrt(sum(predictions - price)**2/n())) Don’t forget to exponentiate the predictions, remember you’re dealing with a log-linear model! model_log’s RMSE is 1.2125133^{4} which is lower than model3’s. However, keep in mind that the model was trained on the whole data, and then the prediction quality was assessed using a subsample of the data the model was trained on… so actually we can’t really say if model_log’s predictions are very useful. Of course, this is the same for model3. In a later section we are going to learn how to do cross validation to avoid this issue. Just as a side note, notice that I had to copy and paste basically the same lines twice to compute the predictions for both models. That’s not much, but if I wanted to compare 10 models, copy and paste mistakes could have sneaked in. Instead, it would have been nice to have a function that computes the RMSE and then use it on my models. We are going to learn how to write our own functions and use it just like if it was another built-in R function. 6.7 Beyond linear regression R has a lot of other built-in functions for regression, such as glm() (for Generalized Linear Models) and nls() for (for Nonlinear Least Squares). There are also functions and additional packages for time series, panel data, machine learning, bayesian and nonparametric methods. Presenting everything here would take too much space, and would be pretty useless as you can find whatever you need using an internet search engine. What you have learned until now is quite general and should work on many type of models. To help you out, here is a list of methods and the recommended packages that you can use: Model Package Quick example Robust Linear Regression MASS rlm(y ~ x, data = mydata) Nonlinear Least Squares stats2 nls(y ~ x1 / (1 + x2), data = mydata)3 Logit stats glm(y ~ x, data = mydata, family = \"binomial\") Probit stats glm(y ~ x, data = mydata, family = binomial(link = \"probit\")) K-Means stats kmeans(data, n)4 PCA stats prcomp(data, scale = TRUE, center = TRUE)5 Multinomial Logit mlogit Requires several steps of data pre-processing and formula definition, refer to the Vignette for more details. Cox PH survival coxph(Surv(y_time, y_status) ~ x, data = mydata)6 Time series Several, depending on your needs. Time series in R is a vast subject that would require a very thick book to cover. You can get started with the following series of blog articles, Tidy time-series, part 1, Tidy time-series, part 2, Tidy time-series, part 3 and Tidy time-series, part 4 Panel data plm plm(y ~ x, data = mydata, model = \"within|random\") Machine learning Several, depending on your needs. R is a very popular programming language for machine learning. This book is a must read if you need to do machine learning with R. Nonparametric regression np Several functions and options available, refer to the Vignette for more details. This table is far from being complete. Should you be a Bayesian, you’d want to look at packages such as {rstan}, which uses STAN, an external piece of software that must be installed on your system. It is also possible to train models using Bayesian inference without the need of external tools, with the {bayesm} package which estimates the usual micro-econometric models. There really are a lot of packages available for Bayesian inference, and you can find them all in the related CRAN Task View. 6.8 Hyper-parameters Hyper-parameters are parameters of the model that cannot be directly learned from the data. A linear regression does not have any hyper-parameters, but a random forest for instance has several. You might have heard of ridge regression, lasso and elasticnet. These are extensions of linear models that avoid over-fitting by penalizing large models. These extensions of the linear regression have hyper-parameters that the practitioner has to tune. There are several ways one can tune these parameters, for example, by doing a grid-search, or a random search over the grid or using more elaborate methods. To introduce hyper-parameters, let’s get to know ridge regression, also called Tikhonov regularization. 6.8.1 Ridge regression Ridge regression is used when the data you are working with has a lot of explanatory variables, or when there is a risk that a simple linear regression might overfit to the training data, because, for example, your explanatory variables are collinear. If you are training a linear model and then you notice that it generalizes very badly to new, unseen data, it is very likely that the linear model you trained overfit the data. In this case, ridge regression might prove useful. The way ridge regression works might seem counter-intuititive; it boils down to fitting a worse model to the training data, but in return, this worse model will generalize better to new data. The closed form solution of the ordinary least squares estimator is defined as: \\[ \\widehat{\\beta} = (X'X)^{-1}X'Y \\] where \\(X\\) is the design matrix (the matrix made up of the explanatory variables) and \\(Y\\) is the dependent variable. For ridge regression, this closed form solution changes a little bit: \\[ \\widehat{\\beta} = (X'X + \\lambda I_p)^{-1}X'Y \\] where \\(\\lambda \\in \\mathbb{R}\\) is an hyper-parameter and \\(I_p\\) is the identity matrix of dimension \\(p\\) (\\(p\\) is the number of explanatory variables). This formula above is the closed form solution to the following optimisation program: \\[ \\sum_{i=1}^n \\left(y_i - \\sum_{j=1}^px_{ij}\\beta_j\\right)^2 \\] such that: \\[ \\sum_{j=1}^p(\\beta_j)^2 < c \\] for any strictly positive \\(c\\). The glmnet() function from the {glmnet} package can be used for ridge regression, by setting the alpha argument to 0 (setting it to 1 would do LASSO, and setting it to a number between 0 and 1 would do elasticnet). But in order to compare linear regression and ridge regression, let me first divide the data into a training set and a testing set: index <- 1:nrow(Housing) set.seed(12345) train_index <- sample(index, round(0.90*nrow(Housing)), replace = FALSE) test_index <- setdiff(index, train_index) train_x <- Housing[train_index, ] %>% select(-price) train_y <- Housing[train_index, ] %>% pull(price) test_x <- Housing[test_index, ] %>% select(-price) test_y <- Housing[test_index, ] %>% pull(price) I do the train/test split this way, because glmnet() requires a design matrix as input, and not a formula. Design matrices can be created using the model.matrix() function: library("glmnet") train_matrix <- model.matrix(train_y ~ ., data = train_x) test_matrix <- model.matrix(test_y ~ ., data = test_x) Let’s now run a linear regression, by setting the penalty to 0: model_lm_ridge <- glmnet(y = train_y, x = train_matrix, alpha = 0, lambda = 0) The model above provides the same result as a linear regression, because I set lambda to 0. Let’s compare the coefficients between the two: coef(model_lm_ridge) ## 13 x 1 sparse Matrix of class "dgCMatrix" ## s0 ## (Intercept) -2667.542863 ## (Intercept) . ## lotsize 3.397596 ## bedrooms 2081.087654 ## bathrms 13294.192823 ## stories 6400.454580 ## drivewayyes 6530.644895 ## recroomyes 5389.856794 ## fullbaseyes 4899.099463 ## gashwyes 12575.611265 ## aircoyes 13078.144146 ## garagepl 4155.249461 ## prefareayes 10260.781753 and now the coefficients of the linear regression (because I provide a design matrix, I have to use lm.fit() instead of lm() which requires a formula, not a matrix.) coef(lm.fit(x = train_matrix, y = train_y)) ## (Intercept) lotsize bedrooms bathrms stories drivewayyes ## -2667.052098 3.397629 2081.344118 13293.707725 6400.416730 6529.972544 ## recroomyes fullbaseyes gashwyes aircoyes garagepl prefareayes ## 5388.871137 4899.024787 12575.970220 13077.988867 4155.269629 10261.056772 as you can see, the coefficients are the same. Let’s compute the RMSE for the unpenalized linear regression: preds_lm <- predict(model_lm_ridge, test_matrix) rmse_lm <- sqrt(mean(preds_lm - test_y)^2) The RMSE for the linear unpenalized regression is equal to 1731.5553157. Let’s now run a ridge regression, with lambda equal to 100, and see if the RMSE is smaller: model_ridge <- glmnet(y = train_y, x = train_matrix, alpha = 0, lambda = 100) and let’s compute the RMSE again: preds <- predict(model_ridge, test_matrix) rmse <- sqrt(mean(preds - test_y)^2) The RMSE for the linear penalized regression is equal to 1726.7632312, which is smaller than before. But which value of lambda gives smallest RMSE? To find out, one must run model over a grid of lambda values and pick the model with lowest RMSE. This procedure is available in the cv.glmnet() function, which picks the best value for lambda: best_model <- cv.glmnet(train_matrix, train_y) # lambda that minimises the MSE best_model$lambda.min ## [1] 61.42681 According to cv.glmnet() the best value for lambda is 61.4268056. In the next section, we will implement cross validation ourselves, in order to find the hyper-parameters of a random forest. 6.9 Training, validating, and testing models Cross-validation is an important procedure which is used to compare models but also to tune the hyper-parameters of a model. In this section, we are going to use several packages from the {tidymodels} collection of packages, namely {recipes}, {rsample} and {parsnip} to train a random forest the tidy way. I will also use {mlrMBO} to tune the hyper-parameters of the random forest. 6.9.1 Set up Let’s load the needed packages: library("tidyverse") library("recipes") library("rsample") library("parsnip") library("yardstick") library("brotools") library("mlbench") Load the data which is included in the {mlrbench} package: data("BostonHousing2") I will train a random forest to predict the housing prices, which is the cmedv column: head(BostonHousing2) ## town tract lon lat medv cmedv crim zn indus chas nox ## 1 Nahant 2011 -70.9550 42.2550 24.0 24.0 0.00632 18 2.31 0 0.538 ## 2 Swampscott 2021 -70.9500 42.2875 21.6 21.6 0.02731 0 7.07 0 0.469 ## 3 Swampscott 2022 -70.9360 42.2830 34.7 34.7 0.02729 0 7.07 0 0.469 ## 4 Marblehead 2031 -70.9280 42.2930 33.4 33.4 0.03237 0 2.18 0 0.458 ## 5 Marblehead 2032 -70.9220 42.2980 36.2 36.2 0.06905 0 2.18 0 0.458 ## 6 Marblehead 2033 -70.9165 42.3040 28.7 28.7 0.02985 0 2.18 0 0.458 ## rm age dis rad tax ptratio b lstat ## 1 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 ## 2 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 ## 3 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 ## 4 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 ## 5 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 ## 6 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 Only keep relevant columns: boston <- BostonHousing2 %>% select(-medv, -tract, -lon, -lat) %>% rename(price = cmedv) I remove tract, lat and lon because the information contained in the column town is enough. To train and evaluate the model’s performance, I split the data in two. One data set, called the training set, will be further split into two down below. I won’t touch the second data set, the test set, until the very end, to finally assess the model’s performance. train_test_split <- initial_split(boston, prop = 0.9) housing_train <- training(train_test_split) housing_test <- testing(train_test_split) initial_split(), training() and testing() are functions from the {rsample} package. I will train a random forest on the training data, but the question, is which random forest? Because random forests have several hyper-parameters, and as explained in the intro these hyper-parameters cannot be directly learned from the data, which one should we choose? We could train 6 random forests for instance and compare their performance, but why only 6? Why not 16? In order to find the right hyper-parameters, the practitioner can use values from the literature that seemed to have worked well (like is done in Macro-econometrics) or you can further split the train set into two, create a grid of hyperparameter, train the model on one part of the data for all values of the grid, and compare the predictions of the models on the second part of the data. You then stick with the model that performed the best, for example, the model with lowest RMSE. The thing is, you can’t estimate the true value of the RMSE with only one value. It’s like if you wanted to estimate the height of the population by drawing one single observation from the population. You need a bit more observations. To approach the true value of the RMSE for a give set of hyperparameters, instead of doing one split, let’s do 30. Then we compute the average RMSE, which implies training 30 models for each combination of the values of the hyperparameters. First, let’s split the training data again, using the mc_cv() function from {rsample} package. This function implements Monte Carlo cross-validation: validation_data <- mc_cv(housing_train, prop = 0.9, times = 30) What does validation_data look like? validation_data ## # Monte Carlo cross-validation (0.9/0.1) with 30 resamples ## # A tibble: 30 × 2 ## splits id ## <list> <chr> ## 1 <split [409/46]> Resample01 ## 2 <split [409/46]> Resample02 ## 3 <split [409/46]> Resample03 ## 4 <split [409/46]> Resample04 ## 5 <split [409/46]> Resample05 ## 6 <split [409/46]> Resample06 ## 7 <split [409/46]> Resample07 ## 8 <split [409/46]> Resample08 ## 9 <split [409/46]> Resample09 ## 10 <split [409/46]> Resample10 ## # … with 20 more rows Let’s look further down: validation_data$splits[[1]] ## <Analysis/Assess/Total> ## <409/46/455> The first value is the number of rows of the first set, the second value of the second, and the third was the original amount of values in the training data, before splitting again. How should we call these two new data sets? The author of {rsample}, Max Kuhn, talks about the analysis and the assessment sets, and I’m going to use this terminology as well. Now, in order to continue I need to pre-process the data. I will do this in three steps. The first and the second steps are used to center and scale the numeric variables and the third step converts character and factor variables to dummy variables. This is needed because I will train a random forest, which cannot handle factor variables directly. Let’s define a recipe to do that, and start by pre-processing the testing set. I write a wrapper function around the recipe, because I will need to apply this recipe to various data sets: simple_recipe <- function(dataset){ recipe(price ~ ., data = dataset) %>% step_center(all_numeric()) %>% step_scale(all_numeric()) %>% step_dummy(all_nominal()) } We have not learned yet about writing functions, and will do so in the next chapter. However, for now, you only need to know that you can write your own functions, and that these functions can take any arguments you need. In the case of the above function, which we called simple_recipe(), we only need one argument, which is a dataset, and which we called dataset. Once the recipe is defined, I can use the prep() function, which estimates the parameters from the data which are needed to process the data. For example, for centering, prep() estimates the mean which will then be subtracted from the variables. With bake() the estimates are then applied on the data: testing_rec <- prep(simple_recipe(housing_test), testing = housing_test) test_data <- bake(testing_rec, new_data = housing_test) It is important to split the data before using prep() and bake(), because if not, you will use observations from the test set in the prep() step, and thus introduce knowledge from the test set into the training data. This is called data leakage, and must be avoided. This is why it is necessary to first split the training data into an analysis and an assessment set, and then also pre-process these sets separately. However, the validation_data object cannot now be used with recipe(), because it is not a dataframe. No worries, I simply need to write a function that extracts the analysis and assessment sets from the validation_data object, applies the pre-processing, trains the model, and returns the RMSE. This will be a big function, at the center of the analysis. But before that, let’s run a simple linear regression, as a benchmark. For the linear regression, I will not use any CV, so let’s pre-process the training set: trainlm_rec <- prep(simple_recipe(housing_train), testing = housing_train) trainlm_data <- bake(trainlm_rec, new_data = housing_train) linreg_model <- lm(price ~ ., data = trainlm_data) broom::augment(linreg_model, newdata = test_data) %>% yardstick::rmse(price, .fitted) ## Warning in predict.lm(x, newdata = newdata, na.action = na.pass, ...): ## prediction from a rank-deficient fit may be misleading ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 0.439 broom::augment() adds the predictions to the test_data in a new column, .fitted. I won’t use this trick with the random forest, because there is no augment() method for random forests from the {ranger} package which I’ll use. I’ll add the predictions to the data myself. Ok, now let’s go back to the random forest and write the big function: my_rf <- function(mtry, trees, split, id){ analysis_set <- analysis(split) analysis_prep <- prep(simple_recipe(analysis_set), training = analysis_set) analysis_processed <- bake(analysis_prep, new_data = analysis_set) model <- rand_forest(mode = "regression", mtry = mtry, trees = trees) %>% set_engine("ranger", importance = 'impurity') %>% fit(price ~ ., data = analysis_processed) assessment_set <- assessment(split) assessment_prep <- prep(simple_recipe(assessment_set), testing = assessment_set) assessment_processed <- bake(assessment_prep, new_data = assessment_set) tibble::tibble("id" = id, "truth" = assessment_processed$price, "prediction" = unlist(predict(model, new_data = assessment_processed))) } The rand_forest() function is available in the {parsnip} package. This package provides an unified interface to a lot of other machine learning packages. This means that instead of having to learn the syntax of range() and randomForest() and, and… you can simply use the rand_forest() function and change the engine argument to the one you want (ranger, randomForest, etc). Let’s try this function: results_example <- map2_df(.x = validation_data$splits, .y = validation_data$id, ~my_rf(mtry = 3, trees = 200, split = .x, id = .y)) head(results_example) ## # A tibble: 6 × 3 ## id truth prediction ## <chr> <dbl> <dbl> ## 1 Resample01 -0.328 -0.0274 ## 2 Resample01 1.06 0.686 ## 3 Resample01 1.04 0.726 ## 4 Resample01 -0.418 -0.0190 ## 5 Resample01 0.909 0.642 ## 6 Resample01 0.0926 -0.134 I can now compute the RMSE when mtry = 3 and trees = 200: results_example %>% group_by(id) %>% yardstick::rmse(truth, prediction) %>% summarise(mean_rmse = mean(.estimate)) %>% pull ## [1] 0.6305034 The random forest has already lower RMSE than the linear regression. The goal now is to lower this RMSE by tuning the mtry and trees hyperparameters. For this, I will use Bayesian Optimization methods implemented in the {mlrMBO} package. 6.9.2 Bayesian hyperparameter optimization I will re-use the code from above, and define a function that does everything from pre-processing to returning the metric I want to minimize by tuning the hyperparameters, the RMSE: tuning <- function(param, validation_data){ mtry <- param[1] trees <- param[2] results <- purrr::map2_df(.x = validation_data$splits, .y = validation_data$id, ~my_rf(mtry = mtry, trees = trees, split = .x, id = .y)) results %>% group_by(id) %>% yardstick::rmse(truth, prediction) %>% summarise(mean_rmse = mean(.estimate)) %>% pull } This is exactly the code from before, but it now returns the RMSE. Let’s try the function with the values from before: tuning(c(3, 200), validation_data) ## [1] 0.6319843 I now follow the code that can be found in the arxiv paper to run the optimization. A simpler model, called the surrogate model, is used to look for promising points and to evaluate the value of the function at these points. This seems somewhat similar (in spirit) to the Indirect Inference method as described in Gourieroux, Monfort, Renault. If you don’t really get what follows, no worries, it is not really important as such. The idea is simply to look for hyper-parameters in an efficient way, and bayesian optimisation provides this efficient way. However, you could use another method, for example a grid search. This would not change anything to the general approach. So I will not spend too much time explaining what is going on below, as you can read the details in the paper cited above as well as the package’s documentation. The focus here is not on this particular method, but rather showing you how you can use various packages to solve a data science problem. Let’s first load the package and create the function to optimize: library("mlrMBO") fn <- makeSingleObjectiveFunction(name = "tuning", fn = tuning, par.set = makeParamSet(makeIntegerParam("x1", lower = 3, upper = 8), makeIntegerParam("x2", lower = 100, upper = 500))) This function is based on the function I defined before. The parameters to optimize are also defined as are their bounds. I will look for mtry between the values of 3 and 8, and trees between 50 and 500. We still need to define some other objects before continuing: # Create initial random Latin Hypercube Design of 10 points library(lhs)# for randomLHS des <- generateDesign(n = 5L * 2L, getParamSet(fn), fun = randomLHS) Then we choose the surrogate model, a random forest too: # Specify kriging model with standard error estimation surrogate <- makeLearner("regr.ranger", predict.type = "se", keep.inbag = TRUE) Here I define some options: # Set general controls ctrl <- makeMBOControl() ctrl <- setMBOControlTermination(ctrl, iters = 10L) ctrl <- setMBOControlInfill(ctrl, crit = makeMBOInfillCritEI()) And this is the optimization part: # Start optimization result <- mbo(fn, des, surrogate, ctrl, more.args = list("validation_data" = validation_data)) result ## Recommended parameters: ## x1=8; x2=314 ## Objective: y = 0.484 ## ## Optimization path ## 10 + 10 entries in total, displaying last 10 (or less): ## x1 x2 y dob eol error.message exec.time ei error.model ## 11 8 283 0.4855415 1 NA <NA> 7.353 -3.276847e-04 <NA> ## 12 8 284 0.4852047 2 NA <NA> 7.321 -3.283713e-04 <NA> ## 13 8 314 0.4839817 3 NA <NA> 7.703 -3.828517e-04 <NA> ## 14 8 312 0.4841398 4 NA <NA> 7.633 -2.829713e-04 <NA> ## 15 8 318 0.4841066 5 NA <NA> 7.692 -2.668354e-04 <NA> ## 16 8 314 0.4845221 6 NA <NA> 7.574 -1.382333e-04 <NA> ## 17 8 321 0.4843018 7 NA <NA> 7.693 -3.828924e-05 <NA> ## 18 8 318 0.4868457 8 NA <NA> 7.696 -8.692828e-07 <NA> ## 19 8 310 0.4862687 9 NA <NA> 7.594 -1.061185e-07 <NA> ## 20 8 313 0.4878694 10 NA <NA> 7.628 -5.153015e-07 <NA> ## train.time prop.type propose.time se mean ## 11 0.011 infill_ei 0.450 0.0143886864 0.5075765 ## 12 0.011 infill_ei 0.427 0.0090265872 0.4971003 ## 13 0.012 infill_ei 0.443 0.0062693960 0.4916927 ## 14 0.012 infill_ei 0.435 0.0037308971 0.4878950 ## 15 0.012 infill_ei 0.737 0.0024446891 0.4860699 ## 16 0.013 infill_ei 0.442 0.0012713838 0.4850705 ## 17 0.012 infill_ei 0.444 0.0006371109 0.4847248 ## 18 0.013 infill_ei 0.467 0.0002106381 0.4844576 ## 19 0.014 infill_ei 0.435 0.0002182254 0.4846214 ## 20 0.013 infill_ei 0.748 0.0002971160 0.4847383 So the recommended parameters are 8 for mtry and 314 for trees. The user can access these recommended parameters with result$x$x1 and result$x$x2. The value of the RMSE is lower than before, and equals 0.4839817. It can be accessed with result$y. Let’s now train the random forest on the training data with this values. First, I pre-process the training data training_rec <- prep(simple_recipe(housing_train), testing = housing_train) train_data <- bake(training_rec, new_data = housing_train) Let’s now train our final model and predict the prices: final_model <- rand_forest(mode = "regression", mtry = result$x$x1, trees = result$x$x2) %>% set_engine("ranger", importance = 'impurity') %>% fit(price ~ ., data = train_data) price_predict <- predict(final_model, new_data = select(test_data, -price)) Let’s transform the data back and compare the predicted prices to the true ones visually: cbind(price_predict * sd(housing_train$price) + mean(housing_train$price), housing_test$price) ## .pred housing_test$price ## 1 16.76938 13.5 ## 2 27.59510 30.8 ## 3 23.14952 24.7 ## 4 21.92390 21.2 ## 5 21.35030 20.0 ## 6 23.15809 22.9 ## 7 23.00947 23.9 ## 8 25.74268 26.6 ## 9 24.13122 22.6 ## 10 34.97671 43.8 ## 11 19.30543 18.8 ## 12 18.09146 15.7 ## 13 18.82922 19.2 ## 14 18.63397 13.3 ## 15 19.14438 14.0 ## 16 17.05549 15.6 ## 17 23.79491 27.0 ## 18 20.30125 17.4 ## 19 22.99200 23.6 ## 20 32.77092 33.3 ## 21 31.66258 34.6 ## 22 28.79583 34.9 ## 23 39.02755 50.0 ## 24 23.53336 21.7 ## 25 24.66551 24.3 ## 26 24.91737 24.0 ## 27 25.11847 25.1 ## 28 24.42518 23.7 ## 29 24.59139 23.7 ## 30 24.91760 26.2 ## 31 38.73875 43.5 ## 32 29.71848 35.1 ## 33 36.89490 46.0 ## 34 24.04041 26.4 ## 35 20.91349 20.3 ## 36 21.18602 23.1 ## 37 22.57069 22.2 ## 38 25.21751 23.9 ## 39 28.55841 50.0 ## 40 14.38216 7.2 ## 41 12.76573 8.5 ## 42 11.78237 9.5 ## 43 13.29279 13.4 ## 44 14.95076 16.4 ## 45 15.79182 19.1 ## 46 18.26510 19.6 ## 47 14.84985 13.3 ## 48 16.01508 16.7 ## 49 24.09930 25.0 ## 50 20.75357 21.8 ## 51 19.49487 19.7 Let’s now compute the RMSE: tibble::tibble("truth" = test_data$price, "prediction" = unlist(price_predict)) %>% yardstick::rmse(truth, prediction) ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 0.425 As I mentioned above, all the part about looking for hyper-parameters could be changed to something else. The general approach though remains what I have described, and can be applied for any models that have hyper-parameters. References "],["defining-your-own-functions.html", "Chapter 7 Defining your own functions 7.1 Control flow 7.2 Writing your own functions 7.3 Exercises 7.4 Functions that take functions as arguments: writing your own higher-order functions 7.5 Functions that return functions 7.6 Functions that take columns of data as arguments 7.7 Functions that use loops 7.8 Anonymous functions 7.9 Exercises", " Chapter 7 Defining your own functions In this section we are going to learn some advanced concepts that are going to make you into a full-fledged R programmer. Before this chapter you only used whatever R came with, as well as the functions contained in packages. We did define some functions ourselves in Chapter 6 already, but without going into many details. In this chapter, we will learn about building functions ourselves, and do so in greater detail than what we did before. 7.1 Control flow Knowing about control flow is essential to build your own functions. Without control flow statements, such as if-else statements or loops (or, in the case of pure functional programming languages, recursion), programming languages would be very limited. 7.1.1 If-else Imagine you want a variable to be equal to a certain value if a condition is met. This is a typical problem that requires the if ... else ... construct. For instance: a <- 4 b <- 5 Suppose that if a > b then f should be equal to 20, else f should be equal to 10. Using if ... else ... you can achieve this like so: if (a > b) { f <- 20 } else { f <- 10 } Obviously, here f = 10. Another way to achieve this is by using the ifelse() function: f <- ifelse(a > b, 20, 10) if...else... and ifelse() might seem interchangeable, but they’re not. ifelse() is vectorized, while if...else.. is not. Let’s try the following: ifelse(c(1,2,4) > c(3, 1, 0), "yes", "no") ## [1] "no" "yes" "yes" The result is a vector. Now, let’s see what happens if we use if...else... instead of ifelse(): if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no") > Error in if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no") : the condition has length > 1 This results in an error (in previous R version, only the first element of the vector would get used). We have already discussed this in Chapter 2, remember? If you want to make sure that such an expression evaluates to TRUE, then you need to use all(): ifelse(all(c(1,2,4) > c(3, 1, 0)), "all elements are greater", "not all elements are greater") ## [1] "not all elements are greater" You may also remember the any() function: ifelse(any(c(1,2,4) > c(3, 1, 0)), "at least one element is greater", "no element greater") ## [1] "at least one element is greater" These are the basics. But sometimes, you might need to test for more complex conditions, which can lead to using nested if...else... constructs. These, however, can get messy: if (10 %% 3 == 0) { print("10 is divisible by 3") } else if (10 %% 2 == 0) { print("10 is divisible by 2") } ## [1] "10 is divisible by 2" 10 being obviously divisible by 2 and not 3, it is the second sentence that will be printed. The %% operator is the modulus operator, which gives the rest of the division of 10 by 2. In such cases, it is easier to use dplyr::case_when(): case_when(10 %% 3 == 0 ~ "10 is divisible by 3", 10 %% 2 == 0 ~ "10 is divisible by 2") ## [1] "10 is divisible by 2" We have already encountered this function in Chapter 4, inside a dplyr::mutate() call to create a new column. Let’s now discuss loops. 7.1.2 For loops For loops make it possible to repeat a set of instructions i times. For example, try the following: for (i in 1:10){ print("hello") } ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" It is also possible to do computations using for loops. Let’s compute the sum of the first 100 integers: result <- 0 for (i in 1:100){ result <- result + i } print(result) ## [1] 5050 result is equal to 5050, the expected result. What happened in that loop? First, we defined a variable called result and set it to 0. Then, when the loops starts, i equals 1, so we add result to 1, which is 1. Then, i equals 2, and again, we add result to i. But this time, result equals 1 and i equals 2, so now result equals 3, and we repeat this until i equals 100. If you know a programming language like C, this probably looks familiar. However, R is not C, and you should, if possible, avoid writing code that looks like this. You should always ask yourself the following questions: Is there an inbuilt function to achieve what I need? In this case we have sum(), so we could use sum(seq(1, 100)). Is there a way to use matrix algebra? This can sometimes make things easier, but it depends how comfortable you are with matrix algebra. This would be the solution with matrix algebra: rep(1, 100) %*% seq(1, 100). Is there a way to use building blocks that are already available? For instance, suppose that sum() would not be a function available in R. Another way to solve this issue would be to use the following building blocks: +, which computes the sum of two numbers and Reduce(), which reduces a list of elements using an operator. Sounds complicated? Let’s see how Reduce() works. First, let me show you how I combine these two functions to achieve the same result as when using sum(): Reduce(`+`, seq(1, 100)) ## [1] 5050 We will see how Reduce() works in greater detail in the next chapter, but what happened was something like this: Reduce(`+`, seq(1, 100)) = 1 + Reduce(`+`, seq(2, 100)) = 1 + 2 + Reduce(`+`, seq(3, 100)) = 1 + 2 + 3 + Reduce(`+`, seq(4, 100)) = .... If you ask yourself these questions, it turns out that you only rarely actually need to write loops, but loops are still important, because sometimes there simply isn’t an alternative. Also, there are other situations where loops are also important, so I refer you to the following section of Hadley Wickham’s Advanced R for an in-depth discussion on situations where loops make more sense than using functions such as Reduce(). 7.1.3 While loops While loops are very similar to for loops. The instructions inside a while loop are repeated while a certain condition holds true. Let’s consider the sum of the first 100 integers again: result <- 0 i <- 1 while (i<=100){ result = result + i i = i + 1 } print(result) ## [1] 5050 Here, we first set result and i to 0. Then, while i is less than, or equal to 100, we add i to result. Notice that there is one more line than in the for loop version of this code: we need to increment the value of i at each iteration, if not, i would stay equal to 1, and the condition would always be fulfilled, and the loop would run forever (not really, only until your computer runs out of memory, or until the heat death of the universe, whichever comes first). Now that we know how to write loops, and know about if...else... constructs, we have (almost) all the ingredients to write our own functions. 7.2 Writing your own functions As you have seen by now, R includes a very large amount of in-built functions, but also many more functions are available in packages. However, there will be a lot of situations where you will need to write your own. In this section we are going to learn how to write our own functions. 7.2.1 Declaring functions in R Suppose you want to create the following function: \\(f(x) = \\dfrac{1}{\\sqrt{x}}\\). Writing this in R is quite simple: my_function <- function(x){ 1/sqrt(x) } The argument of the function, x, gets passed to the function() function and the body of the function (more on that in the next Chapter) contains the function definition. Of course, you could define functions that use more than one input: my_function <- function(x, y){ 1/sqrt(x + y) } or inputs with names longer than one character: my_function <- function(argument1, argument2){ 1/sqrt(argument1 + argument2) } Functions written by the user get called just the same way as functions included in R: my_function(1, 10) ## [1] 0.3015113 It is also possible to provide default values to the function’s arguments, which are values that are used if the user omits them: my_function <- function(argument1, argument2 = 10){ 1/sqrt(argument1 + argument2) } my_function(1) ## [1] 0.3015113 This is especially useful for functions with many arguments. Consider also the following example, where the function has a default method: my_function <- function(argument1, argument2, method = "foo"){ x <- argument1 + argument2 if(method == "foo"){ 1/sqrt(x) } else if (method == "bar"){ "this is a string" } } my_function(10, 11) ## [1] 0.2182179 my_function(10, 11, "bar") ## [1] "this is a string" As you see, depending on the “method” chosen, the returned result is either a numeric, or a string. What happens if the user provides a “method” that is neither “foo” nor “bar”? my_function(10, 11, "spam") As you can see nothing happens. It is possible to add safeguards to your function to avoid such situations: my_function <- function(argument1, argument2, method = "foo"){ if(!(method %in% c("foo", "bar"))){ return("Method must be either 'foo' or 'bar'") } x <- argument1 + argument2 if(method == "foo"){ 1/sqrt(x) } else if (method == "bar"){ "this is a string" } } my_function(10, 11) ## [1] 0.2182179 my_function(10, 11, "bar") ## [1] "this is a string" my_function(10, 11, "foobar") ## [1] "Method must be either 'foo' or 'bar'" Notice that I have used return() inside my first if statement. This is to immediately stop evaluation of the function and return a value. If I had omitted it, evaluation would have continued, as it is always the last expression that gets evaluated. Remove return() and run the function again, and see what happens. Later, we are going to learn how to add better safeguards to your functions and to avoid runtime errors. While in general, it is a good idea to add comments to your functions to explain what they do, I would avoid adding comments to functions that do things that are very obvious, such as with this one. Function names should be of the form: function_name(). Always give your function very explicit names! In mathematics it is standard to give functions just one letter as a name, but I would advise against doing that in your code. Functions that you write are not special in any way; this means that R will treat them the same way, and they will work in conjunction with any other function just as if it was built-in into R. They have one limitation though (which is shared with R’s native function): just like in math, they can only return one value. However, sometimes, you may need to return more than one value. To be able to do this, you must put your values in a list, and return the list of values. For example: average_and_sd <- function(x){ c(mean(x), sd(x)) } average_and_sd(c(1, 3, 8, 9, 10, 12)) ## [1] 7.166667 4.262237 You’re still returning a single object, but it’s a vector. You can also return a named list: average_and_sd <- function(x){ list("mean_x" = mean(x), "sd_x" = sd(x)) } average_and_sd(c(1, 3, 8, 9, 10, 12)) ## $mean_x ## [1] 7.166667 ## ## $sd_x ## [1] 4.262237 As described before, you can use return() at the end of your functions: average_and_sd <- function(x){ result <- c(mean(x), sd(x)) return(result) } average_and_sd(c(1, 3, 8, 9, 10, 12)) ## [1] 7.166667 4.262237 But this is only needed if you need to return a value early: average_and_sd <- function(x){ if(any(is.na(x))){ return(NA) } else { c(mean(x), sd(x)) } } average_and_sd(c(1, 3, 8, 9, 10, 12)) ## [1] 7.166667 4.262237 average_and_sd(c(1, 3, NA, 9, 10, 12)) ## [1] NA If you need to use a function from a package inside your function use ::: my_sum <- function(a_vector){ purrr::reduce(a_vector, `+`) } However, if you need to use more than one function, this can become tedious. A quick and dirty way of doing that, is to use library(package_name), inside the function: my_sum <- function(a_vector){ library(purrr) reduce(a_vector, `+`) } Loading the library inside the function has the advantage that you will be sure that the package upon which your function depends will be loaded. If the package is already loaded, it will not be loaded again, thus not impact performance, but if you forgot to load it at the beginning of your script, then, no worries, your function will load it the first time you use it! However, you should avoid doing this, because the resulting function is now not pure. It has a side effect, which is loading a library. This could result in problems, especially if several functions load several different packages that have functions with the same name. Depending on which function runs first, a function with the same name but coming from the same package will be available in the global environment. The very best way would be to write your own package and declare the packages upon which your functions depend as dependencies. This is something we are going to explore in Chapter 9. You can put a lot of instructions inside a function, such as loops. Let’s create the function that returns Fionacci numbers. 7.2.2 Fibonacci numbers The Fibonacci sequence is the following: \\[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...\\] Each subsequent number is composed of the sum of the two preceding ones. In R, it is possible to define a function that returns the \\(n^{th}\\) fibonacci number: my_fibo <- function(n){ a <- 0 b <- 1 for (i in 1:n){ temp <- b b <- a a <- a + temp } a } Inside the loop, we defined a variable called temp. Defining temporary variables is usually very useful. Let’s try to understand what happens inside this loop: First, we assign the value 0 to variable a and value 1 to variable b. We start a loop, that goes from 1 to n. We assign the value inside of b to a temporary variable, called temp. b becomes a. We assign the sum of a and temp to a. When the loop is finished, we return a. What happens if we want the 3rd fibonacci number? At n = 1 we have first a = 0 and b = 1, then temp = 1, b = 0 and a = 0 + 1. Then n = 2. Now b = 0 and temp = 0. The previous result, a = 0 + 1 is now assigned to b, so b = 1. Then, a = 1 + 0. Finally, n = 3. temp = 1 (because b = 1), the previous result a = 1 is assigned to b and finally, a = 1 + 1. So the third fibonacci number equals 2. Reading this might be a bit confusing; I strongly advise you to run the algorithm on a sheet of paper, step by step. The above algorithm is called an iterative algorithm, because it uses a loop to compute the result. Let’s look at another way to think about the problem, with a so-called recursive function: fibo_recur <- function(n){ if (n == 0 || n == 1){ return(n) } else { fibo_recur(n-1) + fibo_recur(n-2) } } This algorithm should be easier to understand: if n = 0 or n = 1 the function should return n (0 or 1). If n is strictly bigger than 1, fibo_recur() should return the sum of fibo_recur(n-1) and fibo_recur(n-2). This version of the function is very much the same as the mathematical definition of the fibonacci sequence. So why not use only recursive algorithms then? Try to run the following: system.time(my_fibo(30)) ## user system elapsed ## 0.007 0.000 0.007 The result should be printed very fast (the system.time() function returns the time that it took to execute my_fibo(30)). Let’s try with the recursive version: system.time(fibo_recur(30)) ## user system elapsed ## 1.460 0.037 1.498 It takes much longer to execute! Recursive algorithms are very CPU demanding, so if speed is critical, it’s best to avoid recursive algorithms. Also, in fibo_recur() try to remove this line: if (n == 0 || n == 1) and try to run fibo_recur(5) and see what happens. You should get an error: this is because for recursive algorithms you need a stopping condition, or else, it would run forever. This is not the case for iterative algorithms, because the stopping condition is the last step of the loop. So as you can see, for recursive relationships, for or while loops are the way to go in R, whether you’re writing these loops inside functions or not. 7.3 Exercises Exercise 1 In this exercise, you will write a function to compute the sum of the n first integers. Combine the algorithm we saw in section about while loops and what you learned about functions in this section. Exercise 2 Write a function called my_fact() that computes the factorial of a number n. Do it using a loop, using a recursive function, and using a functional: Exercise 3 Write a function to find the roots of quadratic functions. Your function should take 3 arguments, a, b and c and return the two roots. Only consider the case where there are two real roots (delta > 0). 7.4 Functions that take functions as arguments: writing your own higher-order functions Functions that take functions as arguments are very powerful and useful tools. Two very important functions, that we will discuss in chapter 8 are purrr::map() and purrr::reduce(). But you can also write your own! A very simple example would be the following: my_func <- function(x, func){ func(x) } my_func() is a very simple function that takes x and func() as arguments and that simply executes func(x). This might not seem very useful (after all, you could simply use func(x)!) but this is just for illustration purposes, in practice, your functions would be more useful than that! Let’s try to use my_func(): my_func(c(1, 8, 1, 0, 8), mean) ## [1] 3.6 As expected, this returns the mean of the given vector. But now suppose the following: my_func(c(1, 8, 1, NA, 8), mean) ## [1] NA Because one element of the list is NA, the whole mean is NA. mean() has a na.rm argument that you can set to TRUE to ignore the NAs in the vector. However, here, there is no way to provide this argument to the function mean()! Let’s see what happens when we try to: my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE) Error in my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE) : unused argument (na.rm = TRUE) So what you could do is pass the value TRUE to the na.rm argument of mean() from your own function: my_func <- function(x, func, remove_na){ func(x, na.rm = remove_na) } my_func(c(1, 8, 1, NA, 8), mean, remove_na = TRUE) ## [1] 4.5 This is one solution, but mean() also has another argument called trim. What if some other user needs this argument? Should you also add it to your function? Surely there’s a way to avoid this problem? Yes, there is, and it by using the dots. The ... simply mean “any other argument as needed”, and it’s very easy to use: my_func <- function(x, func, ...){ func(x, ...) } my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE) ## [1] 4.5 or, now, if you need the trim argument: my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE, trim = 0.1) ## [1] 4.5 The ... are very useful when writing higher-order functions such as my_func(), because it allows you to pass arguments down to the underlying functions. 7.5 Functions that return functions The example from before, my_func() took three arguments, some x, a function func, and ... (dots). my_func() was a kind of wrapper that evaluated func on its arguments x and .... But sometimes this is not quite what you need or want. It is sometimes useful to write a function that returns a modified function. This type of function is called a function factory, as it builds functions. For instance, suppose that we want to time how long functions take to run. An idea would be to proceed like this: tic <- Sys.time() very_slow_function(x) toc <- Sys.time() running_time <- toc - tic but if you want to time several functions, this gets very tedious. It would be much easier if functions would time themselves. We could achieve this by writing a wrapper, like this: timed_very_slow_function <- function(...){ tic <- Sys.time() result <- very_slow_function(x) toc <- Sys.time() running_time <- toc - tic list("result" = result, "running_time" = running_time) } The problem here is that we have to change each function we need to time. But thanks to the concept of function factories, we can write a function that does this for us: time_f <- function(.f, ...){ function(...){ tic <- Sys.time() result <- .f(...) toc <- Sys.time() running_time <- toc - tic list("result" = result, "running_time" = running_time) } } time_f() is a function that returns a function, a function factory. Calling it on a function returns, as expected, a function: t_mean <- time_f(mean) t_mean ## function(...){ ## ## tic <- Sys.time() ## result <- .f(...) ## toc <- Sys.time() ## ## running_time <- toc - tic ## ## list("result" = result, ## "running_time" = running_time) ## ## } ## <environment: 0x5572990788f8> This function can now be used like any other function: output <- t_mean(seq(-500000, 500000)) output is a list of two elements, the first being simply the result of mean(seq(-500000, 500000)), and the other being the running time. This approach is super flexible. For instance, imagine that there is an NA in the vector. This would result in the mean of this vector being NA: t_mean(c(NA, seq(-500000, 500000))) ## $result ## [1] NA ## ## $running_time ## Time difference of 0.006829977 secs But because we use the ... in the definition of time_f(), we can now simply pass mean()’s option down to it: t_mean(c(NA, seq(-500000, 500000)), na.rm = TRUE) ## $result ## [1] 0 ## ## $running_time ## Time difference of 0.01427937 secs 7.6 Functions that take columns of data as arguments 7.6.1 The enquo() - !!() approach In many situations, you will want to write functions that look similar to this: my_function(my_data, one_column_inside_data) Such a function would be useful in situation where you have to apply a certain number of operations to columns for different data frames. For example if you need to create tables of descriptive statistics or graphs periodically, it might be very interesting to put these operations inside a function and then call the function whenever you need it, on the fresh batch of data. However, if you try to write something like that, something that might seem unexpected, at first, will happen: data(mtcars) simple_function <- function(dataset, col_name){ dataset %>% group_by(col_name) %>% summarise(mean_speed = mean(speed)) } simple_function(cars, "dist") Error: unknown variable to group by : col_name The variable col_name is passed to simple_function() as a string, but group_by() requires a variable name. So why not try to convert col_name to a name? simple_function <- function(dataset, col_name){ col_name <- as.name(col_name) dataset %>% group_by(col_name) %>% summarise(mean_speed = mean(speed)) } simple_function(cars, "dist") Error: unknown variable to group by : col_name This is because R is literally looking for the variable \"dist\" somewhere in the global environment, and not as a column of the data. R does not understand that you are refering to the column \"dist\" that is inside the dataset. So how can we make R understands what you mean? To be able to do that, we need to use a framework that was introduced in the {tidyverse}, called tidy evaluation. This framework can be used by installing the {rlang} package. {rlang} is quite a technical package, so I will spare you the details. But you should at the very least take a look at the following documents here and here. The discussion can get complicated, but you don’t need to know everything about {rlang}. As you will see, knowing some of the capabilities {rlang} provides can be incredibly useful. Take a look at the code below: simple_function <- function(dataset, col_name){ col_name <- enquo(col_name) dataset %>% group_by(!!col_name) %>% summarise(mean_mpg = mean(mpg)) } simple_function(mtcars, cyl) ## # A tibble: 3 × 2 ## cyl mean_mpg ## <dbl> <dbl> ## 1 4 26.7 ## 2 6 19.7 ## 3 8 15.1 As you can see, the previous idea we had, which was using as.name() was not very far away from the solution. The solution, with {rlang}, consists in using enquo(), which (for our purposes), does something similar to as.name(). Now that col_name is (R programmers call it) quoted, or defused, we need to tell group_by() to evaluate the input as is. This is done with !!(), called the injection operator, which is another {rlang} function. I say it again; don’t worry if you don’t understand everything. Just remember to use enquo() on your column names and then !!() inside the {dplyr} function you want to use. Let’s see some other examples: simple_function <- function(dataset, col_name, value){ col_name <- enquo(col_name) dataset %>% filter((!!col_name) == value) %>% summarise(mean_cyl = mean(cyl)) } simple_function(mtcars, am, 1) ## mean_cyl ## 1 5.076923 Notice that I’ve written: filter((!!col_name) == value) and not: filter(!!col_name == value) I have enclosed !!col_name inside parentheses. This is because operators such as == have precedence over !!, so you have to be explicit. Also, notice that I didn’t have to quote 1. This is because it’s standard variable, not a column inside the dataset. Let’s make this function a bit more general. I hard-coded the variable cyl inside the body of the function, but maybe you’d like the mean of another variable? simple_function <- function(dataset, filter_col, mean_col, value){ filter_col <- enquo(filter_col) mean_col <- enquo(mean_col) dataset %>% filter((!!filter_col) == value) %>% summarise(mean((!!mean_col))) } simple_function(mtcars, am, cyl, 1) ## mean(cyl) ## 1 5.076923 Notice that I had to quote mean_col too. Using the ... that we discovered in the previous section, we can pass more than one column: simple_function <- function(dataset, ...){ col_vars <- quos(...) dataset %>% summarise_at(vars(!!!col_vars), funs(mean, sd)) } Because these dots contain more than one variable, you have to use quos() instead of enquo(). This will put the arguments provided via the dots in a list. Then, because we have a list of columns, we have to use summarise_at(), which you should know if you did the exercices of Chapter 4. So if you didn’t do them, go back to them and finish them first. Doing the exercise will also teach you what vars() and funs() are. The last thing you have to pay attention to is to use !!!() if you used quos(). So 3 ! instead of only 2. This allows you to then do things like this: simple_function(mtcars, am, cyl, mpg) ## Warning: `funs()` was deprecated in dplyr 0.8.0. ## Please use a list of either functions or lambdas: ## ## # Simple named list: ## list(mean = mean, median = median) ## ## # Auto named with `tibble::lst()`: ## tibble::lst(mean, median) ## ## # Using lambdas ## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE)) ## am_mean cyl_mean mpg_mean am_sd cyl_sd mpg_sd ## 1 0.40625 6.1875 20.09062 0.4989909 1.785922 6.026948 Using ... with !!!() allows you to write very flexible functions. If you need to be even more general, you can also provide the summary functions as arguments of your function, but you have to rewrite your function a little bit: simple_function <- function(dataset, cols, funcs){ dataset %>% summarise_at(vars(!!!cols), funs(!!!funcs)) } You might be wondering where the quos() went? Well because now we are passing two lists, a list of columns that we have to quote, and a list of functions, that we also have to quote, we need to use quos() when calling the function: simple_function(mtcars, quos(am, cyl, mpg), quos(mean, sd, sum)) ## am_mean cyl_mean mpg_mean am_sd cyl_sd mpg_sd am_sum cyl_sum mpg_sum ## 1 0.40625 6.1875 20.09062 0.4989909 1.785922 6.026948 13 198 642.9 This works, but I don’t think you’ll need to have that much flexibility; either the columns are variables, or the functions, but rarely both at the same time. To conclude this function, I should also talk about as_label() which allows you to change the name of a variable, for instance if you want to call the resulting column mean_mpg when you compute the mean of the mpg column: simple_function <- function(dataset, filter_col, mean_col, value){ filter_col <- enquo(filter_col) mean_col <- enquo(mean_col) mean_name <- paste0("mean_", as_label(mean_col)) dataset %>% filter((!!filter_col) == value) %>% summarise(!!(mean_name) := mean((!!mean_col))) } Pay attention to the := operator in the last line. This is needed when using as_label(). 7.6.2 Curly Curly, a simplified approach to enquo() and !!() The previous section might have been a bit difficult to grasp, but there is a simplified way of doing it, which consists in using {{}}, introduced in {rlang} version 0.4.0. The suggested pronunciation of {{}} is curly-curly, but there is no consensus yet. Let’s suppose that I need to write a function that takes a data frame, as well as a column from this data frame as arguments, just like before: how_many_na <- function(dataframe, column_name){ dataframe %>% filter(is.na(column_name)) %>% count() } Let’s try this function out on the starwars data: data(starwars) head(starwars) ## # A tibble: 6 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooi… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… ## 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo ## 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… ## 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera… ## 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi… ## # … with 4 more variables: species <chr>, films <list>, vehicles <list>, ## # starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color, ## # ³​eye_color, ⁴​birth_year, ⁵​homeworld As you can see, there are missing values in the hair_color column. Let’s try to count how many missing values are in this column: how_many_na(starwars, hair_color) Error: object 'hair_color' not found Just as expected, this does not work. The issue is that the column is inside the dataframe, but when calling the function with hair_color as the second argument, R is looking for a variable called hair_color that does not exist. What about trying with \"hair_color\"? how_many_na(starwars, "hair_color") ## # A tibble: 1 × 1 ## n ## <int> ## 1 0 Now we get something, but something wrong! One way to solve this issue, is to not use the filter() function, and instead rely on base R: how_many_na_base <- function(dataframe, column_name){ na_index <- is.na(dataframe[, column_name]) nrow(dataframe[na_index, column_name]) } how_many_na_base(starwars, "hair_color") ## [1] 5 This works, but not using the {tidyverse} at all is not always an option. For instance, the next function, which uses a grouping variable, would be difficult to implement without the {tidyverse}: summarise_groups <- function(dataframe, grouping_var, column_name){ dataframe %>% group_by(grouping_var) %>% summarise(mean(column_name, na.rm = TRUE)) } Calling this function results in the following error message, as expected: Error: Column `grouping_var` is unknown In the previous section, we solved the issue like so: summarise_groups <- function(dataframe, grouping_var, column_name){ grouping_var <- enquo(grouping_var) column_name <- enquo(column_name) mean_name <- paste0("mean_", as_label(column_name)) dataframe %>% group_by(!!grouping_var) %>% summarise(!!(mean_name) := mean(!!column_name, na.rm = TRUE)) } The core of the function remained very similar to the version from before, but now one has to use the enquo()-!! syntax. Now this can be simplified using the new {{}} syntax: summarise_groups <- function(dataframe, grouping_var, column_name){ dataframe %>% group_by({{grouping_var}}) %>% summarise({{column_name}} := mean({{column_name}}, na.rm = TRUE)) } Much easier and cleaner! You still have to use the := operator instead of = for the column name however, and if you want to modify the column names, for instance in this case return \"mean_height\" instead of height you have to keep using the enquo()-!! syntax. 7.7 Functions that use loops It is entirely possible to put a loop inside a function. For example, consider the following function that return the square root of a number using Newton’s algorithm: sqrt_newton <- function(a, init = 1, eps = 0.01){ stopifnot(a >= 0) while(abs(init**2 - a) > eps){ init <- 1/2 *(init + a/init) } init } This functions contains a while loop inside its body. Let’s see if it works: sqrt_newton(16) ## [1] 4.000001 In the definition of the function, I wrote init = 1 and eps = 0.01 which means that this argument can be omitted and will have the provided value (0.01) as the default. You can then use this function as any other, for example with map(): map(c(16, 7, 8, 9, 12), sqrt_newton) ## [[1]] ## [1] 4.000001 ## ## [[2]] ## [1] 2.645767 ## ## [[3]] ## [1] 2.828469 ## ## [[4]] ## [1] 3.000092 ## ## [[5]] ## [1] 3.464616 This is what I meant before with “your functions are nothing special”. Once the function is defined, you can use it like any other base R function. Notice the use of stopifnot() inside the body of the function. This is a way to return an error in case a condition is not fulfilled. We are going to learn more about this type of functions in the next chapter. 7.8 Anonymous functions As the name implies, anonymous functions are functions that do not have a name. These are useful inside functions that have functions as arguments, such as purrr::map() or purrr::reduce(): map(c(1,2,3,4), function(x){1/sqrt(x)}) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 0.7071068 ## ## [[3]] ## [1] 0.5773503 ## ## [[4]] ## [1] 0.5 These anonymous functions get defined in a very similar way to regular functions, you just skip the name and that’s it. {tidyverse} functions also support formulas; these get converted to anonymous functions: map(c(1,2,3,4), ~{1/sqrt(.)}) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 0.7071068 ## ## [[3]] ## [1] 0.5773503 ## ## [[4]] ## [1] 0.5 Using a formula instead of an anonymous function is less verbose; you use ~ instead of function(x) and a single dot . instead of x. What if you need an anonymous function that requires more than one argument? This is not a problem: map2(c(1, 2, 3, 4, 5), c(9, 8, 7, 6, 5), function(x, y){(x**2)/y}) ## [[1]] ## [1] 0.1111111 ## ## [[2]] ## [1] 0.5 ## ## [[3]] ## [1] 1.285714 ## ## [[4]] ## [1] 2.666667 ## ## [[5]] ## [1] 5 or, using a formula: map2(c(1, 2, 3, 4, 5), c(9, 8, 7, 6, 5), ~{(.x**2)/.y}) ## [[1]] ## [1] 0.1111111 ## ## [[2]] ## [1] 0.5 ## ## [[3]] ## [1] 1.285714 ## ## [[4]] ## [1] 2.666667 ## ## [[5]] ## [1] 5 Because you have now two arguments, a single dot could not work, so instead you use .x and .y to avoid confusion. Since version 4.1, R introduced a short-hand for defining anonymous functions: map(c(1,2,3,4), \\(x)(1/sqrt(x))) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 0.7071068 ## ## [[3]] ## [1] 0.5773503 ## ## [[4]] ## [1] 0.5 \\(x) is supposed to look like this notation: \\(\\lambda(x)\\). This is a notation comes from lambda calculus, where functions are defined like this: \\[ \\lambda(x).1/sqrt(x) \\] which is equivalent to \\(f(x) = 1/sqrt(x)\\). You can use \\(x) or function(x) interchangeably. You now know a lot about writing your own functions. In the next chapter, we are going to learn about functional programming, the programming paradigm I described in the introduction of this book. 7.9 Exercises Exercise 1 Create the following vector: \\[a = (1,6,7,8,8,9,2)\\] Using a for loop and a while loop, compute the sum of its elements. To avoid issues, use i as the counter inside the for loop, and j as the counter for the while loop. How would you achieve that with a functional (a function that takes a function as an argument)? Exercise 2 Let’s use a loop to get the matrix product of a matrix A and B. Follow these steps to create the loop: Create matrix A: \\[A = \\left( \\begin{array}{ccc} 9 & 4 & 12 \\\\ 5 & 0 & 7 \\\\ 2 & 6 & 8 \\\\ 9 & 2 & 9 \\end{array} \\right) \\] Create matrix B: \\[B = \\left( \\begin{array}{cccc} 5 & 4 & 2 & 5 \\\\ 2 & 7 & 2 & 1 \\\\ 8 & 3 & 2 & 6 \\\\ \\end{array} \\right) \\] Create a matrix C, with dimension 4x4 that will hold the result. Use this command: `C = matrix(rep(0,16), nrow = 4)} Using a for loop, loop over the rows of A first: `for(i in 1:nrow(A))} Inside this loop, loop over the columns of B: `for(j in 1:ncol(B))} Again, inside this loop, loop over the rows of B: `for(k in 1:nrow(B))} Inside this last loop, compute the result and save it inside C: `C[i,j] = C[i,j] + A[i,k] * B[k,j]} Now write a function that takes two matrices as arguments, and returns their product. R has a built-in function to compute the dot product of 2 matrices. Which is it? Exercise 3 Fizz Buzz: Print integers from 1 to 100. If a number is divisible by 3, print the word \"Fizz\" if it’s divisible by 5, print \"Buzz\". Use a for loop and if statements. Write a function that takes an integer as arguments, and prints \"Fizz\" or \"Buzz\" up to that integer. Exercise 4 Fizz Buzz 2: Same as above, but now add this third condition: if a number is both divisible by 3 and 5, print \"FizzBuzz\". Write a function that takes an integer as argument, and prints Fizz, Buzz or FizzBuzz up to that integer. "],["functional-programming.html", "Chapter 8 Functional programming 8.1 Function definitions 8.2 Properties of functions 8.3 Functional programming with {purrr} 8.4 List-based workflows for efficiency 8.5 Exercises", " Chapter 8 Functional programming Functional programming is a paradigm that I find very suitable for data science. In functional programming, your code is organised into functions that perform the operations you need. Your scripts will only be a sequence of calls to these functions, making them easier to understand. R is not a pure functional programming language, so we need some self-discipline to apply pure functional programming principles. However, these efforts are worth it, because pure functions are easier to debug, extend and document. In this chapter, we are going to learn about functional programming principles that you can adopt and start using to make your code better. 8.1 Function definitions You should now be familiar with function definitions in R. Let’s suppose you want to write a function to compute the square root of a number and want to do so using Newton’s algorithm: sqrt_newton <- function(a, init, eps = 0.01){ while(abs(init**2 - a) > eps){ init <- 1/2 *(init + a/init) } init } You can then use this function to get the square root of a number: sqrt_newton(16, 2) ## [1] 4.00122 We are using a while loop inside the body of the function. The body of a function are the instructions that define the function. You can get the body of a function with body(some_func). In pure functional programming languages, like Haskell, loops do not exist. How can you program without loops, you may ask? In functional programming, loops are replaced by recursion, which we already discussed in the previous chapter. Let’s rewrite our little example above with recursion: sqrt_newton_recur <- function(a, init, eps = 0.01){ if(abs(init**2 - a) < eps){ result <- init } else { init <- 1/2 * (init + a/init) result <- sqrt_newton_recur(a, init, eps) } result } sqrt_newton_recur(16, 2) ## [1] 4.00122 R is not a pure functional programming language though, so we can still use loops (be it while or for loops) in the bodies of our functions. As discussed in the previous chapter, it is actually better, performance-wise, to use loops instead of recursion, because R is not tail-call optimized. I won’t got into the details of what tail-call optimization is but just remember that if performance is important a loop will be faster. However, sometimes, it is easier to write a function using recursion. I personally tend to avoid loops if performance is not important, because I find that code that avoids loops is easier to read and debug. However, knowing that you can use loops is reassuring, and encapsulating loops inside functions gives you the benefits of both using functions, and loops. In the coming sections I will show you some built-in functions that make it possible to avoid writing loops and that don’t rely on recursion, so performance won’t be penalized. 8.2 Properties of functions Mathematical functions have a nice property: we always get the same output for a given input. This is called referential transparency and we should aim to write our R functions in such a way. For example, the following function: increment <- function(x){ x + 1 } Is a referential transparent function. We always get the same result for any x that we give to this function. This: increment(10) ## [1] 11 will always produce 11. However, this one: increment_opaque <- function(x){ x + spam } is not a referential transparent function, because its value depends on the global variable spam. spam <- 1 increment_opaque(10) ## [1] 11 will produce 11 if spam = 1. But what if spam = 19? spam <- 19 increment_opaque(10) ## [1] 29 To make increment_opaque() a referential transparent function, it is enough to make spam an argument: increment_not_opaque <- function(x, spam){ x + spam } Now even if there is a global variable called spam, this will not influence our function: spam <- 19 increment_not_opaque(10, 34) ## [1] 44 This is because the variable spam defined in the body of the function is a local variable. It could have been called anything else, really. Avoiding opaque functions makes our life easier. Another property that adepts of functional programming value is that functions should have no, or very limited, side-effects. This means that functions should not change the state of your program. For example this function (which is not a referential transparent function): count_iter <- 0 sqrt_newton_side_effect <- function(a, init, eps = 0.01){ while(abs(init**2 - a) > eps){ init <- 1/2 *(init + a/init) count_iter <<- count_iter + 1 # The "<<-" symbol means that we assign the } # RHS value in a variable inside the global environment init } If you look in the environment pane, you will see that count_iter equals 0. Now call this function with the following arguments: sqrt_newton_side_effect(16000, 2) ## [1] 126.4911 print(count_iter) ## [1] 9 If you check the value of count_iter now, you will see that it increased! This is a side effect, because the function changed something outside of its scope. It changed a value in the global environment. In general, it is good practice to avoid side-effects. For example, we could make the above function not have any side effects like this: sqrt_newton_count <- function(a, init, count_iter = 0, eps = 0.01){ while(abs(init**2 - a) > eps){ init <- 1/2 *(init + a/init) count_iter <- count_iter + 1 } c(init, count_iter) } Now, this function returns a list with two elements, the result, and the number of iterations it took to get the result: sqrt_newton_count(16000, 2) ## [1] 126.4911 9.0000 Writing to disk is also considered a side effect, because the function changes something (a file) outside its scope. But this cannot be avoided since you want to write to disk. Just remember: try to avoid having functions changing variables in the global environment unless you have a very good reason of doing so. Very long scripts that don’t use functions and use a lot of global variables with loops changing the values of global variables are a nightmare to debug. If something goes wrong, it might be very difficult to pinpoint where the problem is. Is there an error in one of the loops? Is your code running for a particular value of a particular variable in the global environment, but not for other values? Which values? And of which variables? It can be very difficult to know what is wrong with such a script. With functional programming, you can avoid a lot of this pain for free (well not entirely for free, it still requires some effort, since R is not a pure functional language). Writing functions also makes it easier to parallelize your code. We are going to learn about that later in this chapter too. Finally, another property of mathematical functions, is that they do one single thing. Functional programming purists also program their functions to do one single task. This has benefits, but can complicate things. The function we wrote previously does two things: it computes the square root of a number and also returns the number of iterations it took to compute the result. However, this is not a bad thing; the function is doing two tasks, but these tasks are related to each other and it makes sense to have them together. My piece of advice: avoid having functions that do many unrelated things. This makes debugging harder. In conclusion: you should strive for referential transparency, try to avoid side effects unless you have a good reason to have them and try to keep your functions short and do as little tasks as possible. This makes testing and debugging easier, as you will see in the next chapter, but also improves readability and maintainability of your code. 8.3 Functional programming with {purrr} I mentioned it several times already, but R is not a pure functional programming language. It is possible to write R code using the functional programming paradigm, but some effort is required. The {purrr} package extends R’s base functional programming capabilities with some very interesting functions. We have already seen map() and reduce(), which we are going to see in more detail now. Then, we are going to learn about some other functions included in {purrr} that make functional programming easier in R. 8.3.1 Doing away with loops: the map*() family of functions Instead of using loops, pure functional programming languages use functions that achieve the same result. These functions are often called Map or Reduce (also called Fold). R comes with the *apply() family of functions (which are implementations of Map), as well as Reduce() for functional programming. Within this family, you can find lapply(), sapply(), vapply(), tapply(), mapply(), rapply(), eapply() and apply() (I might have forgotten one or the other, but that’s not important). Each version of an *apply() function has a different purpose, but it is not very easy to remember which does what exactly. To add even more confusion, the arguments are sometimes different between each of these. In the {purrr} package, these functions are replaced by the map*() family of functions. As you will shortly see, they are very consistent, and thus easier to use. The first part of these functions’ names all start with map_ and the second part tells you what this function is going to return. For example, if you want doubles out, you would use map_dbl(). If you are working on data frames and want a data frame back, you would use map_df(). Let’s start with the basic map() function. The following gif (source: Wikipedia) illustrates what map() does fairly well: \\(X\\) is a vector composed of the following scalars: \\((0, 5, 8, 3, 2, 1)\\). The function we want to map to each element of \\(X\\) is \\(f(x) = x + 1\\). \\(X'\\) is the result of this operation. Using R, we would do the following: library("purrr") numbers <- c(0, 5, 8, 3, 2, 1) plus_one <- function(x) (x + 1) my_results <- map(numbers, plus_one) my_results ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 6 ## ## [[3]] ## [1] 9 ## ## [[4]] ## [1] 4 ## ## [[5]] ## [1] 3 ## ## [[6]] ## [1] 2 Using a loop, you would write: numbers <- c(0, 5, 8, 3, 2, 1) plus_one <- function(x) (x + 1) my_results <- vector("list", 6) for(number in seq_along(numbers)){ my_results[[number]] <- plus_one(number) } my_results ## [[1]] ## [1] 2 ## ## [[2]] ## [1] 3 ## ## [[3]] ## [1] 4 ## ## [[4]] ## [1] 5 ## ## [[5]] ## [1] 6 ## ## [[6]] ## [1] 7 Now I don’t know about you, but I prefer the first option. Using functional programming, you don’t need to create an empty list to hold your results, and the code is more concise. Plus, it is less error prone. I had to try several times to get the loop right (and I’ve using R for almost 10 years now). Why? Well, first of all I used %in% instead of in. Then, I forgot about seq_along(). After that, I made a typo, plos_one() instead of plus_one() (ok, that one is unrelated to the loop). Let’s also see how this works using base R: numbers <- c(0, 5, 8, 3, 2, 1) plus_one <- function(x) (x + 1) my_results <- lapply(numbers, plus_one) my_results ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 6 ## ## [[3]] ## [1] 9 ## ## [[4]] ## [1] 4 ## ## [[5]] ## [1] 3 ## ## [[6]] ## [1] 2 So what is the added value of using {purrr}, you might ask. Well, imagine that instead of a list, I need to an atomic vector of numerics. This is fairly easy with {purrr}: library("purrr") numbers <- c(0, 5, 8, 3, 2, 1) plus_one <- function(x) (x + 1) my_results <- map_dbl(numbers, plus_one) my_results ## [1] 1 6 9 4 3 2 We’re going to discuss these functions below, but know that in base R, outputting something else involves more effort. Let’s go back to our sqrt_newton() function. This function has more than one parameter. Often, we would like to map functions with more than one parameter to a list, while holding constant some of the functions parameters. This is easily achieved like so: library("purrr") numbers <- c(7, 8, 19, 64) map(numbers, sqrt_newton, init = 1) ## [[1]] ## [1] 2.645767 ## ## [[2]] ## [1] 2.828469 ## ## [[3]] ## [1] 4.358902 ## ## [[4]] ## [1] 8.000002 It is also possible to use a formula: library("purrr") numbers <- c(7, 8, 19, 64) map(numbers, ~sqrt_newton(., init = 1)) ## [[1]] ## [1] 2.645767 ## ## [[2]] ## [1] 2.828469 ## ## [[3]] ## [1] 4.358902 ## ## [[4]] ## [1] 8.000002 Another function that is similar to map() is rerun(). You guessed it, this one simply reruns an expression: rerun(10, "hello") ## [[1]] ## [1] "hello" ## ## [[2]] ## [1] "hello" ## ## [[3]] ## [1] "hello" ## ## [[4]] ## [1] "hello" ## ## [[5]] ## [1] "hello" ## ## [[6]] ## [1] "hello" ## ## [[7]] ## [1] "hello" ## ## [[8]] ## [1] "hello" ## ## [[9]] ## [1] "hello" ## ## [[10]] ## [1] "hello" rerun() simply runs an expression (which can be arbitrarily complex) n times, whereas map() maps a function to a list of inputs, so to achieve the same with map(), you need to map the print() function to a vector of characters: map(rep("hello", 10), print) ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [1] "hello" ## [[1]] ## [1] "hello" ## ## [[2]] ## [1] "hello" ## ## [[3]] ## [1] "hello" ## ## [[4]] ## [1] "hello" ## ## [[5]] ## [1] "hello" ## ## [[6]] ## [1] "hello" ## ## [[7]] ## [1] "hello" ## ## [[8]] ## [1] "hello" ## ## [[9]] ## [1] "hello" ## ## [[10]] ## [1] "hello" rep() is a function that creates a vector by repeating something, in this case the string “hello”, as many times as needed, here 10. The output here is a bit different that before though, because first you will see “hello” printed 10 times and then the list where each element is “hello”. This is because the print() function has a side effect, which is, well printing to the console. We see this side effect 10 times, plus then the list created with map(). rerun() is useful if you want to run simulation. For instance, let’s suppose that I perform a simulation where I throw a die 5 times, and compute the mean of the points obtained, as well as the variance: mean_var_throws <- function(n){ throws <- sample(1:6, n, replace = TRUE) mean_throws <- mean(throws) var_throws <- var(throws) tibble::tribble(~mean_throws, ~var_throws, mean_throws, var_throws) } mean_var_throws(5) ## # A tibble: 1 × 2 ## mean_throws var_throws ## <dbl> <dbl> ## 1 2.2 1.7 mean_var_throws() returns a tibble object with mean of points and the variance of the points. Now suppose I want to compute the expected value of the distribution of throwing dice. We know from theory that it should be equal to \\(3.5 (= 1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6)\\). Let’s rerun the simulation 50 times: simulations <- rerun(50, mean_var_throws(5)) Let’s see what the simulations object is made of: str(simulations) ## List of 50 ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 2 variables: ## ..$ mean_throws: num 2 ## ..$ var_throws : num 3 ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 2 variables: ## ..$ mean_throws: num 2.8 ## ..$ var_throws : num 0.2 ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 2 variables: ## ..$ mean_throws: num 2.8 ## ..$ var_throws : num 0.7 ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 2 variables: ## ..$ mean_throws: num 2.8 ## ..$ var_throws : num 1.7 ..... simulations is a list of 50 data frames. We can easily combine them into a single data frame, and compute the mean of the means, which should return something close to the expected value of 3.5: bind_rows(simulations) %>% summarise(expected_value = mean(mean_throws)) ## # A tibble: 1 × 1 ## expected_value ## <dbl> ## 1 3.44 Pretty close! Now of course, one could have simply done something like this: mean(sample(1:6, 1000, replace = TRUE)) ## [1] 3.481 but the point was to illustrate that rerun() can run any arbitrarily complex expression, and that it is good practice to put the result in a data frame or list, for easier further manipulation. You now know the standard map() function, and also rerun(), which return lists, but there are a number of variants of this function. map_dbl() returns an atomic vector of doubles, as seen we’ve seen before. A little reminder below: map_dbl(numbers, sqrt_newton, init = 1) ## [1] 2.645767 2.828469 4.358902 8.000002 In a similar fashion, map_chr() returns an atomic vector of strings: map_chr(numbers, sqrt_newton, init = 1) ## [1] "2.645767" "2.828469" "4.358902" "8.000002" map_lgl() returns an atomic vector of TRUE or FALSE: divisible <- function(x, y){ if_else(x %% y == 0, TRUE, FALSE) } map_lgl(seq(1:100), divisible, 3) ## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [13] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [25] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [37] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [49] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [61] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [73] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [85] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE ## [97] FALSE FALSE TRUE FALSE There are also other interesting variants, such as map_if(): a <- seq(1,10) map_if(a, (function(x) divisible(x, 2)), sqrt) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 1.414214 ## ## [[3]] ## [1] 3 ## ## [[4]] ## [1] 2 ## ## [[5]] ## [1] 5 ## ## [[6]] ## [1] 2.44949 ## ## [[7]] ## [1] 7 ## ## [[8]] ## [1] 2.828427 ## ## [[9]] ## [1] 9 ## ## [[10]] ## [1] 3.162278 I used map_if() to take the square root of only those numbers in vector a that are divisble by 2, by using an anonymous function that checks if a number is divisible by 2 (by wrapping divisible()). map_at() is similar to map_if() but maps the function at a position specified by the user: map_at(numbers, c(1, 3), sqrt) ## [[1]] ## [1] 2.645751 ## ## [[2]] ## [1] 8 ## ## [[3]] ## [1] 4.358899 ## ## [[4]] ## [1] 64 or if you have a named list: recipe <- list("spam" = 1, "eggs" = 3, "bacon" = 10) map_at(recipe, "bacon", `*`, 2) ## $spam ## [1] 1 ## ## $eggs ## [1] 3 ## ## $bacon ## [1] 20 I used map_at() to double the quantity of bacon in the recipe (by using the * function, and specifying its second argument, 2. Try the following in the command prompt: `*`(3, 4)). map2() is the equivalent of mapply() and pmap() is the generalisation of map2() for more than 2 arguments: print(a) ## [1] 1 2 3 4 5 6 7 8 9 10 b <- seq(1, 2, length.out = 10) print(b) ## [1] 1.000000 1.111111 1.222222 1.333333 1.444444 1.555556 1.666667 1.777778 ## [9] 1.888889 2.000000 map2(a, b, `*`) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 2.222222 ## ## [[3]] ## [1] 3.666667 ## ## [[4]] ## [1] 5.333333 ## ## [[5]] ## [1] 7.222222 ## ## [[6]] ## [1] 9.333333 ## ## [[7]] ## [1] 11.66667 ## ## [[8]] ## [1] 14.22222 ## ## [[9]] ## [1] 17 ## ## [[10]] ## [1] 20 Each element of a gets multiplied by the element of b that is in the same position. Let’s see what pmap() does. Can you guess from the code below what is going on? I will print a and b again for clarity: a ## [1] 1 2 3 4 5 6 7 8 9 10 b ## [1] 1.000000 1.111111 1.222222 1.333333 1.444444 1.555556 1.666667 1.777778 ## [9] 1.888889 2.000000 n <- seq(1:10) pmap(list(a, b, n), rnorm) ## [[1]] ## [1] -0.1758315 ## ## [[2]] ## [1] -0.2162863 1.1033912 ## ## [[3]] ## [1] 4.5731231 -0.3743379 6.8130737 ## ## [[4]] ## [1] 0.8933089 4.1930837 7.5276030 -2.3575522 ## ## [[5]] ## [1] 2.1814981 -1.7455750 5.0548288 2.7848458 0.9230675 ## ## [[6]] ## [1] 2.806217 5.667499 -5.032922 6.741065 -2.757928 12.414101 ## ## [[7]] ## [1] -3.314145 -7.912019 -3.865292 4.307842 18.022049 1.278158 1.083208 ## ## [[8]] ## [1] 6.2629161 2.1213552 0.3543566 2.1041606 -0.2643654 8.7600450 3.3616206 ## [8] -7.7446668 ## ## [[9]] ## [1] -7.609538 5.472267 -4.869374 -11.943063 4.707929 -7.730088 13.431771 ## [8] 1.606800 -6.578745 ## ## [[10]] ## [1] -9.101480 4.404571 -16.071437 1.110689 7.168097 15.848579 ## [7] 16.710863 1.998482 -17.856521 -2.021087 Let’s take a closer look at what a, b and n look like, when they are place next to each other: cbind(a, b, n) ## a b n ## [1,] 1 1.000000 1 ## [2,] 2 1.111111 2 ## [3,] 3 1.222222 3 ## [4,] 4 1.333333 4 ## [5,] 5 1.444444 5 ## [6,] 6 1.555556 6 ## [7,] 7 1.666667 7 ## [8,] 8 1.777778 8 ## [9,] 9 1.888889 9 ## [10,] 10 2.000000 10 rnorm() gets first called with the parameters from the first line, meaning rnorm(a[1], b[1], n[1]). The second time rnorm() gets called, you guessed it, it with the parameters on the second line of the array above, rnorm(a[2], b[2], n[2]), etc. There are other functions in the map() family of functions, but we will discover them in the exercises! The map() family of functions does not have any more secrets for you. Let’s now take a look at the reduce() family of functions. 8.3.2 Reducing with purrr Reducing is another important concept in functional programming. It allows going from a list of elements, to a single element, by somehow combining the elements into one. For instance, using the base R Reduce() function, you can sum the elements of a list like so: Reduce(`+`, seq(1:100)) ## [1] 5050 using purrr::reduce(), this becomes: reduce(seq(1:100), `+`) ## [1] 5050 If you don’t really get what happening, don’t worry. Things should get clearer once I’ll introduce another version of reduce(), called accumulate(), which we will see below. Sometimes, the direction from which we start to reduce is quite important. You can “start from the end” of the list by using the .dir argument: reduce(seq(1:100), `+`, .dir = "backward") ## [1] 5050 Of course, for commutative operations, direction does not matter. But it does matter for non-commutative operations: reduce(seq(1:100), `-`) ## [1] -5048 reduce(seq(1:100), `-`, .dir = "backward") ## [1] -50 Let’s now take a look at accumulate(). accumulate() is very similar to map(), but keeps the intermediary results. Which intermediary results? Let’s try and see what happens: a <- seq(1, 10) accumulate(a, `-`) ## [1] 1 -1 -4 -8 -13 -19 -26 -34 -43 -53 accumulate() illustrates pretty well what is happening; the first element, 1, is simply the first element of seq(1, 10). The second element of the result however, is the difference between 1 and 2, -1. The next element in a is 3. Thus the next result is -1-3, -4, and so on until we run out of elements in a. The below illustration shows the algorithm step-by-step: (1-2-3-4-5-6-7-8-9-10) ((1)-2-3-4-5-6-7-8-9-10) ((1-2)-3-4-5-6-7-8-9-10) ((-1-3)-4-5-6-7-8-9-10) ((-4-4)-5-6-7-8-9-10) ((-8-5)-6-7-8-9-10) ((-13-6)-7-8-9-10) ((-19-7)-8-9-10) ((-26-8)-9-10) ((-34-9)-10) (-43-10) -53 reduce() only shows the final result of all these operations. accumulate() and reduce() also have an .init argument, that makes it possible to start the reducing procedure from an initial value that is different from the first element of the vector: reduce(a, `+`, .init = 1000) accumulate(a, `-`, .init = 1000, .dir = "backward") ## [1] 1055 ## [1] 995 -994 996 -993 997 -992 998 -991 999 -990 1000 reduce() generalizes functions that only take two arguments. If you were to write a function that returns the minimum between two numbers: my_min <- function(a, b){ if(a < b){ return(a) } else { return(b) } } You could use reduce() to get the minimum of a list of numbers: numbers2 <- c(3, 1, -8, 9) reduce(numbers2, my_min) ## [1] -8 map() and reduce() are arguably the most useful higher-order functions, and perhaps also the most famous one, true ambassadors of functional programming. You might have read about MapReduce, a programming model for processing big data in parallel. The way MapReduce works is inspired by both these map() and reduce() functions, which are always included in functional programming languages. This illustrates that the functional programming paradigm is very well suited to parallel computing. Something else that is very important to understand at this point; up until now, we only used these functions on lists, or atomic vectors, of numbers. However, map() and reduce(), and other higher-order functions for that matter, do not care about the contents of the list. What these functions do, is take another functions, and make it do something to the elements of the list. It does not matter if it’s a list of numbers, of characters, of data frames, even of models. All that matters is that the function that will be applied to these elements, can operate on them. So if you have a list of fitted models, you can map summary() on this list to get summaries of each model. Or if you have a list of data frames, you can map a function that performs several cleaning steps. This will be explored in a future section, but it is important to keep this in mind. 8.3.3 Error handling with safely() and possibly() safely() and possibly() are very useful functions. Consider the following situation: a <- list("a", 4, 5) sqrt(a) Error in sqrt(a) : non-numeric argument to mathematical function Using map() or Map() will result in a similar error. safely() is an higher-order function that takes one function as an argument and executes it… safely, meaning the execution of the function will not stop if there is an error. The error message gets captured alongside valid results. a <- list("a", 4, 5) safe_sqrt <- safely(sqrt) map(a, safe_sqrt) ## [[1]] ## [[1]]$result ## NULL ## ## [[1]]$error ## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function> ## ## ## [[2]] ## [[2]]$result ## [1] 2 ## ## [[2]]$error ## NULL ## ## ## [[3]] ## [[3]]$result ## [1] 2.236068 ## ## [[3]]$error ## NULL possibly() works similarly, but also allows you to specify a return value in case of an error: possible_sqrt <- possibly(sqrt, otherwise = NA_real_) map(a, possible_sqrt) ## [[1]] ## [1] NA ## ## [[2]] ## [1] 2 ## ## [[3]] ## [1] 2.236068 Of course, in this particular example, the same effect could be obtained way more easily: sqrt(as.numeric(a)) ## Warning: NAs introduced by coercion ## [1] NA 2.000000 2.236068 However, in some situations, this trick does not work as intended (or at all). possibly() and safely() allow the programmer to model errors explicitly, and to then provide a consistent way of dealing with them. For instance, consider the following example: data(mtcars) write.csv(mtcars, "my_data/mtcars.csv") Error in file(file, ifelse(append, "a", "w")) : cannot open the connection In addition: Warning message: In file(file, ifelse(append, "a", "w")) : cannot open file 'my_data/mtcars.csv': No such file or directory The folder path/to/save/ does not exist, and as such this code produces an error. You might want to catch this error, and create the directory for instance: possibly_write.csv <- possibly(write.csv, otherwise = NULL) if(is.null(possibly_write.csv(mtcars, "my_data/mtcars.csv"))) { print("Creating folder...") dir.create("my_data/") print("Saving file...") write.csv(mtcars, "my_data/mtcars.csv") } [1] "Creating folder..." [1] "Saving file..." Warning message: In file(file, ifelse(append, "a", "w")) : cannot open file 'my_data/mtcars.csv': No such file or directory The warning message comes from the first time we try to write the .csv, inside the if statement. Because this fails, we create the directory and then actually save the file. In the exercises, you’ll discover quietly(), which also captures warnings and messages. To conclude this section: remember function factories? Turns out that safely(), purely() and quietly() are function factories. 8.3.4 Partial applications with partial() Consider the following simple function: add <- function(a, b) a+b It is possible to create a new function, where one of the parameters is fixed, for instance, where a = 10: add_to_10 <- partial(add, a = 10) add_to_10(12) ## [1] 22 This is equivalent to the following: add_to_10_2 <- function(b){ add(a = 10, b) } Using partial() is much less verbose however, and allowing you to define new functions very quickly: head10 <- partial(head, n = 10) head10(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 8.3.5 Function composition using compose Function composition is another handy tool, which makes chaining equation much more elegant: compose(sqrt, log10, exp)(10) ## [1] 2.083973 You can read this expression as exp() after log10() after sqrt() and is equivalent to: sqrt(log10(exp(10))) ## [1] 2.083973 It is also possible to reverse the order the functions get called using the .dir = option: compose(sqrt, log10, exp, .dir = "forward")(10) ## [1] 1.648721 One could also use the %>% operator to achieve the same result: 10 %>% sqrt %>% log10 %>% exp ## [1] 1.648721 but strictly speaking, this is not function composition. 8.3.6 «Transposing lists» Another interesting function is transpose(). It is not an alternative to the function t() from base but, has a similar effect. transpose() works on lists. Let’s take a look at the example from before: safe_sqrt <- safely(sqrt, otherwise = NA_real_) map(a, safe_sqrt) ## [[1]] ## [[1]]$result ## [1] NA ## ## [[1]]$error ## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function> ## ## ## [[2]] ## [[2]]$result ## [1] 2 ## ## [[2]]$error ## NULL ## ## ## [[3]] ## [[3]]$result ## [1] 2.236068 ## ## [[3]]$error ## NULL The output is a list with the first element being a list with a result and an error message. One might want to have all the results in a single list, and all the error messages in another list. This is possible with transpose(): purrr::transpose(map(a, safe_sqrt)) ## $result ## $result[[1]] ## [1] NA ## ## $result[[2]] ## [1] 2 ## ## $result[[3]] ## [1] 2.236068 ## ## ## $error ## $error[[1]] ## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function> ## ## $error[[2]] ## NULL ## ## $error[[3]] ## NULL I explicitely call purrr::transpose() because there is also a data.table::transpose(), which is not the same function. You have to be careful about that sort of thing, because it can cause errors in your programs and debuging this type of error is a nightmare. Now that we are familiar with functional programming, let’s try to apply some of its principles to data manipulation. 8.4 List-based workflows for efficiency You can use your own functions in pipe workflows: double_number <- function(x){ x+x } mtcars %>% head() %>% mutate(double_mpg = double_number(mpg)) ## mpg cyl disp hp drat wt qsec vs am gear carb double_mpg ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 42.0 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 42.0 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 45.6 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 42.8 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 37.4 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 36.2 It is important to understand that your functions, and functions that are built-in into R, or that come from packages, are exactly the same thing. Every function is a first-class object in R, no matter where they come from. The consequence of functions being first-class objects is that functions can take functions as arguments, functions can return functions (the function factories from the previous chapter) and can be assigned to any variable: plop <- sqrt plop(4) ## [1] 2 bacon <- function(.f){ message("Bacon is tasty") .f } bacon(sqrt) # `bacon` is a function factory, as it returns a function (alongside an informative message) ## Bacon is tasty ## function (x) .Primitive("sqrt") # To actually call it: bacon(sqrt)(4) ## Bacon is tasty ## [1] 2 Now, let’s step back for a bit and think about what we learned up until now, and especially the map() family of functions. Let’s read the list of datasets from the previous chapter: paths <- Sys.glob("datasets/unemployment/*.csv") all_datasets <- import_list(paths) str(all_datasets) ## List of 4 ## $ unemp_2013:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ... ## ..$ of which: Wage-earners : int [1:118] 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ... ## ..$ of which: Non-wage-earners: int [1:118] 19872 1809 168 94 116 294 272 113 189 338 ... ## ..$ Unemployed : int [1:118] 19287 1071 114 25 74 261 98 45 66 207 ... ## ..$ Active population : int [1:118] 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.95 5.67 6.27 2.88 4.92 ... ## ..$ Year : int [1:118] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2013.csv" ## $ unemp_2014:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 228423 18166 1767 845 1505 4129 2172 1007 1268 3124 ... ## ..$ of which: Wage-earners : int [1:118] 208238 16366 1606 757 1390 3840 1897 887 1082 2782 ... ## ..$ of which: Non-wage-earners: int [1:118] 20185 1800 161 88 115 289 275 120 186 342 ... ## ..$ Unemployed : int [1:118] 19362 1066 122 19 66 287 91 38 61 202 ... ## ..$ Active population : int [1:118] 247785 19232 1889 864 1571 4416 2263 1045 1329 3326 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.81 5.54 6.46 2.2 4.2 ... ## ..$ Year : int [1:118] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2014.csv" ## $ unemp_2015:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 233130 18310 1780 870 1470 4130 2170 1050 1300 3140 ... ## ..$ of which: Wage-earners : int [1:118] 212530 16430 1620 780 1350 3820 1910 920 1100 2770 ... ## ..$ of which: Non-wage-earners: int [1:118] 20600 1880 160 90 120 310 260 130 200 370 ... ## ..$ Unemployed : int [1:118] 18806 988 106 29 73 260 80 41 72 169 ... ## ..$ Active population : int [1:118] 251936 19298 1886 899 1543 4390 2250 1091 1372 3309 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.46 5.12 5.62 3.23 4.73 ... ## ..$ Year : int [1:118] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2015.csv" ## $ unemp_2016:'data.frame': 118 obs. of 8 variables: ## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ Total employed population : int [1:118] 236100 18380 1790 870 1470 4160 2160 1030 1330 3150 ... ## ..$ of which: Wage-earners : int [1:118] 215430 16500 1640 780 1350 3840 1900 900 1130 2780 ... ## ..$ of which: Non-wage-earners: int [1:118] 20670 1880 150 90 120 320 260 130 200 370 ... ## ..$ Unemployed : int [1:118] 18185 975 91 27 66 246 76 35 70 206 ... ## ..$ Active population : int [1:118] 254285 19355 1881 897 1536 4406 2236 1065 1400 3356 ... ## ..$ Unemployment rate (in %) : num [1:118] 7.15 5.04 4.84 3.01 4.3 ... ## ..$ Year : int [1:118] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2016.csv" all_datasets is a list with 4 elements, each of them is a data.frame. The first thing we are going to do is use a function to clean the names of the datasets. These names are not very easy to work with; there are spaces, and it would be better if the names of the columns would be all lowercase. For this we are going to use the function clean_names() from the janitor package. For a single dataset, I would write this: library(janitor) one_dataset <- one_dataset %>% clean_names() and I would get a dataset with column names in lowercase and spaces replaced by _ (and other corrections). How can I apply, or map, this function to each dataset in the list? To do this I need to use purrr::map(), which we’ve seen in the previous section: library(purrr) all_datasets <- all_datasets %>% map(clean_names) all_datasets %>% glimpse() ## List of 4 ## $ unemp_2013:'data.frame': 118 obs. of 8 variables: ## ..$ commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ total_employed_population : int [1:118] 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ... ## ..$ of_which_wage_earners : int [1:118] 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ... ## ..$ of_which_non_wage_earners : int [1:118] 19872 1809 168 94 116 294 272 113 189 338 ... ## ..$ unemployed : int [1:118] 19287 1071 114 25 74 261 98 45 66 207 ... ## ..$ active_population : int [1:118] 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ... ## ..$ unemployment_rate_in_percent: num [1:118] 7.95 5.67 6.27 2.88 4.92 ... ## ..$ year : int [1:118] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2013.csv" ## $ unemp_2014:'data.frame': 118 obs. of 8 variables: ## ..$ commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ total_employed_population : int [1:118] 228423 18166 1767 845 1505 4129 2172 1007 1268 3124 ... ## ..$ of_which_wage_earners : int [1:118] 208238 16366 1606 757 1390 3840 1897 887 1082 2782 ... ## ..$ of_which_non_wage_earners : int [1:118] 20185 1800 161 88 115 289 275 120 186 342 ... ## ..$ unemployed : int [1:118] 19362 1066 122 19 66 287 91 38 61 202 ... ## ..$ active_population : int [1:118] 247785 19232 1889 864 1571 4416 2263 1045 1329 3326 ... ## ..$ unemployment_rate_in_percent: num [1:118] 7.81 5.54 6.46 2.2 4.2 ... ## ..$ year : int [1:118] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2014.csv" ## $ unemp_2015:'data.frame': 118 obs. of 8 variables: ## ..$ commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ total_employed_population : int [1:118] 233130 18310 1780 870 1470 4130 2170 1050 1300 3140 ... ## ..$ of_which_wage_earners : int [1:118] 212530 16430 1620 780 1350 3820 1910 920 1100 2770 ... ## ..$ of_which_non_wage_earners : int [1:118] 20600 1880 160 90 120 310 260 130 200 370 ... ## ..$ unemployed : int [1:118] 18806 988 106 29 73 260 80 41 72 169 ... ## ..$ active_population : int [1:118] 251936 19298 1886 899 1543 4390 2250 1091 1372 3309 ... ## ..$ unemployment_rate_in_percent: num [1:118] 7.46 5.12 5.62 3.23 4.73 ... ## ..$ year : int [1:118] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2015.csv" ## $ unemp_2016:'data.frame': 118 obs. of 8 variables: ## ..$ commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ... ## ..$ total_employed_population : int [1:118] 236100 18380 1790 870 1470 4160 2160 1030 1330 3150 ... ## ..$ of_which_wage_earners : int [1:118] 215430 16500 1640 780 1350 3840 1900 900 1130 2780 ... ## ..$ of_which_non_wage_earners : int [1:118] 20670 1880 150 90 120 320 260 130 200 370 ... ## ..$ unemployed : int [1:118] 18185 975 91 27 66 246 76 35 70 206 ... ## ..$ active_population : int [1:118] 254285 19355 1881 897 1536 4406 2236 1065 1400 3356 ... ## ..$ unemployment_rate_in_percent: num [1:118] 7.15 5.04 4.84 3.01 4.3 ... ## ..$ year : int [1:118] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ... ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2016.csv" Remember that map(list, function) simply evaluates function to each element of list. So now, what if I want to know, for each dataset, which communes have an unemployment rate that is less than, say, 3%? For a single dataset I would do something like this: one_dataset %>% filter(unemployment_rate_in_percent < 3) but since we’re dealing with a list of data sets, we cannot simply use filter() on it. This is because filter() expects a data frame, not a list of data frames. The way around this is to use map(). all_datasets %>% map(~filter(., unemployment_rate_in_percent < 3)) ## $unemp_2013 ## commune total_employed_population of_which_wage_earners ## 1 Garnich 844 750 ## 2 Leudelange 1064 937 ## 3 Bech 526 463 ## of_which_non_wage_earners unemployed active_population ## 1 94 25 869 ## 2 127 32 1096 ## 3 63 16 542 ## unemployment_rate_in_percent year ## 1 2.876870 2013 ## 2 2.919708 2013 ## 3 2.952030 2013 ## ## $unemp_2014 ## commune total_employed_population of_which_wage_earners ## 1 Garnich 845 757 ## 2 Leudelange 1102 965 ## 3 Bech 543 476 ## 4 Flaxweiler 879 789 ## of_which_non_wage_earners unemployed active_population ## 1 88 19 864 ## 2 137 34 1136 ## 3 67 15 558 ## 4 90 27 906 ## unemployment_rate_in_percent year ## 1 2.199074 2014 ## 2 2.992958 2014 ## 3 2.688172 2014 ## 4 2.980132 2014 ## ## $unemp_2015 ## commune total_employed_population of_which_wage_earners ## 1 Bech 520 450 ## 2 Bous 750 680 ## of_which_non_wage_earners unemployed active_population ## 1 70 14 534 ## 2 70 22 772 ## unemployment_rate_in_percent year ## 1 2.621723 2015 ## 2 2.849741 2015 ## ## $unemp_2016 ## commune total_employed_population of_which_wage_earners ## 1 Reckange-sur-Mess 980 850 ## 2 Bech 520 450 ## 3 Betzdorf 1500 1350 ## 4 Flaxweiler 910 820 ## of_which_non_wage_earners unemployed active_population ## 1 130 30 1010 ## 2 70 11 531 ## 3 150 45 1545 ## 4 90 24 934 ## unemployment_rate_in_percent year ## 1 2.970297 2016 ## 2 2.071563 2016 ## 3 2.912621 2016 ## 4 2.569593 2016 map() needs a function to map to each element of the list. all_datasets is the list to which I want to map the function. But what function? filter() is the function I need, so why doesn’t: all_datasets %>% map(filter(unemployment_rate_in_percent < 3)) work? This is what happens if we try it: Error in filter(unemployment_rate_in_percent < 3) : object 'unemployment_rate_in_percent' not found This is because filter() needs both the data set, and a so-called predicate (a predicate is an expression that evaluates to TRUE or FALSE). But you need to make more explicit what is the dataset and what is the predicate, because here, filter() thinks that the dataset is unemployment_rate_in_percent. The way to do this is to use an anonymous function (discussed in Chapter 7), which allows you to explicitely state what is the dataset, and what is the predicate. As we’ve seen, there’s three ways to define anonymous functions: Using a formula (only works within {tidyverse} functions): all_datasets %>% map(~filter(., unemployment_rate_in_percent < 3)) %>% glimpse() ## List of 4 ## $ unemp_2013:'data.frame': 3 obs. of 8 variables: ## ..$ commune : chr [1:3] "Garnich" "Leudelange" "Bech" ## ..$ total_employed_population : int [1:3] 844 1064 526 ## ..$ of_which_wage_earners : int [1:3] 750 937 463 ## ..$ of_which_non_wage_earners : int [1:3] 94 127 63 ## ..$ unemployed : int [1:3] 25 32 16 ## ..$ active_population : int [1:3] 869 1096 542 ## ..$ unemployment_rate_in_percent: num [1:3] 2.88 2.92 2.95 ## ..$ year : int [1:3] 2013 2013 2013 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2013.csv" ## $ unemp_2014:'data.frame': 4 obs. of 8 variables: ## ..$ commune : chr [1:4] "Garnich" "Leudelange" "Bech" "Flaxweiler" ## ..$ total_employed_population : int [1:4] 845 1102 543 879 ## ..$ of_which_wage_earners : int [1:4] 757 965 476 789 ## ..$ of_which_non_wage_earners : int [1:4] 88 137 67 90 ## ..$ unemployed : int [1:4] 19 34 15 27 ## ..$ active_population : int [1:4] 864 1136 558 906 ## ..$ unemployment_rate_in_percent: num [1:4] 2.2 2.99 2.69 2.98 ## ..$ year : int [1:4] 2014 2014 2014 2014 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2014.csv" ## $ unemp_2015:'data.frame': 2 obs. of 8 variables: ## ..$ commune : chr [1:2] "Bech" "Bous" ## ..$ total_employed_population : int [1:2] 520 750 ## ..$ of_which_wage_earners : int [1:2] 450 680 ## ..$ of_which_non_wage_earners : int [1:2] 70 70 ## ..$ unemployed : int [1:2] 14 22 ## ..$ active_population : int [1:2] 534 772 ## ..$ unemployment_rate_in_percent: num [1:2] 2.62 2.85 ## ..$ year : int [1:2] 2015 2015 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2015.csv" ## $ unemp_2016:'data.frame': 4 obs. of 8 variables: ## ..$ commune : chr [1:4] "Reckange-sur-Mess" "Bech" "Betzdorf" "Flaxweiler" ## ..$ total_employed_population : int [1:4] 980 520 1500 910 ## ..$ of_which_wage_earners : int [1:4] 850 450 1350 820 ## ..$ of_which_non_wage_earners : int [1:4] 130 70 150 90 ## ..$ unemployed : int [1:4] 30 11 45 24 ## ..$ active_population : int [1:4] 1010 531 1545 934 ## ..$ unemployment_rate_in_percent: num [1:4] 2.97 2.07 2.91 2.57 ## ..$ year : int [1:4] 2016 2016 2016 2016 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2016.csv" (notice the . in the formula, making the position of the dataset as the first argument to filter() explicit) or using an anonymous function (using the function(x) keyword): all_datasets %>% map(function(x)filter(x, unemployment_rate_in_percent < 3)) %>% glimpse() ## List of 4 ## $ unemp_2013:'data.frame': 3 obs. of 8 variables: ## ..$ commune : chr [1:3] "Garnich" "Leudelange" "Bech" ## ..$ total_employed_population : int [1:3] 844 1064 526 ## ..$ of_which_wage_earners : int [1:3] 750 937 463 ## ..$ of_which_non_wage_earners : int [1:3] 94 127 63 ## ..$ unemployed : int [1:3] 25 32 16 ## ..$ active_population : int [1:3] 869 1096 542 ## ..$ unemployment_rate_in_percent: num [1:3] 2.88 2.92 2.95 ## ..$ year : int [1:3] 2013 2013 2013 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2013.csv" ## $ unemp_2014:'data.frame': 4 obs. of 8 variables: ## ..$ commune : chr [1:4] "Garnich" "Leudelange" "Bech" "Flaxweiler" ## ..$ total_employed_population : int [1:4] 845 1102 543 879 ## ..$ of_which_wage_earners : int [1:4] 757 965 476 789 ## ..$ of_which_non_wage_earners : int [1:4] 88 137 67 90 ## ..$ unemployed : int [1:4] 19 34 15 27 ## ..$ active_population : int [1:4] 864 1136 558 906 ## ..$ unemployment_rate_in_percent: num [1:4] 2.2 2.99 2.69 2.98 ## ..$ year : int [1:4] 2014 2014 2014 2014 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2014.csv" ## $ unemp_2015:'data.frame': 2 obs. of 8 variables: ## ..$ commune : chr [1:2] "Bech" "Bous" ## ..$ total_employed_population : int [1:2] 520 750 ## ..$ of_which_wage_earners : int [1:2] 450 680 ## ..$ of_which_non_wage_earners : int [1:2] 70 70 ## ..$ unemployed : int [1:2] 14 22 ## ..$ active_population : int [1:2] 534 772 ## ..$ unemployment_rate_in_percent: num [1:2] 2.62 2.85 ## ..$ year : int [1:2] 2015 2015 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2015.csv" ## $ unemp_2016:'data.frame': 4 obs. of 8 variables: ## ..$ commune : chr [1:4] "Reckange-sur-Mess" "Bech" "Betzdorf" "Flaxweiler" ## ..$ total_employed_population : int [1:4] 980 520 1500 910 ## ..$ of_which_wage_earners : int [1:4] 850 450 1350 820 ## ..$ of_which_non_wage_earners : int [1:4] 130 70 150 90 ## ..$ unemployed : int [1:4] 30 11 45 24 ## ..$ active_population : int [1:4] 1010 531 1545 934 ## ..$ unemployment_rate_in_percent: num [1:4] 2.97 2.07 2.91 2.57 ## ..$ year : int [1:4] 2016 2016 2016 2016 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2016.csv" or, since R 4.1, using the shorthand \\(x): all_datasets %>% map(\\(x)filter(x, unemployment_rate_in_percent < 3)) %>% glimpse() ## List of 4 ## $ unemp_2013:'data.frame': 3 obs. of 8 variables: ## ..$ commune : chr [1:3] "Garnich" "Leudelange" "Bech" ## ..$ total_employed_population : int [1:3] 844 1064 526 ## ..$ of_which_wage_earners : int [1:3] 750 937 463 ## ..$ of_which_non_wage_earners : int [1:3] 94 127 63 ## ..$ unemployed : int [1:3] 25 32 16 ## ..$ active_population : int [1:3] 869 1096 542 ## ..$ unemployment_rate_in_percent: num [1:3] 2.88 2.92 2.95 ## ..$ year : int [1:3] 2013 2013 2013 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2013.csv" ## $ unemp_2014:'data.frame': 4 obs. of 8 variables: ## ..$ commune : chr [1:4] "Garnich" "Leudelange" "Bech" "Flaxweiler" ## ..$ total_employed_population : int [1:4] 845 1102 543 879 ## ..$ of_which_wage_earners : int [1:4] 757 965 476 789 ## ..$ of_which_non_wage_earners : int [1:4] 88 137 67 90 ## ..$ unemployed : int [1:4] 19 34 15 27 ## ..$ active_population : int [1:4] 864 1136 558 906 ## ..$ unemployment_rate_in_percent: num [1:4] 2.2 2.99 2.69 2.98 ## ..$ year : int [1:4] 2014 2014 2014 2014 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2014.csv" ## $ unemp_2015:'data.frame': 2 obs. of 8 variables: ## ..$ commune : chr [1:2] "Bech" "Bous" ## ..$ total_employed_population : int [1:2] 520 750 ## ..$ of_which_wage_earners : int [1:2] 450 680 ## ..$ of_which_non_wage_earners : int [1:2] 70 70 ## ..$ unemployed : int [1:2] 14 22 ## ..$ active_population : int [1:2] 534 772 ## ..$ unemployment_rate_in_percent: num [1:2] 2.62 2.85 ## ..$ year : int [1:2] 2015 2015 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2015.csv" ## $ unemp_2016:'data.frame': 4 obs. of 8 variables: ## ..$ commune : chr [1:4] "Reckange-sur-Mess" "Bech" "Betzdorf" "Flaxweiler" ## ..$ total_employed_population : int [1:4] 980 520 1500 910 ## ..$ of_which_wage_earners : int [1:4] 850 450 1350 820 ## ..$ of_which_non_wage_earners : int [1:4] 130 70 150 90 ## ..$ unemployed : int [1:4] 30 11 45 24 ## ..$ active_population : int [1:4] 1010 531 1545 934 ## ..$ unemployment_rate_in_percent: num [1:4] 2.97 2.07 2.91 2.57 ## ..$ year : int [1:4] 2016 2016 2016 2016 ## ..- attr(*, "filename")= chr "datasets/unemployment/unemp_2016.csv" As you see, everything is starting to come together: lists, to hold complex objects, over which anonymous functions are mapped using higher-order functions. Let’s continue cleaning this dataset. Before merging these datasets together, we would need them to have a year column indicating the year the data was measured in each data frame. It would also be helpful if gave names to these datasets, meaning converting the list to a named list. For this task, we can use purrr::set_names(): all_datasets <- set_names(all_datasets, as.character(seq(2013, 2016))) Let’s take a look at the list now: str(all_datasets) As you can see, each data.frame object contained in the list has been renamed. You can thus access them with the $ operator: Using map() we now know how to apply a function to each dataset of a list. But maybe it would be easier to merge all the datasets first, and then manipulate them? This can be the case sometimes, but not always. As long as you provide a function and a list of elements to reduce(), you will get a single output. So how could reduce() help us with merging all the datasets that are in the list? dplyr comes with a lot of function to merge two datasets. Remember that I said before that reduce() allows you to generalize a function of two arguments? Let’s try it with our list of datasets: unemp_lux <- reduce(all_datasets, full_join) ## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year") ## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year") ## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year") glimpse(unemp_lux) ## Rows: 472 ## Columns: 8 ## $ commune <chr> "Grand-Duche de Luxembourg", "Canton Cape… ## $ total_employed_population <int> 223407, 17802, 1703, 844, 1431, 4094, 214… ## $ of_which_wage_earners <int> 203535, 15993, 1535, 750, 1315, 3800, 187… ## $ of_which_non_wage_earners <int> 19872, 1809, 168, 94, 116, 294, 272, 113,… ## $ unemployed <int> 19287, 1071, 114, 25, 74, 261, 98, 45, 66… ## $ active_population <int> 242694, 18873, 1817, 869, 1505, 4355, 224… ## $ unemployment_rate_in_percent <dbl> 7.947044, 5.674773, 6.274078, 2.876870, 4… ## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013,… full_join() is one of the dplyr function that merges data. There are others that might be useful depending on the kind of join operation you need. Let’s write this data to disk as we’re going to keep using it for the next chapters: export(unemp_lux, "datasets/unemp_lux.csv") 8.4.1 Functional programming and plotting In this section, we are going to learn how to use the possibilities offered by the purrr package and how it can work together with ggplot2 to generate many plots. This is a more advanced topic, but what comes next is also what makes R, and the functional programming paradigm so powerful. For example, suppose that instead of wanting a single plot with the unemployment rate of each commune, you need one unemployment plot, per commune: unemp_lux_data %>% filter(division == "Luxembourg") %>% ggplot(aes(year, unemployment_rate_in_percent, group = division)) + theme_minimal() + labs(title = "Unemployment in Luxembourg", x = "Year", y = "Rate") + geom_line() and then you would write the same for “Esch-sur-Alzette” and also for “Wiltz”. If you only have to make to make these 3 plots, copy and pasting the above lines is no big deal: unemp_lux_data %>% filter(division == "Esch-sur-Alzette") %>% ggplot(aes(year, unemployment_rate_in_percent, group = division)) + theme_minimal() + labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") + geom_line() unemp_lux_data %>% filter(division == "Wiltz") %>% ggplot(aes(year, unemployment_rate_in_percent, group = division)) + theme_minimal() + labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") + geom_line() But copy and pasting is error prone. Can you spot the copy-paste mistake I made? And what if you have to create the above plots for all 108 Luxembourguish communes? That’s a lot of copy pasting. What if, once you are done copy pasting, you have to change something, for example, the theme? You could use the search and replace function of RStudio, true, but sometimes search and replace can also introduce bugs and typos. You can avoid all these issues by using purrr::map(). What do you need to map over? The commune names. So let’s create a vector of commune names: communes <- list("Luxembourg", "Esch-sur-Alzette", "Wiltz") Now we can create the graphs using map(), or map2() to be exact: plots_tibble <- unemp_lux_data %>% filter(division %in% communes) %>% group_by(division) %>% nest() %>% mutate(plot = map2(.x = data, .y = division, ~ggplot(data = .x) + theme_minimal() + geom_line(aes(year, unemployment_rate_in_percent, group = 1)) + labs(title = paste("Unemployment in", .y)))) Let’s study this line by line: the first line is easy, we simply use filter() to keep only the communes we are interested in. Then we group by division and use tidyr::nest(). As a refresher, let’s take a look at what this does: unemp_lux_data %>% filter(division %in% communes) %>% group_by(division) %>% nest() ## # A tibble: 3 × 2 ## # Groups: division [3] ## division data ## <chr> <list> ## 1 Esch-sur-Alzette <tibble [15 × 7]> ## 2 Luxembourg <tibble [15 × 7]> ## 3 Wiltz <tibble [15 × 7]> This creates a tibble with two columns, division and data, where each individual (or commune in this case) is another tibble with all the original variables. This is very useful, because now we can pass these tibbles to map2(), to generate the plots. But why map2() and what’s the difference with map()? map2() works the same way as map(), but maps over two inputs: numbers1 <- list(1, 2, 3, 4, 5) numbers2 <- list(9, 8, 7, 6, 5) map2(numbers1, numbers2, `*`) ## [[1]] ## [1] 9 ## ## [[2]] ## [1] 16 ## ## [[3]] ## [1] 21 ## ## [[4]] ## [1] 24 ## ## [[5]] ## [1] 25 In our example with the graphs, the two inputs are the data, and the names of the communes. This is useful to create the title with labs(title = paste(\"Unemployment in\", .y)))) where .y is the second input of map2(), the commune names contained in variable division. So what happened? We now have a tibble called plots_tibble that looks like this: print(plots_tibble) ## # A tibble: 3 × 3 ## # Groups: division [3] ## division data plot ## <chr> <list> <list> ## 1 Esch-sur-Alzette <tibble [15 × 7]> <gg> ## 2 Luxembourg <tibble [15 × 7]> <gg> ## 3 Wiltz <tibble [15 × 7]> <gg> This tibble contains three columns, division, data and now a new one called plot, that we created before using the last line mutate(plot = ...) (remember that mutate() adds columns to tibbles). plot is a list-column, with elements… being plots! Yes you read that right, the elements of the column plot are literally plots. This is what I meant with list columns. Let’s see what is inside the data and the plot columns exactly: plots_tibble %>% pull(data) ## [[1]] ## # A tibble: 15 × 7 ## year active_population of_which_non_wage_e…¹ of_wh…² total…³ unemp…⁴ unemp…⁵ ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2001 11.3 665 10.1 10.8 561 4.95 ## 2 2002 11.7 677 10.3 11.0 696 5.96 ## 3 2003 11.7 674 10.2 10.9 813 6.94 ## 4 2004 12.2 659 10.6 11.3 899 7.38 ## 5 2005 11.9 654 10.3 11.0 952 7.97 ## 6 2006 12.2 657 10.5 11.2 1.07 8.71 ## 7 2007 12.6 634 10.9 11.5 1.03 8.21 ## 8 2008 12.9 638 11.0 11.6 1.28 9.92 ## 9 2009 13.2 652 11.0 11.7 1.58 11.9 ## 10 2010 13.6 638 11.2 11.8 1.73 12.8 ## 11 2011 13.9 630 11.5 12.1 1.77 12.8 ## 12 2012 14.3 684 11.8 12.5 1.83 12.8 ## 13 2013 14.8 694 12.0 12.7 2.05 13.9 ## 14 2014 15.2 703 12.5 13.2 2.00 13.2 ## 15 2015 15.3 710 12.6 13.3 2.03 13.2 ## # … with abbreviated variable names ¹​of_which_non_wage_earners, ## # ²​of_which_wage_earners, ³​total_employed_population, ⁴​unemployed, ## # ⁵​unemployment_rate_in_percent ## ## [[2]] ## # A tibble: 15 × 7 ## year active_population of_which_non_wage_e…¹ of_wh…² total…³ unemp…⁴ unemp…⁵ ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2001 34.4 2.89 30.4 33.2 1.14 3.32 ## 2 2002 34.8 2.94 30.3 33.2 1.56 4.5 ## 3 2003 35.2 3.03 30.1 33.2 2.04 5.78 ## 4 2004 35.6 3.06 30.1 33.2 2.39 6.73 ## 5 2005 35.6 3.13 29.8 33.0 2.64 7.42 ## 6 2006 35.5 3.12 30.3 33.4 2.03 5.72 ## 7 2007 36.1 3.25 31.1 34.4 1.76 4.87 ## 8 2008 37.5 3.39 31.9 35.3 2.23 5.95 ## 9 2009 37.9 3.49 31.6 35.1 2.85 7.51 ## 10 2010 38.6 3.54 32.1 35.7 2.96 7.66 ## 11 2011 40.3 3.66 33.6 37.2 3.11 7.72 ## 12 2012 41.8 3.81 34.6 38.4 3.37 8.07 ## 13 2013 43.4 3.98 35.5 39.5 3.86 8.89 ## 14 2014 44.6 4.11 36.7 40.8 3.84 8.6 ## 15 2015 45.2 4.14 37.5 41.6 3.57 7.9 ## # … with abbreviated variable names ¹​of_which_non_wage_earners, ## # ²​of_which_wage_earners, ³​total_employed_population, ⁴​unemployed, ## # ⁵​unemployment_rate_in_percent ## ## [[3]] ## # A tibble: 15 × 7 ## year active_population of_which_non_wage_e…¹ of_wh…² total…³ unemp…⁴ unemp…⁵ ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2001 2.13 223 1.79 2.01 122 5.73 ## 2 2002 2.14 220 1.78 2.00 134 6.27 ## 3 2003 2.18 223 1.79 2.02 163 7.48 ## 4 2004 2.24 227 1.85 2.08 156 6.97 ## 5 2005 2.26 229 1.85 2.08 187 8.26 ## 6 2006 2.20 206 1.82 2.02 181 8.22 ## 7 2007 2.27 198 1.88 2.08 197 8.67 ## 8 2008 2.30 200 1.90 2.10 201 8.75 ## 9 2009 2.36 201 1.94 2.15 216 9.14 ## 10 2010 2.42 195 1.97 2.17 256 10.6 ## 11 2011 2.48 190 2.02 2.21 269 10.9 ## 12 2012 2.59 188 2.10 2.29 301 11.6 ## 13 2013 2.66 195 2.15 2.34 318 12.0 ## 14 2014 2.69 185 2.19 2.38 315 11.7 ## 15 2015 2.77 180 2.27 2.45 321 11.6 ## # … with abbreviated variable names ¹​of_which_non_wage_earners, ## # ²​of_which_wage_earners, ³​total_employed_population, ⁴​unemployed, ## # ⁵​unemployment_rate_in_percent each element of data is a tibble for the specific country with columns year, active_population, etc, the original columns. But obviously, there is no division column. So to plot the data, and join all the dots together, we need to add group = 1 in the call to ggplot2() (whereas if you plot multiple lines in the same graph, you need to write group = division). But more interestingly, how can you actually see the plots? If you want to simply look at them, it is enough to use pull(): plots_tibble %>% pull(plot) ## [[1]] ## ## [[2]] ## ## [[3]] And if we want to save these plots, we can do so using map2(): map2(paste0(plots_tibble$division, ".pdf"), plots_tibble$plot, ggsave) Saving 7 x 5 in image Saving 6.01 x 3.94 in image Saving 6.01 x 3.94 in image This was probably the most advanced topic we have studied yet; but you probably agree with me that it is among the most useful ones. This section is a perfect illustration of the power of functional programming; you can mix and match functions as long as you give them the correct arguments. You can pass data to functions that use data and then pass these functions to other functions that use functions as arguments, such as map().7 map() does not care if the functions you pass to it produces tables, graphs or even another function. map() will simply map this function to a list of inputs, and as long as these inputs are correct arguments to the function, map() will do its magic. If you combine this with list-columns, you can even use map() alongside dplyr functions and map your function by first grouping, filtering, etc… 8.4.2 Modeling with functional programming As written just above, map() simply applies a function to a list of inputs, and in the previous section we mapped ggplot() to generate many plots at once. This approach can also be used to map any modeling functions, for instance lm() to a list of datasets. For instance, suppose that you wish to perform a Monte Carlo simulation. Suppose that you are dealing with a binary choice problem; usually, you would use a logistic regression for this. However, in certain disciplines, especially in the social sciences, the so-called Linear Probability Model is often used as well. The LPM is a simple linear regression, but unlike the standard setting of a linear regression, the dependent variable, or target, is a binary variable, and not a continuous variable. Before you yell “Wait, that’s illegal”, you should know that in practice LPMs do a good job of estimating marginal effects, which is what social scientists and econometricians are often interested in. Marginal effects are another way of interpreting models, giving how the outcome (or the target) changes given a change in a independent variable (or a feature). For instance, a marginal effect of 0.10 for age would mean that probability of success would increase by 10% for each added year of age. We already discussed marginal effects in Chapter 6. There has been a lot of discussion on logistic regression vs LPMs, and there are pros and cons of using LPMs. Micro-econometricians are still fond of LPMs, even though the pros of LPMs are not really convincing. However, quoting Angrist and Pischke: “While a nonlinear model may fit the CEF (population conditional expectation function) for LDVs (limited dependent variables) more closely than a linear model, when it comes to marginal effects, this probably matters little” (source: Mostly Harmless Econometrics) so LPMs are still used for estimating marginal effects. Let us check this assessment with one example. First, we simulate some data, then run a logistic regression and compute the marginal effects, and then compare with a LPM: set.seed(1234) x1 <- rnorm(100) x2 <- rnorm(100) z <- .5 + 2*x1 + 4*x2 p <- 1/(1 + exp(-z)) y <- rbinom(100, 1, p) df <- tibble(y = y, x1 = x1, x2 = x2) This data generating process generates data from a binary choice model. Fitting the model using a logistic regression allows us to recover the structural parameters: logistic_regression <- glm(y ~ ., data = df, family = binomial(link = "logit")) Let’s see a summary of the model fit: summary(logistic_regression) ## ## Call: ## glm(formula = y ~ ., family = binomial(link = "logit"), data = df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.91941 -0.44872 0.00038 0.42843 2.55426 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.0960 0.3293 0.292 0.770630 ## x1 1.6625 0.4628 3.592 0.000328 *** ## x2 3.6582 0.8059 4.539 5.64e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 138.629 on 99 degrees of freedom ## Residual deviance: 60.576 on 97 degrees of freedom ## AIC: 66.576 ## ## Number of Fisher Scoring iterations: 7 We do recover the parameters that generated the data, but what about the marginal effects? We can get the marginal effects easily using the {margins} package: library(margins) margins(logistic_regression) ## Average marginal effects ## glm(formula = y ~ ., family = binomial(link = "logit"), data = df) ## x1 x2 ## 0.1598 0.3516 Or, even better, we can compute the true marginal effects, since we know the data generating process: meffects <- function(dataset, coefs){ X <- dataset %>% select(-y) %>% as.matrix() dydx_x1 <- mean(dlogis(X%*%c(coefs[2], coefs[3]))*coefs[2]) dydx_x2 <- mean(dlogis(X%*%c(coefs[2], coefs[3]))*coefs[3]) tribble(~term, ~true_effect, "x1", dydx_x1, "x2", dydx_x2) } (true_meffects <- meffects(df, c(0.5, 2, 4))) ## # A tibble: 2 × 2 ## term true_effect ## <chr> <dbl> ## 1 x1 0.175 ## 2 x2 0.350 Ok, so now what about using this infamous Linear Probability Model to estimate the marginal effects? lpm <- lm(y ~ ., data = df) summary(lpm) ## ## Call: ## lm(formula = y ~ ., data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.83953 -0.31588 -0.02885 0.28774 0.77407 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.51340 0.03587 14.314 < 2e-16 *** ## x1 0.16771 0.03545 4.732 7.58e-06 *** ## x2 0.31250 0.03449 9.060 1.43e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3541 on 97 degrees of freedom ## Multiple R-squared: 0.5135, Adjusted R-squared: 0.5034 ## F-statistic: 51.18 on 2 and 97 DF, p-value: 6.693e-16 It’s not too bad, but maybe it could have been better in other circumstances. Perhaps if we had more observations, or perhaps for a different set of structural parameters the results of the LPM would have been closer. The LPM estimates the marginal effect of x1 to be 0.1677134 vs 0.1597956 for the logistic regression and for x2, the LPM estimation is 0.3124966 vs 0.351607. The true marginal effects are 0.1750963 and 0.3501926 for x1 and x2 respectively. Just as to assess the accuracy of a model data scientists perform cross-validation, a Monte Carlo study can be performed to asses how close the estimation of the marginal effects using a LPM is to the marginal effects derived from a logistic regression. It will allow us to test with datasets of different sizes, and generated using different structural parameters. First, let’s write a function that generates data. The function below generates 10 datasets of size 100 (the code is inspired by this StackExchange answer): generate_datasets <- function(coefs = c(.5, 2, 4), sample_size = 100, repeats = 10){ generate_one_dataset <- function(coefs, sample_size){ x1 <- rnorm(sample_size) x2 <- rnorm(sample_size) z <- coefs[1] + coefs[2]*x1 + coefs[3]*x2 p <- 1/(1 + exp(-z)) y <- rbinom(sample_size, 1, p) df <- tibble(y = y, x1 = x1, x2 = x2) } simulations <- rerun(.n = repeats, generate_one_dataset(coefs, sample_size)) tibble("coefs" = list(coefs), "sample_size" = sample_size, "repeats" = repeats, "simulations" = list(simulations)) } Let’s first generate one dataset: one_dataset <- generate_datasets(repeats = 1) Let’s take a look at one_dataset: one_dataset ## # A tibble: 1 × 4 ## coefs sample_size repeats simulations ## <list> <dbl> <dbl> <list> ## 1 <dbl [3]> 100 1 <list [1]> As you can see, the tibble with the simulated data is inside a list-column called simulations. Let’s take a closer look: str(one_dataset$simulations) ## List of 1 ## $ :List of 1 ## ..$ : tibble [100 × 3] (S3: tbl_df/tbl/data.frame) ## .. ..$ y : int [1:100] 0 1 1 1 0 1 1 0 0 1 ... ## .. ..$ x1: num [1:100] 0.437 1.06 0.452 0.663 -1.136 ... ## .. ..$ x2: num [1:100] -2.316 0.562 -0.784 -0.226 -1.587 ... The structure is quite complex, and it’s important to understand this, because it will have an impact on the next lines of code; it is a list, containing a list, containing a dataset! No worries though, we can still map over the datasets directly, by using modify_depth() instead of map(). Now, let’s fit a LPM and compare the estimation of the marginal effects with the true marginal effects. In order to have some confidence in our results, we will not simply run a linear regression on that single dataset, but will instead simulate hundreds, then thousands and ten of thousands of data sets, get the marginal effects and compare them to the true ones (but here I won’t simulate more than 500 datasets). Let’s first generate 10 datasets: many_datasets <- generate_datasets() Now comes the tricky part. I have this object, many_datasets looking like this: many_datasets ## # A tibble: 1 × 4 ## coefs sample_size repeats simulations ## <list> <dbl> <dbl> <list> ## 1 <dbl [3]> 100 10 <list [10]> I would like to fit LPMs to the 10 datasets. For this, I will need to use all the power of functional programming and the {tidyverse}. I will be adding columns to this data frame using mutate() and mapping over the simulations list-column using modify_depth(). The list of data frames is at the second level (remember, it’s a list containing a list containing data frames). I’ll start by fitting the LPMs, then using broom::tidy() I will get a nice data frame of the estimated parameters. I will then only select what I need, and then bind the rows of all the data frames. I will do the same for the true marginal effects. I highly suggest that you run the following lines, one after another. It is complicated to understand what’s going on if you are not used to such workflows. However, I hope to convince you that once it will click, it’ll be much more intuitive than doing all this inside a loop. Here’s the code: results <- many_datasets %>% mutate(lpm = modify_depth(simulations, 2, ~lm(y ~ ., data = .x))) %>% mutate(lpm = modify_depth(lpm, 2, broom::tidy)) %>% mutate(lpm = modify_depth(lpm, 2, ~select(., term, estimate))) %>% mutate(lpm = modify_depth(lpm, 2, ~filter(., term != "(Intercept)"))) %>% mutate(lpm = map(lpm, bind_rows)) %>% mutate(true_effect = modify_depth(simulations, 2, ~meffects(., coefs = coefs[[1]]))) %>% mutate(true_effect = map(true_effect, bind_rows)) This is how results looks like: results ## # A tibble: 1 × 6 ## coefs sample_size repeats simulations lpm true_effect ## <list> <dbl> <dbl> <list> <list> <list> ## 1 <dbl [3]> 100 10 <list [10]> <tibble [20 × 2]> <tibble [20 × 2]> Let’s take a closer look to the lpm and true_effect columns: results$lpm ## [[1]] ## # A tibble: 20 × 2 ## term estimate ## <chr> <dbl> ## 1 x1 0.228 ## 2 x2 0.353 ## 3 x1 0.180 ## 4 x2 0.361 ## 5 x1 0.165 ## 6 x2 0.374 ## 7 x1 0.182 ## 8 x2 0.358 ## 9 x1 0.125 ## 10 x2 0.345 ## 11 x1 0.171 ## 12 x2 0.331 ## 13 x1 0.122 ## 14 x2 0.309 ## 15 x1 0.129 ## 16 x2 0.332 ## 17 x1 0.102 ## 18 x2 0.374 ## 19 x1 0.176 ## 20 x2 0.410 results$true_effect ## [[1]] ## # A tibble: 20 × 2 ## term true_effect ## <chr> <dbl> ## 1 x1 0.183 ## 2 x2 0.366 ## 3 x1 0.166 ## 4 x2 0.331 ## 5 x1 0.174 ## 6 x2 0.348 ## 7 x1 0.169 ## 8 x2 0.339 ## 9 x1 0.167 ## 10 x2 0.335 ## 11 x1 0.173 ## 12 x2 0.345 ## 13 x1 0.157 ## 14 x2 0.314 ## 15 x1 0.170 ## 16 x2 0.340 ## 17 x1 0.182 ## 18 x2 0.365 ## 19 x1 0.161 ## 20 x2 0.321 Let’s bind the columns, and compute the difference between the true and estimated marginal effects: simulation_results <- results %>% mutate(difference = map2(.x = lpm, .y = true_effect, full_join)) %>% mutate(difference = map(difference, ~mutate(., difference = true_effect - estimate))) %>% mutate(difference = map(difference, ~select(., term, difference))) %>% pull(difference) %>% .[[1]] ## Joining, by = "term" Let’s take a look at the simulation results: simulation_results %>% group_by(term) %>% summarise(mean = mean(difference), sd = sd(difference)) ## # A tibble: 2 × 3 ## term mean sd ## <chr> <dbl> <dbl> ## 1 x1 0.0122 0.0368 ## 2 x2 -0.0141 0.0311 Already with only 10 simulated datasets, the difference in means is not significant. Let’s rerun the analysis, but for difference sizes. In order to make things easier, we can put all the code into a nifty function: monte_carlo <- function(coefs, sample_size, repeats){ many_datasets <- generate_datasets(coefs, sample_size, repeats) results <- many_datasets %>% mutate(lpm = modify_depth(simulations, 2, ~lm(y ~ ., data = .x))) %>% mutate(lpm = modify_depth(lpm, 2, broom::tidy)) %>% mutate(lpm = modify_depth(lpm, 2, ~select(., term, estimate))) %>% mutate(lpm = modify_depth(lpm, 2, ~filter(., term != "(Intercept)"))) %>% mutate(lpm = map(lpm, bind_rows)) %>% mutate(true_effect = modify_depth(simulations, 2, ~meffects(., coefs = coefs[[1]]))) %>% mutate(true_effect = map(true_effect, bind_rows)) simulation_results <- results %>% mutate(difference = map2(.x = lpm, .y = true_effect, full_join)) %>% mutate(difference = map(difference, ~mutate(., difference = true_effect - estimate))) %>% mutate(difference = map(difference, ~select(., term, difference))) %>% pull(difference) %>% .[[1]] simulation_results %>% group_by(term) %>% summarise(mean = mean(difference), sd = sd(difference)) } And now, let’s run the simulation for different parameters and sizes: monte_carlo(c(.5, 2, 4), 100, 10) ## Joining, by = "term" ## # A tibble: 2 × 3 ## term mean sd ## <chr> <dbl> <dbl> ## 1 x1 -0.00826 0.0318 ## 2 x2 -0.00732 0.0421 monte_carlo(c(.5, 2, 4), 100, 100) ## Joining, by = "term" ## # A tibble: 2 × 3 ## term mean sd ## <chr> <dbl> <dbl> ## 1 x1 0.00360 0.0408 ## 2 x2 0.00517 0.0459 monte_carlo(c(.5, 2, 4), 100, 500) ## Joining, by = "term" ## # A tibble: 2 × 3 ## term mean sd ## <chr> <dbl> <dbl> ## 1 x1 -0.00152 0.0388 ## 2 x2 -0.000701 0.0462 monte_carlo(c(pi, 6, 9), 100, 10) ## Joining, by = "term" ## # A tibble: 2 × 3 ## term mean sd ## <chr> <dbl> <dbl> ## 1 x1 -0.00829 0.0421 ## 2 x2 0.00178 0.0397 monte_carlo(c(pi, 6, 9), 100, 100) ## Joining, by = "term" ## # A tibble: 2 × 3 ## term mean sd ## <chr> <dbl> <dbl> ## 1 x1 0.0107 0.0576 ## 2 x2 0.00831 0.0772 monte_carlo(c(pi, 6, 9), 100, 500) ## Joining, by = "term" ## # A tibble: 2 × 3 ## term mean sd ## <chr> <dbl> <dbl> ## 1 x1 0.00879 0.0518 ## 2 x2 0.0113 0.0687 We see that, at least for this set of parameters, the LPM does a good job of estimating marginal effects. Now, this study might in itself not be very interesting to you, but I believe the general approach is quite useful and flexible enough to be adapted to all kinds of use-cases. 8.5 Exercises Exercise 1 Suppose you have an Excel workbook that contains data on three sheets. Create a function that reads entire workbooks, and that returns a list of tibbles, where each tibble is the data of one sheet (download the example Excel workbook, example_workbook.xlsx, from the assets folder on the books github). Exercise 2 Use one of the map() functions to combine two lists into one. Consider the following two lists: mediterranean <- list("starters" = list("humous", "lasagna"), "dishes" = list("sardines", "olives")) continental <- list("starters" = list("pea soup", "terrine"), "dishes" = list("frikadelle", "sauerkraut")) The result we’d like to have would look like this: $starters $starters[[1]] [1] "humous" $starters[[2]] [1] "olives" $starters[[3]] [1] "pea soup" $starters[[4]] [1] "terrine" $dishes $dishes[[1]] [1] "sardines" $dishes[[2]] [1] "lasagna" $dishes[[3]] [1] "frikadelle" $dishes[[4]] [1] "sauerkraut" Functions that have other functions as input are called higher order functions↩︎ "],["package-development.html", "Chapter 9 Package development 9.1 Why you need to write your own package 9.2 Starting easy: creating a package to share data 9.3 Including data inside the package 9.4 Adding functions to your package 9.5 Documenting your package 9.6 Unit testing your package", " Chapter 9 Package development 9.1 Why you need to write your own package One of the reasons you might have tried R in the first place is the abundance of packages. As I’m writing these lines (in November 2020) 16523 packages are available on CRAN (in August 2019, there were 14762, and in August 2016, when I first wrote the number of packages down for my first ebook, it was 8922 packages). This is a staggering amount of packages and to help you look for the right ones, you can check out CRAN Task Views. You might wonder why the heck should you write your own packages? After all, with so many packages you’re sure to find something that suits your needs, right? Well, it depends. Of course, you will not need to write you own function to perform non-linear regression, or to train a neural network. But as time will go, you will start writing your own functions, functions that fit your needs, and that you use daily. It may be functions that prepare and shape data that you use at work for analysis. Or maybe you want to deliver an analysis to a client, with data and source code, so you decide to deliver a package that contains everything (something I’ve already done in the past). Maybe you want to develop a Shiny applications using the {golem} framework, which allows you to build apps as packages. Ok, but is it necessary to write a package? Why not just write functions inside some scripts and then simply run or share these scripts (and in the case of Shiny, you don’t have to use {golem})? This seems like a valid solution at first. However, it quickly becomes tedious, especially if you have multiple scripts scattered around your computer or inside different subfolders. You’ll also have to write the documentation on separate files and these can easily get lost or become outdated. Relying on scripts does not scale well; even if you are not sharing your code outside of your computer (maybe you’re working on super secret projects at NASA), you always have to think about future you. And in general, future you thinks that past you is an asshole, exactly because you put 0 effort in documenting, testing and making your code easy to use. Having everything inside a package takes care of these headaches for you, and will make future you proud of past you. And if you have to share your code, or deliver to a client, believe me, it will make things a thousand times easier. Code that is inside packages is very easy to document and test, especially if you’re using Rstudio. It also makes it possible to use the wonderful {covr} package, which tells you which lines in which functions are called by your tests. If some lines are missing, write tests that invoke them and increase the coverage of your tests! Documenting and testing your code is very important; it gives you assurance that the code your writing works, but most importantly, it gives others assurance that what you wrote works. And I include future you in these others too. In order to share this package with these others we are going to use Git. If you’re familiar with Git, great, you’ll be able to skip some sections. If not, then buckle up, you’re in for a wild ride. As I mentioned in the introduction, if you want to learn much more than I’ll show about packages read Wickham (2015). I will only show you the basics, but it should be enough to get you productive. 9.2 Starting easy: creating a package to share data We will start a package from scratch, in order to share data with the world. For this, we are first going to scrape a table off Wikipedia, prepare the data and then include it in a package. To make distributing this package easy, we’re going to put it up on Github, so you’ll need a Github account. Let’s start by creating a Github account. 9.2.1 Setting up a Github account Setting up a Github account is very easy; just go over to https://github.com/ and simply sign up! Then you will need to generate a ssh key on your computer. This is a way for you to securely interact with your Github account, and push your code to the repository without having to always type your password. I will assume you never created any ssh keys before, because if you already did, you could skip these steps. I will also assume that you are on a GNU+Linux or macOS system; if you’re using windows, the instructions are very similar, but you’ll first need to install Git available here. Git is available by default on any GNU+Linux system, and as far as I know also on macOS, but I might be wrong and you might also need to install git on macOS (but then the instructions are the same whether you’re using GNU+Linux or macOS). If you have trouble installing git, read the following section from the Pro Git book. Then, open a terminal (or the git command line on Windows) and type the following: ssh-keygen This command will generate several files in the .ssh directory inside your HOME directory. Look for the file that ends with the .pub extension, and copy its contents. You will need to paste these contents on Github. So now sign in to Github; once you are signed in, go to settings and then SSH and GPG keys: In the screenshot above, you see my ssh key associated with my account; this will be empty for you. Click on the top right, New SSH key: Give your key a name, and paste the key you generated before. You’re done! You can now configure git a bit more by telling it who you are. Open a terminal, adapt and type the following commands: git config --global user.name "Harold Zurcher" git config --global user.email harold.zurcher@madisonbus.com You’re ready to go!8 You can now push code to github to share it with the world. Or if you do not want to share you package (for confidentiality reasons for instance), you can still benefit from using git, as it possible to have an internal git server that could be managed by your company’s IT team. There is also the possibility to set up corporate, and thus private git servers by buying the service from github, or other providers such as gitlab. 9.2.2 Starting your package To start writing a package, the easiest way is to load up Rstudio and start a new project, under the File menu. If you’re starting from scratch, just choose the first option, New Directory and then R package. Give a new to your package, for example arcade (you’ll see why in a bit) and you can also choose to use git for version control. Now if you check the folder where you chose to save your package, you will see a folder with the same name as your package, and inside this folder a lot of new files and other folders. The most important folder for now is the R folder. This is the folder that will hold your .R source code files. You can also see these files and folders inside the Files panel from within Rstudio. Rstudio will also have hello.R opened, which is a single demo source file inside the R folder. You can get rid of this file, or keep it and edit it. I would advise you keep it and even distribute it inside your package. You can save this file in a special directory called data-raw. You don’t need to manually create this folder now, we will do so in a bit. For now, just follow along. Now, to start working on your package, the best is to use a package called {usethis}. {usethis} is a package that makes writing packages very easy; it includes functions that create the required subfolders and necessary template files so that you do not need to constantly check how file so-and-so should be placed or named. Let’s start by adding a readme file. This is easily achieved by using the following function from {usethis}: usethis::use_readme_md() This creates a template README.md file in the root directory of your package. You can now edit this file accordingly, and that’s it. The next step could be setting up your package to work with {roxygen2}, which will help write the documentation of your package: usethis::use_roxygen_md() The output tells you to run devtools::document(), we will do this later. Since you have learned about the tidyverse by reading this book, I am willing to bet that you will want to use the %>% operator inside the functions contained in your package. To do this without issues, which wil become apparent later, use the following command: usethis::use_pipe() This will make the %>% operator available internally to your package’s functions, but also to the user that will load the package. We are almost done setting up the package. If you plan on distributing data with your package, you might want to also share the code that prepared the data. For instance, if you receive the data from your finance department, but this data needs some cleaning before being useful, you could write a script to do so and then distribute this script also with the package, for reproducibility purposes. These scripts, while not central to the package, could still be of interest to the users. The directory to place them is called data-raw: usethis::use_data_raw() One final folder is inst. You can add files to this folder, and they will be available to the users that install the package. Users can find the files in the folder where packages get installed. On GNU+Linux systems, that would be somewhere like: /home/user/R/amd64-linux-gnu-library/3.6. There, you will find the installation folders of all the packages. If the package you make is called {spam}, you will find the files you put inside the inst folder on the root of the installation folder of spam. You can simply create the inst folder yourself, or use the following command: usethis::use_directory("inst") Finally, the last step is to give your package a license; this again is only useful if you plan on distributing it to the world. If you are writing your own package for yourself, or for purposes internal to your company, this is probably superfluous. I won’t discuss the particularities of licenses, so let’s just say that for the sake of this example package we are writing, we are going to use the MIT license: usethis::use_mit_license() This again creates the right file at the right spot. There are other interesting functions inside the {usethis} package, and we will come back to it later. 9.3 Including data inside the package Many packages include data and we are going to learn how to do it. I’ll assume that we already have a dataset on hand that we have to share. This is quite simple to do, first let’s simply load the data: arcade <- readr::read_csv("~/path/to/data/arcade.csv") and then use, once again, {usethis} comes to our rescue: usethis::use_data(arcade, compress = "xz") that’s it! Well almost. We still need to write a little script that will allow users of your package to load the data. This script is simply called data.R and contains the following lines: #' List of highest-grossing games #' #' Source: https://en.wikipedia.org/wiki/Arcade_game#List_of_highest-grossing_games #' #' @format A data frame with 6 variables: \\code{game}, \\code{release_year}, #' \\code{hardware_units_sold}, \\code{comment_hardware}, \\code{estimated_gross_revenue}, #' \\code{comment_revenue} #' \\describe{ #' \\item{game}{The name of the game} #' \\item{release_year}{The year the game was released} #' \\item{hardware_units_sold}{The amount of hardware units sold} #' \\item{comment_hardware}{Comment accompanying the amount of hardware units sold} #' \\item{estimated_gross_revenue}{Estimated gross revenue in US$ with 2019 inflation} #' \\item{comment_revenue}{Comment accompanying the amount of hardware units sold} #' } "arcade" Basically this is a description of the data, and the name with which the user will invoke the data. To conclude this part, remember the data-raw folder? If you used a script to scrape/get the data from somewhere, or if you had to write code to prepare the data to make it fit for sharing, this is where you can put that script. I have written such a script, I will discuss it in the next chapter, where I’ll show you how to scrape data from the internet. You can also save the file where you wrote all your calls to {usethis} functions if you want. 9.4 Adding functions to your package Functions will be added inside the R package. In there, you will find the hello.R file. You can edit this file if you kept it or you can create a new script. This script can hold one function, or several functions. Let’s start with the simplest case; one function inside one script. 9.4.1 One function inside one script Create a new R script, or edit the hello.R file, and add in the following code: #' Compute descriptive statistics for the numeric columns of a data frame. #' @param df The data frame to summarise. #' @param ... Optional. Columns in the data frame #' @return A data frame with descriptive statistics. If you are only interested in certain columns #' you can add these columns. #' @import dplyr #' @importFrom tidyr gather #' @export #' @examples #' \\dontrun{ #' describe(dataset) #' describe(dataset, col1, col2) #' } describe_numeric <- function(df, ...){ if (nargs() > 1) df <- select(df, ...) df %>% select_if(is.numeric) %>% gather(variable, value) %>% group_by(variable) %>% summarise_all(list(mean = ~mean(., na.rm = TRUE), sd = ~sd(., na.rm = TRUE), nobs = ~length(.), min = ~min(., na.rm = TRUE), max = ~max(., na.rm = TRUE), q05 = ~quantile(., 0.05, na.rm = TRUE), q25 = ~quantile(., 0.25, na.rm = TRUE), mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE), median = ~quantile(., 0.5, na.rm = TRUE), q75 = ~quantile(., 0.75, na.rm = TRUE), q95 = ~quantile(., 0.95, na.rm = TRUE), n_missing = ~sum(is.na(.)))) %>% mutate(type = "Numeric") } Save the script under the name describe.R. This function shows you pretty much you need to know when writing functions for packages. First, there’s the comment lines, that start with #' and not with #. These lines will be converted into the function’s documentation which you and your package’s users will be able to read in Rstudio’s Help pane. Notice the keywords that start with @. These are quite important: @param: used to define the function’s parameters; @return: used to define the object returned by the function; @import: if the function needs functions from another package, in the present case {dplyr}; then this is where you would define these. Separate several package with a space; @importFrom: if the function only needs one function from a package, define it here. Read it as from tidyr import gather, very similar to how it is done in Python; @export: makes the function available to the users. If you omit this, this function will not be available to the users and only available internally to the other functions of the package. Not making functions available to users can be useful if you need to write functions that are used by other functions but never be used by anyone directly. It is still possible to access these internal, private, functions by using :::, as in, package:::private_function(); @examples: lists examples in the documentation. The \\dontrun{} tag is used for when you do not want these examples to run when building the package. As explained before, if the function depends on function from other packages, then @import or @importFrom must be used. But it is also possible to use the package::function() syntax like I did on the following line: mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE), This function uses the sample_mode() function from my {brotools} package. Since it is the only function that I am using, I don’t import the whole package with @import. I could have done the same for gather() from {tidyr} instead of using @importFrom, but I wanted to showcase @importFrom, which can also be use to import several functions: @importFrom package function_1 function_2 function_3 The way I’m doing this however is not optimal. If your package depends on many functions from other packages that are not available on CRAN, but rather on Github, you might want to do that in a cleaner way. The cleaner way is to add a “Remotes” field in the package’s NAMESPACE (this is a very important file that gets generated automatically by devtools::document()) I won’t cover this here, but you can read more about it here. What I will cover is how to declare dependencies to other CRAN packages. These dependencies also get declared inside the ‘Description’ file, which we will cover in the next section. Because I’m doing that in this hacky way, my {brotools} package should be installed: devtools::install_github("b-rodrigues/brotools") Again, I want to emphasize that this is not the best way of doing it. However, using the “REMOTES” field as described in the document I linked above is not complicated. Now comes the function itself. The function is written in pretty much the same way as usual, but there are some particularities. First of all, the second argument of the function is the ..., which were already covered in Chapter 7. I want to give the option to my users to specify any columns to summarise only these columns, instead of all of them, which is the default behaviour. But because I cannot know how many columns the user wants to summarize beforehand, and also because I do not want to limit the user to 2 or 3 columns, I use the .... But what if the user wants to summarize all the columns? This is taken care of in this line: if (nargs() > 1) df <- select(df, ...) nargs() counts the number of arguments of the function. If the user calls the function like so: describe_numeric(mtcars) nargs() will return 1. If, instead, the user calls the function with one or more columns: describe_numeric(mtcars, hp, mpg) then nargs() will return 2 (in this case). And does, this piece of code will be executed: df <- select(df, ...) which selects the columns hp and mpg from the mtcars dataset. This reduced data set is then the one that is being summarized. 9.4.2 Many functions inside a script If you need to add more functions, you can add more in the same script, or create one script per function. The advantage of writing more than one function per script is that you can keep functions that are conceptually similar in the same place. For instance, if you want to add a function called describe_character() to your package, adding it to the same script where describe_numeric() is might be a good idea, so let’s do just that: #' Compute descriptive statistics for the numeric columns of a data frame. #' @param df The data frame to summarise. #' @param ... Optional. Columns in the data frame #' @return A data frame with descriptive statistics. If you are only interested in certain columns #' you can add these columns. #' @import dplyr #' @importFrom tidyr gather #' @export #' @examples #' \\dontrun{ #' describe(dataset) #' describe(dataset, col1, col2) #' } describe_numeric <- function(df, ...){ if (nargs() > 1) df <- select(df, ...) df %>% select(is.numeric) %>% pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>% group_by(variable) %>% summarise(across(everything(), tibble::lst(mean = ~mean(., na.rm = TRUE), sd = ~sd(., na.rm = TRUE), nobs = ~length(.), min = ~min(., na.rm = TRUE), max = ~max(., na.rm = TRUE), q05 = ~quantile(., 0.05, na.rm = TRUE), q25 = ~quantile(., 0.25, na.rm = TRUE), mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE), median = ~quantile(., 0.5, na.rm = TRUE), q75 = ~quantile(., 0.75, na.rm = TRUE), q95 = ~quantile(., 0.95, na.rm = TRUE), n_missing = ~sum(is.na(.))))) %>% mutate(type = "Numeric") } #' Compute descriptive statistics for the character or factor columns of a data frame. #' @param df The data frame to summarise. #' @return A data frame with a description of the character or factor columns. #' @import dplyr #' @importFrom tidyr gather describe_character_or_factors <- function(df, type){ df %>% pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>% group_by(variable) %>% summarise(across(everything(), funs(mode = brotools::sample_mode(value, na.rm = TRUE), nobs = length(value), n_missing = sum(is.na(value)), n_unique = length(unique(value))))) %>% mutate(type = type) } #' Compute descriptive statistics for the character columns of a data frame. #' @param df The data frame to summarise. #' @return A data frame with a description of the character columns. #' @import dplyr #' @export #' @examples #' \\dontrun{ #' describe(dataset) #' } describe_character <- function(df){ df %>% select(where(is.character)) %>% describe_character_or_factors(type = "Character") } Let’s now continue on to the next section, where we will learn to document the package. 9.5 Documenting your package There are several files that you must edit to fully document the package; for now, only the functions are documented. The first of these files is the DESCRIPTION file. 9.5.1 Description By default, the DESCRIPTION file, which you can find in the root of your package project, contains the following lines: Package: arcade Type: Package Title: What the Package Does (Title Case) Version: 0.1.0 Author: Who wrote it Maintainer: The package maintainer <yourself@somewhere.net> Description: More about what it does (maybe more than one line) Use four spaces when indenting paragraphs within the Description. License: What license is it under? Encoding: UTF-8 LazyData: true RoxygenNote: 7.0.2 Each section is quite self-explanatory. This is how it could look like once you’re done editing it: Package: arcade Type: Package Title: List of highest-grossing Arcade Games Version: 0.1.0 Author: person("Harold", "Zurcher", email = "harold.zurcher@madisonbus.com", role = c("aut", "cre")) Description: This package contains data about the highest-grossing arcade games from the 70's until 2010's. Also contains some functions to summarize data. License: CC0 Encoding: UTF-8 LazyData: true RoxygenNote: 7.0.2 The Author and Maintainer need some further explanations; I have added Harold Zurcher as the athor and creator, with the role = c(\"aut\", \"cre\") bit. \"cre\" can also be used for maintainer, so I removed the Maintainer line. 9.6 Unit testing your package References "],["further-topics.html", "Chapter 10 Further topics 10.1 Using Python from R with {reticulate} 10.2 Generating Pdf or Word reports with R 10.3 Scraping the internet 10.4 Regular expressions 10.5 Setting up a blog with {blogdown}", " Chapter 10 Further topics This chapter is a collection of short section that show some of the very nice things you can use R for. These sections are based on past blog posts. 10.1 Using Python from R with {reticulate} There is a lot of discussion online about the benefits of Python over and vice-versa. When it comes to data science, they are for the most part interchangeable. I would say that R has an advantage over Python when it comes to offering specialized packages for certain topics such as econometrics, bioinformatics, actuarial sciences, etc… while Python seems to offer more possibilities when it comes to integrating a machine learning model into an app. However, if most of your work is data analysis/machine learning, both languages are practically interchangeable. But it can happen that you need access to a very specific library with no R equivalent. Well, in that case, no need to completely switch to Python, as you can call Python code from R using the {reticulate} package. {reticulate} allows you to seamlessly call Python functions from an R session. An easy way to use {reticulate} is to start a a new notebook, but you can also use {reticulate} and the included functions interactively. However, I find that Rstudio notebooks work very well for this particular use-case, because you can write R and Python chunks, and thus differentiate the different specific lines of code really well. Let’s see how this works. First of all, you might need to specify the path to your Python executable, in my case, because I’ve installed Python using Anaconda, I need to specify it: # This is an R chunk use_python("~/miniconda3/bin/python") 10.2 Generating Pdf or Word reports with R 10.3 Scraping the internet 10.4 Regular expressions 10.5 Setting up a blog with {blogdown} "],["references.html", "References", " References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] diff --git a/docs/statistical-models.html b/docs/statistical-models.html index 7ff74e2..5329f01 100644 --- a/docs/statistical-models.html +++ b/docs/statistical-models.html @@ -23,7 +23,7 @@ - + @@ -282,7 +282,6 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • 5 Graphs @@ -365,7 +364,7 @@
  • Exercise 1
  • Exercise 2
  • Exercise 3
  • -
  • Exercise 4
  • +
  • Exercise 4
  • 8 Functional programming @@ -472,11 +471,11 @@

    6.1 Terminology6.2 Fitting a model to data

    Suppose you have a variable y that you wish to explain using a set of other variables x1, x2, x3, etc. Let’s take a look at the Housing dataset from the Ecdat package:

    -
    library(Ecdat)
    -
    -data(Housing)
    +
    library(Ecdat)
    +
    +data(Housing)

    You can read a description of the dataset by running:

    -
    ?Housing
    +
    ?Housing
    Housing                 package:Ecdat                  R Documentation
     
     Sales Prices of Houses in the City of Windsor
    @@ -543,7 +542,7 @@ 

    6.2 Fitting a model to data

    or by looking for Housing in the help pane of RStudio. Usually, you would take a look a the data before doing any modeling:

    -
    glimpse(Housing)
    +
    glimpse(Housing)
    ## Rows: 546
     ## Columns: 12
     ## $ price    <dbl> 42000, 38500, 49500, 60500, 61000, 66000, 66000, 69000, 83800…
    @@ -562,7 +561,7 @@ 

    6.2 Fitting a model to datalinear model, you will need to use the built-in lm() function:

    -
    model1 <- lm(price ~ lotsize + bedrooms, data = Housing)
    +
    model1 <- lm(price ~ lotsize + bedrooms, data = Housing)

    lm() takes a formula as an argument, which defines the model you want to estimate. In this case, I ran the following regression:

    \[ @@ -570,7 +569,7 @@

    6.2 Fitting a model to data

    where \(\beta_0, \beta_1\) and \(\beta_2\) are three parameters to estimate. To take a look at the results, you can use the summary() method (not to be confused with dplyr::summarise()):

    -
    summary(model1)
    +
    summary(model1)
    ## 
     ## Call:
     ## lm(formula = price ~ lotsize + bedrooms, data = Housing)
    @@ -592,9 +591,9 @@ 

    6.2 Fitting a model to data

    if you wish to remove the intercept (\(\beta_0\) in the above equation) from your model, you can do so with -1:

    -
    model2 <- lm(price ~ -1 + lotsize + bedrooms, data = Housing)
    -
    -summary(model2)
    +
    model2 <- lm(price ~ -1 + lotsize + bedrooms, data = Housing)
    +
    +summary(model2)
    ## 
     ## Call:
     ## lm(formula = price ~ -1 + lotsize + bedrooms, data = Housing)
    @@ -614,9 +613,9 @@ 

    6.2 Fitting a model to data

    or if you want to use all the columns inside Housing, replacing the column names by .:

    -
    model3 <- lm(price ~ ., data = Housing)
    -
    -summary(model3)
    +
    model3 <- lm(price ~ ., data = Housing)
    +
    +summary(model3)
    ## 
     ## Call:
     ## lm(formula = price ~ ., data = Housing)
    @@ -647,16 +646,16 @@ 

    6.2 Fitting a model to data

    You can access different elements of model3 with $, because the result of lm() is a list (you can check this claim with typeof(model3):

    -
    print(model3$coefficients)
    +
    print(model3$coefficients)
    ##  (Intercept)      lotsize     bedrooms      bathrms      stories  drivewayyes 
     ## -4038.350425     3.546303  1832.003466 14335.558468  6556.945711  6687.778890 
     ##   recroomyes  fullbaseyes     gashwyes     aircoyes     garagepl  prefareayes 
     ##  4511.283826  5452.385539 12831.406266 12632.890405  4244.829004  9369.513239

    but I prefer to use the {broom} package, and more specifically the tidy() function, which converts model3 into a neat data.frame:

    -
    results3 <- broom::tidy(model3)
    -
    -glimpse(results3)
    +
    results3 <- broom::tidy(model3)
    +
    +glimpse(results3)
    ## Rows: 12
     ## Columns: 5
     ## $ term      <chr> "(Intercept)", "lotsize", "bedrooms", "bathrms", "stories", …
    @@ -669,8 +668,8 @@ 

    6.2 Fitting a model to data{yardstick} I prefer to explicitely write broom::tidy() to avoid conflicts.

    Using broom::tidy() is useful, because you can then work on the results easily, for example if you wish to only keep results that are significant at the 5% level:

    -
    results3 %>%
    -  filter(p.value < 0.05)
    +
    results3 %>%
    +  filter(p.value < 0.05)
    ## # A tibble: 10 × 5
     ##    term        estimate std.error statistic  p.value
     ##    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
    @@ -685,9 +684,9 @@ 

    6.2 Fitting a model to data

    You can even add new columns, such as the confidence intervals:

    -
    results3 <- broom::tidy(model3, conf.int = TRUE, conf.level = 0.95)
    -
    -print(results3)
    +
    results3 <- broom::tidy(model3, conf.int = TRUE, conf.level = 0.95)
    +
    +print(results3)
    ## # A tibble: 12 × 7
     ##    term        estimate std.error statistic  p.value  conf.low conf.high
     ##    <chr>          <dbl>     <dbl>     <dbl>    <dbl>     <dbl>     <dbl>
    @@ -704,10 +703,10 @@ 

    6.2 Fitting a model to data

    Going back to model estimation, you can of course use lm() in a pipe workflow:

    -
    Housing %>%
    -  select(-driveway, -stories) %>%
    -  lm(price ~ ., data = .) %>%
    -  broom::tidy()
    +
    Housing %>%
    +  select(-driveway, -stories) %>%
    +  lm(price ~ ., data = .) %>%
    +  broom::tidy()
    ## # A tibble: 10 × 5
     ##    term        estimate std.error statistic  p.value
     ##    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
    @@ -729,10 +728,10 @@ 

    6.2 Fitting a model to dataYou have to specify this, because by default, when using %>% the left hand side argument gets passed as the first argument of the function on the right hand side.

    Since version 4.2, R now also natively includes a placeholder, _:

    -
    Housing |>
    -  select(-driveway, -stories) |>
    -  lm(price ~ ., data = _) |>
    -  broom::tidy()
    +
    Housing |>
    +  select(-driveway, -stories) |>
    +  lm(price ~ ., data = _) |>
    +  broom::tidy()
    ## # A tibble: 10 × 5
     ##    term        estimate std.error statistic  p.value
     ##    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
    @@ -756,7 +755,7 @@ 

    6.3 Diagnostics\(R^2\) at the bottom of the summary (when running summary(my_model)), but if you want to do more than simply reading these diagnostics from RStudio, you can put those in a data.frame too, using broom::glance():

    -
    glance(model3)
    +
    glance(model3)
    ## # A tibble: 1 × 12
     ##   r.squared adj.r.…¹  sigma stati…²   p.value    df logLik    AIC    BIC devia…³
     ##       <dbl>    <dbl>  <dbl>   <dbl>     <dbl> <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
    @@ -765,19 +764,19 @@ 

    6.3 Diagnostics
    library(ggfortify)
    -
    -autoplot(model3, which = 1:6) + theme_minimal()
    +
    library(ggfortify)
    +
    +autoplot(model3, which = 1:6) + theme_minimal()

    which=1:6 is an additional option that shows you all the diagnostics plot. If you omit this option, you will only get 4 of them.

    You can also get the residuals of the regression in two ways; either you grab them directly from the model fit:

    -
    resi3 <- residuals(model3)
    +
    resi3 <- residuals(model3)

    or you can augment the original data with a residuals column, using broom::augment():

    -
    housing_aug <- augment(model3)
    +
    housing_aug <- augment(model3)

    Let’s take a look at housing_aug:

    -
    glimpse(housing_aug)
    +
    glimpse(housing_aug)
    ## Rows: 546
     ## Columns: 18
     ## $ price      <dbl> 42000, 38500, 49500, 60500, 61000, 66000, 66000, 69000, 838…
    @@ -800,25 +799,25 @@ 

    6.3 Diagnostics
    ggplot(housing_aug) +
    -  geom_density(aes(.resid))
    +
    ggplot(housing_aug) +
    +  geom_density(aes(.resid))

    Fitted values are also added to the original data, under the variable .fitted. It would also have been possible to get the fitted values with:

    -
    fit3 <- fitted(model3)
    +
    fit3 <- fitted(model3)

    but I prefer using augment(), because the columns get merged to the original data, which then makes it easier to find specific individuals, for example, you might want to know for which housing units the model underestimates the price:

    -
    total_pos <- housing_aug %>%
    -  filter(.resid > 0) %>%
    -  summarise(total = n()) %>%
    -  pull(total)
    +
    total_pos <- housing_aug %>%
    +  filter(.resid > 0) %>%
    +  summarise(total = n()) %>%
    +  pull(total)

    we find 261 individuals where the residuals are positive. It is also easier to extract outliers:

    -
    housing_aug %>%
    -  mutate(prank = cume_dist(.cooksd)) %>%
    -  filter(prank > 0.99) %>%
    -  glimpse()
    +
    housing_aug %>%
    +  mutate(prank = cume_dist(.cooksd)) %>%
    +  filter(prank > 0.99) %>%
    +  glimpse()
    ## Rows: 6
     ## Columns: 19
     ## $ price      <dbl> 163000, 125000, 132000, 175000, 190000, 174500
    @@ -842,8 +841,8 @@ 

    6.3 Diagnostics
    example <- c(5, 4.6, 2, 1, 0.8, 0, -1)
    -cume_dist(example)
    +
    example <- c(5, 4.6, 2, 1, 0.8, 0, -1)
    +cume_dist(example)
    ## [1] 1.0000000 0.8571429 0.7142857 0.5714286 0.4285714 0.2857143 0.1428571

    by filtering prank > 0.99 we get the top 1% of outliers according to Cook’s distance.

    @@ -878,11 +877,11 @@

    6.4.1 Marginal effects
    library(marginaleffects)
    -
    -effects_model3 <- marginaleffects(model3)
    -
    -summary(effects_model3)
    +
    library(marginaleffects)
    +
    +effects_model3 <- marginaleffects(model3)
    +
    +summary(effects_model3)
    ##        Term Contrast    Effect Std. Error z value   Pr(>|z|)    2.5 %    97.5 %
     ## 1   lotsize    dY/dX     3.546     0.3503  10.124 < 2.22e-16     2.86     4.233
     ## 2  bedrooms    dY/dX  1832.003  1047.0056   1.750 0.08016056  -220.09  3884.097
    @@ -913,12 +912,12 @@ 

    6.4.1 Marginal effects\(\alpha\)). But in the case of a more complex, non-linear model, this is not so obvious. This is where {marginaleffects} will make your life much easier.

    It is also possible to plot the results:

    -
    plot(effects_model3)
    +
    plot(effects_model3)

    effects_model3 is a data frame containing the effects for each house in the data set. For example, let’s take a look at the first house:

    -
    effects_model3 %>%
    -  filter(rowid == 1)
    +
    effects_model3 %>%
    +  filter(rowid == 1)
    ##    rowid     type     term contrast         dydx    std.error statistic
     ## 1      1 response  lotsize    dY/dX     3.546303    0.3502195 10.125944
     ## 2      1 response bedrooms    dY/dX  1832.003466 1046.1608842  1.751168
    @@ -976,8 +975,8 @@ 

    6.4.1 Marginal effects
    data(Participation)
    -
    ?Particpation
    +
    data(Participation)
    +
    ?Particpation
    Participation              package:Ecdat               R Documentation
     
     Labor Force Participation
    @@ -1036,9 +1035,9 @@ 

    6.4.1 Marginal effects
    logit_participation <- glm(lfp ~ ., data = Participation, family = "binomial")
    -
    -broom::tidy(logit_participation)
    +
    logit_participation <- glm(lfp ~ ., data = Participation, family = "binomial")
    +
    +broom::tidy(logit_participation)
    ## # A tibble: 7 × 5
     ##   term        estimate std.error statistic  p.value
     ##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
    @@ -1051,9 +1050,9 @@ 

    6.4.1 Marginal effects
    effects_logit_participation <- marginaleffects(logit_participation)
    -
    -summary(effects_logit_participation)
    +
    effects_logit_participation <- marginaleffects(logit_participation)
    +
    +summary(effects_logit_participation)
    ##      Term Contrast    Effect Std. Error z value   Pr(>|z|)     2.5 %   97.5 %
     ## 1 lnnlinc    dY/dX -0.169940    0.04151 -4.0939 4.2416e-05 -0.251300 -0.08858
     ## 2     age    dY/dX -0.106407    0.01759 -6.0492 1.4560e-09 -0.140884 -0.07193
    @@ -1066,12 +1065,12 @@ 

    6.4.1 Marginal effects
    Participation[1, ]
    +
    Participation[1, ]
    ##   lfp lnnlinc age educ nyc noc foreign
     ## 1  no 10.7875   3    8   1   1      no

    and let’s now look at rowid == 1 in the marginal effects data frame:

    -
    effects_logit_participation %>%
    -  filter(rowid == 1)
    +
    effects_logit_participation %>%
    +  filter(rowid == 1)
    ##   rowid     type    term contrast         dydx   std.error  statistic
     ## 1     1 response lnnlinc    dY/dX -0.156661756 0.038522800 -4.0667282
     ## 2     1 response     age    dY/dX -0.098097148 0.020123709 -4.8747052
    @@ -1098,10 +1097,10 @@ 

    6.4.1 Marginal effects
    dydx_lnnlinc <- effects_logit_participation %>%
    -  filter(term == "lnnlinc")
    -
    -head(dydx_lnnlinc)
    +
    dydx_lnnlinc <- effects_logit_participation %>%
    +  filter(term == "lnnlinc")
    +
    +head(dydx_lnnlinc)
    ##   rowid     type    term contrast        dydx  std.error statistic      p.value
     ## 1     1 response lnnlinc    dY/dX -0.15666176 0.03852280 -4.066728 4.767780e-05
     ## 2     2 response lnnlinc    dY/dX -0.20013939 0.05124543 -3.905507 9.402813e-05
    @@ -1125,12 +1124,12 @@ 

    6.4.1 Marginal effects
    dydx_lnnlinc %>%
    -  summarise(mean(dydx))
    +
    dydx_lnnlinc %>%
    +  summarise(mean(dydx))
    ##   mean(dydx)
     ## 1 -0.1699405

    Let’s compare this to the average marginal effects:

    -
    summary(effects_logit_participation)
    +
    summary(effects_logit_participation)
    ##      Term Contrast    Effect Std. Error z value   Pr(>|z|)     2.5 %   97.5 %
     ## 1 lnnlinc    dY/dX -0.169940    0.04151 -4.0939 4.2416e-05 -0.251300 -0.08858
     ## 2     age    dY/dX -0.106407    0.01759 -6.0492 1.4560e-09 -0.140884 -0.07193
    @@ -1144,7 +1143,7 @@ 

    6.4.1 Marginal effects
    plot(effects_logit_participation)
    +
    plot(effects_logit_participation)

    So an infinitesimal increase, in say, non-labour income (lnnlinc) of 0.001 is associated with a decrease of the probability of labour force participation by 0.001*17 percentage points.

    @@ -1170,15 +1169,15 @@

    6.5 Comparing models
    ggplot(Housing) +
    -  geom_density(aes(price))
    +
    ggplot(Housing) +
    +  geom_density(aes(price))

    it looks like modeling the log of price might provide a better fit:

    -
    model_log <- lm(log(price) ~ ., data = Housing)
    -
    -result_log <- broom::tidy(model_log)
    -
    -print(result_log)
    +
    model_log <- lm(log(price) ~ ., data = Housing)
    +
    +result_log <- broom::tidy(model_log)
    +
    +print(result_log)
    ## # A tibble: 12 × 5
     ##    term          estimate  std.error statistic  p.value
     ##    <chr>            <dbl>      <dbl>     <dbl>    <dbl>
    @@ -1195,7 +1194,7 @@ 

    6.5 Comparing models
    glance(model_log)
    +
    glance(model_log)
    ## # A tibble: 1 × 12
     ##   r.squared adj.r.squ…¹ sigma stati…²   p.value    df logLik   AIC   BIC devia…³
     ##       <dbl>       <dbl> <dbl>   <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
    @@ -1203,21 +1202,21 @@ 

    6.5 Comparing models
    diag_lm <- glance(model3)
    -
    -diag_lm <- diag_lm %>%
    -  mutate(model = "lin-lin model")
    -
    -diag_log <- glance(model_log)
    -
    -diag_log  <- diag_log %>%
    -  mutate(model = "log-lin model")
    -
    -diagnostics_models <- full_join(diag_lm, diag_log) %>%
    -  select(model, everything()) # put the `model` column first
    -
    ## Joining, by = c("r.squared", "adj.r.squared", "sigma", "statistic", "p.value", "df", "logLik", "AIC", "BIC", "deviance",
    -## "df.residual", "nobs", "model")
    -
    print(diagnostics_models)
    +
    diag_lm <- glance(model3)
    +
    +diag_lm <- diag_lm %>%
    +  mutate(model = "lin-lin model")
    +
    +diag_log <- glance(model_log)
    +
    +diag_log  <- diag_log %>%
    +  mutate(model = "log-lin model")
    +
    +diagnostics_models <- full_join(diag_lm, diag_log) %>%
    +  select(model, everything()) # put the `model` column first
    +
    ## Joining, by = c("r.squared", "adj.r.squared", "sigma", "statistic", "p.value",
    +## "df", "logLik", "AIC", "BIC", "deviance", "df.residual", "nobs", "model")
    +
    print(diagnostics_models)
    ## # A tibble: 2 × 13
     ##   model   r.squ…¹ adj.r…²   sigma stati…³   p.value    df  logLik    AIC     BIC
     ##   <chr>     <dbl>   <dbl>   <dbl>   <dbl>     <dbl> <dbl>   <dbl>  <dbl>   <dbl>
    @@ -1243,13 +1242,13 @@ 

    6.6 Using a model for prediction< We are going to explore prediction, overfitting and tuning of models in a later section.

    Let’s go back to the models we trained in the previous section, model3 and model_log. Let’s also take a subsample of data, which we will be using for prediction:

    -
    set.seed(1234)
    -
    -pred_set <- Housing %>%
    -  sample_n(20)
    +
    set.seed(1234)
    +
    +pred_set <- Housing %>%
    +  sample_n(20)

    In order to always get the same pred_set, I set the random seed first. Let’s take a look at the data:

    -
    print(pred_set)
    +
    print(pred_set)
    ##      price lotsize bedrooms bathrms stories driveway recroom fullbase gashw
     ## 284  45000    6750        2       1       1      yes      no       no    no
     ## 101  57000    4500        3       2       2       no      no      yes    no
    @@ -1293,7 +1292,7 @@ 

    6.6 Using a model for prediction< ## 108 no 0 no ## 131 no 0 no

    If we wish to use it for prediction, this is easily done with predict():

    -
    predict(model3, pred_set)
    +
    predict(model3, pred_set)
    ##       284       101       400        98       103       326        79       270 
     ##  51143.48  77286.31  93204.28  76481.82  77688.37 103751.72  66760.79  66486.26 
     ##       382       184         4       212       195       511       479       510 
    @@ -1302,16 +1301,16 @@ 

    6.6 Using a model for prediction< ## 55177.75 77741.03 62980.84 50926.99

    This returns a vector of predicted prices. This can then be used to compute the Root Mean Squared Error for instance. Let’s do it within a tidyverse pipeline:

    -
    rmse <- pred_set %>%
    -  mutate(predictions = predict(model3, .)) %>%
    -  summarise(sqrt(sum(predictions - price)**2/n()))
    +
    rmse <- pred_set %>%
    +  mutate(predictions = predict(model3, .)) %>%
    +  summarise(sqrt(sum(predictions - price)**2/n()))

    The root mean square error of model3 is 3646.0817347.

    I also used the n() function which returns the number of observations in a group (or all the observations, if the data is not grouped). Let’s compare model3 ’s RMSE with the one from model_log:

    -
    rmse2 <- pred_set %>%
    -  mutate(predictions = exp(predict(model_log, .))) %>%
    -  summarise(sqrt(sum(predictions - price)**2/n()))
    +
    rmse2 <- pred_set %>%
    +  mutate(predictions = exp(predict(model_log, .))) %>%
    +  summarise(sqrt(sum(predictions - price)**2/n()))

    Don’t forget to exponentiate the predictions, remember you’re dealing with a log-linear model! model_log’s RMSE is 1.2125133^{4} which is lower than model3’s. However, keep in mind that the model was trained on the whole data, and then the prediction quality was assessed using a subsample of the data the @@ -1455,36 +1454,36 @@

    6.8.1 Ridge regression
    index <- 1:nrow(Housing)
    -
    -set.seed(12345)
    -train_index <- sample(index, round(0.90*nrow(Housing)), replace = FALSE)
    -
    -test_index <- setdiff(index, train_index)
    -
    -train_x <- Housing[train_index, ] %>% 
    -    select(-price)
    -
    -train_y <- Housing[train_index, ] %>% 
    -    pull(price)
    -
    -test_x <- Housing[test_index, ] %>% 
    -    select(-price)
    -
    -test_y <- Housing[test_index, ] %>% 
    -    pull(price)
    +
    index <- 1:nrow(Housing)
    +
    +set.seed(12345)
    +train_index <- sample(index, round(0.90*nrow(Housing)), replace = FALSE)
    +
    +test_index <- setdiff(index, train_index)
    +
    +train_x <- Housing[train_index, ] %>% 
    +    select(-price)
    +
    +train_y <- Housing[train_index, ] %>% 
    +    pull(price)
    +
    +test_x <- Housing[test_index, ] %>% 
    +    select(-price)
    +
    +test_y <- Housing[test_index, ] %>% 
    +    pull(price)

    I do the train/test split this way, because glmnet() requires a design matrix as input, and not a formula. Design matrices can be created using the model.matrix() function:

    -
    library("glmnet")
    -
    -train_matrix <- model.matrix(train_y ~ ., data = train_x)
    -
    -test_matrix <- model.matrix(test_y ~ ., data = test_x)
    +
    library("glmnet")
    +
    +train_matrix <- model.matrix(train_y ~ ., data = train_x)
    +
    +test_matrix <- model.matrix(test_y ~ ., data = test_x)

    Let’s now run a linear regression, by setting the penalty to 0:

    -
    model_lm_ridge <- glmnet(y = train_y, x = train_matrix, alpha = 0, lambda = 0)
    +
    model_lm_ridge <- glmnet(y = train_y, x = train_matrix, alpha = 0, lambda = 0)

    The model above provides the same result as a linear regression, because I set lambda to 0. Let’s compare the coefficients between the two:

    -
    coef(model_lm_ridge)
    +
    coef(model_lm_ridge)
    ## 13 x 1 sparse Matrix of class "dgCMatrix"
     ##                       s0
     ## (Intercept) -2667.542863
    @@ -1502,30 +1501,30 @@ 

    6.8.1 Ridge regression
    coef(lm.fit(x = train_matrix, y = train_y))
    +
    coef(lm.fit(x = train_matrix, y = train_y))
    ##  (Intercept)      lotsize     bedrooms      bathrms      stories  drivewayyes 
     ## -2667.052098     3.397629  2081.344118 13293.707725  6400.416730  6529.972544 
     ##   recroomyes  fullbaseyes     gashwyes     aircoyes     garagepl  prefareayes 
     ##  5388.871137  4899.024787 12575.970220 13077.988867  4155.269629 10261.056772

    as you can see, the coefficients are the same. Let’s compute the RMSE for the unpenalized linear regression:

    -
    preds_lm <- predict(model_lm_ridge, test_matrix)
    -
    -rmse_lm <- sqrt(mean(preds_lm - test_y)^2)
    +
    preds_lm <- predict(model_lm_ridge, test_matrix)
    +
    +rmse_lm <- sqrt(mean(preds_lm - test_y)^2)

    The RMSE for the linear unpenalized regression is equal to 1731.5553157.

    Let’s now run a ridge regression, with lambda equal to 100, and see if the RMSE is smaller:

    -
    model_ridge <- glmnet(y = train_y, x = train_matrix, alpha = 0, lambda = 100)
    +
    model_ridge <- glmnet(y = train_y, x = train_matrix, alpha = 0, lambda = 100)

    and let’s compute the RMSE again:

    -
    preds <- predict(model_ridge, test_matrix)
    -
    -rmse <- sqrt(mean(preds - test_y)^2)
    +
    preds <- predict(model_ridge, test_matrix)
    +
    +rmse <- sqrt(mean(preds - test_y)^2)

    The RMSE for the linear penalized regression is equal to 1726.7632312, which is smaller than before. But which value of lambda gives smallest RMSE? To find out, one must run model over a grid of lambda values and pick the model with lowest RMSE. This procedure is available in the cv.glmnet() function, which picks the best value for lambda:

    -
    best_model <- cv.glmnet(train_matrix, train_y)
    -# lambda that minimises the MSE
    -best_model$lambda.min
    +
    best_model <- cv.glmnet(train_matrix, train_y)
    +# lambda that minimises the MSE
    +best_model$lambda.min
    ## [1] 61.42681

    According to cv.glmnet() the best value for lambda is 61.4268056. In the next section, we will implement cross validation ourselves, in order to find the hyper-parameters @@ -1544,17 +1543,17 @@

    6.9 Training, validating, and tes

    6.9.1 Set up

    Let’s load the needed packages:

    -
    library("tidyverse")
    -library("recipes")
    -library("rsample")
    -library("parsnip")
    -library("yardstick")
    -library("brotools")
    -library("mlbench")
    +
    library("tidyverse")
    +library("recipes")
    +library("rsample")
    +library("parsnip")
    +library("yardstick")
    +library("brotools")
    +library("mlbench")

    Load the data which is included in the {mlrbench} package:

    -
    data("BostonHousing2")
    +
    data("BostonHousing2")

    I will train a random forest to predict the housing prices, which is the cmedv column:

    -
    head(BostonHousing2)
    +
    head(BostonHousing2)
    ##         town tract      lon     lat medv cmedv    crim zn indus chas   nox
     ## 1     Nahant  2011 -70.9550 42.2550 24.0  24.0 0.00632 18  2.31    0 0.538
     ## 2 Swampscott  2021 -70.9500 42.2875 21.6  21.6 0.02731  0  7.07    0 0.469
    @@ -1570,19 +1569,19 @@ 

    6.9.1 Set up
    boston <- BostonHousing2 %>% 
    -    select(-medv, -tract, -lon, -lat) %>% 
    -    rename(price = cmedv)

    +
    boston <- BostonHousing2 %>% 
    +    select(-medv, -tract, -lon, -lat) %>% 
    +    rename(price = cmedv)

    I remove tract, lat and lon because the information contained in the column town is enough.

    To train and evaluate the model’s performance, I split the data in two. One data set, called the training set, will be further split into two down below. I won’t touch the second data set, the test set, until the very end, to finally assess the model’s performance.

    -
    train_test_split <- initial_split(boston, prop = 0.9)
    -
    -housing_train <- training(train_test_split)
    -
    -housing_test <- testing(train_test_split)
    +
    train_test_split <- initial_split(boston, prop = 0.9)
    +
    +housing_train <- training(train_test_split)
    +
    +housing_test <- testing(train_test_split)

    initial_split(), training() and testing() are functions from the {rsample} package.

    I will train a random forest on the training data, but the question, is which random forest? Because random forests have several hyper-parameters, and as explained in the intro these @@ -1601,9 +1600,9 @@

    6.9.1 Set up
    validation_data <- mc_cv(housing_train, prop = 0.9, times = 30)
    +
    validation_data <- mc_cv(housing_train, prop = 0.9, times = 30)

    What does validation_data look like?

    -
    validation_data
    +
    validation_data
    ## # Monte Carlo cross-validation (0.9/0.1) with 30 resamples  
     ## # A tibble: 30 × 2
     ##    splits           id        
    @@ -1620,7 +1619,7 @@ 

    6.9.1 Set up
    validation_data$splits[[1]]
    +
    validation_data$splits[[1]]
    ## <Analysis/Assess/Total>
     ## <409/46/455>

    The first value is the number of rows of the first set, the second value of the second, and the third @@ -1633,12 +1632,12 @@

    6.9.1 Set up
    simple_recipe <- function(dataset){
    -    recipe(price ~ ., data = dataset) %>%
    -        step_center(all_numeric()) %>%
    -        step_scale(all_numeric()) %>%
    -        step_dummy(all_nominal())
    -}
    +
    simple_recipe <- function(dataset){
    +    recipe(price ~ ., data = dataset) %>%
    +        step_center(all_numeric()) %>%
    +        step_scale(all_numeric()) %>%
    +        step_dummy(all_nominal())
    +}

    We have not learned yet about writing functions, and will do so in the next chapter. However, for now, you only need to know that you can write your own functions, and that these functions can take any arguments you need. In the case of the above function, which we called simple_recipe(), @@ -1647,9 +1646,9 @@

    6.9.1 Set up
    testing_rec <- prep(simple_recipe(housing_test), testing = housing_test)
    -
    -test_data <- bake(testing_rec, new_data = housing_test)
    +
    testing_rec <- prep(simple_recipe(housing_test), testing = housing_test)
    +
    +test_data <- bake(testing_rec, new_data = housing_test)

    It is important to split the data before using prep() and bake(), because if not, you will use observations from the test set in the prep() step, and thus introduce knowledge from the test set into the training data. This is called data leakage, and must be avoided. This is why it is @@ -1660,14 +1659,14 @@

    6.9.1 Set up
    trainlm_rec <- prep(simple_recipe(housing_train), testing = housing_train)
    -
    -trainlm_data <- bake(trainlm_rec, new_data = housing_train)
    -
    -linreg_model <- lm(price ~ ., data = trainlm_data)
    -
    -broom::augment(linreg_model, newdata = test_data) %>% 
    -    yardstick::rmse(price, .fitted)
    +
    trainlm_rec <- prep(simple_recipe(housing_train), testing = housing_train)
    +
    +trainlm_data <- bake(trainlm_rec, new_data = housing_train)
    +
    +linreg_model <- lm(price ~ ., data = trainlm_data)
    +
    +broom::augment(linreg_model, newdata = test_data) %>% 
    +    yardstick::rmse(price, .fitted)
    ## Warning in predict.lm(x, newdata = newdata, na.action = na.pass, ...):
     ## prediction from a rank-deficient fit may be misleading
    ## # A tibble: 1 × 3
    @@ -1678,37 +1677,37 @@ 

    6.9.1 Set up
    my_rf <- function(mtry, trees, split, id){
    -    
    -    analysis_set <- analysis(split)
    -    
    -    analysis_prep <- prep(simple_recipe(analysis_set), training = analysis_set)
    -    
    -    analysis_processed <- bake(analysis_prep, new_data = analysis_set)
    -    
    -    model <- rand_forest(mode = "regression", mtry = mtry, trees = trees) %>%
    -        set_engine("ranger", importance = 'impurity') %>%
    -        fit(price ~ ., data = analysis_processed)
    -
    -    assessment_set <- assessment(split)
    -    
    -    assessment_prep <- prep(simple_recipe(assessment_set), testing = assessment_set)
    -    
    -    assessment_processed <- bake(assessment_prep, new_data = assessment_set)
    -
    -    tibble::tibble("id" = id,
    -        "truth" = assessment_processed$price,
    -        "prediction" = unlist(predict(model, new_data = assessment_processed)))
    -}
    +
    my_rf <- function(mtry, trees, split, id){
    +    
    +    analysis_set <- analysis(split)
    +    
    +    analysis_prep <- prep(simple_recipe(analysis_set), training = analysis_set)
    +    
    +    analysis_processed <- bake(analysis_prep, new_data = analysis_set)
    +    
    +    model <- rand_forest(mode = "regression", mtry = mtry, trees = trees) %>%
    +        set_engine("ranger", importance = 'impurity') %>%
    +        fit(price ~ ., data = analysis_processed)
    +
    +    assessment_set <- assessment(split)
    +    
    +    assessment_prep <- prep(simple_recipe(assessment_set), testing = assessment_set)
    +    
    +    assessment_processed <- bake(assessment_prep, new_data = assessment_set)
    +
    +    tibble::tibble("id" = id,
    +        "truth" = assessment_processed$price,
    +        "prediction" = unlist(predict(model, new_data = assessment_processed)))
    +}

    The rand_forest() function is available in the {parsnip} package. This package provides an unified interface to a lot of other machine learning packages. This means that instead of having to learn the syntax of range() and randomForest() and, and… you can simply use the rand_forest() function and change the engine argument to the one you want (ranger, randomForest, etc).

    Let’s try this function:

    -
    results_example <- map2_df(.x = validation_data$splits,
    -                           .y = validation_data$id,
    -                           ~my_rf(mtry = 3, trees = 200, split = .x, id = .y))
    -
    head(results_example)
    +
    results_example <- map2_df(.x = validation_data$splits,
    +                           .y = validation_data$id,
    +                           ~my_rf(mtry = 3, trees = 200, split = .x, id = .y))
    +
    head(results_example)
    ## # A tibble: 6 × 3
     ##   id           truth prediction
     ##   <chr>        <dbl>      <dbl>
    @@ -1719,11 +1718,11 @@ 

    6.9.1 Set up
    results_example %>%
    -    group_by(id) %>%
    -    yardstick::rmse(truth, prediction) %>%
    -    summarise(mean_rmse = mean(.estimate)) %>%
    -    pull
    +
    results_example %>%
    +    group_by(id) %>%
    +    yardstick::rmse(truth, prediction) %>%
    +    summarise(mean_rmse = mean(.estimate)) %>%
    +    pull
    ## [1] 0.6305034

    The random forest has already lower RMSE than the linear regression. The goal now is to lower this RMSE by tuning the mtry and trees hyperparameters. For this, I will use Bayesian Optimization @@ -1733,24 +1732,24 @@

    6.9.1 Set up6.9.2 Bayesian hyperparameter optimization

    I will re-use the code from above, and define a function that does everything from pre-processing to returning the metric I want to minimize by tuning the hyperparameters, the RMSE:

    -
    tuning <- function(param, validation_data){
    -
    -    mtry <- param[1]
    -    trees <- param[2]
    -
    -    results <- purrr::map2_df(.x = validation_data$splits,
    -                       .y = validation_data$id,
    -                       ~my_rf(mtry = mtry, trees = trees, split = .x, id = .y))
    -
    -    results %>%
    -        group_by(id) %>%
    -        yardstick::rmse(truth, prediction) %>%
    -        summarise(mean_rmse = mean(.estimate)) %>%
    -        pull
    -}
    +
    tuning <- function(param, validation_data){
    +
    +    mtry <- param[1]
    +    trees <- param[2]
    +
    +    results <- purrr::map2_df(.x = validation_data$splits,
    +                       .y = validation_data$id,
    +                       ~my_rf(mtry = mtry, trees = trees, split = .x, id = .y))
    +
    +    results %>%
    +        group_by(id) %>%
    +        yardstick::rmse(truth, prediction) %>%
    +        summarise(mean_rmse = mean(.estimate)) %>%
    +        pull
    +}

    This is exactly the code from before, but it now returns the RMSE. Let’s try the function with the values from before:

    -
    tuning(c(3, 200), validation_data)
    +
    tuning(c(3, 200), validation_data)
    ## [1] 0.6319843

    I now follow the code that can be found in the arxiv paper to run the optimization. A simpler model, called the surrogate model, is used to look for promising @@ -1765,30 +1764,30 @@

    6.9.2 Bayesian hyperparameter opt documentation. The focus here is not on this particular method, but rather showing you how you can use various packages to solve a data science problem.

    Let’s first load the package and create the function to optimize:

    -
    library("mlrMBO")
    -
    fn <- makeSingleObjectiveFunction(name = "tuning",
    -                                 fn = tuning,
    -                                 par.set = makeParamSet(makeIntegerParam("x1", lower = 3, upper = 8),
    -                                                        makeIntegerParam("x2", lower = 100, upper = 500)))
    +
    library("mlrMBO")
    +
    fn <- makeSingleObjectiveFunction(name = "tuning",
    +                                 fn = tuning,
    +                                 par.set = makeParamSet(makeIntegerParam("x1", lower = 3, upper = 8),
    +                                                        makeIntegerParam("x2", lower = 100, upper = 500)))

    This function is based on the function I defined before. The parameters to optimize are also defined as are their bounds. I will look for mtry between the values of 3 and 8, and trees between 50 and 500.

    We still need to define some other objects before continuing:

    -
    # Create initial random Latin Hypercube Design of 10 points
    -library(lhs)# for randomLHS
    -des <- generateDesign(n = 5L * 2L, getParamSet(fn), fun = randomLHS)
    +
    # Create initial random Latin Hypercube Design of 10 points
    +library(lhs)# for randomLHS
    +des <- generateDesign(n = 5L * 2L, getParamSet(fn), fun = randomLHS)

    Then we choose the surrogate model, a random forest too:

    -
    # Specify kriging model with standard error estimation
    -surrogate <- makeLearner("regr.ranger", predict.type = "se", keep.inbag = TRUE)
    +
    # Specify kriging model with standard error estimation
    +surrogate <- makeLearner("regr.ranger", predict.type = "se", keep.inbag = TRUE)

    Here I define some options:

    -
    # Set general controls
    -ctrl <- makeMBOControl()
    -ctrl <- setMBOControlTermination(ctrl, iters = 10L)
    -ctrl <- setMBOControlInfill(ctrl, crit = makeMBOInfillCritEI())
    +
    # Set general controls
    +ctrl <- makeMBOControl()
    +ctrl <- setMBOControlTermination(ctrl, iters = 10L)
    +ctrl <- setMBOControlInfill(ctrl, crit = makeMBOInfillCritEI())

    And this is the optimization part:

    -
    # Start optimization
    -result <- mbo(fn, des, surrogate, ctrl, more.args = list("validation_data" = validation_data))
    -
    result
    +
    # Start optimization
    +result <- mbo(fn, des, surrogate, ctrl, more.args = list("validation_data" = validation_data))
    +
    result
    ## Recommended parameters:
     ## x1=8; x2=314
     ## Objective: y = 0.484
    @@ -1823,18 +1822,18 @@ 

    6.9.2 Bayesian hyperparameter opt result$y. Let’s now train the random forest on the training data with this values. First, I pre-process the training data

    -
    training_rec <- prep(simple_recipe(housing_train), testing = housing_train)
    -
    -train_data <- bake(training_rec, new_data = housing_train)
    +
    training_rec <- prep(simple_recipe(housing_train), testing = housing_train)
    +
    +train_data <- bake(training_rec, new_data = housing_train)

    Let’s now train our final model and predict the prices:

    -
    final_model <- rand_forest(mode = "regression", mtry = result$x$x1, trees = result$x$x2) %>%
    -        set_engine("ranger", importance = 'impurity') %>%
    -        fit(price ~ ., data = train_data)
    -
    -price_predict <- predict(final_model, new_data = select(test_data, -price))
    +
    final_model <- rand_forest(mode = "regression", mtry = result$x$x1, trees = result$x$x2) %>%
    +        set_engine("ranger", importance = 'impurity') %>%
    +        fit(price ~ ., data = train_data)
    +
    +price_predict <- predict(final_model, new_data = select(test_data, -price))

    Let’s transform the data back and compare the predicted prices to the true ones visually:

    -
    cbind(price_predict * sd(housing_train$price) + mean(housing_train$price), 
    -      housing_test$price)
    +
    cbind(price_predict * sd(housing_train$price) + mean(housing_train$price), 
    +      housing_test$price)
    ##       .pred housing_test$price
     ## 1  16.76938               13.5
     ## 2  27.59510               30.8
    @@ -1888,9 +1887,9 @@ 

    6.9.2 Bayesian hyperparameter opt ## 50 20.75357 21.8 ## 51 19.49487 19.7

    Let’s now compute the RMSE:

    -
    tibble::tibble("truth" = test_data$price,
    -        "prediction" = unlist(price_predict)) %>% 
    -    yardstick::rmse(truth, prediction)
    +
    tibble::tibble("truth" = test_data$price,
    +        "prediction" = unlist(price_predict)) %>% 
    +    yardstick::rmse(truth, prediction)
    ## # A tibble: 1 × 3
     ##   .metric .estimator .estimate
     ##   <chr>   <chr>          <dbl>
    diff --git a/index.Rmd b/index.Rmd
    index 6edb1ec..df247b8 100644
    --- a/index.Rmd
    +++ b/index.Rmd
    @@ -190,7 +190,10 @@ both pieces of software are available free of charge (paid options for RStudio e
     that need technical support). Installation is simple, but operating system dependent. To download
     and install R for Windows, follow [this link](https://cloud.r-project.org/bin/windows/base/).
     For macOS, follow [this one](https://cloud.r-project.org/bin/macosx/). If you run a GNU+Linux
    -distribution, you can install R using the system's package manager. On Ubuntu, install `r-base`.
    +distribution, you can install R using the system's package manager. If you're running Ubuntu, you
    +might want to take a look at [r2u](https://github.com/eddelbuettel/r2u), which provides very
    +fast installation of packages, full integration with `apt` (so dependencies get solved automatically)
    +and covers the entirety of CRAN.
     
     For RStudio, look for your operating system [here](https://www.rstudio.com/products/rstudio/download/#download).