08-functional_programming.Rmd

# Functional programming

Functional programming is a paradigm that I find very suitable for data science. In functional
programming, your code is organised into functions that perform the operations you need. Your scripts
will only be a sequence of calls to these functions, making them easier to understand. R is not a pure
functional programming language, so we need some self-discipline to apply pure functional programming
principles. However, these efforts are worth it, because pure functions are easier to debug, extend
and document. In this chapter, we are going to learn about functional programming principles that you
can adopt and start using to make your code better.

## Function definitions

You should now be familiar with function definitions in R. Let's suppose you want to write a function
to compute the square root of a number and want to do so using Newton's algorithm:

```{r square_root_loop}
sqrt_newton <- function(a, init, eps = 0.01){
    while(abs(init**2 - a) > eps){
        init <- 1/2 *(init + a/init)
    }
    init
}
```

You can then use this function to get the square root of a number:

```{r}
sqrt_newton(16, 2)
```

We are using a `while` loop inside the body of the function. The *body* of a function are the 
instructions that define the function. You can get the body of a function with `body(some_func)`. 
In *pure* functional programming languages, like Haskell, loops do not exist. How can you
program without loops, you may ask? In functional programming, loops are replaced by recursion,
which we already discussed in the previous chapter. Let's rewrite our little example above
with recursion:

```{r square_root_recur}
sqrt_newton_recur <- function(a, init, eps = 0.01){
    if(abs(init**2 - a) < eps){
        result <- init
    } else {
        init <- 1/2 * (init + a/init)
        result <- sqrt_newton_recur(a, init, eps)
    }
    result
}
```

```{r}
sqrt_newton_recur(16, 2)
```

R is not a pure functional programming language though, so we can still use loops (be it `while` or
`for` loops) in the bodies of our functions. As discussed in the previous chapter, it is actually
better, performance-wise, to use loops instead of recursion, because R is not tail-call optimized.
I won't got into the details of what tail-call optimization is but just remember that if
performance is important a loop will be faster. However, sometimes, it is easier to write a
function using recursion. I personally tend to avoid loops if performance is not important,
because I find that code that avoids loops is easier to read and debug. However, knowing that
you can use loops is reassuring, and encapsulating loops inside functions gives you the benefits of
both using functions, and loops. In the coming sections I will show you some built-in functions
that make it possible to avoid writing loops and that don't rely on recursion, so performance
won't be penalized.

## Properties of functions

Mathematical functions have a nice property: we always get the same output for a given input. This
is called referential transparency and we should aim to write our R functions in such a way.
For example, the following function:

```{r}
increment <- function(x){
    x + 1
}
```

Is a referential transparent function. We always get the same result for any `x` that we give to
this function.

This:

```{r}
increment(10)
```

will always produce `11`.

However, this one:

```{r}
increment_opaque <- function(x){
    x + spam
}
```

is not a referential transparent function, because its value depends on the global variable `spam`.

```{r}
spam <- 1

increment_opaque(10)
```

will produce `11` if `spam = 1`. But what if `spam = 19`?

```{r}
spam <- 19

increment_opaque(10)
```

To make `increment_opaque()` a referential transparent function, it is enough to make `spam` an
argument:

```{r}
increment_not_opaque <- function(x, spam){
    x + spam
}
```

Now even if there is a global variable called `spam`, this will not influence our function:

```{r}
spam <- 19

increment_not_opaque(10, 34)
```

This is because the variable `spam` defined in the body of the function is a local variable. It
could have been called anything else, really. Avoiding opaque functions makes our life easier.

Another property that adepts of functional programming value is that functions should have no, or
very limited, side-effects. This means that functions should not change the state of your program.

For example this function (which is not a referential transparent function):

```{r square_root_loop_side_effects}
count_iter <- 0

sqrt_newton_side_effect <- function(a, init, eps = 0.01){
    while(abs(init**2 - a) > eps){
        init <- 1/2 *(init + a/init)
        count_iter <<- count_iter + 1 # The "<<-" symbol means that we assign the
    }                                 # RHS value in a variable inside the global environment
    init
}
```

If you look in the environment pane, you will see that `count_iter` equals 0. Now call this
function with the following arguments:

```{r}
sqrt_newton_side_effect(16000, 2)

print(count_iter)
```

If you check the value of `count_iter` now, you will see that it increased! This is a side effect,
because the function changed something outside of its scope. It changed a value in the global
environment. In general, it is good practice to avoid side-effects. For example, we could make the
above function not have any side effects like this:

```{r square_root_loop_not_more_side_effects}
sqrt_newton_count <- function(a, init, count_iter = 0, eps = 0.01){
    while(abs(init**2 - a) > eps){
        init <- 1/2 *(init + a/init)
        count_iter <- count_iter + 1
    }
    c(init, count_iter)
}
```

Now, this function returns a list with two elements, the result, and the number of iterations it
took to get the result:

```{r}
sqrt_newton_count(16000, 2)
```

Writing to disk is also considered a side effect, because the function changes something (a file)
outside its scope. But this cannot be avoided since you *want* to write to disk.
Just remember: try to avoid having functions changing variables in the global environment unless
you have a very good reason of doing so.

Very long scripts that don't use functions and use a lot of global variables with loops changing
the values of global variables are a nightmare to debug. If something goes wrong, it might be very 
difficult to pinpoint where the problem is. Is there an error in one of the loops? 
Is your code running for a particular value of a particular variable in the global environment, but
not for other values? Which values? And of which variables? It can be very difficult to know what 
is wrong with such a script.
With functional programming, you can avoid a lot of this pain for free (well not entirely for free,
it still requires some effort, since R is not a pure functional language). Writing functions also
makes it easier to parallelize your code. We are going to learn about that later in this chapter too.

Finally, another property of mathematical functions, is that they do one single thing. Functional
programming purists also program their functions to do one single task. This has benefits, but
can complicate things. The function we wrote previously does two things: it computes the square
root of a number and also returns the number of iterations it took to compute the result. However,
this is not a bad thing; the function is doing two tasks, but these tasks are related to each other
and it makes sense to have them together. My piece of advice: avoid having functions that do
many *unrelated* things. This makes debugging harder.

In conclusion: you should strive for referential transparency, try to avoid side effects unless you
have a good reason to have them and try to keep your functions short and do as little tasks as
possible. This makes testing and debugging easier, as you will see in the next chapter, but also
improves readability and maintainability of your code.

## Functional programming with `{purrr}`

I mentioned it several times already, but R is not a pure functional programming language. It is
possible to write R code using the functional programming paradigm, but some effort is required.
The `{purrr}` package extends R's base functional programming capabilities with some very interesting
functions. We have already seen `map()` and `reduce()`, which we are going to see in more detail now.
Then, we are going to learn about some other functions included in `{purrr}` that make functional
programming easier in R.

### Doing away with loops: the `map*()` family of functions

Instead of using loops, pure functional programming languages use functions that achieve
the same result. These functions are often called `Map` or `Reduce` (also called `Fold`). R comes
with the `*apply()` family of functions (which are implementations of `Map`), 
as well as `Reduce()` for functional programming.

Within this family, you can find `lapply()`, `sapply()`, `vapply()`, `tapply()`, `mapply()`, `rapply()`,
`eapply()` and `apply()` (I might have forgotten one or the other, but that's not important).
Each version of an `*apply()` function has a different purpose, but it is not very easy to
remember which does what exactly. To add even more confusion, the arguments are sometimes different between
each of these.

In the `{purrr}` package, these functions are replaced by the `map*()` family of functions. As you will
shortly see, they are very consistent, and thus easier to use.
The first part of these functions' names all start with `map_` and the second part tells you what 
this function is going to return. For example, if you want `double`s out, you would use `map_dbl()`. 
If you are working on data frames and want a data frame back, you would use `map_df()`. Let's start 
with the basic `map()` function. The following gif 
(source: [Wikipedia](https://en.wikipedia.org/wiki/Map_(higher-order_function))) illustrates
what `map()` does fairly well:

```{r, echo=FALSE}
knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/0/06/Mapping-steps-loillibe-new.gif")
```

$X$ is a vector composed of the following scalars: $(0, 5, 8, 3, 2, 1)$. The function we want to
map to each element of $X$ is $f(x) = x + 1$. $X'$ is the result of this operation. Using R, we
would do the following:

```{r}
library("purrr")
numbers <- c(0, 5, 8, 3, 2, 1)

plus_one <- function(x) (x + 1)

my_results <- map(numbers, plus_one)

my_results
```

Using a loop, you would write:

```{r}
numbers <- c(0, 5, 8, 3, 2, 1)

plus_one <- function(x) (x + 1)

my_results <- vector("list", 6)

for(number in seq_along(numbers)){
  my_results[[number]] <- plus_one(number)
}

my_results
```

Now I don't know about you, but I prefer the first option. Using functional programming, you don't
need to create an empty list to hold your results, and the code is more concise. Plus,
it is less error prone. I had to try several times to get the loop right
(and I've using R for almost 10 years now). Why? Well, first of all I used `%in%` instead of `in`.
Then, I forgot about `seq_along()`. After that, I made a typo, `plos_one()` instead of `plus_one()`
(ok, that one is unrelated to the loop). Let's also see how this works using base R:

```{r}
numbers <- c(0, 5, 8, 3, 2, 1)

plus_one <- function(x) (x + 1)

my_results <- lapply(numbers, plus_one)

my_results
```

So what is the added value of using `{purrr}`, you might ask. Well, imagine that instead of a list,
I need to an atomic vector of `numeric`s. This is fairly easy with `{purrr}`:

```{r}
library("purrr")
numbers <- c(0, 5, 8, 3, 2, 1)

plus_one <- function(x) (x + 1)

my_results <- map_dbl(numbers, plus_one)

my_results
```

We're going to discuss these functions below, but know that in base R, outputting something else 
involves more effort.

Let's go back to our `sqrt_newton()` function. This function has more than one parameter. Often,
we would like to map functions with more than one parameter to a list, while holding constant
some of the functions parameters. This is easily achieved like so:

```{r}
library("purrr")
numbers <- c(7, 8, 19, 64)

map(numbers, sqrt_newton, init = 1)
```

It is also possible to use a formula:

```{r}
library("purrr")
numbers <- c(7, 8, 19, 64)

map(numbers, ~sqrt_newton(., init = 1))
```

Another function that is similar to `map()` is `rerun()`. You guessed it, this one simply
reruns an expression:

```{r}
rerun(10, "hello")
```

`rerun()` simply runs an expression (which can be arbitrarily complex) `n` times, whereas `map()`
maps a function to a list of inputs, so to achieve the same with `map()`, you need to map the `print()`
function to a vector of characters:

```{r}
map(rep("hello", 10), print)
```

`rep()` is a function that creates a vector by repeating something, in this case the string "hello",
as many times as needed, here 10. The output here is a bit different that before though, because first
you will see "hello" printed 10 times and then the list where each element is "hello".
This is because the `print()` function has a side effect, which is, well printing to the console.
We see this side effect 10 times, plus then the list created with `map()`.

`rerun()` is useful if you want to run simulation. For instance, let's suppose that I perform a simulation
where I throw a die 5 times, and compute the mean of the points obtained, as well as the variance:

```{r}
mean_var_throws <- function(n){
  throws <- sample(1:6, n, replace = TRUE)

  mean_throws <- mean(throws)
  var_throws <- var(throws)

  tibble::tribble(~mean_throws, ~var_throws,
                   mean_throws, var_throws)
}

mean_var_throws(5)
```

`mean_var_throws()` returns a `tibble` object with mean of points and the variance of the points. Now suppose
I want to compute the expected value of the distribution of throwing dice. We know from theory that it should
be equal to $3.5 (= 1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6)$.

Let's rerun the simulation 50 times:

```{r}
simulations <- rerun(50, mean_var_throws(5))
```

Let's see what the `simulations` object is made of:

```{r, eval = FALSE}
str(simulations)
```

```
## List of 50
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
##   ..$ mean_throws: num 2
##   ..$ var_throws : num 3
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
##   ..$ mean_throws: num 2.8
##   ..$ var_throws : num 0.2
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
##   ..$ mean_throws: num 2.8
##   ..$ var_throws : num 0.7
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
##   ..$ mean_throws: num 2.8
##   ..$ var_throws : num 1.7
.....
```

`simulations` is a list of 50 data frames. We can easily combine them into a single data frame, and compute the
mean of the means, which should return something close to the expected value of 3.5:

```{r}
bind_rows(simulations) %>%
  summarise(expected_value = mean(mean_throws))
```

Pretty close! Now of course, one could have simply done something like this:

```{r}
mean(sample(1:6, 1000, replace = TRUE))
```

but the point was to illustrate that `rerun()` can run any arbitrarily complex expression, and that it is good
practice to put the result in a data frame or list, for easier further manipulation.

You now know the standard `map()` function, and also `rerun()`, which return lists, but there are a 
number of variants of this function. `map_dbl()` returns an atomic vector of doubles, as seen 
we've seen before. A little reminder below:

```{r}
map_dbl(numbers, sqrt_newton, init = 1)
```

 In a similar fashion, `map_chr()` returns an atomic vector of strings:

```{r}
map_chr(numbers, sqrt_newton, init = 1)
```

`map_lgl()` returns an atomic vector of `TRUE` or `FALSE`:

```{r}
divisible <- function(x, y){
  if_else(x %% y == 0, TRUE, FALSE)
}

map_lgl(seq(1:100), divisible, 3)
```

There are also other interesting variants, such as `map_if()`:

```{r}
a <- seq(1,10)

map_if(a, (function(x) divisible(x, 2)), sqrt)
```

I used `map_if()` to take the square root of only those numbers in vector `a` that are divisble by 2,
by using an anonymous function that checks if a number is divisible by 2 (by wrapping `divisible()`).

`map_at()` is similar to `map_if()` but maps the function at a position specified by the user:

```{r}
map_at(numbers, c(1, 3), sqrt)
```

or if you have a named list:

```{r}
recipe <- list("spam" = 1, "eggs" = 3, "bacon" = 10)

map_at(recipe, "bacon", `*`, 2)
```

I used `map_at()` to double the quantity of bacon in the recipe (by using the `*` function, and specifying
its second argument, `2`. Try the following in the command prompt: `` `*`(3, 4) ``).

`map2()` is the equivalent of `mapply()` and `pmap()` is the generalisation of `map2()` for more
than 2 arguments:

```{r}
print(a)

b <- seq(1, 2, length.out = 10)

print(b)

map2(a, b, `*`)
```

Each element of `a` gets multiplied by the element of `b` that is in the same position.
Let's see what `pmap()` does. Can you guess from the code below what is going on? I will print
`a` and `b` again for clarity:

```{r}
a

b

n <- seq(1:10)

pmap(list(a, b, n), rnorm)
```

Let's take a closer look at what `a`, `b` and `n` look like, when they are place next to each other:

```{r}
cbind(a, b, n)
```

`rnorm()` gets first called with the parameters from the first line, meaning 
`rnorm(a[1], b[1], n[1])`. The second time `rnorm()` gets called, you guessed it, 
it with the parameters on the second line of the array above,
`rnorm(a[2], b[2], n[2])`, etc.

There are other functions in the `map()` family of functions, but we will discover them in the 
exercises!

The `map()` family of functions does not have any more secrets for you. Let's now take a look at
the `reduce()` family of functions.

### Reducing with `purrr`

Reducing is another important concept in functional programming. It allows going from a list of 
elements, to a single element, by somehow *combining* the elements into one. For instance, using 
the base R `Reduce()` function, you can sum the elements of a list like so:

```{r}
Reduce(`+`, seq(1:100))
```

using `purrr::reduce()`, this becomes:

```{r}
reduce(seq(1:100), `+`)
```

If you don't really get what happening, don't worry. Things should get clearer once I'll introduce
another version of `reduce()`, called `accumulate()`, which we will see below.

Sometimes, the direction from which we start to reduce is quite important. You can "start from the
end" of the list by using the `.dir` argument:

```{r}
reduce(seq(1:100), `+`, .dir = "backward")
```

Of course, for commutative operations, direction does not matter. But it does matter for non-commutative
operations:


```{r}
reduce(seq(1:100), `-`)

reduce(seq(1:100), `-`, .dir = "backward")
```

Let's now take a look at `accumulate()`. `accumulate()` is very similar to `map()`, but keeps the
intermediary results. Which intermediary results? Let's try and see what happens:

```{r, eval=FALSE}
a <- seq(1, 10)

accumulate(a, `-`)
```


```{r, echo=FALSE}
a <- seq(1, 10)

purrr::accumulate(a, `-`)
```


`accumulate()` illustrates pretty well what is happening; the first element, `1`, is simply the 
first element of `seq(1, 10)`. The second element of the result however, is the difference between
`1` and `2`, `-1`. The next element in `a` is `3`. Thus the next result is `-1-3`, `-4`, and so 
on until we run out of elements in `a`.

The below illustration shows the algorithm step-by-step:

```
(1-2-3-4-5-6-7-8-9-10)
((1)-2-3-4-5-6-7-8-9-10)
((1-2)-3-4-5-6-7-8-9-10)
((-1-3)-4-5-6-7-8-9-10)
((-4-4)-5-6-7-8-9-10)
((-8-5)-6-7-8-9-10)
((-13-6)-7-8-9-10)
((-19-7)-8-9-10)
((-26-8)-9-10)
((-34-9)-10)
(-43-10)
-53
```

`reduce()` only shows the final result of all these operations. `accumulate()` and `reduce()` also
have an `.init` argument, that makes it possible to start the reducing procedure from an initial
value that is different from the first element of the vector:

```{r, eval=FALSE}
reduce(a, `+`, .init = 1000)

accumulate(a, `-`, .init = 1000, .dir = "backward")
```

```{r, echo=FALSE}
reduce(a, `+`, .init = 1000)

purrr::accumulate(a, `-`, .init = 1000, .dir = "backward")
```

`reduce()` generalizes functions that only take two arguments. If you were to write a function that returns
the minimum between two numbers:

```{r}
my_min <- function(a, b){
    if(a < b){
        return(a)
    } else {
        return(b)
    }
}
```

You could use `reduce()` to get the minimum of a list of numbers:

```{r}
numbers2 <- c(3, 1, -8, 9)

reduce(numbers2, my_min)
```

`map()` and `reduce()` are arguably the most useful higher-order functions, and perhaps also the 
most famous one, true ambassadors of functional programming. You might have read about 
[MapReduce](https://en.wikipedia.org/wiki/MapReduce), a programming model for processing big
data in parallel. The way MapReduce works is inspired by both these `map()` and `reduce()` functions,
which are always included in functional programming languages. This illustrates that the functional
programming paradigm is very well suited to parallel computing. 

Something else that is very important to understand at this point; up until now, we only used these
functions on lists, or atomic vectors, of numbers. However, `map()` and `reduce()`, and other 
higher-order functions for that matter, do not care about the contents of the list. What these
functions do, is take another functions, and make it do something to the elements of the list. 
It does not matter if it's a list of numbers, of characters, of data frames, even of models. All that
matters is that the function that will be applied to these elements, can operate on them.
So if you have a list of fitted models, you can map `summary()` on this list to get summaries of
each model. Or if you have a list of data frames, you can map a function that performs several 
cleaning steps. This will be explored in a future section, but it is important to keep this in mind.

### Error handling with `safely()` and `possibly()`

`safely()` and `possibly()` are very useful functions. Consider the following situation:

```{r, eval = FALSE}

a <- list("a", 4, 5)

sqrt(a)
```

```{r, eval = FALSE}
Error in sqrt(a) : non-numeric argument to mathematical function
```

Using `map()` or `Map()` will result in a similar error. `safely()` is an higher-order function that
takes one function as an argument and executes it... *safely*, meaning the execution of the function
will not stop if there is an error. The error message gets captured alongside valid results.

```{r}

a <- list("a", 4, 5)

safe_sqrt <- safely(sqrt)

map(a, safe_sqrt)
```
`possibly()` works similarly, but also allows you to specify a return value in case of an error:

```{r}
possible_sqrt <- possibly(sqrt, otherwise = NA_real_)

map(a, possible_sqrt)
```

Of course, in this particular example, the same effect could be obtained way more easily:

```{r}
sqrt(as.numeric(a))
```

However, in some situations, this trick does not work as intended (or at all). `possibly()` and
`safely()` allow the programmer to model errors explicitly, and to then provide a consistent way
of dealing with them. For instance, consider the following example:

```{r, eval=FALSE}
data(mtcars)

write.csv(mtcars, "my_data/mtcars.csv")
```

```
Error in file(file, ifelse(append, "a", "w")) : 
  cannot open the connection
In addition: Warning message:
In file(file, ifelse(append, "a", "w")) :
  cannot open file 'my_data/mtcars.csv': No such file or directory
```

The folder `path/to/save/` does not exist, and as such this code produces an error. You might 
want to catch this error, and create the directory for instance:

```{r, eval=FALSE}
possibly_write.csv <- possibly(write.csv, otherwise = NULL)

if(is.null(possibly_write.csv(mtcars, "my_data/mtcars.csv"))) {
  print("Creating folder...")
  dir.create("my_data/")
  print("Saving file...")
  write.csv(mtcars, "my_data/mtcars.csv")
}
```

```
[1] "Creating folder..."
[1] "Saving file..."
Warning message:
In file(file, ifelse(append, "a", "w")) :
  cannot open file 'my_data/mtcars.csv': No such file or directory
```

The warning message comes from the first time we try to write the `.csv`, inside the `if` 
statement. Because this fails, we create the directory and then actually save the file.
In the exercises, you'll discover `quietly()`, which also captures warnings and messages.

To conclude this section: remember function factories? Turns out that `safely()`, `purely()` and `quietly()` are
function factories.

### Partial applications with `partial()`

Consider the following simple function:

```{r}
add <- function(a, b) a+b
```

It is possible to create a new function, where one of the parameters is fixed, for instance, where
`a = 10`:

```{r}
add_to_10 <- partial(add, a = 10)
```

```{r}
add_to_10(12)
```

This is equivalent to the following:

```{r}
add_to_10_2 <- function(b){
  add(a = 10, b)
}
```

Using `partial()` is much less verbose however, and allowing you to define new functions very quickly:

```{r}
head10 <- partial(head, n = 10)

head10(mtcars)
```

### Function composition using `compose`

Function composition is another handy tool, which makes chaining equation much more elegant:

```{r}
compose(sqrt, log10, exp)(10)
```

You can read this expression as *`exp()` after `log10()` after `sqrt()`* and is equivalent to:

```{r}
sqrt(log10(exp(10)))
```

It is also possible to reverse the order the functions get called using the `.dir = ` option:

```{r}
compose(sqrt, log10, exp, .dir = "forward")(10)
```

One could also use the `%>%` operator to achieve the same result:

```{r}
10 %>%
  sqrt %>%
  log10 %>%
  exp
```

but strictly speaking, this is not function composition.

### «Transposing lists»

Another interesting function is `transpose()`. It is not an alternative to the function `t()` from
`base` but, has a similar effect. `transpose()` works on lists. Let's take a look at the example
from before:

```{r}
safe_sqrt <- safely(sqrt, otherwise = NA_real_)

map(a, safe_sqrt)
```

The output is a list with the first element being a list with a result and an error message. One
might want to have all the results in a single list, and all the error messages in another list.
This is possible with `transpose()`:

```{r}
purrr::transpose(map(a, safe_sqrt))
```

I explicitely call `purrr::transpose()` because there is also a `data.table::transpose()`, which
is not the same function. You have to be careful about that sort of thing, because it can cause
errors in your programs and debuging this type of error is a nightmare.

Now that we are familiar with functional programming, let's try to apply some of its principles
to data manipulation.

## List-based workflows for efficiency

You can use your own functions in pipe workflows:

```{r}
double_number <- function(x){
  x+x
}
```

```{r}
mtcars %>%
  head() %>%
  mutate(double_mpg = double_number(mpg))
```

It is important to understand that your functions, and functions that are built-in into R, or that
come from packages, are exactly the same thing. Every function is a first-class object in R, no 
matter where they come from. The consequence of functions being first-class objects is that 
functions can take functions as arguments, functions can return functions (the function factories
from the previous chapter) and can be assigned to any variable:

```{r}
plop <- sqrt

plop(4)
```

```{r}
bacon <- function(.f){

  message("Bacon is tasty")

  .f

}

bacon(sqrt) # `bacon` is a function factory, as it returns a function (alongside an informative message)

# To actually call it:
bacon(sqrt)(4)
```

Now, let's step back for a bit and think about what we learned up until now, and especially
the `map()` family of functions.

Let's read the list of datasets from the previous chapter:

```{r}
paths <- Sys.glob("datasets/unemployment/*.csv")

all_datasets <- import_list(paths)

str(all_datasets)
```

`all_datasets` is a list with `r length(all_datasets)` elements, each of them is a `data.frame`.

The first thing we are going to do is use a function to clean the names of the datasets. These
names are not very easy to work with; there are spaces, and it would be better if the names of the
columns would be all lowercase. For this we are going to use the function `clean_names()` from the
`janitor` package. For a single dataset, I would write this:

```{r, include=FALSE}
library(janitor)
```

```{r, eval = FALSE}
library(janitor)

one_dataset <- one_dataset %>%
  clean_names()
```

and I would get a dataset with column names in lowercase and spaces replaced by `_` (and other
corrections). How can I apply, or map, this function to each dataset in the list? To do this I need
to use `purrr::map()`, which we've seen in the previous section:

```{r}
library(purrr)

all_datasets <- all_datasets %>%
  map(clean_names)

all_datasets %>%
  glimpse()
```

Remember that `map(list, function)` simply evaluates `function` to each element of `list`.

So now, what if I want to know, for each dataset, which *communes* have an unemployment rate that is
less than, say, 3%? For a single dataset I would do something like this:

```{r, eval=FALSE}
one_dataset %>%
  filter(unemployment_rate_in_percent < 3)
```

but since we're dealing with a list of data sets, we cannot simply use `filter()` on it. This is because
`filter()` expects a data frame, not a list of data frames. The way around this is to use `map()`.

```{r}
all_datasets %>%
  map(~filter(., unemployment_rate_in_percent < 3))
```

`map()` needs a function to map to each element of the list. `all_datasets` is the list to which I
want to map the function. But what function? `filter()` is the function I need, so why doesn't:

```{r, eval = FALSE}
all_datasets %>%
  map(filter(unemployment_rate_in_percent < 3))
```

work? This is what happens if we try it:

```
Error in filter(unemployment_rate_in_percent < 3) :
  object 'unemployment_rate_in_percent' not found
```

This is because `filter()` needs both the data set, and a so-called predicate (a predicate
is an expression that evaluates to `TRUE` or `FALSE`). But you need to make more explicit 
what is the dataset and what is the predicate, because here, `filter()` thinks that the 
dataset is `unemployment_rate_in_percent`. The way to do this is to use an anonymous
function (discussed in Chapter 7), which allows you to explicitely state what is the
dataset, and what is the predicate. As we've seen, there's three ways to define
anonymous functions:

- Using a formula (only works within `{tidyverse}` functions):

```{r}
all_datasets %>%
  map(~filter(., unemployment_rate_in_percent < 3)) %>% 
  glimpse()
```

(notice the `.` in the formula, making the position of the dataset as the first argument to `filter()` 
explicit) or 

- using an anonymous function (using the `function(x)` keyword):

```{r}
all_datasets %>%
  map(function(x)filter(x, unemployment_rate_in_percent < 3)) %>%
  glimpse()
```

- or, since R 4.1, using the shorthand `\(x)`:


```{r}
all_datasets %>%
  map(\(x)filter(x, unemployment_rate_in_percent < 3)) %>%
  glimpse()
```

As you see, everything is starting to come together: lists, to hold complex objects, over which anonymous
functions are mapped using higher-order functions. Let's continue cleaning this dataset.

Before merging these datasets together, we would need them to have a `year` column indicating the
year the data was measured in each data frame. It would also be helpful if gave names to these datasets, meaning
converting the list to a named list. For this task, we can use `purrr::set_names()`:

```{r, eval=FALSE}
all_datasets <- set_names(all_datasets, as.character(seq(2013, 2016)))
```

Let's take a look at the list now:

```{r, eval=FALSE}
str(all_datasets)
```

As you can see, each `data.frame` object contained in the list has been renamed. You can thus
access them with the `$` operator:

```{r, echo=FALSE}
knitr::include_graphics("pics/all_datasets_names.png")
```

Using `map()` we now know how to apply a function to each dataset of a list. But maybe it would be
easier to merge all the datasets first, and then manipulate them? This can be the case sometimes,
but not always. 
As long as you provide a function and a list of elements to `reduce()`, you will get a single
output. So how could `reduce()` help us with merging all the datasets that are in the list? `dplyr`
comes with a lot of function to merge *two* datasets. Remember that I said before that `reduce()`
allows you to generalize a function of two arguments? Let's try it with our list of datasets:


```{r}
unemp_lux <- reduce(all_datasets, full_join)

glimpse(unemp_lux)
```

`full_join()` is one of the `dplyr` function that merges data. There are others that might be
useful depending on the kind of join operation you need. Let's write this data to disk as we're
going to keep using it for the next chapters:

```{r}
export(unemp_lux, "datasets/unemp_lux.csv")
```


### Functional programming and plotting

In this section, we are going to learn how to use the possibilities offered by the `purrr` package
and how it can work together with `ggplot2` to generate many plots. This is a more advanced topic,
but what comes next is also what makes R, and the functional programming paradigm so powerful.

For example, suppose that instead of wanting a single plot with the unemployment rate of each
commune, you need one unemployment plot, per commune:

```{r}
unemp_lux_data %>%
  filter(division == "Luxembourg") %>%
  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
  theme_minimal() +
  labs(title = "Unemployment in Luxembourg", x = "Year", y = "Rate") +
  geom_line()
```

and then you would write the same for "Esch-sur-Alzette" and also for "Wiltz". If you only have to
make to make these 3 plots, copy and pasting the above lines is no big deal:

```{r}
unemp_lux_data %>%
  filter(division == "Esch-sur-Alzette") %>%
  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
  theme_minimal() +
  labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") +
  geom_line()

unemp_lux_data %>%
  filter(division == "Wiltz") %>%
  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
  theme_minimal() +
  labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") +
  geom_line()
```

But copy and pasting is error prone. Can you spot the copy-paste mistake I made? And what if you
have to create the above plots for all 108 Luxembourguish communes? That's a lot of copy pasting.
What if, once you are done copy pasting, you have to change something, for example, the theme? You
could use the search and replace function of RStudio, true, but sometimes search and replace can
also introduce bugs and typos. You can avoid all these issues by using `purrr::map()`. What do you
need to map over? The commune names. So let's create a vector of commune names:

```{r}
communes <- list("Luxembourg", "Esch-sur-Alzette", "Wiltz")
```

Now we can create the graphs using `map()`, or `map2()` to be exact:

```{r}
plots_tibble <- unemp_lux_data %>%
  filter(division %in% communes) %>%
  group_by(division) %>%
  nest() %>%
  mutate(plot = map2(.x = data, .y = division, ~ggplot(data = .x) +
       theme_minimal() +
       geom_line(aes(year, unemployment_rate_in_percent, group = 1)) +
       labs(title = paste("Unemployment in", .y))))
```

Let's study this line by line: the first line is easy, we simply use `filter()` to keep only the
communes we are interested in. Then we group by `division` and use `tidyr::nest()`. As a refresher,
let's take a look at what this does:

```{r}
unemp_lux_data %>%
  filter(division %in% communes) %>%
  group_by(division) %>%
  nest()
```

This creates a tibble with two columns, `division` and `data`, where each individual (or
commune in this case) is another tibble with all the original variables. This is very useful,
because now we can pass these tibbles to `map2()`, to generate the plots. But why `map2()` and
what's the difference with `map()`? `map2()` works the same way as `map()`, but maps over two
inputs:

```{r}
numbers1 <- list(1, 2, 3, 4, 5)

numbers2 <- list(9, 8, 7, 6, 5)

map2(numbers1, numbers2, `*`)
```

In our example with the graphs, the two inputs are the data, and the names of the communes. This is
useful to create the title with `labs(title = paste("Unemployment in", .y))))` where `.y` is the
second input of `map2()`, the commune names contained in variable `division`.

So what happened? We now have a tibble called `plots_tibble` that looks like this:

```{r}
print(plots_tibble)
```

This tibble contains three columns, `division`, `data` and now a new one called `plot`, that we
created before using the last line `mutate(plot = ...)` (remember that `mutate()` adds columns to
tibbles). `plot` is a list-column, with elements... being plots! Yes you read that right, the
elements of the column `plot` are literally plots. This is what I meant with list columns.
Let's see what is inside the `data` and the `plot` columns exactly:

```{r}
plots_tibble %>%
  pull(data)
```

each element of data is a tibble for the specific country with columns `year`, `active_population`,
etc, the original columns. But obviously, there is no `division` column. So to plot the data, and
join all the dots together, we need to add `group = 1` in the call to `ggplot2()` (whereas if you
plot multiple lines in the same graph, you need to write `group = division`).

But more interestingly, how can you actually see the plots? If you want to simply look at them, it
is enough to use `pull()`:

```{r}
plots_tibble %>%
  pull(plot)
```

And if we want to save these plots, we can do so using `map2()`:

```{r, eval=FALSE}
map2(paste0(plots_tibble$division, ".pdf"), plots_tibble$plot, ggsave)
```

```
Saving 7 x 5 in image
Saving 6.01 x 3.94 in image
Saving 6.01 x 3.94 in image
```

This was probably the most advanced topic we have studied yet; but you probably agree with me that
it is among the most useful ones. This section is a perfect illustration of the power of functional
programming; you can mix and match functions as long as you give them the correct arguments.
You can pass data to functions that use data and then pass these functions to other functions that
use functions as arguments, such as `map()`.^[Functions that have other functions as input are
called *higher order functions*] `map()` does not care if the functions you pass to it produces tables,
graphs or even another function. `map()` will simply map this function to a list of inputs, and as
long as these inputs are correct arguments to the function, `map()` will do its magic. If you
combine this with list-columns, you can even use `map()` alongside `dplyr` functions and map your
function by first grouping, filtering, etc...

### Modeling with functional programming

As written just above, `map()` simply applies a function to a list of inputs, and in the previous
section we mapped `ggplot()` to generate many plots at once. This approach can also be used to 
map any modeling functions, for instance `lm()` to a list of datasets.

For instance, suppose that you wish to perform a Monte Carlo simulation. Suppose that you are 
dealing with a binary choice problem; usually, you would use a logistic regression for this.

However, in certain disciplines, especially in the social sciences, the so-called Linear Probability 
Model is often used as well. The LPM is a simple linear regression, but unlike the standard setting
of a linear regression, the dependent variable, or target, is a binary variable, and not a continuous
variable. Before you yell "Wait, that's illegal", you should know that in practice LPMs do a good 
job of estimating marginal effects, which is what social scientists and econometricians are often
interested in. Marginal effects are another way of interpreting models, giving how the outcome 
(or the target) changes given a change in a independent variable (or a feature). For instance,
a marginal effect of 0.10 for age would mean that probability of success would increase by 10% for
each added year of age. We already discussed marginal effects in Chapter 6.

There has been a lot of discussion on logistic regression vs LPMs, and there are pros and cons
of using LPMs. Micro-econometricians are still fond of LPMs, even though the pros of LPMs are 
not really convincing. However, quoting Angrist and Pischke:

"While a nonlinear model may fit the CEF (population conditional expectation function) for LDVs
(limited dependent variables) more closely than a linear model, when it comes to marginal effects,
this probably matters little" (source: *Mostly Harmless Econometrics*)

so LPMs are still used for estimating marginal effects.

Let us check this assessment with one example. First, we simulate some data, then 
run a logistic regression and compute the marginal effects, and then compare with a LPM:

```{r}
set.seed(1234)
x1 <- rnorm(100)
x2 <- rnorm(100)
  
z <- .5 + 2*x1 + 4*x2

p <- 1/(1 + exp(-z))

y <- rbinom(100, 1, p)

df <- tibble(y = y, x1 = x1, x2 = x2)
```

This data generating process generates data from a binary choice model. Fitting the model using a 
logistic regression allows us to recover the structural parameters:

```{r}
logistic_regression <- glm(y ~ ., data = df, family = binomial(link = "logit"))
```

Let's see a summary of the model fit:

```{r}
summary(logistic_regression)
```

We do recover the parameters that generated the data, but what about the marginal effects? We can
get the marginal effects easily using the `{margins}` package:

```{r}
library(margins)

margins(logistic_regression)
```

Or, even better, we can compute the *true* marginal effects, since we know the data 
generating process:

```{r}
meffects <- function(dataset, coefs){
  X <- dataset %>% 
  select(-y) %>% 
  as.matrix()
  
  dydx_x1 <- mean(dlogis(X%*%c(coefs[2], coefs[3]))*coefs[2])
  dydx_x2 <- mean(dlogis(X%*%c(coefs[2], coefs[3]))*coefs[3])
  
  tribble(~term, ~true_effect,
          "x1", dydx_x1,
          "x2", dydx_x2)
}

(true_meffects <- meffects(df, c(0.5, 2, 4)))
```

Ok, so now what about using this infamous Linear Probability Model to estimate the marginal effects?

```{r}
lpm <- lm(y ~ ., data = df)

summary(lpm)
```

It's not too bad, but maybe it could have been better in other circumstances. Perhaps if we had more
observations, or perhaps for a different set of structural parameters the results of the LPM
would have been closer. The LPM estimates the marginal effect of `x1` to be 
`r summary(lpm)$coefficients[2]` vs `r mean(marginal_effects(logistic_regression)$dydx_x1)`
for the logistic regression and for `x2`, the LPM estimation is `r summary(lpm)$coefficients[3]` 
vs `r mean(marginal_effects(logistic_regression)$dydx_x2)`. The *true* marginal effects are 
`r true_meffects$true_effect[1]` and `r true_meffects$true_effect[2]` for `x1` and `x2` respectively.

Just as to assess the accuracy of a model data scientists perform cross-validation, a Monte Carlo
study can be performed to asses how close the estimation of the marginal effects using a LPM is 
to the marginal effects derived from a logistic regression. It will allow us to test with datasets
of different sizes, and generated using different structural parameters.

First, let's write a function that generates data. The function below generates 10 datasets of size 
100 (the code is inspired by this [StackExchange answer](https://stats.stackexchange.com/a/46525)):

```{r}
generate_datasets <- function(coefs = c(.5, 2, 4), sample_size = 100, repeats = 10){

  generate_one_dataset <- function(coefs, sample_size){
  x1 <- rnorm(sample_size)
  x2 <- rnorm(sample_size)
  
  z <- coefs[1] + coefs[2]*x1 + coefs[3]*x2

  p <- 1/(1 + exp(-z))

  y <- rbinom(sample_size, 1, p)

  df <- tibble(y = y, x1 = x1, x2 = x2)
  }

  simulations <- rerun(.n = repeats, generate_one_dataset(coefs, sample_size))
 
  tibble("coefs" = list(coefs), "sample_size" = sample_size, "repeats" = repeats, "simulations" = list(simulations))
}
```

Let's first generate one dataset:

```{r}
one_dataset <- generate_datasets(repeats = 1)
```

Let's take a look at `one_dataset`:

```{r}
one_dataset
```

As you can see, the tibble with the simulated data is inside a list-column called `simulations`.
Let's take a closer look:

```{r}
str(one_dataset$simulations)
```

The structure is quite complex, and it's important to understand this, because it will have an
impact on the next lines of code; it is a list, containing a list, containing a dataset! No worries
though, we can still map over the datasets directly, by using `modify_depth()` instead of `map()`.

Now, let's fit a LPM and compare the estimation of the marginal effects with the *true* marginal
effects. In order to have some confidence in our results, 
we will not simply run a linear regression on that single dataset, but will instead simulate hundreds, 
then thousands and ten of thousands of data sets, get the marginal effects and compare 
them to the true ones (but here I won't simulate more than 500 datasets).

Let's first generate 10 datasets:

```{r}
many_datasets <- generate_datasets()
```

Now comes the tricky part. I have this object, `many_datasets` looking like this:

```{r}
many_datasets
```

I would like to fit LPMs to the 10 datasets. For this, I will need to use all the power of functional
programming and the `{tidyverse}`. I will be adding columns to this data frame using `mutate()`
and mapping over the `simulations` list-column using `modify_depth()`. The list of data frames is
at the second level (remember, it's a list containing a list containing data frames).

I'll start by fitting the LPMs, then using `broom::tidy()` I will get a nice data frame of the 
estimated parameters. I will then only select what I need, and then bind the rows of all the 
data frames. I will do the same for the *true* marginal effects.

I highly suggest that you run the following lines, one after another. It is complicated to understand
what's going on if you are not used to such workflows. However, I hope to convince you that once
it will click, it'll be much more intuitive than doing all this inside a loop. Here's the code:

```{r}
results <- many_datasets %>% 
  mutate(lpm = modify_depth(simulations, 2, ~lm(y ~ ., data = .x))) %>% 
  mutate(lpm = modify_depth(lpm, 2, broom::tidy)) %>% 
  mutate(lpm = modify_depth(lpm, 2, ~select(., term, estimate))) %>% 
  mutate(lpm = modify_depth(lpm, 2, ~filter(., term != "(Intercept)"))) %>% 
  mutate(lpm = map(lpm, bind_rows)) %>% 
  mutate(true_effect = modify_depth(simulations, 2, ~meffects(., coefs = coefs[[1]]))) %>% 
  mutate(true_effect = map(true_effect, bind_rows))
```

This is how results looks like:

```{r}
results
```

Let's take a closer look to the `lpm` and `true_effect` columns:

```{r}
results$lpm

results$true_effect
```

Let's bind the columns, and compute the difference between the *true* and estimated marginal 
effects:

```{r}
simulation_results <- results %>% 
  mutate(difference = map2(.x = lpm, .y = true_effect, full_join)) %>%  
  mutate(difference = map(difference, ~mutate(., difference = true_effect - estimate))) %>% 
  mutate(difference = map(difference, ~select(., term, difference))) %>% 
  pull(difference) %>% 
  .[[1]]
```

Let's take a look at the simulation results:

```{r}
simulation_results %>% 
  group_by(term) %>% 
  summarise(mean = mean(difference), 
            sd = sd(difference))
```

Already with only 10 simulated datasets, the difference in means is not significant. Let's rerun
the analysis, but for difference sizes. In order to make things easier, we can put all the code
into a nifty function:

```{r}
monte_carlo <- function(coefs, sample_size, repeats){
  many_datasets <- generate_datasets(coefs, sample_size, repeats)
  
  results <- many_datasets %>% 
    mutate(lpm = modify_depth(simulations, 2, ~lm(y ~ ., data = .x))) %>% 
    mutate(lpm = modify_depth(lpm, 2, broom::tidy)) %>% 
    mutate(lpm = modify_depth(lpm, 2, ~select(., term, estimate))) %>% 
    mutate(lpm = modify_depth(lpm, 2, ~filter(., term != "(Intercept)"))) %>% 
    mutate(lpm = map(lpm, bind_rows)) %>% 
    mutate(true_effect = modify_depth(simulations, 2, ~meffects(., coefs = coefs[[1]]))) %>% 
    mutate(true_effect = map(true_effect, bind_rows))

  simulation_results <- results %>% 
    mutate(difference = map2(.x = lpm, .y = true_effect, full_join)) %>% 
    mutate(difference = map(difference, ~mutate(., difference = true_effect - estimate))) %>% 
    mutate(difference = map(difference, ~select(., term, difference))) %>% 
    pull(difference) %>% 
    .[[1]]

  simulation_results %>% 
    group_by(term) %>% 
    summarise(mean = mean(difference), 
              sd = sd(difference))
}
```

And now, let's run the simulation for different parameters and sizes:

```{r}
monte_carlo(c(.5, 2, 4), 100, 10)
monte_carlo(c(.5, 2, 4), 100, 100)
monte_carlo(c(.5, 2, 4), 100, 500)

monte_carlo(c(pi, 6, 9), 100, 10)
monte_carlo(c(pi, 6, 9), 100, 100)
monte_carlo(c(pi, 6, 9), 100, 500)
```

We see that, at least for this set of parameters, the LPM does a good job of estimating marginal 
effects.

Now, this study might in itself not be very interesting to you, but I believe the general approach
is quite useful and flexible enough to be adapted to all kinds of use-cases.

## Exercises

### Exercise 1 {-}

Suppose you have an Excel workbook that contains data on three sheets. Create a function that
reads entire workbooks, and that returns a list of tibbles, where each tibble is the data of one
sheet (download the example Excel workbook, `example_workbook.xlsx`, from the `assets` folder on
the books github).

### Exercise 2 {-}

Use one of the `map()` functions to combine two lists into one. Consider the following two lists:

```{r, eval = FALSE}
mediterranean <- list("starters" = list("humous", "lasagna"), "dishes" = list("sardines", "olives"))

continental <- list("starters" = list("pea soup", "terrine"), "dishes" = list("frikadelle", "sauerkraut"))
```

The result we'd like to have would look like this:

```{r, eval = FALSE, include = FALSE}
map2(mediterranean, continental, append)
```


```{r, eval = FALSE}
$starters
$starters[[1]]
[1] "humous"

$starters[[2]]
[1] "olives"

$starters[[3]]
[1] "pea soup"

$starters[[4]]
[1] "terrine"


$dishes
$dishes[[1]]
[1] "sardines"

$dishes[[2]]
[1] "lasagna"

$dishes[[3]]
[1] "frikadelle"

$dishes[[4]]
[1] "sauerkraut"
```