functional programming added

b-rodrigues · Oct 25, 2023 · 98eb5eb · 98eb5eb
1 parent f5fc51f
commit 98eb5eb
Show file tree

Hide file tree

Showing 2 changed files with 279 additions and 0 deletions.
diff --git a/03-functional-programming.qmd b/03-functional-programming.qmd
@@ -0,0 +1,278 @@
+# A primer on functional programming
+
+<div style="text-align:center;">
+```{r, echo = F}
+knitr::include_graphics("img/lambda.png")
+```
+</div>
+
+What you'll have learned by the end of the chapter: writing your own functions,
+functional programming basics (map, reduce, anonymous functions and higher-order
+functions).
+
+## Introduction
+
+Functional programming is a way of writing programs that relies exclusively on
+the evalutation of functions. Mathematical functions have a very neat property:
+for any given input, they ALWAYS return exactly the same output. This is what we
+want to achieve with the functions that we will write. Functions that always
+return the same result are called pure, and a language that only allows writing
+pure functions is called a pure functional programming language. R is not a pure
+functional programming language, so we have to be careful not to write impure
+functions that manipulate the global state.
+
+But what is state? Run the following code in your console:
+
+```{r, eval = F}
+ls()
+```
+
+This will list every object defined in the global environment. Now run the
+following line:
+
+```{r, eval = F}
+
+x <- 1
+
+```
+
+and then `ls()` again. `x` should now be listed alongside the other objects. You
+just manipulated the state of your current R session. Now if you run something
+like:
+
+```{r, eval = F}
+
+x + 1
+
+```
+
+This will produce `2`. We want to avoid pipelines that depend on some definition
+of some global variable somewhere, which could be subject to change, because
+this could mean that 2 different runs of the same pipeline could produce 2
+different results. Notice that I used the verb *avoid* in the sentence before.
+This is sometimes not possible to avoid. Such situations have to be carefully
+documented and controlled.
+
+As a more realistic example, imagine that within the pipeline you set up, some
+random numbers are generated. For example, to generate 10 random draws from a
+normal distribution:
+
+```{r}
+rnorm(n = 10)
+```
+
+Each time you run this line, you will get another set of 10 random numbers. This
+is obviously a good thing in interactive data analysis, but much less so when
+running a pipeline programmatically. R provides a way to fix the random seed,
+which will make sure you always get the same random numbers:
+
+```{r}
+set.seed(1234)
+rnorm(n = 10)
+```
+
+But `set.seed()` only works for one call, so you must call it again if you need
+the random numbers again:
+
+```{r}
+set.seed(1234)
+rnorm(10)
+
+rnorm(10)
+
+set.seed(1234)
+rnorm(10)
+```
+
+The problem with `set.seed()` is that you only partially solve the problem of
+`rnorm()` not being pure; this is because while `rnorm()` now does return the
+same output for the same input, this only works if you manipulate the state of
+your program to change the seed beforehand. Ideally, we would like to have a
+pure version of `rnorm()`, which would be self-contained and not depend on the
+value of the seed defined in the global environment. There is a package
+developped by Posit (the makers of RStudio and the packages from the
+*tidyverse*), called `{withr}` which allows to rewrite our functions in a pure
+way. `{withr}` has several functions, all starting with `with_` that allow users
+to run code with some temporary defined variables, without altering the global
+environment. For example, it is possible to run a `rnorm()` with a seed, using
+`withr::with_seed()`:
+
+```{r}
+library(withr)
+
+with_seed(seed = 1234, {
+  rnorm(10)
+})
+```
+
+But ideally you’d want to go a step further and define a new function
+that is pure. To turn an impure function into a pure function, you usually only
+need to add some arguments to it. This is how we would create a `pure_rnorm()`
+function:
+
+```{r}
+pure_rnorm <- function(..., seed){
+
+  with_seed(seed, rnorm(...))
+}
+
+pure_rnorm(10, seed = 1234)
+```
+
+`pure_rnorm()` is now self-contained, and does not pollute the global
+environment. We’re going to learn how to write functions in just a
+bit, so don’t worry if the code above does not make sense yet.
+
+<div style="text-align:center;">
+```{r, echo = F}
+knitr::include_graphics("img/cat_loops.png")
+```
+</div>
+
+A very practical consequence of using functional programming is that loops are
+not used, because loops are imperative and imperative programming is all about
+manipulating state. However, there are situations where loops are more efficient
+than the alternative (in R at least). So we will still learn and use them, but
+only when absolutely necessary, and we will always encapsulate a loop inside a
+function. Just like with the example above, this ensures that we have a pure,
+self-contained function that we can reason about easily. What I mean by this, is
+that loops are not always very easy to decipher. The concept of loops is simple
+enough: take this instruction, and repeat it N times. But in practice, if
+you’re reading code, it is not possible to understand what a loop is
+doing at first glance. There are only two solutions in this case:
+
+- you’re lucky and there are comments that explain what the loop is doing;
+- you have to let the loop run either in your head or in a console with some examples to really understand whit is going on.
+
+For example, consider the following code:
+
+```{r}
+suppressPackageStartupMessages(library(dplyr))
+
+data(starwars)
+
+sum_humans <- 0
+sum_others <- 0
+n_humans <- 0
+n_others <- 0
+
+for(i in seq_along(1:nrow(starwars))){
+
+  if(!is.na(unlist(starwars[i, "species"])) &
+     unlist(starwars[i, "species"]) == "Human"){
+    if(!is.na(unlist(starwars[i, "height"]))){
+      sum_humans <- sum_humans + unlist(starwars[i, "height"])
+      n_humans <- n_humans + 1
+    } else {
+
+      0
+
+    }
+
+  } else {
+    if(!is.na(unlist(starwars[i, "height"]))){
+      sum_others <- sum_others + unlist(starwars[i, "height"])
+      n_others <- n_others + 1
+    } else {
+      0
+    }
+  }
+}
+
+mean_height_humans <- sum_humans/n_humans
+mean_height_others <- sum_others/n_others
+```
+
+What this does is not immediately obvious. The only hint you get are the two
+last lines, where you can read that we compute the average height for humans and
+non-humans in the sample. And this code could look a lot worse, because I am
+using functions like `is.na()` to test if a value is `NA` or not, and
+I’m using `unlist()` as well. If you compare this mess to a
+functional approach, I hope that I can stop my diatribe against imperative style
+programming here:
+
+```{r}
+starwars %>%
+  group_by(is_human = species == "Human") %>%
+  summarise(mean_height = mean(height, na.rm = TRUE))
+```
+
+Not only is this shorter, it doesn’t even need any comments to
+explain what’s going on. If you’re using functions with
+explicit names, the code becomes self-explanatory.
+
+The other advantage of a functional (also called declarative) programming style
+is that you get function composition for free. Function composition is an
+operation that takes two functions *g* and *f* and returns a new function *h*
+such that $h(x) = g(f(x))$. Formally:
+
+```
+h = g ∘ f such that h(x) = g(f(x))
+```
+
+`∘` is the composition operator. You can read `g ∘ f` as
+*g after f*. When using functional programming, you can compose functions very
+easily, simply by using `|>` or `%>%`:
+
+```{r, eval = F}
+h <- f |> g
+```
+
+`f |> g` can be read as *f then g*, which is equivalent to *g after f*. Function
+composition might not seem like a big deal, but it actually is. If we structure
+our programs in this way, as a sequence of function calls, we get many benefits.
+Functions are easy to test, document, maintain, share and can be composed. This
+allows us to very succintly express complex workflows:
+
+```{r}
+starwars %>%
+  filter(skin_color == "light") %>%
+  select(species, sex, mass) %>%
+  group_by(sex, species) %>%
+  summarise(
+    total_individuals = n(),
+    min_mass = min(mass, na.rm = TRUE),
+    mean_mass = mean(mass, na.rm = TRUE),
+    sd_mass = sd(mass, na.rm = TRUE),
+    max_mass = max(mass, na.rm = TRUE),
+    .groups = "drop"
+  ) %>%
+  select(-species) %>%
+  tidyr::pivot_longer(-sex, names_to = "statistic", values_to = "value")
+```
+
+Needless to say, writing this in an imperative approach would be quite
+complicated.
+
+Another consequence of using functional programming is that our code will live
+in plain text files, and not in Jupyter (or equivalent) notebooks. Not only does
+imperative code have state, but notebooks themselves have a (hidden) state. You
+should avoid notebooks at all costs, even for experimenting.
+
+## Defining your own functions
+
+Let's first learn about actually writing functions. Read [chapter
+7](https://b-rodrigues.github.io/modern_R/defining-your-own-functions.html) of
+my other book.
+
+The most important concepts for this course are discussed in the following
+sections:
+
+- functions that take functions as arguments [(section 7.4)](https://b-rodrigues.github.io/modern_R/defining-your-own-functions.html#functions-that-take-functions-as-arguments-writing-your-own-higher-order-functions)
+- functions that take data (and the data's columns) as arguments [(section 7.6)](https://b-rodrigues.github.io/modern_R/defining-your-own-functions.html#functions-that-take-columns-of-data-as-arguments);
+
+## Functional programming
+
+You should ideally work through the whole of chapter 7, and then tackle [chapter
+8](https://b-rodrigues.github.io/modern_R/functional-programming.html). What's
+important there are:
+
+- `purrr::map()`, `purrr::reduce()` (sections [8.3.1](https://b-rodrigues.github.io/modern_R/functional-programming.html#doing-away-with-loops-the-map-family-of-functions) and [8.3.2](https://b-rodrigues.github.io/modern_R/functional-programming.html#reducing-with-purrr))
+- And list based workflows (section [8.4](https://b-rodrigues.github.io/modern_R/functional-programming.html#functional-programming-and-plotting))
+
+## Further reading
+
+- [Cleaner R Code with Functional Programming](https://towardsdatascience.com/cleaner-r-code-with-functional-programming-adc37931ef7a)
+- [Functional Programming (Chapter from Advanced R)](http://adv-r.had.co.nz/Functional-programming.html)
+- [Why you should(n't) care about Monads if you're an R programmer](https://www.brodrigues.co/blog/2022-04-11-monads/)
+- [Some learnings from functional programming you can use to write safer programs](https://www.brodrigues.co/blog/2022-05-26-safer_programs/)
diff --git a/_quarto.yml b/_quarto.yml
@@ -14,6 +14,7 @@ book:
   chapters:
     - index.qmd
     - 02-intro_R.qmd
+    - 03-functional-programming.qmd
   page-navigation: true
 
 bibliography: references.bib