-
Notifications
You must be signed in to change notification settings - Fork 32
/
08-functional_programming.Rmd
1508 lines (1104 loc) · 49.4 KB
/
08-functional_programming.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Functional programming
Functional programming is a paradigm that I find very suitable for data science. In functional
programming, your code is organised into functions that perform the operations you need. Your scripts
will only be a sequence of calls to these functions, making them easier to understand. R is not a pure
functional programming language, so we need some self-discipline to apply pure functional programming
principles. However, these efforts are worth it, because pure functions are easier to debug, extend
and document. In this chapter, we are going to learn about functional programming principles that you
can adopt and start using to make your code better.
## Function definitions
You should now be familiar with function definitions in R. Let's suppose you want to write a function
to compute the square root of a number and want to do so using Newton's algorithm:
```{r square_root_loop}
sqrt_newton <- function(a, init, eps = 0.01){
while(abs(init**2 - a) > eps){
init <- 1/2 *(init + a/init)
}
init
}
```
You can then use this function to get the square root of a number:
```{r}
sqrt_newton(16, 2)
```
We are using a `while` loop inside the body of the function. The *body* of a function are the
instructions that define the function. You can get the body of a function with `body(some_func)`.
In *pure* functional programming languages, like Haskell, loops do not exist. How can you
program without loops, you may ask? In functional programming, loops are replaced by recursion,
which we already discussed in the previous chapter. Let's rewrite our little example above
with recursion:
```{r square_root_recur}
sqrt_newton_recur <- function(a, init, eps = 0.01){
if(abs(init**2 - a) < eps){
result <- init
} else {
init <- 1/2 * (init + a/init)
result <- sqrt_newton_recur(a, init, eps)
}
result
}
```
```{r}
sqrt_newton_recur(16, 2)
```
R is not a pure functional programming language though, so we can still use loops (be it `while` or
`for` loops) in the bodies of our functions. As discussed in the previous chapter, it is actually
better, performance-wise, to use loops instead of recursion, because R is not tail-call optimized.
I won't got into the details of what tail-call optimization is but just remember that if
performance is important a loop will be faster. However, sometimes, it is easier to write a
function using recursion. I personally tend to avoid loops if performance is not important,
because I find that code that avoids loops is easier to read and debug. However, knowing that
you can use loops is reassuring, and encapsulating loops inside functions gives you the benefits of
both using functions, and loops. In the coming sections I will show you some built-in functions
that make it possible to avoid writing loops and that don't rely on recursion, so performance
won't be penalized.
## Properties of functions
Mathematical functions have a nice property: we always get the same output for a given input. This
is called referential transparency and we should aim to write our R functions in such a way.
For example, the following function:
```{r}
increment <- function(x){
x + 1
}
```
Is a referential transparent function. We always get the same result for any `x` that we give to
this function.
This:
```{r}
increment(10)
```
will always produce `11`.
However, this one:
```{r}
increment_opaque <- function(x){
x + spam
}
```
is not a referential transparent function, because its value depends on the global variable `spam`.
```{r}
spam <- 1
increment_opaque(10)
```
will produce `11` if `spam = 1`. But what if `spam = 19`?
```{r}
spam <- 19
increment_opaque(10)
```
To make `increment_opaque()` a referential transparent function, it is enough to make `spam` an
argument:
```{r}
increment_not_opaque <- function(x, spam){
x + spam
}
```
Now even if there is a global variable called `spam`, this will not influence our function:
```{r}
spam <- 19
increment_not_opaque(10, 34)
```
This is because the variable `spam` defined in the body of the function is a local variable. It
could have been called anything else, really. Avoiding opaque functions makes our life easier.
Another property that adepts of functional programming value is that functions should have no, or
very limited, side-effects. This means that functions should not change the state of your program.
For example this function (which is not a referential transparent function):
```{r square_root_loop_side_effects}
count_iter <- 0
sqrt_newton_side_effect <- function(a, init, eps = 0.01){
while(abs(init**2 - a) > eps){
init <- 1/2 *(init + a/init)
count_iter <<- count_iter + 1 # The "<<-" symbol means that we assign the
} # RHS value in a variable inside the global environment
init
}
```
If you look in the environment pane, you will see that `count_iter` equals 0. Now call this
function with the following arguments:
```{r}
sqrt_newton_side_effect(16000, 2)
print(count_iter)
```
If you check the value of `count_iter` now, you will see that it increased! This is a side effect,
because the function changed something outside of its scope. It changed a value in the global
environment. In general, it is good practice to avoid side-effects. For example, we could make the
above function not have any side effects like this:
```{r square_root_loop_not_more_side_effects}
sqrt_newton_count <- function(a, init, count_iter = 0, eps = 0.01){
while(abs(init**2 - a) > eps){
init <- 1/2 *(init + a/init)
count_iter <- count_iter + 1
}
c(init, count_iter)
}
```
Now, this function returns a list with two elements, the result, and the number of iterations it
took to get the result:
```{r}
sqrt_newton_count(16000, 2)
```
Writing to disk is also considered a side effect, because the function changes something (a file)
outside its scope. But this cannot be avoided since you *want* to write to disk.
Just remember: try to avoid having functions changing variables in the global environment unless
you have a very good reason of doing so.
Very long scripts that don't use functions and use a lot of global variables with loops changing
the values of global variables are a nightmare to debug. If something goes wrong, it might be very
difficult to pinpoint where the problem is. Is there an error in one of the loops?
Is your code running for a particular value of a particular variable in the global environment, but
not for other values? Which values? And of which variables? It can be very difficult to know what
is wrong with such a script.
With functional programming, you can avoid a lot of this pain for free (well not entirely for free,
it still requires some effort, since R is not a pure functional language). Writing functions also
makes it easier to parallelize your code. We are going to learn about that later in this chapter too.
Finally, another property of mathematical functions, is that they do one single thing. Functional
programming purists also program their functions to do one single task. This has benefits, but
can complicate things. The function we wrote previously does two things: it computes the square
root of a number and also returns the number of iterations it took to compute the result. However,
this is not a bad thing; the function is doing two tasks, but these tasks are related to each other
and it makes sense to have them together. My piece of advice: avoid having functions that do
many *unrelated* things. This makes debugging harder.
In conclusion: you should strive for referential transparency, try to avoid side effects unless you
have a good reason to have them and try to keep your functions short and do as little tasks as
possible. This makes testing and debugging easier, as you will see in the next chapter, but also
improves readability and maintainability of your code.
## Functional programming with `{purrr}`
I mentioned it several times already, but R is not a pure functional programming language. It is
possible to write R code using the functional programming paradigm, but some effort is required.
The `{purrr}` package extends R's base functional programming capabilities with some very interesting
functions. We have already seen `map()` and `reduce()`, which we are going to see in more detail now.
Then, we are going to learn about some other functions included in `{purrr}` that make functional
programming easier in R.
### Doing away with loops: the `map*()` family of functions
Instead of using loops, pure functional programming languages use functions that achieve
the same result. These functions are often called `Map` or `Reduce` (also called `Fold`). R comes
with the `*apply()` family of functions (which are implementations of `Map`),
as well as `Reduce()` for functional programming.
Within this family, you can find `lapply()`, `sapply()`, `vapply()`, `tapply()`, `mapply()`, `rapply()`,
`eapply()` and `apply()` (I might have forgotten one or the other, but that's not important).
Each version of an `*apply()` function has a different purpose, but it is not very easy to
remember which does what exactly. To add even more confusion, the arguments are sometimes different between
each of these.
In the `{purrr}` package, these functions are replaced by the `map*()` family of functions. As you will
shortly see, they are very consistent, and thus easier to use.
The first part of these functions' names all start with `map_` and the second part tells you what
this function is going to return. For example, if you want `double`s out, you would use `map_dbl()`.
If you are working on data frames and want a data frame back, you would use `map_df()`. Let's start
with the basic `map()` function. The following gif
(source: [Wikipedia](https://en.wikipedia.org/wiki/Map_(higher-order_function))) illustrates
what `map()` does fairly well:
```{r, echo=FALSE}
knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/0/06/Mapping-steps-loillibe-new.gif")
```
$X$ is a vector composed of the following scalars: $(0, 5, 8, 3, 2, 1)$. The function we want to
map to each element of $X$ is $f(x) = x + 1$. $X'$ is the result of this operation. Using R, we
would do the following:
```{r}
library("purrr")
numbers <- c(0, 5, 8, 3, 2, 1)
plus_one <- function(x) (x + 1)
my_results <- map(numbers, plus_one)
my_results
```
Using a loop, you would write:
```{r}
numbers <- c(0, 5, 8, 3, 2, 1)
plus_one <- function(x) (x + 1)
my_results <- vector("list", 6)
for(number in seq_along(numbers)){
my_results[[number]] <- plus_one(number)
}
my_results
```
Now I don't know about you, but I prefer the first option. Using functional programming, you don't
need to create an empty list to hold your results, and the code is more concise. Plus,
it is less error prone. I had to try several times to get the loop right
(and I've using R for almost 10 years now). Why? Well, first of all I used `%in%` instead of `in`.
Then, I forgot about `seq_along()`. After that, I made a typo, `plos_one()` instead of `plus_one()`
(ok, that one is unrelated to the loop). Let's also see how this works using base R:
```{r}
numbers <- c(0, 5, 8, 3, 2, 1)
plus_one <- function(x) (x + 1)
my_results <- lapply(numbers, plus_one)
my_results
```
So what is the added value of using `{purrr}`, you might ask. Well, imagine that instead of a list,
I need to an atomic vector of `numeric`s. This is fairly easy with `{purrr}`:
```{r}
library("purrr")
numbers <- c(0, 5, 8, 3, 2, 1)
plus_one <- function(x) (x + 1)
my_results <- map_dbl(numbers, plus_one)
my_results
```
We're going to discuss these functions below, but know that in base R, outputting something else
involves more effort.
Let's go back to our `sqrt_newton()` function. This function has more than one parameter. Often,
we would like to map functions with more than one parameter to a list, while holding constant
some of the functions parameters. This is easily achieved like so:
```{r}
library("purrr")
numbers <- c(7, 8, 19, 64)
map(numbers, sqrt_newton, init = 1)
```
It is also possible to use a formula:
```{r}
library("purrr")
numbers <- c(7, 8, 19, 64)
map(numbers, ~sqrt_newton(., init = 1))
```
Another function that is similar to `map()` is `rerun()`. You guessed it, this one simply
reruns an expression:
```{r}
rerun(10, "hello")
```
`rerun()` simply runs an expression (which can be arbitrarily complex) `n` times, whereas `map()`
maps a function to a list of inputs, so to achieve the same with `map()`, you need to map the `print()`
function to a vector of characters:
```{r}
map(rep("hello", 10), print)
```
`rep()` is a function that creates a vector by repeating something, in this case the string "hello",
as many times as needed, here 10. The output here is a bit different that before though, because first
you will see "hello" printed 10 times and then the list where each element is "hello".
This is because the `print()` function has a side effect, which is, well printing to the console.
We see this side effect 10 times, plus then the list created with `map()`.
`rerun()` is useful if you want to run simulation. For instance, let's suppose that I perform a simulation
where I throw a die 5 times, and compute the mean of the points obtained, as well as the variance:
```{r}
mean_var_throws <- function(n){
throws <- sample(1:6, n, replace = TRUE)
mean_throws <- mean(throws)
var_throws <- var(throws)
tibble::tribble(~mean_throws, ~var_throws,
mean_throws, var_throws)
}
mean_var_throws(5)
```
`mean_var_throws()` returns a `tibble` object with mean of points and the variance of the points. Now suppose
I want to compute the expected value of the distribution of throwing dice. We know from theory that it should
be equal to $3.5 (= 1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6)$.
Let's rerun the simulation 50 times:
```{r}
simulations <- rerun(50, mean_var_throws(5))
```
Let's see what the `simulations` object is made of:
```{r, eval = FALSE}
str(simulations)
```
```
## List of 50
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 2 variables:
## ..$ mean_throws: num 2
## ..$ var_throws : num 3
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 2 variables:
## ..$ mean_throws: num 2.8
## ..$ var_throws : num 0.2
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 2 variables:
## ..$ mean_throws: num 2.8
## ..$ var_throws : num 0.7
## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 2 variables:
## ..$ mean_throws: num 2.8
## ..$ var_throws : num 1.7
.....
```
`simulations` is a list of 50 data frames. We can easily combine them into a single data frame, and compute the
mean of the means, which should return something close to the expected value of 3.5:
```{r}
bind_rows(simulations) %>%
summarise(expected_value = mean(mean_throws))
```
Pretty close! Now of course, one could have simply done something like this:
```{r}
mean(sample(1:6, 1000, replace = TRUE))
```
but the point was to illustrate that `rerun()` can run any arbitrarily complex expression, and that it is good
practice to put the result in a data frame or list, for easier further manipulation.
You now know the standard `map()` function, and also `rerun()`, which return lists, but there are a
number of variants of this function. `map_dbl()` returns an atomic vector of doubles, as seen
we've seen before. A little reminder below:
```{r}
map_dbl(numbers, sqrt_newton, init = 1)
```
In a similar fashion, `map_chr()` returns an atomic vector of strings:
```{r}
map_chr(numbers, sqrt_newton, init = 1)
```
`map_lgl()` returns an atomic vector of `TRUE` or `FALSE`:
```{r}
divisible <- function(x, y){
if_else(x %% y == 0, TRUE, FALSE)
}
map_lgl(seq(1:100), divisible, 3)
```
There are also other interesting variants, such as `map_if()`:
```{r}
a <- seq(1,10)
map_if(a, (function(x) divisible(x, 2)), sqrt)
```
I used `map_if()` to take the square root of only those numbers in vector `a` that are divisble by 2,
by using an anonymous function that checks if a number is divisible by 2 (by wrapping `divisible()`).
`map_at()` is similar to `map_if()` but maps the function at a position specified by the user:
```{r}
map_at(numbers, c(1, 3), sqrt)
```
or if you have a named list:
```{r}
recipe <- list("spam" = 1, "eggs" = 3, "bacon" = 10)
map_at(recipe, "bacon", `*`, 2)
```
I used `map_at()` to double the quantity of bacon in the recipe (by using the `*` function, and specifying
its second argument, `2`. Try the following in the command prompt: `` `*`(3, 4) ``).
`map2()` is the equivalent of `mapply()` and `pmap()` is the generalisation of `map2()` for more
than 2 arguments:
```{r}
print(a)
b <- seq(1, 2, length.out = 10)
print(b)
map2(a, b, `*`)
```
Each element of `a` gets multiplied by the element of `b` that is in the same position.
Let's see what `pmap()` does. Can you guess from the code below what is going on? I will print
`a` and `b` again for clarity:
```{r}
a
b
n <- seq(1:10)
pmap(list(a, b, n), rnorm)
```
Let's take a closer look at what `a`, `b` and `n` look like, when they are place next to each other:
```{r}
cbind(a, b, n)
```
`rnorm()` gets first called with the parameters from the first line, meaning
`rnorm(a[1], b[1], n[1])`. The second time `rnorm()` gets called, you guessed it,
it with the parameters on the second line of the array above,
`rnorm(a[2], b[2], n[2])`, etc.
There are other functions in the `map()` family of functions, but we will discover them in the
exercises!
The `map()` family of functions does not have any more secrets for you. Let's now take a look at
the `reduce()` family of functions.
### Reducing with `purrr`
Reducing is another important concept in functional programming. It allows going from a list of
elements, to a single element, by somehow *combining* the elements into one. For instance, using
the base R `Reduce()` function, you can sum the elements of a list like so:
```{r}
Reduce(`+`, seq(1:100))
```
using `purrr::reduce()`, this becomes:
```{r}
reduce(seq(1:100), `+`)
```
If you don't really get what happening, don't worry. Things should get clearer once I'll introduce
another version of `reduce()`, called `accumulate()`, which we will see below.
Sometimes, the direction from which we start to reduce is quite important. You can "start from the
end" of the list by using the `.dir` argument:
```{r}
reduce(seq(1:100), `+`, .dir = "backward")
```
Of course, for commutative operations, direction does not matter. But it does matter for non-commutative
operations:
```{r}
reduce(seq(1:100), `-`)
reduce(seq(1:100), `-`, .dir = "backward")
```
Let's now take a look at `accumulate()`. `accumulate()` is very similar to `map()`, but keeps the
intermediary results. Which intermediary results? Let's try and see what happens:
```{r, eval=FALSE}
a <- seq(1, 10)
accumulate(a, `-`)
```
```{r, echo=FALSE}
a <- seq(1, 10)
purrr::accumulate(a, `-`)
```
`accumulate()` illustrates pretty well what is happening; the first element, `1`, is simply the
first element of `seq(1, 10)`. The second element of the result however, is the difference between
`1` and `2`, `-1`. The next element in `a` is `3`. Thus the next result is `-1-3`, `-4`, and so
on until we run out of elements in `a`.
The below illustration shows the algorithm step-by-step:
```
(1-2-3-4-5-6-7-8-9-10)
((1)-2-3-4-5-6-7-8-9-10)
((1-2)-3-4-5-6-7-8-9-10)
((-1-3)-4-5-6-7-8-9-10)
((-4-4)-5-6-7-8-9-10)
((-8-5)-6-7-8-9-10)
((-13-6)-7-8-9-10)
((-19-7)-8-9-10)
((-26-8)-9-10)
((-34-9)-10)
(-43-10)
-53
```
`reduce()` only shows the final result of all these operations. `accumulate()` and `reduce()` also
have an `.init` argument, that makes it possible to start the reducing procedure from an initial
value that is different from the first element of the vector:
```{r, eval=FALSE}
reduce(a, `+`, .init = 1000)
accumulate(a, `-`, .init = 1000, .dir = "backward")
```
```{r, echo=FALSE}
reduce(a, `+`, .init = 1000)
purrr::accumulate(a, `-`, .init = 1000, .dir = "backward")
```
`reduce()` generalizes functions that only take two arguments. If you were to write a function that returns
the minimum between two numbers:
```{r}
my_min <- function(a, b){
if(a < b){
return(a)
} else {
return(b)
}
}
```
You could use `reduce()` to get the minimum of a list of numbers:
```{r}
numbers2 <- c(3, 1, -8, 9)
reduce(numbers2, my_min)
```
`map()` and `reduce()` are arguably the most useful higher-order functions, and perhaps also the
most famous one, true ambassadors of functional programming. You might have read about
[MapReduce](https://en.wikipedia.org/wiki/MapReduce), a programming model for processing big
data in parallel. The way MapReduce works is inspired by both these `map()` and `reduce()` functions,
which are always included in functional programming languages. This illustrates that the functional
programming paradigm is very well suited to parallel computing.
Something else that is very important to understand at this point; up until now, we only used these
functions on lists, or atomic vectors, of numbers. However, `map()` and `reduce()`, and other
higher-order functions for that matter, do not care about the contents of the list. What these
functions do, is take another functions, and make it do something to the elements of the list.
It does not matter if it's a list of numbers, of characters, of data frames, even of models. All that
matters is that the function that will be applied to these elements, can operate on them.
So if you have a list of fitted models, you can map `summary()` on this list to get summaries of
each model. Or if you have a list of data frames, you can map a function that performs several
cleaning steps. This will be explored in a future section, but it is important to keep this in mind.
### Error handling with `safely()` and `possibly()`
`safely()` and `possibly()` are very useful functions. Consider the following situation:
```{r, eval = FALSE}
a <- list("a", 4, 5)
sqrt(a)
```
```{r, eval = FALSE}
Error in sqrt(a) : non-numeric argument to mathematical function
```
Using `map()` or `Map()` will result in a similar error. `safely()` is an higher-order function that
takes one function as an argument and executes it... *safely*, meaning the execution of the function
will not stop if there is an error. The error message gets captured alongside valid results.
```{r}
a <- list("a", 4, 5)
safe_sqrt <- safely(sqrt)
map(a, safe_sqrt)
```
`possibly()` works similarly, but also allows you to specify a return value in case of an error:
```{r}
possible_sqrt <- possibly(sqrt, otherwise = NA_real_)
map(a, possible_sqrt)
```
Of course, in this particular example, the same effect could be obtained way more easily:
```{r}
sqrt(as.numeric(a))
```
However, in some situations, this trick does not work as intended (or at all). `possibly()` and
`safely()` allow the programmer to model errors explicitly, and to then provide a consistent way
of dealing with them. For instance, consider the following example:
```{r, eval=FALSE}
data(mtcars)
write.csv(mtcars, "my_data/mtcars.csv")
```
```
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
In addition: Warning message:
In file(file, ifelse(append, "a", "w")) :
cannot open file 'my_data/mtcars.csv': No such file or directory
```
The folder `path/to/save/` does not exist, and as such this code produces an error. You might
want to catch this error, and create the directory for instance:
```{r, eval=FALSE}
possibly_write.csv <- possibly(write.csv, otherwise = NULL)
if(is.null(possibly_write.csv(mtcars, "my_data/mtcars.csv"))) {
print("Creating folder...")
dir.create("my_data/")
print("Saving file...")
write.csv(mtcars, "my_data/mtcars.csv")
}
```
```
[1] "Creating folder..."
[1] "Saving file..."
Warning message:
In file(file, ifelse(append, "a", "w")) :
cannot open file 'my_data/mtcars.csv': No such file or directory
```
The warning message comes from the first time we try to write the `.csv`, inside the `if`
statement. Because this fails, we create the directory and then actually save the file.
In the exercises, you'll discover `quietly()`, which also captures warnings and messages.
To conclude this section: remember function factories? Turns out that `safely()`, `purely()` and `quietly()` are
function factories.
### Partial applications with `partial()`
Consider the following simple function:
```{r}
add <- function(a, b) a+b
```
It is possible to create a new function, where one of the parameters is fixed, for instance, where
`a = 10`:
```{r}
add_to_10 <- partial(add, a = 10)
```
```{r}
add_to_10(12)
```
This is equivalent to the following:
```{r}
add_to_10_2 <- function(b){
add(a = 10, b)
}
```
Using `partial()` is much less verbose however, and allowing you to define new functions very quickly:
```{r}
head10 <- partial(head, n = 10)
head10(mtcars)
```
### Function composition using `compose`
Function composition is another handy tool, which makes chaining equation much more elegant:
```{r}
compose(sqrt, log10, exp)(10)
```
You can read this expression as *`exp()` after `log10()` after `sqrt()`* and is equivalent to:
```{r}
sqrt(log10(exp(10)))
```
It is also possible to reverse the order the functions get called using the `.dir = ` option:
```{r}
compose(sqrt, log10, exp, .dir = "forward")(10)
```
One could also use the `%>%` operator to achieve the same result:
```{r}
10 %>%
sqrt %>%
log10 %>%
exp
```
but strictly speaking, this is not function composition.
### «Transposing lists»
Another interesting function is `transpose()`. It is not an alternative to the function `t()` from
`base` but, has a similar effect. `transpose()` works on lists. Let's take a look at the example
from before:
```{r}
safe_sqrt <- safely(sqrt, otherwise = NA_real_)
map(a, safe_sqrt)
```
The output is a list with the first element being a list with a result and an error message. One
might want to have all the results in a single list, and all the error messages in another list.
This is possible with `transpose()`:
```{r}
purrr::transpose(map(a, safe_sqrt))
```
I explicitely call `purrr::transpose()` because there is also a `data.table::transpose()`, which
is not the same function. You have to be careful about that sort of thing, because it can cause
errors in your programs and debuging this type of error is a nightmare.
Now that we are familiar with functional programming, let's try to apply some of its principles
to data manipulation.
## List-based workflows for efficiency
You can use your own functions in pipe workflows:
```{r}
double_number <- function(x){
x+x
}
```
```{r}
mtcars %>%
head() %>%
mutate(double_mpg = double_number(mpg))
```
It is important to understand that your functions, and functions that are built-in into R, or that
come from packages, are exactly the same thing. Every function is a first-class object in R, no
matter where they come from. The consequence of functions being first-class objects is that
functions can take functions as arguments, functions can return functions (the function factories
from the previous chapter) and can be assigned to any variable:
```{r}
plop <- sqrt
plop(4)
```
```{r}
bacon <- function(.f){
message("Bacon is tasty")
.f
}
bacon(sqrt) # `bacon` is a function factory, as it returns a function (alongside an informative message)
# To actually call it:
bacon(sqrt)(4)
```
Now, let's step back for a bit and think about what we learned up until now, and especially
the `map()` family of functions.
Let's read the list of datasets from the previous chapter:
```{r}
paths <- Sys.glob("datasets/unemployment/*.csv")
all_datasets <- import_list(paths)
str(all_datasets)
```
`all_datasets` is a list with `r length(all_datasets)` elements, each of them is a `data.frame`.
The first thing we are going to do is use a function to clean the names of the datasets. These
names are not very easy to work with; there are spaces, and it would be better if the names of the
columns would be all lowercase. For this we are going to use the function `clean_names()` from the
`janitor` package. For a single dataset, I would write this:
```{r, include=FALSE}
library(janitor)
```
```{r, eval = FALSE}
library(janitor)
one_dataset <- one_dataset %>%
clean_names()
```
and I would get a dataset with column names in lowercase and spaces replaced by `_` (and other
corrections). How can I apply, or map, this function to each dataset in the list? To do this I need
to use `purrr::map()`, which we've seen in the previous section:
```{r}
library(purrr)
all_datasets <- all_datasets %>%
map(clean_names)
all_datasets %>%
glimpse()
```
Remember that `map(list, function)` simply evaluates `function` to each element of `list`.
So now, what if I want to know, for each dataset, which *communes* have an unemployment rate that is
less than, say, 3%? For a single dataset I would do something like this:
```{r, eval=FALSE}
one_dataset %>%
filter(unemployment_rate_in_percent < 3)
```
but since we're dealing with a list of data sets, we cannot simply use `filter()` on it. This is because
`filter()` expects a data frame, not a list of data frames. The way around this is to use `map()`.
```{r}
all_datasets %>%
map(~filter(., unemployment_rate_in_percent < 3))
```
`map()` needs a function to map to each element of the list. `all_datasets` is the list to which I
want to map the function. But what function? `filter()` is the function I need, so why doesn't:
```{r, eval = FALSE}
all_datasets %>%
map(filter(unemployment_rate_in_percent < 3))
```
work? This is what happens if we try it:
```
Error in filter(unemployment_rate_in_percent < 3) :
object 'unemployment_rate_in_percent' not found
```
This is because `filter()` needs both the data set, and a so-called predicate (a predicate
is an expression that evaluates to `TRUE` or `FALSE`). But you need to make more explicit
what is the dataset and what is the predicate, because here, `filter()` thinks that the
dataset is `unemployment_rate_in_percent`. The way to do this is to use an anonymous
function (discussed in Chapter 7), which allows you to explicitely state what is the
dataset, and what is the predicate. As we've seen, there's three ways to define
anonymous functions:
- Using a formula (only works within `{tidyverse}` functions):
```{r}
all_datasets %>%
map(~filter(., unemployment_rate_in_percent < 3)) %>%
glimpse()
```
(notice the `.` in the formula, making the position of the dataset as the first argument to `filter()`
explicit) or
- using an anonymous function (using the `function(x)` keyword):
```{r}
all_datasets %>%
map(function(x)filter(x, unemployment_rate_in_percent < 3)) %>%
glimpse()
```
- or, since R 4.1, using the shorthand `\(x)`:
```{r}
all_datasets %>%
map(\(x)filter(x, unemployment_rate_in_percent < 3)) %>%
glimpse()
```
As you see, everything is starting to come together: lists, to hold complex objects, over which anonymous
functions are mapped using higher-order functions. Let's continue cleaning this dataset.
Before merging these datasets together, we would need them to have a `year` column indicating the
year the data was measured in each data frame. It would also be helpful if gave names to these datasets, meaning
converting the list to a named list. For this task, we can use `purrr::set_names()`:
```{r, eval=FALSE}
all_datasets <- set_names(all_datasets, as.character(seq(2013, 2016)))
```
Let's take a look at the list now:
```{r, eval=FALSE}
str(all_datasets)
```
As you can see, each `data.frame` object contained in the list has been renamed. You can thus
access them with the `$` operator: