diff --git a/.nojekyll b/.nojekyll
index 0a0eb6e..0d15504 100644
--- a/.nojekyll
+++ b/.nojekyll
@@ -1 +1 @@
-bdf9b80b
\ No newline at end of file
+bca47019
\ No newline at end of file
diff --git a/Building-reproducible-analytical-pipelines-with-R.epub b/Building-reproducible-analytical-pipelines-with-R.epub
index 15dde3d..844593f 100644
Binary files a/Building-reproducible-analytical-pipelines-with-R.epub and b/Building-reproducible-analytical-pipelines-with-R.epub differ
diff --git a/Building-reproducible-analytical-pipelines-with-R.pdf b/Building-reproducible-analytical-pipelines-with-R.pdf
index 57f75da..c731a25 100644
Binary files a/Building-reproducible-analytical-pipelines-with-R.pdf and b/Building-reproducible-analytical-pipelines-with-R.pdf differ
diff --git a/fprog.html b/fprog.html
index 14a285c..9372253 100644
--- a/fprog.html
+++ b/fprog.html
@@ -814,8 +814,8 @@
chronicler::read_log(result)
[1] "Complete log:"
-[2] "NOK! sqrt() ran unsuccessfully with following exception: NaNs produced at 2023-10-03 18:59:36"
-[3] "Total running time: 0.00126385688781738 secs"
+[2] "NOK! sqrt() ran unsuccessfully with following exception: NaNs produced at 2023-11-01 08:42:45"
+[3] "Total running time: 0.00116109848022461 secs"
The {purrr} package also comes with function factories that you might find useful ({possibly}, {safely} and {quietly}).
@@ -1678,7 +1678,7 @@
function (x, ...)
UseMethod("print")
-<bytecode: 0x55df20ffc410>
+<bytecode: 0x5574d6500a28>
<environment: namespace:base>
@@ -1722,7 +1722,7 @@
diff --git a/search.json b/search.json
index 72b88bc..a356f18 100644
--- a/search.json
+++ b/search.json
@@ -214,7 +214,7 @@
"href": "fprog.html#writing-good-functions",
"title": "6 Functional programming",
"section": "6.2 Writing good functions",
- "text": "6.2 Writing good functions\n\n6.2.1 Functions are first-class objects\nIn a functional programming language, functions are first-class objects. Contrary to what the name implies, this means that functions, especially the ones you define yourself, are nothing special. A function is an object like any other, and can thus be manipulated as such. Think of anything that you can do with any object in R, and you can do the same thing with a function. For example, let’s consider the +() function. It takes two numeric objects and returns their sum:\n\n1 + 5.3\n\n[1] 6.3\n\n# or alternatively: `+`(1, 5.3)\n\nYou can replace the numbers with functions that return numbers:\n\nsqrt(1) + log(5.3)\n\n[1] 2.667707\n\n\nIt’s also possible to define a function that explicitly takes another function as an input:\n\nh <- function(number, f){\n f(number)\n}\n\nYou can call then use h() as a wrapper for f():\n\nh(4, sqrt)\n\n[1] 2\n\nh(10, log10)\n\n[1] 1\n\n\nBecause h() takes another function as an argument, h() is called a higher-order function.\nIf you don’t know how many arguments f(), the function you’re wrapping, has, you can use the ...:\n\nh <- function(number, f, ...){\n f(number, ...)\n}\n\n... are simply a place-holder for any potential additional argument that f() might have:\n\nh(c(1, 2, NA, 3), mean, na.rm = TRUE)\n\n[1] 2\n\nh(c(1, 2, NA, 3), mean, na.rm = FALSE)\n\n[1] NA\n\n\nna.rm is an argument of mean(). As the developer of h(), I don’t necessarily know what f() might be, but even if I knew what f() would be and knew all its arguments, I might not want to list them all. So I can use ... instead. The following is also possible:\n\nw <- function(...){\n paste0(\"First argument: \", ..1,\n \", second argument: \", ..2,\n \", last argument: \", ..3)\n}\n\nw(1, 2, 3)\n\n[1] \"First argument: 1, second argument: 2, last argument: 3\"\n\n\nIf you want to learn more about ..., type ?dots in an R console.\nBecause functions are nothing special, you can also write functions that return functions. As an illustration, we’ll be writing a function that converts warnings to errors. This can be quite useful if you want your functions to fail early, which often makes debugging easier. For example, try running this:\n\nsqrt(-5)\n\nWarning in sqrt(-5): NaNs produced\n\n\n[1] NaN\n\n\nThis only raises a warning and returns NaN (Not a Number). This can be quite dangerous, especially when working non-interactively, which is what we will be doing a lot later on. It is much better if a pipeline fails early due to an error, than dragging a NaN value. This also happens with log10():\n\nlog10(-10)\n\nWarning: NaNs produced\n\n\n[1] NaN\n\n\nSo it could be useful to redefine these functions to raise an error instead, for example like this:\n\nstrict_sqrt <- function(x){\n\n if(x < 0) stop(\"x is negative\")\n\n sqrt(x)\n\n}\n\nThis function now throws an error for negative x:\n\nstrict_sqrt(-10)\n\nError in strict_sqrt(-10) : x is negative\nHowever, it can be quite tedious to redefine every function that we need in our pipeline, and remember, we don’t want to repeat ourselves. So, because functions are nothing special, we can define a function that takes a function as an argument, converts any warning thrown by that function into an error, and returns a new function. For example:\n\nstrictly <- function(f){\n function(...){\n tryCatch({\n f(...)\n },\n warning = function(warning)stop(\"Can't do that chief\"))\n }\n}\n\nThis function makes use of tryCatch() which catches warnings raised by an expression (in this example the expression is f(...)) and then raises an error instead with the stop() function. It is now possible to define new functions like this:\n\ns_sqrt <- strictly(sqrt)\n\n\ns_sqrt(-4)\n\nError in value[[3L]](cond) : Can't do that chief\n\ns_log <- strictly(log)\n\n\ns_log(-4)\n\nError in value[[3L]](cond) : Can't do that chief\nFunctions that return functions are called function factories and they’re incredibly useful. I use this so much that I’ve written a package, available on CRAN, called {chronicler}, that does this:\n\ns_sqrt <- chronicler::record(sqrt)\n\n\nresult <- s_sqrt(-4)\n\nresult\n\nNOK! Value computed unsuccessfully:\n---------------\nNothing\n\n---------------\nThis is an object of type `chronicle`.\nRetrieve the value of this object with pick(.c, \"value\").\nTo read the log of this object, call read_log(.c).\n\n\nBecause the expression above resulted in an error, Nothing is returned. Nothing is a special value defined in the {maybe} package (check it out, a very interesting package!). We can then even read a log to see what went wrong:\n\nchronicler::read_log(result)\n\n[1] \"Complete log:\" \n[2] \"NOK! sqrt() ran unsuccessfully with following exception: NaNs produced at 2023-10-03 18:59:36\"\n[3] \"Total running time: 0.00126385688781738 secs\" \n\n\nThe {purrr} package also comes with function factories that you might find useful ({possibly}, {safely} and {quietly}).\nIn part 2 we will also learn about assertive programming, another way of making our functions safer, as an alternative to using function factories.\n\n\n6.2.2 Optional arguments\nIt is possible to make functions’ arguments optional, by using NULL. For example:\n\ng <- function(x, y = NULL){\n if(is.null(y)){\n print(\"optional argument y is NULL\")\n x\n } else {\n if(y == 5) print(\"y is present\"); x+y\n }\n}\n\nCalling g(10) prints the message “Optional argument y is NULL”, and returns 10. Calling g(10, 5) however, prints “y is present” and returns 15. It is also possible to use missing():\n\ng <- function(x, y){\n if(missing(y)){\n print(\"optional argument y is missing\")\n x\n } else {\n if(y == 5) print(\"y is present\"); x+y\n }\n}\n\nI however prefer the first approach, because it is clearer which arguments are optional, which is not the case with the second approach, where you need to read the body of the function.\n\n\n6.2.3 Safe functions\nIt is important that your functions are safe and predictable. You should avoid writing functions that behave like the nchar() base function. Let’s see why this function is not safe:\n\nnchar(\"10000000\")\n\n[1] 8\n\n\nIt returns the expected result of 8. But what if I remove the quotes?\n\nnchar(10000000)\n\n[1] 5\n\n\nWhat is going on here? I’ll give you a hint: simply type 10000000 in the console:\n\n10000000\n\n[1] 1e+07\n\n\n10000000 gets represented as 1e+07 by R. This number in scientific notation gets then converted into the character “1e+07” by nchar(), and this conversion happens silently. nchar() then counts the number of characters, and correctly returns 5. The problem is that it doesn’t make sense to provide a number to a function that expects a character. This function should have returned an error message, or at the very least raised a warning that the number got converted into a character. Here is how you could rewrite nchar() to make it safer:\n\nnchar2 <- function(x, result = 0){\n\n if(!isTRUE(is.character(x))){\n stop(paste0(\"x should be of type 'character', but is of type '\",\n typeof(x), \"' instead.\"))\n } else if(x == \"\"){\n result\n } else {\n result <- result + 1\n split_x <- strsplit(x, split = \"\")[[1]]\n nchar2(paste0(split_x[-1],\n collapse = \"\"), result)\n }\n}\n\nThis function now returns an error message if the input is not a character:\n\nnchar2(10000000)\n\nError in nchar2(10000000) : x should be of type 'character', but is of type 'integer' instead.\nThis section is in a sense an introduction to assertive programming. As mentioned in the section on function factories, we will be learning about assertive programming in greater detail in part 2 of the book.\n\n\n6.2.4 Recursive functions\nYou may have noticed in the last lines of nchar2() (defined above) that nchar2() calls itself. A function that calls itself in its own body is called a recursive function. It is sometimes easier to define a function in its recursive form than in an iterative form. The most common example is the factorial function. However, there is an issue with recursive functions (in the R programming language, other programming languages may not have the same problem, like Haskell): while it is sometimes easier to write a function using a recursive algorithm than an iterative algorithm, like for the factorial function, recursive functions in R are quite slow. Let’s take a look at two definitions of the factorial function, one recursive, the other iterative:\n\nfact_iter <- function(n){\n result = 1\n for(i in 1:n){\n result = result * i\n }\n result\n}\n\nfact_recur <- function(n){\n if(n == 0 || n == 1){\n result = 1\n } else {\n n * fact_recur(n-1)\n }\n}\n\nUsing the {microbenchmark} package we can benchmark the code:\n\nmicrobenchmark::microbenchmark(\n fact_recur(50),\n fact_iter(50)\n)\n\nUnit: microseconds\n expr min lq mean median uq max neval\n fact_recur(50) 21.501 21.701 23.82701 21.901 22.0515 68.902 100\n fact_iter(50) 2.000 2.101 2.74599 2.201 2.3510 21.000 100\nWe see that the recursive factorial function is 10 times slower than the iterative version. In this particular example it doesn’t make much of a difference, because the functions only take microseconds to run. But if you’re working with more complex functions, this is a problem. If you want to keep using the recursive function and not switch to an iterative algorithm, there are ways to make them faster. The first is called trampolining. I won’t go into details, but if you’re interested, there is an R package that allows you to use trampolining with R, aptly called {trampoline}1. Another solution is using the {memoise}2 package. Again, I won’t go into details. So if you want to use and optimize recursive functions, take a look at these packages.\n\n\n6.2.5 Anonymous functions\nIt is possible to define a function and not give it a name. For example:\n\nfunction(x)(x+1)(10)\n\nSince R version 4.1, there is even a shorthand notation for anonymous functions:\n\n(\\(x)(x+1))(10)\n\nBecause we don’t name them, we cannot reuse them. So why is this useful? Anonymous functions are useful when you need to apply a function somewhere inside a pipe once, and don’t want to define a function just for this. This will become clearer once we learn about lists, but before that, let’s philosophize a bit.\n\n\n6.2.6 The Unix philosophy applied to R\n\nThis is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.\n\nDoug McIlroy, in A Quarter Century of Unix3\nWe can take inspiration from the Unix philosophy and rewrite it for our purposes:\nWrite functions that do one thing and do it well. Write functions that work together. Write functions that handle lists, because that is a universal interface.\nStrive for writing simple functions that only perform one task. Don’t hesitate to split a big function into smaller ones. Small functions that only perform one task are easier to maintain, test, document and debug. These smaller functions can then be chained using the |> operator. In other words, it is preferable to have something like:\na |> f() |> g() |> h()\nwhere a is for example a path to a data set, and where f(), g() and h() successively read, clean, and plot the data, than having something like:\nbig_function(a)\nthat does all the steps above in one go.\nThis idea of splitting the problem into smaller chunks, each chunk in turn split into even smaller units that can be handled by functions and then the results of these function combined into a final output is called composition.\nThe advantage of splitting big_function() into f(), g() and h() is that you can eat the elephant one bite at a time, and also reuse these smaller functions in other projects more easily. So what’s important is that you can make small functions work together by sharing a common interface. The list is usually a good candidate for this."
+ "text": "6.2 Writing good functions\n\n6.2.1 Functions are first-class objects\nIn a functional programming language, functions are first-class objects. Contrary to what the name implies, this means that functions, especially the ones you define yourself, are nothing special. A function is an object like any other, and can thus be manipulated as such. Think of anything that you can do with any object in R, and you can do the same thing with a function. For example, let’s consider the +() function. It takes two numeric objects and returns their sum:\n\n1 + 5.3\n\n[1] 6.3\n\n# or alternatively: `+`(1, 5.3)\n\nYou can replace the numbers with functions that return numbers:\n\nsqrt(1) + log(5.3)\n\n[1] 2.667707\n\n\nIt’s also possible to define a function that explicitly takes another function as an input:\n\nh <- function(number, f){\n f(number)\n}\n\nYou can call then use h() as a wrapper for f():\n\nh(4, sqrt)\n\n[1] 2\n\nh(10, log10)\n\n[1] 1\n\n\nBecause h() takes another function as an argument, h() is called a higher-order function.\nIf you don’t know how many arguments f(), the function you’re wrapping, has, you can use the ...:\n\nh <- function(number, f, ...){\n f(number, ...)\n}\n\n... are simply a place-holder for any potential additional argument that f() might have:\n\nh(c(1, 2, NA, 3), mean, na.rm = TRUE)\n\n[1] 2\n\nh(c(1, 2, NA, 3), mean, na.rm = FALSE)\n\n[1] NA\n\n\nna.rm is an argument of mean(). As the developer of h(), I don’t necessarily know what f() might be, but even if I knew what f() would be and knew all its arguments, I might not want to list them all. So I can use ... instead. The following is also possible:\n\nw <- function(...){\n paste0(\"First argument: \", ..1,\n \", second argument: \", ..2,\n \", last argument: \", ..3)\n}\n\nw(1, 2, 3)\n\n[1] \"First argument: 1, second argument: 2, last argument: 3\"\n\n\nIf you want to learn more about ..., type ?dots in an R console.\nBecause functions are nothing special, you can also write functions that return functions. As an illustration, we’ll be writing a function that converts warnings to errors. This can be quite useful if you want your functions to fail early, which often makes debugging easier. For example, try running this:\n\nsqrt(-5)\n\nWarning in sqrt(-5): NaNs produced\n\n\n[1] NaN\n\n\nThis only raises a warning and returns NaN (Not a Number). This can be quite dangerous, especially when working non-interactively, which is what we will be doing a lot later on. It is much better if a pipeline fails early due to an error, than dragging a NaN value. This also happens with log10():\n\nlog10(-10)\n\nWarning: NaNs produced\n\n\n[1] NaN\n\n\nSo it could be useful to redefine these functions to raise an error instead, for example like this:\n\nstrict_sqrt <- function(x){\n\n if(x < 0) stop(\"x is negative\")\n\n sqrt(x)\n\n}\n\nThis function now throws an error for negative x:\n\nstrict_sqrt(-10)\n\nError in strict_sqrt(-10) : x is negative\nHowever, it can be quite tedious to redefine every function that we need in our pipeline, and remember, we don’t want to repeat ourselves. So, because functions are nothing special, we can define a function that takes a function as an argument, converts any warning thrown by that function into an error, and returns a new function. For example:\n\nstrictly <- function(f){\n function(...){\n tryCatch({\n f(...)\n },\n warning = function(warning)stop(\"Can't do that chief\"))\n }\n}\n\nThis function makes use of tryCatch() which catches warnings raised by an expression (in this example the expression is f(...)) and then raises an error instead with the stop() function. It is now possible to define new functions like this:\n\ns_sqrt <- strictly(sqrt)\n\n\ns_sqrt(-4)\n\nError in value[[3L]](cond) : Can't do that chief\n\ns_log <- strictly(log)\n\n\ns_log(-4)\n\nError in value[[3L]](cond) : Can't do that chief\nFunctions that return functions are called function factories and they’re incredibly useful. I use this so much that I’ve written a package, available on CRAN, called {chronicler}, that does this:\n\ns_sqrt <- chronicler::record(sqrt)\n\n\nresult <- s_sqrt(-4)\n\nresult\n\nNOK! Value computed unsuccessfully:\n---------------\nNothing\n\n---------------\nThis is an object of type `chronicle`.\nRetrieve the value of this object with pick(.c, \"value\").\nTo read the log of this object, call read_log(.c).\n\n\nBecause the expression above resulted in an error, Nothing is returned. Nothing is a special value defined in the {maybe} package (check it out, a very interesting package!). We can then even read a log to see what went wrong:\n\nchronicler::read_log(result)\n\n[1] \"Complete log:\" \n[2] \"NOK! sqrt() ran unsuccessfully with following exception: NaNs produced at 2023-11-01 08:42:45\"\n[3] \"Total running time: 0.00116109848022461 secs\" \n\n\nThe {purrr} package also comes with function factories that you might find useful ({possibly}, {safely} and {quietly}).\nIn part 2 we will also learn about assertive programming, another way of making our functions safer, as an alternative to using function factories.\n\n\n6.2.2 Optional arguments\nIt is possible to make functions’ arguments optional, by using NULL. For example:\n\ng <- function(x, y = NULL){\n if(is.null(y)){\n print(\"optional argument y is NULL\")\n x\n } else {\n if(y == 5) print(\"y is present\"); x+y\n }\n}\n\nCalling g(10) prints the message “Optional argument y is NULL”, and returns 10. Calling g(10, 5) however, prints “y is present” and returns 15. It is also possible to use missing():\n\ng <- function(x, y){\n if(missing(y)){\n print(\"optional argument y is missing\")\n x\n } else {\n if(y == 5) print(\"y is present\"); x+y\n }\n}\n\nI however prefer the first approach, because it is clearer which arguments are optional, which is not the case with the second approach, where you need to read the body of the function.\n\n\n6.2.3 Safe functions\nIt is important that your functions are safe and predictable. You should avoid writing functions that behave like the nchar() base function. Let’s see why this function is not safe:\n\nnchar(\"10000000\")\n\n[1] 8\n\n\nIt returns the expected result of 8. But what if I remove the quotes?\n\nnchar(10000000)\n\n[1] 5\n\n\nWhat is going on here? I’ll give you a hint: simply type 10000000 in the console:\n\n10000000\n\n[1] 1e+07\n\n\n10000000 gets represented as 1e+07 by R. This number in scientific notation gets then converted into the character “1e+07” by nchar(), and this conversion happens silently. nchar() then counts the number of characters, and correctly returns 5. The problem is that it doesn’t make sense to provide a number to a function that expects a character. This function should have returned an error message, or at the very least raised a warning that the number got converted into a character. Here is how you could rewrite nchar() to make it safer:\n\nnchar2 <- function(x, result = 0){\n\n if(!isTRUE(is.character(x))){\n stop(paste0(\"x should be of type 'character', but is of type '\",\n typeof(x), \"' instead.\"))\n } else if(x == \"\"){\n result\n } else {\n result <- result + 1\n split_x <- strsplit(x, split = \"\")[[1]]\n nchar2(paste0(split_x[-1],\n collapse = \"\"), result)\n }\n}\n\nThis function now returns an error message if the input is not a character:\n\nnchar2(10000000)\n\nError in nchar2(10000000) : x should be of type 'character', but is of type 'integer' instead.\nThis section is in a sense an introduction to assertive programming. As mentioned in the section on function factories, we will be learning about assertive programming in greater detail in part 2 of the book.\n\n\n6.2.4 Recursive functions\nYou may have noticed in the last lines of nchar2() (defined above) that nchar2() calls itself. A function that calls itself in its own body is called a recursive function. It is sometimes easier to define a function in its recursive form than in an iterative form. The most common example is the factorial function. However, there is an issue with recursive functions (in the R programming language, other programming languages may not have the same problem, like Haskell): while it is sometimes easier to write a function using a recursive algorithm than an iterative algorithm, like for the factorial function, recursive functions in R are quite slow. Let’s take a look at two definitions of the factorial function, one recursive, the other iterative:\n\nfact_iter <- function(n){\n result = 1\n for(i in 1:n){\n result = result * i\n }\n result\n}\n\nfact_recur <- function(n){\n if(n == 0 || n == 1){\n result = 1\n } else {\n n * fact_recur(n-1)\n }\n}\n\nUsing the {microbenchmark} package we can benchmark the code:\n\nmicrobenchmark::microbenchmark(\n fact_recur(50),\n fact_iter(50)\n)\n\nUnit: microseconds\n expr min lq mean median uq max neval\n fact_recur(50) 21.501 21.701 23.82701 21.901 22.0515 68.902 100\n fact_iter(50) 2.000 2.101 2.74599 2.201 2.3510 21.000 100\nWe see that the recursive factorial function is 10 times slower than the iterative version. In this particular example it doesn’t make much of a difference, because the functions only take microseconds to run. But if you’re working with more complex functions, this is a problem. If you want to keep using the recursive function and not switch to an iterative algorithm, there are ways to make them faster. The first is called trampolining. I won’t go into details, but if you’re interested, there is an R package that allows you to use trampolining with R, aptly called {trampoline}1. Another solution is using the {memoise}2 package. Again, I won’t go into details. So if you want to use and optimize recursive functions, take a look at these packages.\n\n\n6.2.5 Anonymous functions\nIt is possible to define a function and not give it a name. For example:\n\nfunction(x)(x+1)(10)\n\nSince R version 4.1, there is even a shorthand notation for anonymous functions:\n\n(\\(x)(x+1))(10)\n\nBecause we don’t name them, we cannot reuse them. So why is this useful? Anonymous functions are useful when you need to apply a function somewhere inside a pipe once, and don’t want to define a function just for this. This will become clearer once we learn about lists, but before that, let’s philosophize a bit.\n\n\n6.2.6 The Unix philosophy applied to R\n\nThis is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.\n\nDoug McIlroy, in A Quarter Century of Unix3\nWe can take inspiration from the Unix philosophy and rewrite it for our purposes:\nWrite functions that do one thing and do it well. Write functions that work together. Write functions that handle lists, because that is a universal interface.\nStrive for writing simple functions that only perform one task. Don’t hesitate to split a big function into smaller ones. Small functions that only perform one task are easier to maintain, test, document and debug. These smaller functions can then be chained using the |> operator. In other words, it is preferable to have something like:\na |> f() |> g() |> h()\nwhere a is for example a path to a data set, and where f(), g() and h() successively read, clean, and plot the data, than having something like:\nbig_function(a)\nthat does all the steps above in one go.\nThis idea of splitting the problem into smaller chunks, each chunk in turn split into even smaller units that can be handled by functions and then the results of these function combined into a final output is called composition.\nThe advantage of splitting big_function() into f(), g() and h() is that you can eat the elephant one bite at a time, and also reuse these smaller functions in other projects more easily. So what’s important is that you can make small functions work together by sharing a common interface. The list is usually a good candidate for this."
},
{
"objectID": "fprog.html#lists-a-powerful-data-structure",
@@ -228,7 +228,7 @@
"href": "fprog.html#functional-programming-in-r",
"title": "6 Functional programming",
"section": "6.4 Functional programming in R",
- "text": "6.4 Functional programming in R\nUp until now I focused on general concepts rather than on specifics of the R programming language when it comes to functional programming. In this section, we will be focusing entirely on R-specific capabilities and packages for functional programming.\n\n6.4.1 Base capabilities\nR is a functional programming language (but not only), and as such it comes with many functions out of the box to write functional code. We have already discussed lapply() and Reduce(). You should know that depending on what you want to achieve, there are other functions that are similar to lapply(): apply(), sapply(), vapply(), mapply() and tapply(). There’s also Map() which is a wrapper around mapply(). Each function performs the same basic task of applying a function over all the elements of a list or list-like structure, but it can be hard to keep them apart and when you should use one over another. This is why {purrr}, which we will discuss in the next section, is quite an interesting alternative to base R’s offering.\nAnother one of the quintessential functional programming functions (alongside Reduce() and Map()) that ships with R is Filter(). If you know dplyr::filter() you should be familiar with the concept of filtering rows of a data frame where the elements of one particular column satisfy a predicate. Filter() works the same way, but focusing on lists instead of data frame:\n\nFilter(is.character,\n list(\n seq(1, 5),\n \"Hey\")\n )\n\n[[1]]\n[1] \"Hey\"\n\n\nThe call above only returns the elements where is.character() evaluates to TRUE.\nAnother useful function is Negate() which is a function factory that takes a boolean function as an input and returns the opposite boolean function. As an illustration, suppose that in the example above we wanted to get everything but the character:\n\nFilter(Negate(is.character),\n list(\n seq(1, 5),\n \"Hey\")\n )\n\n[[1]]\n[1] 1 2 3 4 5\n\n\nThere are some other functions like this that you might want to check out: type ?Negate in console to read more about them.\nSometimes you may need to run code with side-effects, but want to avoid any interaction between these side-effects and the global environment. For example, you might want to run some code that creates a plot and saves it to disk, or code that creates some data and writes them to disk. local() can be used for this. local() runs code in a temporary environment that gets discarded at the end:\n\nlocal({\n a <- 2\n})\n\nVariable a was created inside this local environment. Checking if it exists now yields FALSE:\n\nexists(\"a\")\n\n[1] FALSE\n\n\nWe will be using this technique later in the book to keep our scripts pure.\nBefore continuing with R packages that extend R’s functional programming capabilities it’s also important to stress that just as R is a functional programming language, it is also an object oriented language. In fact, R is what John Chambers called a functional OOP language (Chambers (2014)). I won’t delve too much into what this means (read Wickham (2019) for this), but as a short discussion, consider the print() function. Depending on what type of object the user gives it, it seems as if somehow print() knows what to do with it:\n\nprint(5)\n\n[1] 5\n\nprint(head(mtcars))\n\n mpg cyl disp hp drat wt qsec vs am\nMazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1\nMazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1\nDatsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1\nHornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0\nHornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0\nValiant 18.1 6 225 105 2.76 3.460 20.22 1 0\n gear carb\nMazda RX4 4 4\nMazda RX4 Wag 4 4\nDatsun 710 4 1\nHornet 4 Drive 3 1\nHornet Sportabout 3 2\nValiant 3 1\n\nprint(str(mtcars))\n\n'data.frame': 32 obs. of 11 variables:\n $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...\n $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...\n $ disp: num 160 160 108 258 360 ...\n $ hp : num 110 110 93 110 175 105 245 62 95 123 ...\n $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...\n $ wt : num 2.62 2.88 2.32 3.21 3.44 ...\n $ qsec: num 16.5 17 18.6 19.4 17 ...\n $ vs : num 0 0 1 1 0 1 0 1 1 1 ...\n $ am : num 1 1 1 0 0 0 0 0 0 0 ...\n $ gear: num 4 4 4 3 3 3 3 4 4 4 ...\n $ carb: num 4 4 1 1 2 1 4 2 2 4 ...\nNULL\n\n\nThis works by essentially mixing both functional and object-oriented programming, hence functional OOP. Let’s take a closer look at the source code of print() by simply typing print without brackets, into a console:\n\nprint\n\nfunction (x, ...) \nUseMethod(\"print\")\n<bytecode: 0x55df20ffc410>\n<environment: namespace:base>\n\n\nQuite unexpectedly, the source code of print() is one line long and is just UseMethod(\"print\"). So all print() does is use a generic method called “print”. If your text editor has auto-completion enabled, you might see that there are actually many print() functions. For example, type print.data.frame into a console:\n\nprint.data.frame\n\nfunction (x, ..., digits = NULL, quote = FALSE, right = TRUE, \n row.names = TRUE, max = NULL) \n{\n n <- length(row.names(x))\n if (length(x) == 0L) {\n cat(sprintf(ngettext(n, \"data frame with 0 columns and %d row\", \n \"data frame with 0 columns and %d rows\"), n), \"\\n\", \n sep = \"\")\n }\n else if (n == 0L) {\n print.default(names(x), quote = FALSE)\n cat(gettext(\"<0 rows> (or 0-length row.names)\\n\"))\n }\n else {\n if (is.null(max)) \n max <- getOption(\"max.print\", 99999L)\n if (!is.finite(max)) \n stop(\"invalid 'max' / getOption(\\\"max.print\\\"): \", \n max)\n omit <- (n0 <- max%/%length(x)) < n\n m <- as.matrix(format.data.frame(if (omit) \n x[seq_len(n0), , drop = FALSE]\n else x, digits = digits, na.encode = FALSE))\n if (!isTRUE(row.names)) \n dimnames(m)[[1L]] <- if (isFALSE(row.names)) \n rep.int(\"\", if (omit) \n n0\n else n)\n else row.names\n print(m, ..., quote = quote, right = right, max = max)\n if (omit) \n cat(\" [ reached 'max' / getOption(\\\"max.print\\\") -- omitted\", \n n - n0, \"rows ]\\n\")\n }\n invisible(x)\n}\n<bytecode: 0x55df21978088>\n<environment: namespace:base>\n\n\nThis is the print function for data.frame objects. So what print() does, is look at the class of its argument x, and then look for the right print function to call. In more traditional OOP languages, users would type something like:\n\nmtcars.print()\n\nIn these languages, objects encapsulate methods (the equivalent of our functions), so if mtcars is a data frame, it encapsulates a print() method that then does the printing. R is different, because classes and methods are kept separate. If a package developer creates a new object class, then the developer also must implement the required methods. For example in the {chronicler} package, the chronicler class is defined alongside the print.chronicler() function to print these objects.\nAll of this to say that if you want to extend R by writing packages, learning some OOP essentials is also important. But for data analysis, functional programming does the job perfectly well. To learn more about R’s different OOP systems (yes, R can do OOP in different ways and the one I sketched here is the simplest, but probably the most used as well), take a look at Wickham (2019).\n\n\n6.4.2 purrr\nThe {purrr} package, developed by Posit (formerly RStudio), contains many functions to make functional programming with R more smooth. In the previous section, we discussed the apply() family of function; they all do a very similar thing, which is looping over a list and applying a function to the elements of the list, but it is not quite easy to remember which one does what. Also, for some of these functions like apply(), the list argument comes first, and then the function, but in the case of mapply(), the function comes first. This type of inconsistencies can be frustrating. Another issue with these functions is that it is not always easy to know what type the output is going to be. List? Atomic vector? Something else?\n{purrr} solves this issue by offering the map() family of functions, which behave in a very consistent way. The basic function is called map() and we’ve already used it:\n\nmap(seq(1, 5), sqrt)\n\n[[1]]\n[1] 1\n\n[[2]]\n[1] 1.414214\n\n[[3]]\n[1] 1.732051\n\n[[4]]\n[1] 2\n\n[[5]]\n[1] 2.236068\n\n\nBut there are many interesting variants:\n\nmap_dbl(seq(1, 5), sqrt)\n\n[1] 1.000000 1.414214 1.732051 2.000000 2.236068\n\n\nmap_dbl() coerces the output to an atomic vector of doubles instead of a list of doubles. Then there’s:\n\nmap_chr(letters, toupper)\n\n [1] \"A\" \"B\" \"C\" \"D\" \"E\" \"F\" \"G\" \"H\" \"I\" \"J\" \"K\" \"L\" \"M\" \"N\"\n[15] \"O\" \"P\" \"Q\" \"R\" \"S\" \"T\" \"U\" \"V\" \"W\" \"X\" \"Y\" \"Z\"\n\n\nfor when the output needs to be an atomic vector of characters.\nThere are many others, so take a look at the documentation with ?map. There’s also walk() which is used if you’re only interested in the side-effect of the function (for example if the function takes paths as input and saves something to disk).\n{purrr} also has functions to replace Reduce(), simply called reduce() and accumulate(), and there are many, many other useful functions. Read through the documentation of the package4 and take the time to learn about all it has to offer.\n\n\n6.4.3 withr\n{withr} is a powerful package that makes it easy to “purify” functions that behave in a way that can cause problems. Remember the function from the introduction that randomly gave out a dish Bruno liked? Here it is again:\n\nh <- function(name, food_list = list()){\n\n food <- sample(c(\"lasagna\", \"cassoulet\", \"feijoada\"), 1)\n\n food_list <- append(food_list, food)\n\n print(paste0(name, \" likes \", food))\n\n food_list\n}\n\nFor the same input, this function may return different outputs so this function is not referentially transparent. So we improved the function by adding calls to set.seed() like this:\n\nh2 <- function(name, food_list = list(), seed = 123){\n\n # We set the seed, making sure that we get the same selection of food for a given seed\n set.seed(seed)\n food <- sample(c(\"lasagna\", \"cassoulet\", \"feijoada\"), 1)\n\n # We now need to unset the seed, because if we don't, guess what, the seed will stay set for the whole session!\n set.seed(NULL)\n\n food_list <- append(food_list, food)\n\n print(paste0(name, \" likes \", food))\n\n food_list\n}\n\nThe problem with this approach is that we need to modify our function. We can instead use withr::with_seed() to achieve the same effect:\n\nwithr::with_seed(seed = 123,\n h(\"Bruno\"))\n\n[1] \"Bruno likes feijoada\"\n\n\n[[1]]\n[1] \"feijoada\"\n\n\nIt is also easier to create a wrapper if needed:\n\nh3 <- function(..., seed){\n withr::with_seed(seed = seed,\n h(...))\n}\n\n\nh3(\"Bruno\", seed = 123)\n\n[1] \"Bruno likes feijoada\"\n\n\n[[1]]\n[1] \"feijoada\"\n\n\nIn a previous example we downloaded a dataset and loaded it into memory; we did so by first creating a temporary file, then downloading it and then loading it. Suppose that instead of loading this data into our session, we simply wanted to test whether the link was still working. We wouldn’t want to keep the loaded data in our session, so to avoid having to delete it again manually, we could use with_tempfile():\n\nwithr::with_tempfile(\"unemp\", {\n download.file(\n \"https://is.gd/l57cNX\",\n destfile = unemp)\n load(unemp)\n nrow(unemp)\n }\n)\n\n[1] 472\n\n\nThe data got downloaded, and then loaded, and then we computed the number of rows of the data, without touching the global environment, or state, of our current session.\nJust like for {purrr}, {withr} has many useful functions which I encourage you to familiarize yourself with5."
+ "text": "6.4 Functional programming in R\nUp until now I focused on general concepts rather than on specifics of the R programming language when it comes to functional programming. In this section, we will be focusing entirely on R-specific capabilities and packages for functional programming.\n\n6.4.1 Base capabilities\nR is a functional programming language (but not only), and as such it comes with many functions out of the box to write functional code. We have already discussed lapply() and Reduce(). You should know that depending on what you want to achieve, there are other functions that are similar to lapply(): apply(), sapply(), vapply(), mapply() and tapply(). There’s also Map() which is a wrapper around mapply(). Each function performs the same basic task of applying a function over all the elements of a list or list-like structure, but it can be hard to keep them apart and when you should use one over another. This is why {purrr}, which we will discuss in the next section, is quite an interesting alternative to base R’s offering.\nAnother one of the quintessential functional programming functions (alongside Reduce() and Map()) that ships with R is Filter(). If you know dplyr::filter() you should be familiar with the concept of filtering rows of a data frame where the elements of one particular column satisfy a predicate. Filter() works the same way, but focusing on lists instead of data frame:\n\nFilter(is.character,\n list(\n seq(1, 5),\n \"Hey\")\n )\n\n[[1]]\n[1] \"Hey\"\n\n\nThe call above only returns the elements where is.character() evaluates to TRUE.\nAnother useful function is Negate() which is a function factory that takes a boolean function as an input and returns the opposite boolean function. As an illustration, suppose that in the example above we wanted to get everything but the character:\n\nFilter(Negate(is.character),\n list(\n seq(1, 5),\n \"Hey\")\n )\n\n[[1]]\n[1] 1 2 3 4 5\n\n\nThere are some other functions like this that you might want to check out: type ?Negate in console to read more about them.\nSometimes you may need to run code with side-effects, but want to avoid any interaction between these side-effects and the global environment. For example, you might want to run some code that creates a plot and saves it to disk, or code that creates some data and writes them to disk. local() can be used for this. local() runs code in a temporary environment that gets discarded at the end:\n\nlocal({\n a <- 2\n})\n\nVariable a was created inside this local environment. Checking if it exists now yields FALSE:\n\nexists(\"a\")\n\n[1] FALSE\n\n\nWe will be using this technique later in the book to keep our scripts pure.\nBefore continuing with R packages that extend R’s functional programming capabilities it’s also important to stress that just as R is a functional programming language, it is also an object oriented language. In fact, R is what John Chambers called a functional OOP language (Chambers (2014)). I won’t delve too much into what this means (read Wickham (2019) for this), but as a short discussion, consider the print() function. Depending on what type of object the user gives it, it seems as if somehow print() knows what to do with it:\n\nprint(5)\n\n[1] 5\n\nprint(head(mtcars))\n\n mpg cyl disp hp drat wt qsec vs am\nMazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1\nMazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1\nDatsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1\nHornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0\nHornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0\nValiant 18.1 6 225 105 2.76 3.460 20.22 1 0\n gear carb\nMazda RX4 4 4\nMazda RX4 Wag 4 4\nDatsun 710 4 1\nHornet 4 Drive 3 1\nHornet Sportabout 3 2\nValiant 3 1\n\nprint(str(mtcars))\n\n'data.frame': 32 obs. of 11 variables:\n $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...\n $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...\n $ disp: num 160 160 108 258 360 ...\n $ hp : num 110 110 93 110 175 105 245 62 95 123 ...\n $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...\n $ wt : num 2.62 2.88 2.32 3.21 3.44 ...\n $ qsec: num 16.5 17 18.6 19.4 17 ...\n $ vs : num 0 0 1 1 0 1 0 1 1 1 ...\n $ am : num 1 1 1 0 0 0 0 0 0 0 ...\n $ gear: num 4 4 4 3 3 3 3 4 4 4 ...\n $ carb: num 4 4 1 1 2 1 4 2 2 4 ...\nNULL\n\n\nThis works by essentially mixing both functional and object-oriented programming, hence functional OOP. Let’s take a closer look at the source code of print() by simply typing print without brackets, into a console:\n\nprint\n\nfunction (x, ...) \nUseMethod(\"print\")\n<bytecode: 0x5574d6500a28>\n<environment: namespace:base>\n\n\nQuite unexpectedly, the source code of print() is one line long and is just UseMethod(\"print\"). So all print() does is use a generic method called “print”. If your text editor has auto-completion enabled, you might see that there are actually many print() functions. For example, type print.data.frame into a console:\n\nprint.data.frame\n\nfunction (x, ..., digits = NULL, quote = FALSE, right = TRUE, \n row.names = TRUE, max = NULL) \n{\n n <- length(row.names(x))\n if (length(x) == 0L) {\n cat(sprintf(ngettext(n, \"data frame with 0 columns and %d row\", \n \"data frame with 0 columns and %d rows\"), n), \"\\n\", \n sep = \"\")\n }\n else if (n == 0L) {\n print.default(names(x), quote = FALSE)\n cat(gettext(\"<0 rows> (or 0-length row.names)\\n\"))\n }\n else {\n if (is.null(max)) \n max <- getOption(\"max.print\", 99999L)\n if (!is.finite(max)) \n stop(\"invalid 'max' / getOption(\\\"max.print\\\"): \", \n max)\n omit <- (n0 <- max%/%length(x)) < n\n m <- as.matrix(format.data.frame(if (omit) \n x[seq_len(n0), , drop = FALSE]\n else x, digits = digits, na.encode = FALSE))\n if (!isTRUE(row.names)) \n dimnames(m)[[1L]] <- if (isFALSE(row.names)) \n rep.int(\"\", if (omit) \n n0\n else n)\n else row.names\n print(m, ..., quote = quote, right = right, max = max)\n if (omit) \n cat(\" [ reached 'max' / getOption(\\\"max.print\\\") -- omitted\", \n n - n0, \"rows ]\\n\")\n }\n invisible(x)\n}\n<bytecode: 0x5574d7d38508>\n<environment: namespace:base>\n\n\nThis is the print function for data.frame objects. So what print() does, is look at the class of its argument x, and then look for the right print function to call. In more traditional OOP languages, users would type something like:\n\nmtcars.print()\n\nIn these languages, objects encapsulate methods (the equivalent of our functions), so if mtcars is a data frame, it encapsulates a print() method that then does the printing. R is different, because classes and methods are kept separate. If a package developer creates a new object class, then the developer also must implement the required methods. For example in the {chronicler} package, the chronicler class is defined alongside the print.chronicler() function to print these objects.\nAll of this to say that if you want to extend R by writing packages, learning some OOP essentials is also important. But for data analysis, functional programming does the job perfectly well. To learn more about R’s different OOP systems (yes, R can do OOP in different ways and the one I sketched here is the simplest, but probably the most used as well), take a look at Wickham (2019).\n\n\n6.4.2 purrr\nThe {purrr} package, developed by Posit (formerly RStudio), contains many functions to make functional programming with R more smooth. In the previous section, we discussed the apply() family of function; they all do a very similar thing, which is looping over a list and applying a function to the elements of the list, but it is not quite easy to remember which one does what. Also, for some of these functions like apply(), the list argument comes first, and then the function, but in the case of mapply(), the function comes first. This type of inconsistencies can be frustrating. Another issue with these functions is that it is not always easy to know what type the output is going to be. List? Atomic vector? Something else?\n{purrr} solves this issue by offering the map() family of functions, which behave in a very consistent way. The basic function is called map() and we’ve already used it:\n\nmap(seq(1, 5), sqrt)\n\n[[1]]\n[1] 1\n\n[[2]]\n[1] 1.414214\n\n[[3]]\n[1] 1.732051\n\n[[4]]\n[1] 2\n\n[[5]]\n[1] 2.236068\n\n\nBut there are many interesting variants:\n\nmap_dbl(seq(1, 5), sqrt)\n\n[1] 1.000000 1.414214 1.732051 2.000000 2.236068\n\n\nmap_dbl() coerces the output to an atomic vector of doubles instead of a list of doubles. Then there’s:\n\nmap_chr(letters, toupper)\n\n [1] \"A\" \"B\" \"C\" \"D\" \"E\" \"F\" \"G\" \"H\" \"I\" \"J\" \"K\" \"L\" \"M\" \"N\"\n[15] \"O\" \"P\" \"Q\" \"R\" \"S\" \"T\" \"U\" \"V\" \"W\" \"X\" \"Y\" \"Z\"\n\n\nfor when the output needs to be an atomic vector of characters.\nThere are many others, so take a look at the documentation with ?map. There’s also walk() which is used if you’re only interested in the side-effect of the function (for example if the function takes paths as input and saves something to disk).\n{purrr} also has functions to replace Reduce(), simply called reduce() and accumulate(), and there are many, many other useful functions. Read through the documentation of the package4 and take the time to learn about all it has to offer.\n\n\n6.4.3 withr\n{withr} is a powerful package that makes it easy to “purify” functions that behave in a way that can cause problems. Remember the function from the introduction that randomly gave out a dish Bruno liked? Here it is again:\n\nh <- function(name, food_list = list()){\n\n food <- sample(c(\"lasagna\", \"cassoulet\", \"feijoada\"), 1)\n\n food_list <- append(food_list, food)\n\n print(paste0(name, \" likes \", food))\n\n food_list\n}\n\nFor the same input, this function may return different outputs so this function is not referentially transparent. So we improved the function by adding calls to set.seed() like this:\n\nh2 <- function(name, food_list = list(), seed = 123){\n\n # We set the seed, making sure that we get the same selection of food for a given seed\n set.seed(seed)\n food <- sample(c(\"lasagna\", \"cassoulet\", \"feijoada\"), 1)\n\n # We now need to unset the seed, because if we don't, guess what, the seed will stay set for the whole session!\n set.seed(NULL)\n\n food_list <- append(food_list, food)\n\n print(paste0(name, \" likes \", food))\n\n food_list\n}\n\nThe problem with this approach is that we need to modify our function. We can instead use withr::with_seed() to achieve the same effect:\n\nwithr::with_seed(seed = 123,\n h(\"Bruno\"))\n\n[1] \"Bruno likes feijoada\"\n\n\n[[1]]\n[1] \"feijoada\"\n\n\nIt is also easier to create a wrapper if needed:\n\nh3 <- function(..., seed){\n withr::with_seed(seed = seed,\n h(...))\n}\n\n\nh3(\"Bruno\", seed = 123)\n\n[1] \"Bruno likes feijoada\"\n\n\n[[1]]\n[1] \"feijoada\"\n\n\nIn a previous example we downloaded a dataset and loaded it into memory; we did so by first creating a temporary file, then downloading it and then loading it. Suppose that instead of loading this data into our session, we simply wanted to test whether the link was still working. We wouldn’t want to keep the loaded data in our session, so to avoid having to delete it again manually, we could use with_tempfile():\n\nwithr::with_tempfile(\"unemp\", {\n download.file(\n \"https://is.gd/l57cNX\",\n destfile = unemp)\n load(unemp)\n nrow(unemp)\n }\n)\n\n[1] 472\n\n\nThe data got downloaded, and then loaded, and then we computed the number of rows of the data, without touching the global environment, or state, of our current session.\nJust like for {purrr}, {withr} has many useful functions which I encourage you to familiarize yourself with5."
},
{
"objectID": "fprog.html#conclusion",
@@ -501,7 +501,7 @@
"href": "targets.html#sec-targets-rewrite",
"title": "13 Build automation with targets",
"section": "13.8 Rewriting our project as a pipeline and {renv} redux",
- "text": "13.8 Rewriting our project as a pipeline and {renv} redux\nIt is now time to return our little project into a full-fledged reproducible pipeline. For this, we’re going back to our project’s folder and specifically to the fusen branch. This is the branch where we used {fusen} to turn our .Rmd into a package. This package contains the functions that we need to update the data. But remember, we wrote the analysis in another .Rmd file that we did not inflate, analyse_data.Rmd. We are now going to write a {targets} pipeline that will make use of the inflated package and compute all the targets required for the analysis. The first step is to create a new branch, but you could also create an entirely new repository if you want. It’s up to you. If you create a new branch, start from the rmd branch, since this will provide a nice starting point.\n#switch to the rmd branch\nowner@localhost ➤ git checkout rmd\n\n#create and switch to the new branch called pipeline\nowner@localhost ➤ git checkout -b pipeline \nIf you start with a fresh repository, you can grab the analyse_data.Rmd from here3.\nFirst order of business, let’s delete save_data.Rmd (unless you started with an empty repo). We don’t need that file anymore, since everything is now available in the package we developed:\nowner@localhost ➤ rm save_data.Rmd\nLet’s now start an R session in that folder and install our {housing} package. Whether you followed along and developed the package, or skipped the previous parts and didn’t develop the package by following along, install it from my Github repository. This ensures that you have exactly the same version as me. Run the following line:\n\nremotes::install_github(\"rap4all/housing@fusen\",\n ref = \"1c86095\")\n\nThis will install the package from my Github repository, and very specifically the version from the fusen branch at commit 1c86095 (you may need to install the {remotes} package first). Now that the package is installed, we can start building the pipeline. In the same R session, call tar_script() which will give us a nice template _targets.R file:\n\ntargets::tar_script()\n\nYou should at most have three files: README.md, _targets.R and analyse_data.Rmd (unless you started with an empty repo, in which case you don’t have the README.md file). We will now change analyse_data.Rmd, to load pre-computed targets, instead of computing them inside the analyse_data.Rmd at compilation time.\nFirst, we need to load the data. The two datasets we use are now part of the package, so we can simply load them using data(commune_level_data) and data(country_level_data). But remember, {targets} only loves pure functions, and data() is not pure! Let’s see what happens when you call data(mtcars). If you’re using RStudio, this is really visible: in a fresh session, calling data(mtcars) shows the following in the Environment pane:\n\n\n\nWhat’s described as ‘mtcars’ is not a ‘data.frame’, yet.\n\n\nAt this stage, mtcars is only a promise. It’s only if you need to interact with mtcars that the promise becomes a data.frame. So data() returns a promise, does this mean that we can save that into a variable? If you try the following:\n\nx <- data(mtcars)\n\nAnd check out x, you will see that x contains the string \"mtcars\" and is of class character! So data() returns a promise by saving it directly to the global environment (this is a side-effect) but it returns a string. Because {targets} needs pure functions, if we write:\n\ntar_target(\n target_mtcars,\n data(mtcars)\n)\n\nthe target_mtcars target will be equal to the \"mtcars\" character string. This might be confusing, but just remember: a target must return something, and functions with side-effects don’t always return something, or not the thing we want. Also remember the example on plotting with plot(), which does not return an object. It’s actually the same issue here.\nSo to solve this, we need a pure function that returns the data.frame. This means that it first needs to load the data, which results in a promise (which gets loaded into the environment directly), and then evaluate that promise. The function to achieve this is as follows:\n\nread_data <- function(data_name, package_name){\n\n temp <- new.env(parent = emptyenv())\n\n data(list = data_name,\n package = package_name,\n envir = temp)\n\n get(data_name, envir = temp)\n}\n\nThis function takes data_name and package_name as arguments, both strings.\nThen, I used data() but with two arguments: list = and package =. list = is needed because we pass data_name as a string. If we did something like data(data_name) instead, hoping that data_name would get replaced by its bound value (\"commune_level_data\") it would result in an error. This is because data() would be looking for a data set literally called data_name instead of replacing data_name by its bound value. The second argument, package = simply states that we’re looking for that dataset in the {housing} package and uses the bound value of package_name. Now comes the envir = argument. This argument tells data() where to load the data set. By default, data() loads the data in the global environment. But remember, we want our function to be pure, meaning, it should only return the data object and not load anything into the global environment! So that’s where the temporary environment created in the first line of the body of the function comes into play. What happens is that the function loads the data object into this temporary environment, which is different from the global environment. Once we’re done, we can simply discard this environment, and so our global environment stays clean.\nThe final step is using get(). Remember that once the line data(list = data_name...) has run, all we have is a promise. So if we stop there, the target would simply hold the character \"commune_level_data\". In order to turn that promise into the data frame, we use get(). We’ve already encountered this function in Chapter 7. get() searches an object by name, and returns it. So in the line get(data_name), data_name gets first replaced by its bound value, so get(\"commune_level_data\") and hence we get the dataset back. Also, get() looks for that name in the temporary environment that was set up. This way, there is literally no interaction with the global environment, so that function is pure: it always returns the same output for the same input, and does not pollute, in any way, the global environment. After that function is done running, the temporary environment is discarded.\nThis seems overly complicated, but it’s all a consequence of {targets} needing pure functions that actually return something to work well. Unfortunately some of R’s default functions are not pure, so we need this kind of workaround. However, all of this work is not in vain! By forcing us to use pure functions, {targets} contributes to the general quality and safety of our pipeline. Once our pipeline is done running, the global environment will stay clean. Having stuff pollute the global environment can cause interactions with subsequent runs of the pipeline.\nLet’s continue the pipeline: here is what it will ultimately look like:\n\nlibrary(targets)\nlibrary(tarchetypes)\n\ntar_option_set(packages = \"housing\")\n\nsource(\"functions/read_data.R\")\n\nlist(\n tar_target(\n commune_level_data,\n read_data(\"commune_level_data\",\n \"housing\")\n ),\n\n tar_target(\n country_level_data,\n read_data(\"country_level_data\",\n \"housing\")\n ),\n\n tar_target(\n commune_data,\n get_laspeyeres(commune_level_data)\n ),\n\n tar_target(\n country_data,\n get_laspeyeres(country_level_data)\n ),\n\n tar_target(\n communes,\n c(\"Luxembourg\",\n \"Esch-sur-Alzette\",\n \"Mamer\",\n \"Schengen\",\n \"Wincrange\")\n ),\n\n tar_render(\n analyse_data,\n \"analyse_data.Rmd\"\n )\n\n)\n\nAnd here is what analyse_data.Rmd now looks like:\n---\ntitle: \"Nominal house prices data in Luxembourg\"\nauthor: \"Bruno Rodrigues\"\ndate: \"`r Sys.Date()`\"\n---\n\nLet’s load the datasets (the Laspeyeres price index is already computed):\n\n```{r}\ntar_load(commune_data)\ntar_load(country_data)\n```\n\nWe are going to create a plot for 5 communes and compare the\nprice evolution in the communes to the national price evolution. \nLet’s first load the communes:\n\n```{r}\ntar_load(communes)\n```\n\n```{r, results = \"asis\"}\nres <- lapply(communes, function(x){\n\n knitr::knit_child(text = c(\n\n '\\n',\n '## Plot for commune: `r x`',\n '\\n',\n '```{r, echo = F}',\n 'print(make_plot(country_data, commune_data, x))',\n '```'\n\n ),\n envir = environment(),\n quiet = TRUE)\n\n})\n\ncat(unlist(res), sep = \"\\n\")\n\n```\nAs you can see, the data gets loaded by using tar_load() which loads the two pre-computed data sets. The end portion of the document looks fairly similar to what we had before turning our analysis into a package and then a pipeline. We use a child document to generate as many sections as required automatically (remember, Don’t Repeat Yourself!). Try to change something in the pipeline, for example remove some communes from the communes object, and rerun the whole pipeline using tar_make().\nWe are now done with this introduction to {targets}: we have turned our analysis into a pipeline, and now we need to ensure that the outputs it produces are reproducible. So the first step is to use {renv}; but as already discussed, this will not be enough, but it is essential that you do it! So let’s initialize {renv}:\n\nrenv::init()\n\nThis will create an renv.lock file with all the dependencies of the pipeline listed. Very importantly, our Github package also gets listed:\n\"housing\": {\n \"Package\": \"housing\",\n \"Version\": \"0.1\",\n \"Source\": \"GitHub\",\n \"RemoteType\": \"github\",\n \"RemoteHost\": \"api.github.com\",\n \"RemoteRepo\": \"housing\",\n \"RemoteUsername\": \"rap4all\",\n \"RemoteRef\": \"fusen\",\n \"RemoteSha\": \"1c860959310b80e67c41f7bbdc3e84cef00df18e\",\n \"Hash\": \"859672476501daeea9b719b9218165f1\",\n \"Requirements\": [\n \"dplyr\",\n \"ggplot2\",\n \"janitor\",\n \"purrr\",\n \"readxl\",\n \"rlang\",\n \"rvest\",\n \"stringr\",\n \"tidyr\"\n ]\n},\nIf you look at the fields titled RemoteSha and RemoteRef you should recognize the commit hash and repository that were used to install the package:\n\"RemoteRef\": \"fusen\",\n\"RemoteSha\": \"1c860959310b80e67c41f7bbdc3e84cef00df18e\",\nThis means that if someone wants to re-run our project, by running renv::restore() the correct version of the package will get installed! To finish things, we should edit the Readme.md file and add instructions on how to re-run the project. This is what the Readme.md file could look like:\n# How to run\n\n- Clone the repository: `git clone git@github.com:rap4all/housing.git`\n- Switch to the `pipeline` branch: `git checkout pipeline`\n- Start an R session in the folder and run `renv::restore()` \n to install the project’s dependencies.\n- Run the pipeline with `targets::tar_make()`.\n- Checkout `analyse_data.html` for the output.\nIf you followed along, don’t forget to change the url of the repository to your own in the first bullet point of th Readme."
+ "text": "13.8 Rewriting our project as a pipeline and {renv} redux\nIt is now time to return our little project into a full-fledged reproducible pipeline. For this, we’re going back to our project’s folder and specifically to the fusen branch. This is the branch where we used {fusen} to turn our .Rmd into a package. This package contains the functions that we need to update the data. But remember, we wrote the analysis in another .Rmd file that we did not inflate, analyse_data.Rmd. We are now going to write a {targets} pipeline that will make use of the inflated package and compute all the targets required for the analysis. The first step is to create a new branch, but you could also create an entirely new repository if you want. It’s up to you. If you create a new branch, start from the rmd branch, since this will provide a nice starting point.\n#switch to the rmd branch\nowner@localhost ➤ git checkout rmd\n\n#create and switch to the new branch called pipeline\nowner@localhost ➤ git checkout -b pipeline \nIf you start with a fresh repository, you can grab the analyse_data.Rmd from here3.\nFirst order of business, let’s delete save_data.Rmd (unless you started with an empty repo). We don’t need that file anymore, since everything is now available in the package we developed:\nowner@localhost ➤ rm save_data.Rmd\nLet’s now start an R session in that folder and install our {housing} package. Whether you followed along and developed the package, or skipped the previous parts and didn’t develop the package by following along, install it from my Github repository. This ensures that you have exactly the same version as me. Run the following line:\n\nremotes::install_github(\"rap4all/housing@fusen\",\n ref = \"1c86095\")\n\nThis will install the package from my Github repository, and very specifically the version from the fusen branch at commit 1c86095 (you may need to install the {remotes} package first). Now that the package is installed, we can start building the pipeline. In the same R session, call tar_script() which will give us a nice template _targets.R file:\n\ntargets::tar_script()\n\nYou should at most have three files: README.md, _targets.R and analyse_data.Rmd (unless you started with an empty repo, in which case you don’t have the README.md file). We will now change analyse_data.Rmd, to load pre-computed targets, instead of computing them inside the analyse_data.Rmd at compilation time.\nFirst, we need to load the data. The two datasets we use are now part of the package, so we can simply load them using data(commune_level_data) and data(country_level_data). But remember, {targets} only loves pure functions, and data() is not pure! Let’s see what happens when you call data(mtcars). If you’re using RStudio, this is really visible: in a fresh session, calling data(mtcars) shows the following in the Environment pane:\n\n\n\nWhat’s described as ‘mtcars’ is not a ‘data.frame’, yet.\n\n\nAt this stage, mtcars is only a promise. It’s only if you need to interact with mtcars that the promise becomes a data.frame. So data() returns a promise, does this mean that we can save that into a variable? If you try the following:\n\nx <- data(mtcars)\n\nAnd check out x, you will see that x contains the string \"mtcars\" and is of class character! So data() returns a promise by saving it directly to the global environment (this is a side-effect) but it returns a string. Because {targets} needs pure functions, if we write:\n\ntar_target(\n target_mtcars,\n data(mtcars)\n)\n\nthe target_mtcars target will be equal to the \"mtcars\" character string. This might be confusing, but just remember: a target must return something, and functions with side-effects don’t always return something, or not the thing we want. Also remember the example on plotting with plot(), which does not return an object. It’s actually the same issue here.\nSo to solve this, we need a pure function that returns the data.frame. This means that it first needs to load the data, which results in a promise (which gets loaded into the environment directly), and then evaluate that promise. The function to achieve this is as follows:\n\nread_data <- function(data_name, package_name){\n\n temp <- new.env(parent = emptyenv())\n\n data(list = data_name,\n package = package_name,\n envir = temp)\n\n get(data_name, envir = temp)\n}\n\nThis function takes data_name and package_name as arguments, both strings.\nThen, I used data() but with two arguments: list = and package =. list = is needed because we pass data_name as a string. If we did something like data(data_name) instead, hoping that data_name would get replaced by its bound value (\"commune_level_data\") it would result in an error. This is because data() would be looking for a data set literally called data_name instead of replacing data_name by its bound value. The second argument, package = simply states that we’re looking for that dataset in the {housing} package and uses the bound value of package_name. Now comes the envir = argument. This argument tells data() where to load the data set. By default, data() loads the data in the global environment. But remember, we want our function to be pure, meaning, it should only return the data object and not load anything into the global environment! So that’s where the temporary environment created in the first line of the body of the function comes into play. What happens is that the function loads the data object into this temporary environment, which is different from the global environment. Once we’re done, we can simply discard this environment, and so our global environment stays clean.\nThe final step is using get(). Remember that once the line data(list = data_name...) has run, all we have is a promise. So if we stop there, the target would simply hold the character \"commune_level_data\". In order to turn that promise into the data frame, we use get(). We’ve already encountered this function in Chapter 7. get() searches an object by name, and returns it. So in the line get(data_name), data_name gets first replaced by its bound value, so get(\"commune_level_data\") and hence we get the dataset back. Also, get() looks for that name in the temporary environment that was set up. This way, there is literally no interaction with the global environment, so that function is pure: it always returns the same output for the same input, and does not pollute, in any way, the global environment. After that function is done running, the temporary environment is discarded.\nThis seems overly complicated, but it’s all a consequence of {targets} needing pure functions that actually return something to work well. Unfortunately some of R’s default functions are not pure, so we need this kind of workaround. However, all of this work is not in vain! By forcing us to use pure functions, {targets} contributes to the general quality and safety of our pipeline. Once our pipeline is done running, the global environment will stay clean. Having stuff pollute the global environment can cause interactions with subsequent runs of the pipeline.\nLet’s continue the pipeline: here is what it will ultimately look like:\n\nlibrary(targets)\nlibrary(tarchetypes)\n\ntar_option_set(packages = \"housing\")\n\nsource(\"functions/read_data.R\")\n\nlist(\n tar_target(\n commune_level_data,\n read_data(\"commune_level_data\",\n \"housing\")\n ),\n\n tar_target(\n country_level_data,\n read_data(\"country_level_data\",\n \"housing\")\n ),\n\n tar_target(\n commune_data,\n get_laspeyeres(commune_level_data)\n ),\n\n tar_target(\n country_data,\n get_laspeyeres(country_level_data)\n ),\n\n tar_target(\n communes,\n c(\"Luxembourg\",\n \"Esch-sur-Alzette\",\n \"Mamer\",\n \"Schengen\",\n \"Wincrange\")\n ),\n\n tar_render(\n analyse_data,\n \"analyse_data.Rmd\"\n )\n\n)\n\nAnd here is what analyse_data.Rmd now looks like:\n---\ntitle: \"Nominal house prices data in Luxembourg\"\nauthor: \"Bruno Rodrigues\"\ndate: \"`r Sys.Date()`\"\n---\n\nLet’s load the datasets (the Laspeyeres price index is already computed):\n\n```{r}\ntar_load(commune_data)\ntar_load(country_data)\n```\n\nWe are going to create a plot for 5 communes and compare the\nprice evolution in the communes to the national price evolution. \nLet’s first load the communes:\n\n```{r}\ntar_load(communes)\n```\n\n```{r, results = \"asis\"}\nres <- lapply(communes, function(x){\n\n knitr::knit_child(text = c(\n\n '\\n',\n '## Plot for commune: `r x`',\n '\\n',\n '```{r, echo = F}',\n 'print(make_plot(country_data, commune_data, x))',\n '```'\n\n ),\n envir = environment(),\n quiet = TRUE)\n\n})\n\ncat(unlist(res), sep = \"\\n\")\n\n```\nAs you can see, the data gets loaded by using tar_load() which loads the two pre-computed data sets. The end portion of the document looks fairly similar to what we had before turning our analysis into a package and then a pipeline. We use a child document to generate as many sections as required automatically (remember, Don’t Repeat Yourself!). Try to change something in the pipeline, for example remove some communes from the communes object, and rerun the whole pipeline using tar_make().\nWe are now done with this introduction to {targets}: we have turned our analysis into a pipeline, and now we need to ensure that the outputs it produces are reproducible. So the first step is to use {renv}; but as already discussed, this will not be enough, but it is essential that you do it! So let’s initialize {renv}:\n\nrenv::init()\n\nThis will create an renv.lock file with all the dependencies of the pipeline listed. Very importantly, our Github package also gets listed:\n\"housing\": {\n \"Package\": \"housing\",\n \"Version\": \"0.1\",\n \"Source\": \"GitHub\",\n \"RemoteType\": \"github\",\n \"RemoteHost\": \"api.github.com\",\n \"RemoteRepo\": \"housing\",\n \"RemoteUsername\": \"rap4all\",\n \"RemoteRef\": \"fusen\",\n \"RemoteSha\": \"1c860959310b80e67c41f7bbdc3e84cef00df18e\",\n \"Hash\": \"859672476501daeea9b719b9218165f1\",\n \"Requirements\": [\n \"dplyr\",\n \"ggplot2\",\n \"janitor\",\n \"purrr\",\n \"readxl\",\n \"rlang\",\n \"rvest\",\n \"stringr\",\n \"tidyr\"\n ]\n},\nIf you look at the fields titled RemoteSha and RemoteRef you should recognize the commit hash and repository that were used to install the package:\n\"RemoteRef\": \"fusen\",\n\"RemoteSha\": \"1c860959310b80e67c41f7bbdc3e84cef00df18e\",\nThis means that if someone wants to re-run our project, by running renv::restore() the correct version of the package will get installed! To finish things, we should edit the Readme.md file and add instructions on how to re-run the project. This is what the Readme.md file could look like:\n# How to run\n\n- Clone the repository: `git clone git@github.com:rap4all/housing.git`\n- Switch to the `pipeline` branch: `git checkout pipeline`\n- Start an R session in the folder and run `renv::restore()` \n to install the project’s dependencies.\n- Run the pipeline with `targets::tar_make()`.\n- Checkout `analyse_data.html` for the output.\nIf you followed along, don’t forget to change the url of the repository to your own in the first bullet point of the Readme."
},
{
"objectID": "targets.html#some-little-tips-before-concluding",
diff --git a/targets.html b/targets.html
index c9274b2..9f26426 100644
--- a/targets.html
+++ b/targets.html
@@ -1292,7 +1292,7 @@
to install the project’s dependencies.
- Run the pipeline with `targets::tar_make()`.- Checkout `analyse_data.html` for the output.
-
If you followed along, don’t forget to change the url of the repository to your own in the first bullet point of th Readme.
+
If you followed along, don’t forget to change the url of the repository to your own in the first bullet point of the Readme.