diff --git a/.nojekyll b/.nojekyll index 813b9e3..207b79d 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -92170ac9 \ No newline at end of file +29592927 \ No newline at end of file diff --git a/02-intro_R.html b/02-intro_R.html index eae892c..881424b 100644 --- a/02-intro_R.html +++ b/02-intro_R.html @@ -20,40 +20,6 @@ margin: 0 0.8em 0.2em -1em; /* quarto-specific, see https://github.com/quarto-dev/quarto-cli/issues/4556 */ vertical-align: middle; } -/* CSS for syntax highlighting */ -pre > code.sourceCode { white-space: pre; position: relative; } -pre > code.sourceCode > span { line-height: 1.25; } -pre > code.sourceCode > span:empty { height: 1.2em; } -.sourceCode { overflow: visible; } -code.sourceCode > span { color: inherit; text-decoration: inherit; } -div.sourceCode { margin: 1em 0; } -pre.sourceCode { margin: 0; } -@media screen { -div.sourceCode { overflow: auto; } -} -@media print { -pre > code.sourceCode { white-space: pre-wrap; } -pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } -} -pre.numberSource code - { counter-reset: source-line 0; } -pre.numberSource code > span - { position: relative; left: -4em; counter-increment: source-line; } -pre.numberSource code > span > a:first-child::before - { content: counter(source-line); - position: relative; left: -1em; text-align: right; vertical-align: baseline; - border: none; display: inline-block; - -webkit-touch-callout: none; -webkit-user-select: none; - -khtml-user-select: none; -moz-user-select: none; - -ms-user-select: none; user-select: none; - padding: 0 4px; width: 4em; - } -pre.numberSource { margin-left: 3em; padding-left: 4px; } -div.sourceCode - { } -@media screen { -pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } -} @@ -104,6 +70,298 @@ } } + + + + + @@ -279,6 +537,16 @@

Table of contents

+ +
@@ -335,36 +603,568 @@

<

1.2 A little aside on pipes

Since R version 4.1, a forward pipe |> is included in the standard library of the language. It allows to do this:

-
-
4 |>
-  sqrt()
-
-
[1] 2
-
+
+ +
+
+
+

+    
+
+
+
+

Before R version 4.1, there was already a forward pipe, introduced with the {magrittr} package (and automatically loaded by many other packages from the tidyverse, like {dplyr}):

-
-
library(dplyr)
-
-

-Attaching package: 'dplyr'
+
+ +
+
+
+

+    
+
+
+
-
-
The following objects are masked from 'package:stats':
+
 

Both expressions above are equivalent to sqrt(4). You will see why this is useful very soon. For now, just know this exists and try to get used to it.

diff --git a/03-functional-programming.html b/03-functional-programming.html index 5c91a92..043ee27 100644 --- a/03-functional-programming.html +++ b/03-functional-programming.html @@ -328,8 +328,8 @@

rnorm(n = 10)
-
 [1] -1.77883248  0.42853232  0.47230020 -0.09081041 -0.20139649 -0.38217676
- [7] -1.20057531  0.48772363 -1.83718414  0.21290767
+
 [1] -0.50951320  0.32378969 -0.55483440 -0.03788488 -0.16860005  1.49830472
+ [7] -0.81259719  0.51249157  0.21957924 -0.14170387

Each time you run this line, you will get another set of 10 random numbers. This is obviously a good thing in interactive data analysis, but much less so when running a pipeline programmatically. R provides a way to fix the random seed, which will make sure you always get the same random numbers:

diff --git a/04-git_files/figure-pdf/unnamed-chunk-36-1.pdf b/04-git_files/figure-pdf/unnamed-chunk-36-1.pdf index 5022f7a..f955224 100644 Binary files a/04-git_files/figure-pdf/unnamed-chunk-36-1.pdf and b/04-git_files/figure-pdf/unnamed-chunk-36-1.pdf differ diff --git a/Building-Reproducible-Analytical-Pipelines.epub b/Building-Reproducible-Analytical-Pipelines.epub index cb01f1c..7c8ef67 100644 Binary files a/Building-Reproducible-Analytical-Pipelines.epub and b/Building-Reproducible-Analytical-Pipelines.epub differ diff --git a/Building-Reproducible-Analytical-Pipelines.pdf b/Building-Reproducible-Analytical-Pipelines.pdf index 7821cc8..1b00b7b 100644 Binary files a/Building-Reproducible-Analytical-Pipelines.pdf and b/Building-Reproducible-Analytical-Pipelines.pdf differ diff --git a/search.json b/search.json index fa6a18b..8af777e 100644 --- a/search.json +++ b/search.json @@ -124,7 +124,7 @@ "href": "02-intro_R.html#a-little-aside-on-pipes", "title": "1  Introduction to R", "section": "1.2 A little aside on pipes", - "text": "1.2 A little aside on pipes\nSince R version 4.1, a forward pipe |> is included in the standard library of the language. It allows to do this:\n\n4 |>\n sqrt()\n\n[1] 2\n\n\nBefore R version 4.1, there was already a forward pipe, introduced with the {magrittr} package (and automatically loaded by many other packages from the tidyverse, like {dplyr}):\n\nlibrary(dplyr)\n\n\nAttaching package: 'dplyr'\n\n\nThe following objects are masked from 'package:stats':\n\n filter, lag\n\n\nThe following objects are masked from 'package:base':\n\n intersect, setdiff, setequal, union\n\n4 %>%\n sqrt()\n\n[1] 2\n\n\nBoth expressions above are equivalent to sqrt(4). You will see why this is useful very soon. For now, just know this exists and try to get used to it.", + "text": "1.2 A little aside on pipes\nSince R version 4.1, a forward pipe |> is included in the standard library of the language. It allows to do this:\n\n 🟡 Loading\n webR...\n \n \n \n \n \n \n \n \n\n\nBefore R version 4.1, there was already a forward pipe, introduced with the {magrittr} package (and automatically loaded by many other packages from the tidyverse, like {dplyr}):\n\n 🟡 Loading\n webR...\n \n \n \n \n \n \n \n \n\n\nBoth expressions above are equivalent to sqrt(4). You will see why this is useful very soon. For now, just know this exists and try to get used to it.", "crumbs": [ "1  Introduction to R" ] @@ -164,7 +164,7 @@ "href": "03-functional-programming.html#introduction", "title": "2  A primer on functional programming", "section": "2.1 Introduction", - "text": "2.1 Introduction\nFunctional programming is a way of writing programs that relies exclusively on the evalutation of functions. Mathematical functions have a very neat property: for any given input, they ALWAYS return exactly the same output. This is what we want to achieve with the functions that we will write. Functions that always return the same result are called pure, and a language that only allows writing pure functions is called a pure functional programming language. R is not a pure functional programming language, so we have to be careful not to write impure functions that manipulate the global state.\nBut what is state? Run the following code in your console:\n\nls()\n\nThis will list every object defined in the global environment. Now run the following line:\n\nx <- 1\n\nand then ls() again. x should now be listed alongside the other objects. You just manipulated the state of your current R session. Now if you run something like:\n\nx + 1\n\nThis will produce 2. We want to avoid pipelines that depend on some definition of some global variable somewhere, which could be subject to change, because this could mean that 2 different runs of the same pipeline could produce 2 different results. Notice that I used the verb avoid in the sentence before. This is sometimes not possible to avoid. Such situations have to be carefully documented and controlled.\nAs a more realistic example, imagine that within the pipeline you set up, some random numbers are generated. For example, to generate 10 random draws from a normal distribution:\n\nrnorm(n = 10)\n\n [1] -1.77883248 0.42853232 0.47230020 -0.09081041 -0.20139649 -0.38217676\n [7] -1.20057531 0.48772363 -1.83718414 0.21290767\n\n\nEach time you run this line, you will get another set of 10 random numbers. This is obviously a good thing in interactive data analysis, but much less so when running a pipeline programmatically. R provides a way to fix the random seed, which will make sure you always get the same random numbers:\n\nset.seed(1234)\nrnorm(n = 10)\n\n [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559\n [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378\n\n\nBut set.seed() only works for one call, so you must call it again if you need the random numbers again:\n\nset.seed(1234)\nrnorm(10)\n\n [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559\n [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378\n\nrnorm(10)\n\n [1] -0.47719270 -0.99838644 -0.77625389 0.06445882 0.95949406 -0.11028549\n [7] -0.51100951 -0.91119542 -0.83717168 2.41583518\n\nset.seed(1234)\nrnorm(10)\n\n [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559\n [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378\n\n\nThe problem with set.seed() is that you only partially solve the problem of rnorm() not being pure; this is because while rnorm() now does return the same output for the same input, this only works if you manipulate the state of your program to change the seed beforehand. Ideally, we would like to have a pure version of rnorm(), which would be self-contained and not depend on the value of the seed defined in the global environment. There is a package developped by Posit (the makers of RStudio and the packages from the tidyverse), called {withr} which allows to rewrite our functions in a pure way. {withr} has several functions, all starting with with_ that allow users to run code with some temporary defined variables, without altering the global environment. For example, it is possible to run a rnorm() with a seed, using withr::with_seed():\n\nlibrary(withr)\n\nwith_seed(seed = 1234, {\n rnorm(10)\n})\n\n [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559\n [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378\n\n\nBut ideally you’d want to go a step further and define a new function that is pure. To turn an impure function into a pure function, you usually only need to add some arguments to it. This is how we would create a pure_rnorm() function:\n\npure_rnorm <- function(..., seed){\n\n with_seed(seed, rnorm(...))\n}\n\npure_rnorm(10, seed = 1234)\n\n [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559\n [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378\n\n\npure_rnorm() is now self-contained, and does not pollute the global environment. We’re going to learn how to write functions in just a bit, so don’t worry if the code above does not make sense yet.\n\n\n\n\n\n\n\nA very practical consequence of using functional programming is that loops are not used, because loops are imperative and imperative programming is all about manipulating state. However, there are situations where loops are more efficient than the alternative (in R at least). So we will still learn and use them, but only when absolutely necessary, and we will always encapsulate a loop inside a function. Just like with the example above, this ensures that we have a pure, self-contained function that we can reason about easily. What I mean by this, is that loops are not always very easy to decipher. The concept of loops is simple enough: take this instruction, and repeat it N times. But in practice, if you’re reading code, it is not possible to understand what a loop is doing at first glance. There are only two solutions in this case:\n\nyou’re lucky and there are comments that explain what the loop is doing;\nyou have to let the loop run either in your head or in a console with some examples to really understand whit is going on.\n\nFor example, consider the following code:\n\nsuppressPackageStartupMessages(library(dplyr))\n\ndata(starwars)\n\nsum_humans <- 0\nsum_others <- 0\nn_humans <- 0\nn_others <- 0\n\nfor(i in seq_along(1:nrow(starwars))){\n\n if(!is.na(unlist(starwars[i, \"species\"])) &\n unlist(starwars[i, \"species\"]) == \"Human\"){\n if(!is.na(unlist(starwars[i, \"height\"]))){\n sum_humans <- sum_humans + unlist(starwars[i, \"height\"])\n n_humans <- n_humans + 1\n } else {\n\n 0\n\n }\n\n } else {\n if(!is.na(unlist(starwars[i, \"height\"]))){\n sum_others <- sum_others + unlist(starwars[i, \"height\"])\n n_others <- n_others + 1\n } else {\n 0\n }\n }\n}\n\nmean_height_humans <- sum_humans/n_humans\nmean_height_others <- sum_others/n_others\n\nWhat this does is not immediately obvious. The only hint you get are the two last lines, where you can read that we compute the average height for humans and non-humans in the sample. And this code could look a lot worse, because I am using functions like is.na() to test if a value is NA or not, and I’m using unlist() as well. If you compare this mess to a functional approach, I hope that I can stop my diatribe against imperative style programming here:\n\nstarwars %>%\n group_by(is_human = species == \"Human\") %>%\n summarise(mean_height = mean(height, na.rm = TRUE))\n\n# A tibble: 3 × 2\n is_human mean_height\n <lgl> <dbl>\n1 FALSE 172.\n2 TRUE 177.\n3 NA 181.\n\n\nNot only is this shorter, it doesn’t even need any comments to explain what’s going on. If you’re using functions with explicit names, the code becomes self-explanatory.\nThe other advantage of a functional (also called declarative) programming style is that you get function composition for free. Function composition is an operation that takes two functions g and f and returns a new function h such that \\(h(x) = g(f(x))\\). Formally:\nh = g ∘ f such that h(x) = g(f(x))\n∘ is the composition operator. You can read g ∘ f as g after f. When using functional programming, you can compose functions very easily, simply by using |> or %>%:\n\nh <- f |> g\n\nf |> g can be read as f then g, which is equivalent to g after f. Function composition might not seem like a big deal, but it actually is. If we structure our programs in this way, as a sequence of function calls, we get many benefits. Functions are easy to test, document, maintain, share and can be composed. This allows us to very succintly express complex workflows:\n\nstarwars %>%\n filter(skin_color == \"light\") %>%\n select(species, sex, mass) %>%\n group_by(sex, species) %>%\n summarise(\n total_individuals = n(),\n min_mass = min(mass, na.rm = TRUE),\n mean_mass = mean(mass, na.rm = TRUE),\n sd_mass = sd(mass, na.rm = TRUE),\n max_mass = max(mass, na.rm = TRUE),\n .groups = \"drop\"\n ) %>%\n select(-species) %>%\n tidyr::pivot_longer(-sex, names_to = \"statistic\", values_to = \"value\")\n\n# A tibble: 10 × 3\n sex statistic value\n <chr> <chr> <dbl>\n 1 female total_individuals 6 \n 2 female min_mass 45 \n 3 female mean_mass 56.3\n 4 female sd_mass 16.3\n 5 female max_mass 75 \n 6 male total_individuals 5 \n 7 male min_mass 79 \n 8 male mean_mass 90.5\n 9 male sd_mass 19.8\n10 male max_mass 120 \n\n\nNeedless to say, writing this in an imperative approach would be quite complicated.\nAnother consequence of using functional programming is that our code will live in plain text files, and not in Jupyter (or equivalent) notebooks. Not only does imperative code have state, but notebooks themselves have a (hidden) state. You should avoid notebooks at all costs, even for experimenting.", + "text": "2.1 Introduction\nFunctional programming is a way of writing programs that relies exclusively on the evalutation of functions. Mathematical functions have a very neat property: for any given input, they ALWAYS return exactly the same output. This is what we want to achieve with the functions that we will write. Functions that always return the same result are called pure, and a language that only allows writing pure functions is called a pure functional programming language. R is not a pure functional programming language, so we have to be careful not to write impure functions that manipulate the global state.\nBut what is state? Run the following code in your console:\n\nls()\n\nThis will list every object defined in the global environment. Now run the following line:\n\nx <- 1\n\nand then ls() again. x should now be listed alongside the other objects. You just manipulated the state of your current R session. Now if you run something like:\n\nx + 1\n\nThis will produce 2. We want to avoid pipelines that depend on some definition of some global variable somewhere, which could be subject to change, because this could mean that 2 different runs of the same pipeline could produce 2 different results. Notice that I used the verb avoid in the sentence before. This is sometimes not possible to avoid. Such situations have to be carefully documented and controlled.\nAs a more realistic example, imagine that within the pipeline you set up, some random numbers are generated. For example, to generate 10 random draws from a normal distribution:\n\nrnorm(n = 10)\n\n [1] -0.50951320 0.32378969 -0.55483440 -0.03788488 -0.16860005 1.49830472\n [7] -0.81259719 0.51249157 0.21957924 -0.14170387\n\n\nEach time you run this line, you will get another set of 10 random numbers. This is obviously a good thing in interactive data analysis, but much less so when running a pipeline programmatically. R provides a way to fix the random seed, which will make sure you always get the same random numbers:\n\nset.seed(1234)\nrnorm(n = 10)\n\n [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559\n [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378\n\n\nBut set.seed() only works for one call, so you must call it again if you need the random numbers again:\n\nset.seed(1234)\nrnorm(10)\n\n [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559\n [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378\n\nrnorm(10)\n\n [1] -0.47719270 -0.99838644 -0.77625389 0.06445882 0.95949406 -0.11028549\n [7] -0.51100951 -0.91119542 -0.83717168 2.41583518\n\nset.seed(1234)\nrnorm(10)\n\n [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559\n [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378\n\n\nThe problem with set.seed() is that you only partially solve the problem of rnorm() not being pure; this is because while rnorm() now does return the same output for the same input, this only works if you manipulate the state of your program to change the seed beforehand. Ideally, we would like to have a pure version of rnorm(), which would be self-contained and not depend on the value of the seed defined in the global environment. There is a package developped by Posit (the makers of RStudio and the packages from the tidyverse), called {withr} which allows to rewrite our functions in a pure way. {withr} has several functions, all starting with with_ that allow users to run code with some temporary defined variables, without altering the global environment. For example, it is possible to run a rnorm() with a seed, using withr::with_seed():\n\nlibrary(withr)\n\nwith_seed(seed = 1234, {\n rnorm(10)\n})\n\n [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559\n [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378\n\n\nBut ideally you’d want to go a step further and define a new function that is pure. To turn an impure function into a pure function, you usually only need to add some arguments to it. This is how we would create a pure_rnorm() function:\n\npure_rnorm <- function(..., seed){\n\n with_seed(seed, rnorm(...))\n}\n\npure_rnorm(10, seed = 1234)\n\n [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559\n [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378\n\n\npure_rnorm() is now self-contained, and does not pollute the global environment. We’re going to learn how to write functions in just a bit, so don’t worry if the code above does not make sense yet.\n\n\n\n\n\n\n\nA very practical consequence of using functional programming is that loops are not used, because loops are imperative and imperative programming is all about manipulating state. However, there are situations where loops are more efficient than the alternative (in R at least). So we will still learn and use them, but only when absolutely necessary, and we will always encapsulate a loop inside a function. Just like with the example above, this ensures that we have a pure, self-contained function that we can reason about easily. What I mean by this, is that loops are not always very easy to decipher. The concept of loops is simple enough: take this instruction, and repeat it N times. But in practice, if you’re reading code, it is not possible to understand what a loop is doing at first glance. There are only two solutions in this case:\n\nyou’re lucky and there are comments that explain what the loop is doing;\nyou have to let the loop run either in your head or in a console with some examples to really understand whit is going on.\n\nFor example, consider the following code:\n\nsuppressPackageStartupMessages(library(dplyr))\n\ndata(starwars)\n\nsum_humans <- 0\nsum_others <- 0\nn_humans <- 0\nn_others <- 0\n\nfor(i in seq_along(1:nrow(starwars))){\n\n if(!is.na(unlist(starwars[i, \"species\"])) &\n unlist(starwars[i, \"species\"]) == \"Human\"){\n if(!is.na(unlist(starwars[i, \"height\"]))){\n sum_humans <- sum_humans + unlist(starwars[i, \"height\"])\n n_humans <- n_humans + 1\n } else {\n\n 0\n\n }\n\n } else {\n if(!is.na(unlist(starwars[i, \"height\"]))){\n sum_others <- sum_others + unlist(starwars[i, \"height\"])\n n_others <- n_others + 1\n } else {\n 0\n }\n }\n}\n\nmean_height_humans <- sum_humans/n_humans\nmean_height_others <- sum_others/n_others\n\nWhat this does is not immediately obvious. The only hint you get are the two last lines, where you can read that we compute the average height for humans and non-humans in the sample. And this code could look a lot worse, because I am using functions like is.na() to test if a value is NA or not, and I’m using unlist() as well. If you compare this mess to a functional approach, I hope that I can stop my diatribe against imperative style programming here:\n\nstarwars %>%\n group_by(is_human = species == \"Human\") %>%\n summarise(mean_height = mean(height, na.rm = TRUE))\n\n# A tibble: 3 × 2\n is_human mean_height\n <lgl> <dbl>\n1 FALSE 172.\n2 TRUE 177.\n3 NA 181.\n\n\nNot only is this shorter, it doesn’t even need any comments to explain what’s going on. If you’re using functions with explicit names, the code becomes self-explanatory.\nThe other advantage of a functional (also called declarative) programming style is that you get function composition for free. Function composition is an operation that takes two functions g and f and returns a new function h such that \\(h(x) = g(f(x))\\). Formally:\nh = g ∘ f such that h(x) = g(f(x))\n∘ is the composition operator. You can read g ∘ f as g after f. When using functional programming, you can compose functions very easily, simply by using |> or %>%:\n\nh <- f |> g\n\nf |> g can be read as f then g, which is equivalent to g after f. Function composition might not seem like a big deal, but it actually is. If we structure our programs in this way, as a sequence of function calls, we get many benefits. Functions are easy to test, document, maintain, share and can be composed. This allows us to very succintly express complex workflows:\n\nstarwars %>%\n filter(skin_color == \"light\") %>%\n select(species, sex, mass) %>%\n group_by(sex, species) %>%\n summarise(\n total_individuals = n(),\n min_mass = min(mass, na.rm = TRUE),\n mean_mass = mean(mass, na.rm = TRUE),\n sd_mass = sd(mass, na.rm = TRUE),\n max_mass = max(mass, na.rm = TRUE),\n .groups = \"drop\"\n ) %>%\n select(-species) %>%\n tidyr::pivot_longer(-sex, names_to = \"statistic\", values_to = \"value\")\n\n# A tibble: 10 × 3\n sex statistic value\n <chr> <chr> <dbl>\n 1 female total_individuals 6 \n 2 female min_mass 45 \n 3 female mean_mass 56.3\n 4 female sd_mass 16.3\n 5 female max_mass 75 \n 6 male total_individuals 5 \n 7 male min_mass 79 \n 8 male mean_mass 90.5\n 9 male sd_mass 19.8\n10 male max_mass 120 \n\n\nNeedless to say, writing this in an imperative approach would be quite complicated.\nAnother consequence of using functional programming is that our code will live in plain text files, and not in Jupyter (or equivalent) notebooks. Not only does imperative code have state, but notebooks themselves have a (hidden) state. You should avoid notebooks at all costs, even for experimenting.", "crumbs": [ "2  A primer on functional programming" ] diff --git a/webr-serviceworker.js b/webr-serviceworker.js new file mode 100644 index 0000000..4022c54 --- /dev/null +++ b/webr-serviceworker.js @@ -0,0 +1 @@ +importScripts('https://webr.r-wasm.org/v0.2.2/webr-serviceworker.js'); diff --git a/webr-worker.js b/webr-worker.js new file mode 100644 index 0000000..8aa663a --- /dev/null +++ b/webr-worker.js @@ -0,0 +1 @@ +importScripts('https://webr.r-wasm.org/v0.2.2/webr-worker.js');