Skip to content

Commit

Permalink
2022 edition, start
Browse files Browse the repository at this point in the history
  • Loading branch information
b-rodrigues committed Jun 6, 2022
1 parent 8b45d3f commit 63ea694
Show file tree
Hide file tree
Showing 497 changed files with 8,901 additions and 7,954 deletions.
3 changes: 3 additions & 0 deletions 02-data_types.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,9 @@ or, for named lists:
list4$c
```

The `$` operator is very useful, because it also allows you to access entire columns
of `data.frame` objects, which we are going to get to know in the next section.

Lists are used extensively because they are so flexible. You can build lists of datasets and apply
functions to all the datasets at once, build lists of models, lists of plots, etc... In the later
chapters we are going to learn all about them. Lists are central objects in a functional programming
Expand Down
9 changes: 6 additions & 3 deletions 03-reading_writing_data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -166,16 +166,19 @@ would like to save, say, a list containing any arbitrary object? This is possibl
`saveRDS()` function. Literally anything can be saved with `saveRDS()`:

```{r}
my_list <- list("this is a list", list("which contains a list", 12), c(1, 2, 3, 4), matrix(c(2, 4,
3, 1, 5, 7), nrow = 2))
my_list <- list("this is a list",
list("which contains a list", 12),
c(1, 2, 3, 4),
matrix(c(2, 4, 3, 1, 5, 7),
nrow = 2))
str(my_list)
```

`my_list` is a list containing a string, a list which contains a string and a number, a vector and
a matrix... Now suppose that computing this list takes a very long time. For example, imagine that
each element of the list is the result of estimating a very complex model on a simulated
dataset, which takes hours to simulate. Because this takes so long to compute, you'd want to save
dataset, which takes hours to run. Because this takes so long to compute, you'd want to save
it to disk. This is possible with `saveRDS()`:

```{r}
Expand Down
1,009 changes: 537 additions & 472 deletions 04-descriptives.Rmd

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion 05-graphs.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -711,7 +711,7 @@ bwages <- Bwages %>%
Then, plot a scatter plot of wages on experience, by education level. Add a theme that you like,
and remove the title of the legend.

```{r, eval=FALSE, echo=FALSE}
```{r, eval=FALSE, echo=FALSE}
ggplot(bwages) +
geom_point(aes(exper, wage, colour = educ_level)) +
theme_minimal() +
Expand Down
9 changes: 6 additions & 3 deletions 06-statistical_models.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -514,6 +514,8 @@ Then, using `dydx()`, I get the marginal effect of variable `lnnlinc` for these

### Explainability of *black-box* models

Just read Christoph Molnar's
[Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/).


## Comparing models
Expand Down Expand Up @@ -939,9 +941,9 @@ The first value is the number of rows of the first set, the second value of the
was the original amount of values in the training data, before splitting again.

How should we call these two new data sets? The author of `{rsample}`, Max Kuhn, talks about
the *analysis* and the *assessment* sets:
the *analysis* and the *assessment* sets, and I'm going to use this terminology as well.

```{r, echo=FALSE}
```{r, echo=FALSE, include = FALSE}
blogdown::shortcode("tweet", "1066131042615140353")
```

Expand Down Expand Up @@ -1104,7 +1106,8 @@ is simply to look for hyper-parameters in an efficient way, and bayesian optimis
this efficient way. However, you could use another method, for example a grid search. This would not
change anything to the general approach. So I will not spend too much time explaining what is
going on below, as you can read the details in the paper cited above as well as the package's
documentation.
documentation. The focus here is not on this particular method, but rather showing you how you can
use various packages to solve a data science problem.

Let's first load the package and create the function to optimize:

Expand Down
121 changes: 116 additions & 5 deletions 07-defining_your_own_functions.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,16 @@ ifelse(c(1,2,4) > c(3, 1, 0), "yes", "no")

The result is a vector. Now, let's see what happens if we use `if...else...` instead of `ifelse()`:

```{r}
```{r, eval = F}
if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no")
```

Only the first element of my atomic vector is used for the comparison. This is very important to keep in mind.
```{r, eval = F}
> Error in if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no") :
the condition has length > 1
```

This results in an error (in previous R version, only the first element of the vector would get used).
Suppose that you want an expression to be evaluated, only if every element is `TRUE`. In this case, you should
use the `all()` function, as seen previously in Chapter 2:

Expand Down Expand Up @@ -567,7 +572,95 @@ or, now, if you need the `trim` argument:
my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE, trim = 0.1)
```

The `...` are very useful when writing wrappers such as `my_func()`.
The `...` are very useful when writing higher-order functions such as `my_func()`, because it allows
you to pass arguments *down* to the underlying functions.

## Functions that return functions

The example from before, `my_func()` took three arguments, some `x`, a function `func`, and `...` (dots). `my_func()`
was a kind of wrapper that evaluated `func` on its arguments `x` and `...`. But sometimes this is not quite what you
need or want. It is sometimes useful to write a function that returns a modified function. This type of function
is called a function factory, as it *builds* functions. For instance, suppose that we want to time how long functions
take to run. An idea would be to proceed like this:

```{r, eval = FALSE}
tic <- Sys.time()
very_slow_function(x)
toc <- Sys.time()
running_time <- toc - tic
```

but if you want to time several functions, this gets very tedious. It would be much easier if functions would
time *themselves*. We could achieve this by writing a wrapper, like this:

```{r, eval = FALSE}
timed_very_slow_function <- function(...){
tic <- Sys.time()
result <- very_slow_function(x)
toc <- Sys.time()
running_time <- toc - tic
list("result" = result,
"running_time" = running_time)
}
```

The problem here is that we have to change each function we need to time. But thanks to the concept of function
factories, we can write a function that does this for us:

```{r}
time_f <- function(.f, ...){
function(...){
tic <- Sys.time()
result <- .f(...)
toc <- Sys.time()
running_time <- toc - tic
list("result" = result,
"running_time" = running_time)
}
}
```

`time_f()` is a function that returns a function, a function factory. Calling it on a function returns, as expected,
a function:

```{r}
t_mean <- time_f(mean)
t_mean
```

This function can now be used like any other function:

```{r}
output <- t_mean(seq(-500000, 500000))
```

`output` is a list of two elements, the first being simply the result of `mean(seq(-500000, 500000))`, and the other
being the running time.

This approach is super flexible. For instance, imagine that there is an `NA` in the vector. This would result in
the mean of this vector being `NA`:

```{r}
t_mean(c(NA, seq(-500000, 500000)))
```

But because we use the `...` in the definition of `time_f()`, we can now simply pass `mean()`'s option down to it:

```{r}
t_mean(c(NA, seq(-500000, 500000)), na.rm = TRUE)
```


## Functions that take columns of data as arguments

Expand Down Expand Up @@ -942,8 +1035,26 @@ map2(c(1, 2, 3, 4, 5), c(9, 8, 7, 6, 5), ~{(.x**2)/.y})
Because you have now two arguments, a single dot could not work, so instead you use `.x` and `.y` to
avoid confusion.

You now know a lot about writing your own functions. In the next chapter, we are going to learn about
functional programming, the programming paradigm I described in the introduction of this book.
Since version 4.1, R introduced a short-hand for defining anonymous functions:

```{r}
map(c(1,2,3,4), \(x)(1/sqrt(x)))
```

`\(x)` is supposed to look like this notation: $\lambda(x)$. This is a notation comes from lambda calculus, where functions
are defined like this:

$$
\lambda(x).1/sqrt(x)
$$

which is equivalent to $f(x) = 1/sqrt(x)$. You can use `\(x)` or `function(x)` interchangeably.


You now know a lot about writing your own functions. In the next chapter, we are going to learn
about functional programming, the programming paradigm I described in the introduction of this
book.

## Exercises

Expand Down
Loading

0 comments on commit 63ea694

Please sign in to comment.