-
Notifications
You must be signed in to change notification settings - Fork 32
/
09-package_development.Rmd
512 lines (414 loc) · 23.4 KB
/
09-package_development.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
# Package development
## Why you need to write your own package
One of the reasons you might have tried R in the first place is the abundance of packages. As I'm
writing these lines (in November 2020) 16523 packages are available on CRAN (in August 2019, there
were 14762, and in August 2016, when I first wrote the number of packages down for my first ebook,
it was 8922 packages).
This is a staggering amount of packages and to help you look for the right ones, you can check
out [CRAN Task Views](https://cran.r-project.org/).
You might wonder why the heck should you write your own packages? After all, with so many packages
you're sure to find something that suits your needs, right? Well, it depends. Of course, you will
not need to write you own function to perform non-linear regression, or to train a neural network.
But as time will go, you will start writing your own functions, functions that fit your needs, and
that you use daily. It may be functions that prepare and shape data that you use at work for
analysis. Or maybe you want to deliver an analysis to a client, with data and source code, so
you decide to deliver a package that contains everything (something I've already done in the
past). Maybe you want to develop a Shiny applications using the `{golem}` framework, which allows
you to build apps as packages.
Ok, but is it necessary to write a package? Why not just write functions inside some scripts and
then simply run or share these scripts (and in the case of Shiny, you don't have to use `{golem}`)?
This seems like a valid solution at first. However, it quickly becomes tedious, especially if you
have multiple scripts scattered around your computer or inside different subfolders. You'll also
have to write the documentation on separate files and these can easily get lost or become outdated.
Relying on scripts does not scale well; even if you are not sharing your code outside of your
computer (maybe you're working on super secret projects at NASA), you always have to think about
future you. And in general, future you thinks that past you is an asshole, exactly because you put
0 effort in documenting, testing and making your code easy to use. Having everything inside a
package takes care of these headaches for you, and will make future you proud of past you. And if
you have to share your code, or deliver to a client, believe me, it will make things a thousand
times easier.
Code that is inside packages is very easy to document and test, especially if you're using Rstudio.
It also makes it possible to use the wonderful `{covr}` package, which tells you which lines in
which functions are called by your tests. If some lines are missing, write tests that invoke them and
increase the coverage of your tests! Documenting and testing your code is very important; it gives
you assurance that the code your writing works, but most importantly, it gives *others* assurance
that what you wrote works. And I include future you in these *others* too.
In order to share this package with these *others* we are going to use Git. If you're familiar with
Git, great, you'll be able to skip some sections. If not, then buckle up, you're in for a wild ride.
As I mentioned in the introduction, if you want to learn much more than I'll show about packages
read @wickham2015. I will only show you the basics, but it should be enough to get you productive.
## Starting easy: creating a package to share data
We will start a package from scratch, in order to share data with the world. For this, we are first
going to scrape a table off Wikipedia, prepare the data and then include it in a package. To make
distributing this package easy, we're going to put it up on Github, so you'll need a Github account.
Let's start by creating a Github account.
### Setting up a Github account
Setting up a Github account is very easy; just go over to [https://github.com/](https://github.com/)
and simply sign up!
Then you will need to generate a ssh key on your computer. This is a way for you to securely
interact with your Github account, and push your code to the repository without having to always
type your password. I will assume you never created any ssh
keys before, because if you already did, you could skip these steps. I will also assume that you are
on a GNU+Linux or macOS system; if you're using windows, the instructions are very similar, but
you'll first need to install Git available [here](https://git-scm.com/downloads). Git is available
by default on any GNU+Linux system, and as far as I know also on macOS, but I might be wrong and
you might also need to install git on macOS (but then the instructions are the same whether
you're using GNU+Linux or macOS). If you have trouble installing git, read the following section
from the [Pro Git book](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).
Then, open a terminal (or the git command line on Windows) and type the following:
```
ssh-keygen
```
This command will generate several files in the `.ssh` directory inside your `HOME` directory. Look
for the file that ends with the `.pub` extension, and copy its contents. You will need to paste
these contents on Github.
So now sign in to Github; once you are signed in, go to settings and then `SSH and GPG keys`:
```{r, echo=FALSE}
knitr::include_graphics("pics/github_ssh.png")
```
In the screenshot above, you see my ssh key associated with my account; this will be empty for you.
Click on the top right, *New SSH key*:
```{r, echo=FALSE}
knitr::include_graphics("pics/github_add_ssh.png")
```
Give your key a name, and paste the key you generated before. You're done! You can now configure
git a bit more by telling it who you are. Open a terminal, adapt and type the following commands:
```
git config --global user.name "Harold Zurcher"
git config --global user.email [email protected]
```
You're ready to go!^[If you know who Harold Zurcher is, you da real MVP]
You can now push code to github to share it with the world. Or if you do not want
to share you package (for confidentiality reasons for instance), you can still benefit from using
git, as it possible to have an internal git server that could be managed by your company's IT team.
There is also the possibility to set up corporate, and thus private git servers by buying the service
from github, or other providers such as gitlab.
### Starting your package
```{r, echo=FALSE}
knitr::include_graphics("pics/new_package.gif")
```
To start writing a package, the easiest way is to load up Rstudio and start a new project, under the
*File* menu. If you're starting from scratch, just choose the first option, *New Directory* and then
*R package*. Give a new to your package, for example `arcade` (you'll see why in a bit) and you can
also choose to use git for version control. Now if you check the folder where you chose to save
your package, you will see a folder with the same name as your package, and inside this folder a
lot of new files and other folders. The most important folder for now is the `R` folder. This is
the folder that will hold your `.R` source code files. You can also see these files and folders
inside the *Files* panel from within Rstudio. Rstudio will also have `hello.R` opened, which is a
single demo source file inside the `R` folder. You can get rid of this file, or keep it and edit it.
I would advise you keep it and even distribute it inside your package. You can save this file
in a special directory called `data-raw`. You don't need to manually create this folder now, we will
do so in a bit. For now, just follow along.
Now, to start working on your package, the best is to use a package called `{usethis}`. `{usethis}`
is a package that makes writing packages very easy; it includes functions that create the required
subfolders and necessary template files so that you do not need to constantly check how file so-and-so
should be placed or named.
Let's start by adding a readme file. This is easily achieved by using the following function from
`{usethis}`:
```{r, eval=FALSE}
usethis::use_readme_md()
```
This creates a template README.md file in the root directory of your package. You can now edit this
file accordingly, and that's it.
The next step could be setting up your package to work with `{roxygen2}`, which will help write
the documentation of your package:
```{r, eval=FALSE}
usethis::use_roxygen_md()
```
The output tells you to run `devtools::document()`, we will do this later.
Since you have learned about the tidyverse by reading this book, I am willing to bet that you will
want to use the `%>%` operator inside the functions contained in your package. To do this without issues,
which wil become apparent later, use the following command:
```{r, eval=FALSE}
usethis::use_pipe()
```
This will make the `%>%` operator available internally to your package's functions, but also to the
user that will load the package.
We are almost done setting up the package. If you plan on distributing data with your package,
you might want to also share the code that prepared the data. For instance, if you receive the
data from your finance department, but this data needs some cleaning before being useful, you could
write a script to do so and then distribute this script also with the package, for reproducibility
purposes. These scripts, while not central to the package, could still be of interest to the users.
The directory to place them is called `data-raw`:
```{r, eval=FALSE}
usethis::use_data_raw()
```
One final folder is `inst`. You can add files to this folder, and they will be available to the users
that install the package. Users can find the files in the folder where packages get installed. On
GNU+Linux systems, that would be somewhere like: `/home/user/R/amd64-linux-gnu-library/3.6`. There,
you will find the installation folders of all the packages. If the package you make is called `{spam}`,
you will find the files you put inside the `inst` folder on the root of the installation folder of
`spam`. You can simply create the `inst` folder yourself, or use the following command:
```{r, eval=FALSE}
usethis::use_directory("inst")
```
Finally, the last step is to give your package a license; this again is only useful if you plan on
distributing it to the world. If you are writing your own package for yourself, or for purposes
internal to your company, this is probably superfluous. I won't discuss the particularities of
licenses, so let's just say that for the sake of this example package we are writing, we are going
to use the MIT license:
```{r, eval=FALSE}
usethis::use_mit_license()
```
This again creates the right file at the right spot. There are other interesting functions inside
the `{usethis}` package, and we will come back to it later.
## Including data inside the package
Many packages include data and we are going to learn how to do it. I'll assume that we already
have a dataset on hand that we have to share. This is quite simple to do, first let's simply
load the data:
```{r, eval=FALSE}
arcade <- readr::read_csv("~/path/to/data/arcade.csv")
```
and then use, once again, `{usethis}` comes to our rescue:
```{r, eval=FALSE}
usethis::use_data(arcade, compress = "xz")
```
that's it! Well almost. We still need to write a little script that will allow users of your
package to load the data. This script is simply called `data.R` and contains the following lines:
```{r, eval=FALSE}
#' List of highest-grossing games
#'
#' Source: https://en.wikipedia.org/wiki/Arcade_game#List_of_highest-grossing_games
#'
#' @format A data frame with 6 variables: \code{game}, \code{release_year},
#' \code{hardware_units_sold}, \code{comment_hardware}, \code{estimated_gross_revenue},
#' \code{comment_revenue}
#' \describe{
#' \item{game}{The name of the game}
#' \item{release_year}{The year the game was released}
#' \item{hardware_units_sold}{The amount of hardware units sold}
#' \item{comment_hardware}{Comment accompanying the amount of hardware units sold}
#' \item{estimated_gross_revenue}{Estimated gross revenue in US$ with 2019 inflation}
#' \item{comment_revenue}{Comment accompanying the amount of hardware units sold}
#' }
"arcade"
```
Basically this is a description of the data, and the name with which the user will invoke the data. To
conclude this part, remember the `data-raw` folder? If you used a script to scrape/get the data
from somewhere, or if you had to write code to prepare the data to make it fit for sharing, this
is where you can put that script. I have written such a script, I will discuss it in the next
chapter, where I'll show you how to scrape data from the internet. You can also save the file
where you wrote all your calls to `{usethis}` functions if you want.
## Adding functions to your package
Functions will be added inside the `R` package. In there, you will find the `hello.R` file. You can
edit this file if you kept it or you can create a new script. This script can hold one function, or
several functions.
Let's start with the simplest case; one function inside one script.
### One function inside one script
Create a new R script, or edit the `hello.R` file, and add in the following code:
```{r, eval=FALSE}
#' Compute descriptive statistics for the numeric columns of a data frame.
#' @param df The data frame to summarise.
#' @param ... Optional. Columns in the data frame
#' @return A data frame with descriptive statistics. If you are only interested in certain columns
#' you can add these columns.
#' @import dplyr
#' @importFrom tidyr gather
#' @export
#' @examples
#' \dontrun{
#' describe(dataset)
#' describe(dataset, col1, col2)
#' }
describe_numeric <- function(df, ...){
if (nargs() > 1) df <- select(df, ...)
df %>%
select_if(is.numeric) %>%
gather(variable, value) %>%
group_by(variable) %>%
summarise_all(list(mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
nobs = ~length(.),
min = ~min(., na.rm = TRUE),
max = ~max(., na.rm = TRUE),
q05 = ~quantile(., 0.05, na.rm = TRUE),
q25 = ~quantile(., 0.25, na.rm = TRUE),
mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE),
median = ~quantile(., 0.5, na.rm = TRUE),
q75 = ~quantile(., 0.75, na.rm = TRUE),
q95 = ~quantile(., 0.95, na.rm = TRUE),
n_missing = ~sum(is.na(.)))) %>%
mutate(type = "Numeric")
}
```
Save the script under the name `describe.R`.
This function shows you pretty much you need to know when writing functions for packages. First,
there's the comment lines, that start with `#'` and not with `#`. These lines will be converted
into the function's documentation which you and your package's users will be able to read in
Rstudio's *Help* pane. Notice the keywords that start with `@`. These are quite important:
- `@param`: used to define the function's parameters;
- `@return`: used to define the object returned by the function;
- `@import`: if the function needs functions from another package, in the present case `{dplyr}`;
then this is where you would define these. Separate several package with a space;
- `@importFrom`: if the function only needs one function from a package, define it here. Read it as
*from tidyr import gather*, very similar to how it is done in Python;
- `@export`: makes the function available to the users. If you omit this, this function will not
be available to the users and only available internally to the other functions of the package. Not
making functions available to users can be useful if you need to write functions that are used by
other functions but never be used by anyone directly. It is still possible to access these internal,
private, functions by using `:::`, as in, `package:::private_function()`;
- `@examples`: lists examples in the documentation. The `\dontrun{}` tag is used for when you do
not want these examples to run when building the package.
As explained before, if the function depends on function from other packages, then `@import` or
`@importFrom` must be used. But it is also possible to use the `package::function()` syntax like
I did on the following line:
```{r, eval=FALSE}
mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE),
```
This function uses the `sample_mode()` function from my `{brotools}` package. Since it is the only
function that I am using, I don't import the whole package with `@import`. I could have done the
same for `gather()` from `{tidyr}` instead of using `@importFrom`, but I wanted to showcase
`@importFrom`, which can also be use to import several functions:
```
@importFrom package function_1 function_2 function_3
```
The way I'm doing this however is not optimal. If your package depends on many functions from
other packages that are not available on CRAN, but rather on Github, you might want to do that
in a cleaner way. The cleaner way is to add a "Remotes" field in the package's NAMESPACE (this is
a very important file that gets generated automatically by `devtools::document()`) I won't
cover this here, but you can read more about it [here](https://cran.r-project.org/web/packages/devtools/vignettes/dependencies.html).
What I will cover is how to declare dependencies to other CRAN packages. These dependencies also
get declared inside the 'Description' file, which we will cover in the next section.
Because I'm doing that in this hacky way, my `{brotools}` package should be installed:
```{r, eval=FALSE}
devtools::install_github("b-rodrigues/brotools")
```
Again, I want to emphasize that this is not the best way of doing it. However, using the "REMOTES"
field as described in the document I linked above is not complicated.
Now comes the function itself. The function is written in pretty much the same way as usual, but
there are some particularities. First of all, the second argument of the function is the `...`, which
were already covered in Chapter 7. I want to give the option to my users to specify any columns to
summarise only these columns, instead of all of them, which is the default behaviour. But because
I cannot know how many columns the user wants to summarize beforehand, and also because I do not
want to limit the user to 2 or 3 columns, I use the `...`.
But what if the user wants to summarize all the columns? This is taken care of in this line:
```{r, eval=FALSE}
if (nargs() > 1) df <- select(df, ...)
```
`nargs()` counts the number of arguments of the function. If the user calls the function like so:
```{r, eval=FALSE}
describe_numeric(mtcars)
```
`nargs()` will return 1. If, instead, the user calls the function with one or more columns:
```{r, eval=FALSE}
describe_numeric(mtcars, hp, mpg)
```
then `nargs()` will return 2 (in this case). And does, this piece of code will be executed:
```{r, eval=FALSE}
df <- select(df, ...)
```
which selects the columns `hp` and `mpg` from the `mtcars` dataset. This reduced data set is then
the one that is being summarized.
### Many functions inside a script
If you need to add more functions, you can add more in the same
script, or create one script per function. The advantage of writing more than one function per
script is that you can keep functions that are conceptually similar in the same place. For instance,
if you want to add a function called `describe_character()` to your package, adding it to the same
script where `describe_numeric()` is might be a good idea, so let's do just that:
```{r, eval=FALSE}
#' Compute descriptive statistics for the numeric columns of a data frame.
#' @param df The data frame to summarise.
#' @param ... Optional. Columns in the data frame
#' @return A data frame with descriptive statistics. If you are only interested in certain columns
#' you can add these columns.
#' @import dplyr
#' @importFrom tidyr gather
#' @export
#' @examples
#' \dontrun{
#' describe(dataset)
#' describe(dataset, col1, col2)
#' }
describe_numeric <- function(df, ...){
if (nargs() > 1) df <- select(df, ...)
df %>%
select(is.numeric) %>%
pivot_longer(cols = everything(),
names_to = "variable", values_to = "value") %>%
group_by(variable) %>%
summarise(across(everything(),
tibble::lst(mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
nobs = ~length(.),
min = ~min(., na.rm = TRUE),
max = ~max(., na.rm = TRUE),
q05 = ~quantile(., 0.05, na.rm = TRUE),
q25 = ~quantile(., 0.25, na.rm = TRUE),
mode = ~as.character(brotools::sample_mode(.), na.rm = TRUE),
median = ~quantile(., 0.5, na.rm = TRUE),
q75 = ~quantile(., 0.75, na.rm = TRUE),
q95 = ~quantile(., 0.95, na.rm = TRUE),
n_missing = ~sum(is.na(.))))) %>%
mutate(type = "Numeric")
}
#' Compute descriptive statistics for the character or factor columns of a data frame.
#' @param df The data frame to summarise.
#' @return A data frame with a description of the character or factor columns.
#' @import dplyr
#' @importFrom tidyr gather
describe_character_or_factors <- function(df, type){
df %>%
pivot_longer(cols = everything(),
names_to = "variable", values_to = "value") %>%
group_by(variable) %>%
summarise(across(everything(),
funs(mode = brotools::sample_mode(value, na.rm = TRUE),
nobs = length(value),
n_missing = sum(is.na(value)),
n_unique = length(unique(value))))) %>%
mutate(type = type)
}
#' Compute descriptive statistics for the character columns of a data frame.
#' @param df The data frame to summarise.
#' @return A data frame with a description of the character columns.
#' @import dplyr
#' @export
#' @examples
#' \dontrun{
#' describe(dataset)
#' }
describe_character <- function(df){
df %>%
select(where(is.character)) %>%
describe_character_or_factors(type = "Character")
}
```
Let's now continue on to the next section, where we will learn to document the package.
## Documenting your package
There are several files that you must edit to fully document the package; for now, only the functions
are documented. The first of these files is the `DESCRIPTION` file.
### Description
By default, the `DESCRIPTION` file, which you can find in the root of your package project, contains
the following lines:
```
Package: arcade
Type: Package
Title: What the Package Does (Title Case)
Version: 0.1.0
Author: Who wrote it
Maintainer: The package maintainer <[email protected]>
Description: More about what it does (maybe more than one line)
Use four spaces when indenting paragraphs within the Description.
License: What license is it under?
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.0.2
```
Each section is quite self-explanatory. This is how it could look like once you're done editing it:
```
Package: arcade
Type: Package
Title: List of highest-grossing Arcade Games
Version: 0.1.0
Author: person("Harold", "Zurcher", email = "[email protected]", role = c("aut", "cre"))
Description: This package contains data about the highest-grossing arcade games from the 70's until
2010's. Also contains some functions to summarize data.
License: CC0
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.0.2
```
The `Author` and `Maintainer` need some further explanations; I have added Harold Zurcher as
the athor and creator, with the `role = c("aut", "cre")` bit. `"cre"` can also be used for
maintainer, so I removed the `Maintainer` line.
## Unit testing your package