-
Notifications
You must be signed in to change notification settings - Fork 1
/
r_best-practices.org
executable file
·164 lines (108 loc) · 8.03 KB
/
r_best-practices.org
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
#+OPTIONS: title:t date:t author:t email:t
#+OPTIONS: toc:t h:6 num:nil |:t todo:nil
#+OPTIONS: *:t -:t ::t <:t \n:t e:t creator:nil
#+OPTIONS: f:t inline:t tasks:t tex:t timestamp:t
#+OPTIONS: html-preamble:t html-postamble:nil
#+PROPERTY: header-args:R :session R:best-prac :results output :exports code :tangle yes :comments link :eval no
#+TITLE: Tips on best practices in R
#+DATE: {{{time(%B %d\, %Y)}}}
#+AUTHOR: Marie-Hélène Burle
#+EMAIL: [email protected]
* Projects
#+BEGIN_VERBATIM
Work in self-contained projects
#+END_VERBATIM
Advantages:
- allows collaboration
- makes projects portable
- facilitates version control
- reduces the risk to end up with broken links or missing files necessary for the projects
#+BEGIN_EMPHASIS
Create a directory that contains all the necessary subdirectories and files for your project
#+END_EMPHASIS
This includes data, scripts, as well as the project outputs (graphs, results, etc.)
If you use RStudio, create RStudio projects.
** Example possible project structure
This would be a rather standard project structure:
#+BEGIN_EXAMPLE
/project_root /data /raw
/clean
/results /graphs
/tables
/docs
/bin
/src
/ms
#+END_EXAMPLE
You are free to organize your projects in a way that works for you. But following some rather standard way of organizing files will help collaborators find their way around your project and some consistency between projects will also help you adapt scripts and snippets from one project to the next.
* Paths
** The problem with absolute paths
If this is how your script starts:
#+BEGIN_SRC R
setwd("C:\Users\charlie\some_very_personal_directory_structure\our_project")
#+END_SRC
what do you think the odds are that the person to whom you are giving that script
- is called Charlie and
- is using Windows and
- is having the exact same directory structure in their computer?
In order to run your script, your friend Lucy, who uses linux, will have to change all links to:
#+BEGIN_SRC R
"/home/lucy/totally_different_directory_structure/our_project"
#+END_SRC
This makes for challenging collaborations.
If you move your project to another machine, the links will equally get broken. And it is likely that they will get broken over time as you reorganize documents in your computer.
** Solution: use paths relative to your project root
*Once your entire project lives within one directory* (the project root), as we saw in the previous section, make sure that:
#+BEGIN_VERBATIM
all the paths in your scripts are relative to the project root
#+END_VERBATIM
This makes your project /portable/, /much more robust/ (lower risk to get broken over time), and straightforward to /share/.
The way a lot of people go around this is by setting the working directory manually in RStudio (by clicking on "Set As Working Directory") to the project root. But for this to run in another machine, the user will have to do the same before running the script (unless the script is in the project root, but this is rarely the case and you certainly cannot assume that it is).
A portable script should run as is, without this manual tinkering. Why?
1. It is easy to forget to do the tinkering and wonder why nothing works
2. This manual tinkering prevents automation and defeats one of the main advantages of programming:
once you realize the advantages of writing code over the use of a GUI software, you should realize that you have to be careful in how you use RStudio. Clicking around is ok for some tasks, but it is not if this does something necessary for the script to run successfully (such as setting the working directory).
If you use RStudio projects (and if you use RStudio, you definitely should create RStudio projects), RStudio will automatically, upon opening a project, move the current working directory to the project root. This is great. But you cannot assume that everybody running your script uses RStudio: R scripts can be run directly in R, in the command line, from a shell script, or using other tools or IDEs. So the script itself should ideally contain a way to refer to the project root, independent of RStudio.
How?
** Package src_R[:eval no]{here}
There are various methods, but a wonderfully easy one is:
#+BEGIN_VERBATIM
the package [[https://github.com/r-lib/here][here]] from [[https://github.com/krlmlr][Kirill Müller]]
#+END_VERBATIM
The function src_R[:eval no]{here::here()} starts from the current working directory (the directory in which the script is running if you don't set it manually in RStudio or with src_R[:eval no]{setwd}) and goes up the directory chain until it finds a src_R[:eval no]{.Rproj} file (if you use RStudio projects), a src_R[:eval no]{.git} or src_R[:eval no]{.svn} file (if you version control your projects), a src_R[:eval no]{.projectile} file (if you use emacs projectile), or other sensible files which signify a project root. If none of these apply to you (which is unlikely), you can create a file src_R[:eval no]{.here} in your project root by running the function src_R[:eval no]{set_here("path/to/project/root")} in the console (don't add it to your script since the path to your project root is specific to your machine and you only need to do this once). This src_R[:eval no]{.here} file is now the marker of the project root for the function src_R[:eval no]{here()}.
From there on, you can refer to any file in your project with src_R[:eval no]{here("file/path/from/project/root")}.
/Example usage:/
#+BEGIN_SRC R
library(tidyverse)
library(here)
my_data <- read_excel(here("data/raw/my_data.xlsx"))
my_plot <- ggplot(data = my_data) + geom_point()
ggsave(here("results/graphs/my_plot.png"))
#+END_SRC
* Clean session
#+BEGIN_VERBATIM
Never set anything that might change how your code runs
#+END_VERBATIM
In particular:
- never save your workspace upon closing a session (beware of RStudio default settings! [[https://twitter.com/hadleywickham/status/1032665959734108160][go edit them now]]),
- restart your R session frequently to make sure that you are not running bits of code from past sessions,
- do not add anything in your src_R[:eval no]{.Rprofile}, src_R[:eval no]{.Renviron}, or any other setting file that would affect the output of your code in any way, such as setting options, creating functions, loading packages, etc. This is tempting if you always use the same options or packages. But this makes your scripts non-reproducible by others who do not have those settings. It is much better to create snippets to add those lines of code very easily (even automatically) at the beginning of your scripts.
* Formatting
There is no official R formatting. [[http://hadley.nz/][Hadley Wickham]] wrote a [[http://style.tidyverse.org/][short book]] on R formatting and this can be a great template to follow. A growing number of people are following his guidelines and it would be a good idea to familiarize yourself with them.
The package src_R[:eval no]{lintr} by [[https://github.com/jimhester][Jim Hester]], which runs in emacs ESS, Sublime, Vim, and Atom, as well as RStudio functionalities highlight where your code does not follow these formatting recommendations and can be a great way to get used to applying them to your code until they become automatic.
But the most important pieces of advice, when it comes to formatting code are:
#+BEGIN_VERBATIM
*be consistent*
follow the style used by your collaborators, particularly if you edit their scripts
#+END_VERBATIM
* Things you do not want in a script
#+BEGIN_VERBATIM
Avoid anything that will make changes to a computer
#+END_VERBATIM
If someone runs your script, this should not install packages or make any other change to their machine. So, for instance, avoid
#+BEGIN_SRC R
install.packages()
#+END_SRC
#+BEGIN_RED
I owe these better coding habits to [[https://github.com/jennybc][Jenny Bryan]] and [[http://hadley.nz/][Hadley Wickham]]. Do not hesitate to look for their books, workshops, and other material that are very useful and open source.
#+END_RED