-
Notifications
You must be signed in to change notification settings - Fork 1
/
cleaning.Rmd
570 lines (377 loc) · 30.9 KB
/
cleaning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
---
title: "Data cleaning and raw data management"
output:
html_document:
toc: yes
toc_float: yes
html_notebook:
toc: yes
toc_float: yes
---
[<<BACK](https://remi-daigle.github.io/2017-CHONe-Data/)
# The reproducible workflow
My unofficial mantra (again):
!["Adam Savage via Pintrest"](https://s-media-cache-ak0.pinimg.com/564x/e9/f2/19/e9f219dce30f13670158832310b0e42c.jpg)
In [Data organization with spreadsheets](https://remi-daigle.github.io/2017-CHONe-Data/organization.nb.html) we learned how to write down our data, avoid common errors, and how to save your data so that you and others can read and understand it later.
Now let's learn how to write down **ALL** the steps needed to take you from your raw data to publication quality figures. That's the key to reproducibility, if every step is written down, then others can reproduce your findings! Below, You will learn how to clean your data in a reproducible way using OpenRefine and R.
> **Pro-tip**
>
> After spending all that time entering your raw data, **NEVER** change it again.
>
> - Do not edit the data
> - Do not edit the column headers
> - Do not remove 'outliers'
> - Do not do calculations directly on the raw data
>
> Store your data in a `raw_data` directory (folder) in your project directory, and never save/write over it!
I advise you archive your data immediately upon collection to reduce risk of data loss. [Zenodo](https://zenodo.org/) and [Figshare](https://figshare.com/) both offer different **free** options for archiving your raw data **permanently** and have some awesome options for embargoes, restricted access, or even private storage to suite your privacy needs. More detail can be found in the [Data Archiving & version controls](https://remi-daigle.github.io/2017-CHONe-Data/versioncontrol.nb.html) lesson.
![[](https://twitter.com/TrevorABranch/status/466780827985002496)](pictures/Branch.png)
By now I've probably said the words reproducible and reproducibility so often that it's starting to lose meaning. Trust me, this is important! (or don't trust me and read ["A manifesto for reproducible Science"](https://www.nature.com/articles/s41562-016-0021), and ["Reproducible Data Science with R"](https://www.r-bloggers.com/reproducible-data-science-with-r/)). By not 'hiding' you workflow, you help:
- **yourself** in the future since:
- you can easily re-run your analysis after 'reviewer 3' makes you add one data point
- you will be able to easily adapt your method
- you will receive higher citations since someone may cite you for your methods/data
- you will appear like an honest/better scientist
- **other scientists** since:
- they will be able to reuse/adapt your methods/data
- they will be better able to judge your methods as appropriate (for your data during review, or their data in the future)
- **'Science'** since
- discoveries are published more efficiently
- findings are 'correct' more often
- there is increased trust in 'Science', which we badly need right now
# OpenRefine
OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another. It's effectively a reproducible way to work in a spreadsheet, no coding is required on your part since it generates a script that details what you did to the data, step by step.
First, let's open OpenRefine. You'll notice it opens in the browser, but it's running locally (does not require internet connection).
![](pictures/OR.png)
The first step is to load your data and to create a project. If you haven't done so already, please download the [zip file](https://github.com/remi-daigle/2017-CHONe-Data/archive/gh-pages.zip) the entire github [project repository](https://github.com/remi-daigle/2017-CHONe-Data/) (contains data files, and all the R scripts used to make this very website!) and extract it somewhere convenient.
In OpenRefine, click `Choose Files` and find the `larval abundance.csv` which is in the `rawdata` directory of the project folder. Then click on the `Create Project` button (you may want to rename the project).
## Data Cleaning
You are now working on a copy of the raw data and changes you make in OpenRefine will not 'break' your original raw data. In here you can do all the regular 'spreadsheet-y' things. You can edit specific cells, sort, undo/redo, view subsets of your data (facet), etc. But the more powerful functions of OpenRefine are:
- **Cluster** (click on column header arrow, then `Edit cells > Cluster and Edit...`) which means “finding groups of different values that might be alternative representations of the same thing”. For example, the two strings “New York” and “new york” are very likely to refer to the same concept and just have capitalization differences.
- **Whitespace management** (click on column header arrow, then `Edit cells > Common transforms > Trim leading and trailing whitespace.` and `Edit cells > Common transforms > Collapse consecutive whitespace.`) Strings with spaces at the beginning or end are particularly hard for we humans to tell from strings without, but the blank characters will make a difference to the computer. We usually want to remove these.
![](pictures/ORwhitespace.png)
## Reproducibility
OpenRefine saves every change, every edit you make to the larvalAbundance in a file you can save on your machine. If you had 20 files to clean, and they all had the same type of errors, and all files had the same columns, you could save the script, open a new file to clean, paste in the script and run it. Voila, clean data.
- In the `Undo / Redo section`, click `Extract`, save the bits desired using the check boxes.
- Copy the code and paste it into a text editor. Save it as a `.txt` file.
- To run these steps on a new larvalAbundance, import the new larvalAbundance into OpenRefine, open the `Extract / Apply` section, paste in the `.txt` file, click `Apply`.
For more information and tutorials on OpenRefine, please see [Data Carpentry](http://www.datacarpentry.org/OpenRefine-ecology-lesson/)
# R and RStudio
You can do everything mentioned above in [R](http://cran.utstat.utoronto.ca/), it may at first appear more difficult to do it in R, but in my opinion, you will save time by streamlining your workflow using just one tool. I'm not at all disouraging the use of OpenRefine,it is open source and reproducible, such much kudos is due.
For this and future lessons, we will focus on achieving reproducibility by using [R](http://cran.utstat.utoronto.ca/) which is an open source language and environment for statistical computing and graphics. There are many other such languages (e.g. [Python](https://www.python.org/), [Julia](https://julialang.org/), [MATLAB](https://www.mathworks.com/products/matlab.html),etc) used by conservationists, biologist, and oceanographers; however, we believe R is currently the most widely adopted among our colleagues and also has the most convenient set of statistical tools developed for our field.
## Intro to R
In "the olden days" we ~~had to walk to school uphill both ways~~ used R in the terminal or using the built in graphical use interface (GUI). Yes, before 2011, RStudio did not exist and yes, R and RStudio are not the same thing!
R is accessible in the terminal (that thing that looks like DOS, and in case my 'old' is showing, the thing that is usually a black screen, a blinking cursor and you can only type in commands) by typing `R.exe` on Windows, or just `R` on Mac or Linux. In this way, you can type commands in one by one, or similar to what we just saw with OpenRefine, you save your steps/instuctions/commands in a plain text file (with a `.R` extension instead of `.txt`) and you can run those in the terminal by typing `Rscript.exe scriptname.R` on Windows, or just `Rscript scriptname.R` on Mac or Linux. While I don't often work in this way anymore, but this is the only option when using [Compute Canada](https://www.computecanada.ca/research-portal/national-services/compute/)'s awesome resources. Using the commands above and a little server specific magic, you can run your scripts on 100's of processors instead of the one lonely processor on your computer! I've used hundreds of years of computer time in a matter of weeks, all for free! If you are affiliated with any Canadian university, you can do this too!
!["This is what 'plain vanilla' R looks like"](pictures/Rterm.png)
R also come with it's own GUI, in which you can have a script editor, which is essentially a plain text editor, to write/develop your script and an interactive R console where you can actually execute commands. The advantage of the GUI is that you can execute the entire script ('source') or run it line by line all while recording your commands in the script file.
![](pictures/Rgui.png)
## Intro to RStudio
[RStudio](https://www.rstudio.com/products/RStudio/#Desktop) takes this GUI concept a bit further and provides you with several extra support window. If the idea of having windows for your environment, your files, your plots, as well as packages and a help tab all at hand does not excite you, hold on tight, you'll get there.
![](pictures/RStudio.png)
There's also a lot more information about RStudio on their [cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/01/rstudio-IDE-cheatsheet.pdf). On the subject of cheatsheets, RStudio has developed several **super useful** [cheatsheets](https://www.rstudio.com/resources/cheatsheets/); seriously, you probably will want to print most of these and put them on the wall in your office.
## Enough talk! Let's get coding! The Fundamentals
You read my mind! However, before we get to cleaning the data, we need to cover a few R fundamentals so that what we do in later steps makes sense.
Go back to the project folder you downloaded during the [OpenRefine lesson](https://remi-daigle.github.io/2017-CHONe-Data/cleaning.html#openrefine) and open the `2017-CHONe-Data.Rproj` file. This is an R project file that allows you to set a number of options for the project (see [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects)), but for our purposes just know that the project file is setting the 'working directory', it tells R where all your files are. We'll get back to that later.
In R you can do math, type the command below in your R console and hit enter:
```{r}
1+1
```
You can also assign values to variables, or in R parlance an 'object' using the `<-` symbol (shortkey `Alt` + `-`).
```{r}
a <- 2
b <- 1+2
```
You'll notice that there was no output this time there was no output. That's because the value on the right side of the `<-` symbol is assigned to a object, it goes to your environment (top right window) instead of being output to the console. In the 1+1 example above, there was no object to go to, so it defaulted to printing in the console.
You can see the contents of a object in the environment window, or by typing the object into the console. You can also use these objects like algebra
```{r}
a
b
a/b
```
Up to now, we've been dealing with numbers, but R can also deal with character string if surrounded by single or double quotation marks. According to R help (I learned this today!): "Single and double quotes delimit character constants. They can be used interchangeably but double quotes are preferred (and character constants are printed using double quotes), so single quotes are normally only used to delimit character constants containing double quotes."
```{r}
f <- "This is a character string, you can tell because of the quotation marks"
```
A object can also contain multiple values, this is called a vector. The `:` symbol essentially means 'to'
```{r}
x <- 1:3
```
Another way to do that, with more flexibility is using the `c()`; the c is short for concatenate and the round brackets indicate that it's a function. So this concatenate function will concatenate all the 'arguments' (things inside the round brackets) which are separated by commas. You can also combine these strategies
```{r}
x <- c(1,3,5)
y <- c(1:4,6,8)
```
There are many functions, but they all follow the format `functionName(argument1,argument2,argument3,...)` where the 'arguments' are the input to the function. Some are fairly straightforward:
```{r}
mean(y)
```
But even then there are some surprises, let's look at the help file for `mean()`. To do that you can:
- if you are on the active line in the console or anywhere in a script, put your cursor on the function and press `F1`
- in the console, type `?mean` (or `??mean` if your not so sure `mean` is the name of the function)
- find the help window (one of the tabs for the bottom right window) and use the search bar
- also, when all else fails, Google is your friend!
Any method should get you to something like this:
![](pictures/help.png)
In R in most cases you could use `=` instead of the `<-` symbol with no problems when you are assigning a value to a object. However, it is best practice to use `<-` when assigning environment objects and `=` when defining function arguments. Oh, and `NA` in R means ‘Not Available’ / Missing Values. Like so:
```{r}
x <- c(1,2,5,7,88,3,4,2,4,6,7,NA)
mean(x)
mean(x, na.rm = TRUE)
```
I also snuck a `TRUE` in there; `TRUE` and `FALSE` are called logical and are distinct from numeric or character strings. They are sometimes used as arguments values, but they can also used to test things. The `==` asks if both sides are equal (since the single `=` is already used for other things), and the `!=` asks if both sides are not equal.
```{r}
2==1
2==2
2!=1
```
## Creating your own scripts
Up until now, we've been playing in the console which means the 'instructions' we need to save to reproduce our science are lost (well not really, they can be retrieved from the History tab in the top right, or the console if it hasn't rolled off the screen). It is a good idea to develop your analysis using a script file (those simple text files with the `.R` extension I was talking about earlier) because you can save your code easily.
To create a new script, your can click on the little paper with the plus symbol (see below), or you can hit `Ctrl`+`Shift`+`N` (Windows), or `Command`+`Shift`+`N` (Mac), and if that's not enough options, you can click `File > New File > New Script`
![](pictures/newscript.png)
These scripts are designed to read by R from top to bottom when you hit the ![](pictures/source.png) button, or `Ctrl`+`Shift`+`S` (Windows), or `Command`+`Shift`+`S` (Mac). Alternatively, you can run portions of your code with `Ctrl`+`Enter` (Windows), or `Command`+`Enter` (Mac) and either putting your cursor on a line to run the entire line, or highlighting a subsection of code to run just that portion. This will allow us to build multi-step data processing and analysis scripts.
>**Pro-tip**
>If you don't want R to read something, us the `#`. Anything that is preceded by a `#` is regarded as a 'comment' by R and it does not try to execute those lines (i.e. R ignores anything after a `#`).
> This is also useful if you want to avoid running a few lines of code when you are developing your script. Instead of typing a `#` in front of each line of code, you can highlight the lines you want commented out and hit `Ctrl`+`Shift`+`C` (Windows), or `Command`+`Shift`+`C` (Mac). Magic!
Commenting is super useful to include human readable instructions/documentation in your code. Let's give this a try, write this chunk of code into your script, then run it line by line.
```{r}
x <- 1
x <- 2
# x <- 3
```
What is the value of `x` after running all the lines and why?
## Indexing and dimensions
We already mentioned that we can have multiple values in an object, but so far this has been in only 1 dimension, but using `matrix()` (2D) and `array()` (>2D) we can store numbers in multiple dimensions
```{r}
# make a matrix
x <- matrix(data = c(1,1,2,5,3,4), nrow = 2, ncol = 3)
x
# make an array
y <- array(data = c(1:8), dim = c(2,2,2))
y
```
>**Exercise**
>
> Try storing numeric and character strings in a matrix (or an array). What happens to the numerics?
That's great that we can store this data in multiple dimensions, but how do I get it back? That's what `[]` are for!
```{r}
# in 1D
x <- c(1,2,3,4,6,34,2,1,5,6,7)
# if we want the 7th value
x[7]
# if we want the 2nd and 7th value
x[c(2,7)]
# if we want only values greater than 10
x[x>10]
# whoa, that blew my mind! How did that work?
# well indexing works by giving the numeric index, or a logical vector the length of the vector we're working with
# so x>10 produces a logical vector the length of x
x>10
# in 2D
x <- matrix(data = c(1,1,2,5,3,4), nrow = 2, ncol = 3)
# this works similar, but you need to provide 2 numbers (or 2 vectors of numbers), for row number and for column number
x[1,3]
x[c(1,2),1]
# if you leave one dimension blank, the whole row or column will be returned
x[,1]
x[2,]
# in 3D, add another dimension!
x <- array(data = c(1:24), dim = c(2,3,4))
x[1,2,3]
x[,,3]
```
Data frames (`data.frame()`) are a special type of matrix that can hold numeric and character strings (and logicals, factors, geometries, etc) without having to convert them all to a single type. It's effectively a collection of vectors of the same length.
```{r}
# make a data frame
x <- data.frame(nums = c(1,2,3),
chars = c("one","two","three"),
logis = c(TRUE, FALSE, TRUE))
# print the whole thing to the console
x
# or for a very big dataframe, you may want to use head to see the top 5 rows
head(x)
# use the str function to see it's structure
str(x)
```
>**Pro-tip**
>
> Did you notice that the structure of chars was `Factor` and not `chr`? Factors are a special type of of character string vector that conserves information about the vector's 'levels' (and you can also order those levels). Many functions, such as `data.frame()` have an argument called `stringsAsFactors` and is usually set to `TRUE` by default, you may want to set this to `FALSE`. Often, these are interchangeable, but I have had **many frustrating errors** because I mistakenly had a character string where I needed a factor and vice versa. Be aware that they are different and not knowing which one you have can lead to errors. You can easily convert back and forth with `as.factor` (or `as.ordered()`) and `as.character()`
```{r}
# let's try having both strings and factors in the data frame
# make a data frame
x <- data.frame(nums = c(1,2,3),
chars = c("one","two","three"),
facts = as.factor(c("one","two","three")),
logis = c(TRUE, FALSE, TRUE),
stringsAsFactors = FALSE)
# print the whole thing to the console
x
# or for a very big dataframe, you may want to use head to see the top 5 rows
head(x)
# use the str function to see it's structure
str(x)
```
Indexing in dataframes can be the same as for a 2D matrix, or you can use `$` to access columns by name
```{r}
# make a data frame
x <- data.frame(nums = c(1,2,3),
chars = c("one","two","three"),
facts = as.factor(c("one","two","three")),
logis = c(TRUE, FALSE, TRUE),
stringsAsFactors = FALSE)
x[1,2]
x[,2]
x$chars
# and you can also treat these columns as 1D vectors
x$chars[1]
```
## Reading in data
That's all great, but our data is store in a `.csv` file, how do we get that? The simplest way is to use the `read.csv()` function. But we need to know where on the computer the file is stored.
The first step is to load your data and to create a project. If you haven't done so already, please download the [zip file](https://github.com/remi-daigle/2017-CHONe-Data/archive/gh-pages.zip) the entire github [project repository](https://github.com/remi-daigle/2017-CHONe-Data/) (contains data files, and all the R scripts used to make this very website!) and extract it somewhere convenient.
Go back to the project folder you extracted and open the `2017-CHONe-Data.Rproj` file. This is an R project file that allows you to set a number of options for the project (see [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects)), but for our purposes just know that the project file is setting the 'working directory', it tells R where all your files are.
```{r}
# you can hard code the whole file path, but try to never do that!
# your computer may not have a C drive, and your name almost certainly is not Remi, so this will not work for you
# larvalAbundance <- read.csv("C:/Users/Remi-Work/Desktop/2017-CHONe-Data/rawdata/larval abundance.csv")
# The above is unsurprisingly not reproducible! Always use relative paths!
# Relative paths are a shortened version of the above, you only need to type what comes after the project directory
# REMINDER: The project directory is where the .Rproj file is stored
larvalAbundance <- read.csv("rawdata/larval abundance.csv", stringsAsFactors = FALSE)
# writing it this way (relative path) means that your project is reproducible since you can move this whole directory to another location on your computer OR ANY OTHER COMPUTER!!!
```
This data comes from one of Remi's PhD thesis chapters[^1] that was part of the first CHONe. I modified it from the original version archived on [Dryad](http://datadryad.org/handle/10255/dryad.59482) so we would have cleaning to do!
[^1]: Daigle RM, Metaxas A, deYoung B (2014) Bay-scale patterns in the distribution, aggregation and spatial variability of larvae of benthic invertebrates. Marine Ecology Progress Series 503:139-156. http://dx.doi.org/10.3354/meps10734
## Cleaning data in R
Common errors in data are:
- trailing or leading white spaces in character strings
- data entered in different formats (e.g. numbers and character strings in one column)
- data entered with wrong units
The first step I always take is to make sure that the data loaded in as expected. Head over to the Environment window and click on `data`. You can do (temporary) sorting and filtering of the data in the data viewer. I also use the `str()` function to make sure all the variables (columns) are of the correct type.
```{r}
str(larvalAbundance)
```
If I wanted to make a correction manually, we can use indexing:
```{r}
# say for example, I knew that the second observation was in fact taken at 12 m depth
larvalAbundance$depth[2]
# the value in the data frame is indeed 3 WRONG! Let,s correct it
larvalAbundance$depth[2] <- 12
head(larvalAbundance)
# The other issue you may notice is that my months values are mostly numbers, but there's at least 1 "August" in there
# We could correct just that one we see on line 4
larvalAbundance$month[4] <- 8
# Or we could get rid of all the "August"'s in one pass
larvalAbundance$month[larvalAbundance$month=="August"] <- 8
# but that column is still a character string, let's convert it to numeric
larvalAbundance$month <- as.numeric(larvalAbundance$month)
str(larvalAbundance)
```
If you were wondering how many years I had sampled, you could do:
```{r}
unique(larvalAbundance$year)
# or for time
unique(larvalAbundance$time)
# as you can see there are a few trailing white spaces, to do text substitutions, let's use gsub()
# but first let's see how this works; it matches a pattern in x and replaces it.
gsub(pattern = "ABC",replacement = "XYZ",x = "TUVWABC")
gsub(pattern = "doesn't work",replacement = "works",x = "If this sentence no longer contains the pattern, then gsub doesn't work")
# So now, let's use gsub to remove those spaces. The patter we are matching is just a space and the replacement is nothing, so that will remove white spaces
larvalAbundance$time <- gsub(pattern = " ",replacement = "",x = larvalAbundance$time)
```
Feel the power! The `gsub()` function is very powerful and the pattern matching works based on 'regular expressions' (which is a nearly universal pattern matching language/protocol). For example, if you had spaces you wanted to keep and only wanted to remove white spaces at the end, you could use `pattern = " +$"` since the dollar sign in regex means 'ends with' and the plus means 'one or more', so gsub would match one or more spaces at the end of a character string. You can practice your 'regex' with [regexpal](http://www.regexpal.com/) and this [cheatsheet](http://www.rexegg.com/regex-quickstart.html)
Other useful things you should check are the minimum and maximum values for each column to make sure things were entered all in the same order of magnitude, or a even a quick histogram
```{r}
min(larvalAbundance$Margarites.spp.)
max(larvalAbundance$Margarites.spp.)
hist(larvalAbundance$Margarites.spp.)
```
Lastly, there are some column names that are not 'up to code', so to avoid a stern talking to from CHONe's data manager, let's fix that now! (Also, you may have noticed that latitude and longitude were swapped!)
```{r}
names(larvalAbundance)
names(larvalAbundance)[names(larvalAbundance)=="long"] <- "decimalLatitude"
names(larvalAbundance)[names(larvalAbundance)=="lat"] <- "decimalLongitude"
names(larvalAbundance)[names(larvalAbundance)=="site"] <- "locationID"
```
Everything now seems reasonable to me, but it is your responsibility to check that each column of your data 'makes sense'. But for sake of time, lets move on!
Now we have a choice, we can either run a data cleaning script every time we load the raw data, or we can save a 'clean' data product. Let's do the latter.
```{r}
# let's create a new folder for intermediate data products
dir.create("data")
# Then let's save the cleaned data in that folder
write.csv(larvalAbundance, file = "data/larvalAbundanceClean.csv", row.names = FALSE)
```
>**Pro-tip**
>
> Notice that:
> - we are not writing over the raw data
> - we are not writing in the same folder as the raw data
> - we are naming our new data file informatively
## Making your own functions and packages
Part of R's awesomeness is that it already comes with a lot of functions that are very useful for everyday science. Additionally, there are many `packages` which are essentially collections of new functions and help files generated by users like you and me that add to the already broad functionality of R.
*Warning: shameless self promotion below!*
Here is a package I created called [BESTMPA](https://github.com/remi-daigle/BESTMPA) and the peer-reviewed paper describing it: [An adaptable toolkit to assess commercial fishery costs and benefits related to marine protected area network design](https://f1000research.com/articles/4-1234/v2)
Making your own function follows the particular format below, let's make one called `custommean()`
```{r}
custommean <- function(x){
m <- sum(x)/length(x)
return(m)
}
# so we defined custommean as a function with arguments 'x', and it does what is inside the curly brackets
# it will return m which is the mean of x.
x <- c(1,2,3,6)
# does it work?
custommean(x = x)
```
If you make a few of those and write some help files to go along with them, you can make your own package. If you're interested see the "[R packages](http://r-pkgs.had.co.nz/)" book by Hadley Wickham (Chief Scientist at RStudio, not the last time I will mention him), but making packages is beyond the scope of what I can cover in this workshop
Anyway, there is a central organization called 'Comprehensive R Archive Network' or CRAN which houses all the official package, but there are also other packages (like mine), as well as the development version of many of the official packages on github that are worth taking a look at.
## Install packages for tomorrow
To install packages, you can either use the Packages window at the bottom right, or you can do it with written commands (my preference). Here are a few we will use tomorrow, try installing them now and let us know if you get any errors.
```{r, eval=FALSE}
install.packages('tidyverse') # The tidyverse is a collection of R packages that share common philosophies and are designed to work together. (e.g. ggplot2, dplyr, tidyr)
install.packages('marmap') # Import xyz data from the NOAA (National Oceanic and Atmospheric Administration, <http://www.noaa.gov>), GEBCO (General Bathymetric Chart of the Oceans, <http://www.gebco.net>) and other sources, plot xyz data to prepare publication-ready figures
install.packages('raster') # Reading, writing, manipulating, analyzing and modeling of gridded spatial data (and also getting access to GADM basemaps)
install.packages('devtools') # Allows you to install packages from github
# the 'robis' package on CRAN does not work with the latest version of R (yet), so we need to get the latest version from github
devtools::install_github("iobis/robis")
# you already have the CRAN version of ggplot2, but that version is not compatible with the sf package yet
# So, we need the github version of ggplot2 as well!
devtools::install_github("tidyverse/ggplot2")
install.packages('gridExtra') # to be able to arrange multiple ggplot plots
install.packages('taxize') # To extract and validate species taxonomy
install.packages('rfishbase') # To access resources available on Fishbase and SeaLifeBase
install.packages('rglobi') # To access interactions data
devtools::install_github("ropensci/rnoaa") # To access environmental data from the NOAA databases
install.packages('knitr')
install.packages('biomod2') # to perform species distribution models
install.packages('iGraph') # to produce network plots
install.packages('networkD3') # to produce html (a.k.a. interactive) network plots!
devtools::install_github("guiblanchet/HMSC") # package to perform hierarchical modeling of species communities
install.packages('coda') # package to summarize and plot outputs from Markov Chains
install.packages('corrplot') # visualization of correlation matrices
install.packages('circlize') # circular visualization of data
install.packages('ModelMetrics') # collection of metrics coded for efficiency in C++ using Rcpp
install.packages("pdftools") # to extract pdf content
install.packages('stringr') # Simple, Consistent Wrappers for Common String Operations
install.packages('tidytext') # text analysis package
install.packages('viridis') # Port of the new 'matplotlib' color maps
install.packages('tibble') # Simple Data Frames
devtools::install_github("dgrtwo/widyr") # Widen, process, and re-tidy a dataset
install.packages('ggraph') # An Implementation of Grammar of Graphics for Graphs and Networks
install.packages('wordcloud2') # wordle generator
install.packages('leaflet') # Create and customize interactive maps
install.packages('mapview') # Interactive Viewing of Spatial Objects in R
install.packages('scales') # Graphical scales map data to aesthetics
# to gain access to the functions in a package, you can use library(), eg:
library(tidyverse)
```
## Keeping R up to date
Every so often, R releases a new update. Many of you had install issues last night that were resolved by installing the newest version of R. To keep things running smoothly in the future, here is a great trick:
```{r, eval=FALSE}
# install the package called installr
install.packages("installr")
# load the library
library(installr)
# run the updater function (this is best done outside Rstudio in the R gui)
updater()
```
This will prompt you to install the latest version of R and copy over all of you packages if you choose to do so (the alternative is installing them all by hand again), and it can also update your packages for you. It's really a great time saver!
[<<BACK](https://remi-daigle.github.io/2017-CHONe-Data/)