-
Notifications
You must be signed in to change notification settings - Fork 0
/
03_data_manipulation_with_dplyr.Rmd
735 lines (568 loc) · 21.4 KB
/
03_data_manipulation_with_dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
---
editor_options:
chunk_output_type: console
---
# Data manipulation with `dplyr`
```{r load_lib}
# load the tidyverse
library(tidyverse)
```
## Introduction
### Foreword on `dplyr`
`dplyr` is tasked with performing all sorts of transformations on a dataset.
The structure of `dplyr` revolves around a set of functions, the so-called
**verbs**, that share a common syntax and logic, and are meant to work with one
another in chained operations. Chained operations are performed with the pipe
operator (`%>%`), that will be introduced in section 3.2.2.
The basic syntax is `verb(data, variable)`, where `data` is a data frame and
`variable` is the name of one or more columns containing a set of values for
each observation.
There are 5 main verbs, which names already hint at what they do: `rename()`,
`select()`, `filter()`, `mutate()`, and `summarise()`.
I'm going to introduce each of them (and a couple more) through the following sections.
### Example data
Through this tutorial, we will be using mammal trait data from the [Phylacine](https://megapast2future.github.io/PHYLACINE_1.2/) database.
Let's have a peek at what it contains.
```{r read_data, message=FALSE}
phylacine <- read_csv("data/phylacine_traits.csv")
phylacine
```
`readr` automatically loads the data in a `tibble`, as we have seen in chapter
1 and 2. Calling the tibble gives a nice preview of what it contains. We have
data for 5,831 mammal species, and the variables contain information on taxonomy,
(broad) habitat, mass, IUCN status, and diet.
If you remember Section 1.2 on tidy data, you may see that this data isn't
exactly tidy. In fact, some columns are in wide (and messy) format, like the
"habitat" (terrestrial, marine, etc.) and diet columns.
`dplyr` actually does not require your data to be strictly tidy. If you feel that your
data satisfies the definition "one observation per row, one variable per column",
that's probably good enough.
I use a `tibble` here, but `dplyr` works equally well on base data frames. In
fact, `dplyr` is built for `data.frame` objects, and tibbles are data frames.
Therefore, tibbles are mortal.
## Working with existing variables
### Renaming variables with `rename()`
The variable names in the phylacine dataset are descriptive, but quite unpractical. Typing
`Binomial.1.2.` is cumbersome and subject to typos (in fact, I just made one).
`binomial` would be much simpler to use.
Changing names is straightforward with `rename()`.
```{r rename}
rename(.data = phylacine, "binomial" = Binomial.1.2)
```
The first argument is always `.data`, the data table you want to apply change to.
Note how columns are referred to. Once the data table as been passed as an
argument, there is no need to refer to it directly anymore, `dplyr` understands
that you're dealing with variables inside that data frame. So drop that
`data$var`, `data[, "var"]`, and forget the very existence of `attach()` /
`detach()`.
You can refer to variables names either with strings or directly as objects,
whether you're reading or creating them:
```{r rename2, eval=FALSE}
rename(
phylacine,
# this works
binomial = Binomial.1.2
)
rename(
phylacine,
# this works too!
binomial = "Binomial.1.2"
)
rename(
phylacine,
# guess what
"binomial" = "Binomial.1.2"
)
```
I have applied similar changes to all variables in the dataset. Here is what the
new names look like:
```{r rename3, echo=FALSE}
phylacine <- rename(
.data = phylacine,
"binomial" = Binomial.1.2,
"order" = Order.1.2,
"family" = Family.1.2,
"genus" = Genus.1.2,
"species" = Species.1.2,
"terrestrial" = Terrestrial,
"marine" = Marine,
"freshwater" = Freshwater,
"aerial" = Aerial,
"life_habit_method" = Life.Habit.Method,
"life_habit_source" = Life.Habit.Source,
"mass_g" = Mass.g,
"mass_method" = Mass.Method,
"mass_source" = Mass.Source,
"mass_comparison" = Mass.Comparison,
"mass_comparison_source" = Mass.Comparison.Source,
"island_endemicity" = Island.Endemicity,
"iucn_status" = IUCN.Status.1.2, # not even for acronyms
"added_iucn_status" = Added.IUCN.Status.1.2,
"diet_plant" = Diet.Plant,
"diet_vertebrate" = Diet.Vertebrate,
"diet_invertebrate" = Diet.Invertebrate,
"diet_method" = Diet.Method,
"diet_source" = Diet.Source
)
phylacine
```
### The pipe operator `%>%`
If you have already come across pieces of code using the tidyverse, chances are
that you have seen this odd symbol. While the pipe is not strictly-speaking a
part of the tidyverse (it comes from its own package, `magrittr`), it is
imported along with each package and widely used in conjunction with its
functions.
What does it do? Consider the following example with `rename()`:
```{r the_pipe}
phylacine2 <- readr::read_csv("data/phylacine_traits.csv")
# regular syntax
rename(phylacine2, "binomial" = "Binomial.1.2")
# alternative syntax with the pipe operator
phylacine2 %>% rename("binomial" = "Binomial.1.2")
```
Got it? The pipe takes the object on its left-side and silently feeds it to the
*first* argument of the function on its right-side. It could be read as "take x,
then do...".
The reason for using the pipe is because it makes code syntax closer to
the syntax of a sentence, and therefore, easier and faster for your brain to
process (and write!) the code. In particular, the pipe enables easy chains of
operations, where you apply something to an object, then apply something else to
the outcome, and so on...
Through the later sections, you will see some examples of chained operations
with `dplyr` functions, but
for that I first need to introduce a couple more verbs.
Using the pipe can be quite unsettling at first, because you are not used to
think in this way. But if you push a bit for it, I promise it will make things
a lot easier (and it's quite addictive!). To avoid typing the tedious symbols,
`magrittr` installs a shortcut for you in RStudio. Use `Ctrl + Shift + M` on
Windows, and `Cmd + Shift + M` on MacOS.
Finally I should emphasize that the use of the pipe isn't limited to the
tidyverse, but extends to almost all R functions. Remember that by default
the piped value is always matched to the first argument of the following
function
```{r pipe_examples}
5 %>% rep(3)
"meow" %>% cat()
```
If you need to pass the left-hand side to an argument other than the first,
you can use the dot place-holder `.`.
```{r dot}
"meow" %>% cat("cats", "go")
```
Because of its syntax, most base R operators are not compatible with the pipe
(but this is very rarely needed).
If needed, `magrittr` introduces alternative functions for operators.
Subsetting operators can be piped, with the dot place-holder.
```{r pipe_limits}
# 5 %>% * 3 # no, won't work
# 5 %>% .* 3 # neither
5 %>% magrittr::multiply_by(3) # yes
# subsetting
list("monkey see", "monkey_do") %>% .[[2]]
phylacine %>% .$binomial %>% head()
```
Because subsetting in this way is particularly hideous, `dplyr`
delivers a function to extract values from a single variable. In only works on tables, though.
```{r pull}
phylacine %>% pull(binomial) %>% head()
```
### Select variables with `select()`
To extract a set of variables (i.e. columns), use the conveniently-named
`select()`. The basic syntax is the same as `rename()`: pass your data as the
first argument, then call the variables to select, quoted or not.
```{r select}
# Single variable
phylacine %>% select(binomial)
# A set of variables
phylacine %>% select(genus, "species", mass_g)
# A range of contiguous variables
phylacine %>% select(family:terrestrial)
```
You can select by variable numbers. This is not recommended, as prone to
errors, especially if you change the variable order.
```{r select_nb}
phylacine %>% select(2)
```
`select()` can also be used to *exclude* variables:
```{r exclude}
phylacine %>% select(-binomial)
phylacine %>% select(-(binomial:species))
```
`select()` and `rename()` are pretty similar, and in fact, `select()` can also
rename variables along the way:
```{r select_rename}
phylacine %>% select("linnaeus" = binomial)
```
And you can mix all of that at once:
```{r select_mix}
phylacine %>% select(
"fam" = family,
genus:freshwater,
-terrestrial
)
```
### Select variables with helpers
The Rstudio team just released `dplyr 1.0.0`, and along with it, some nice
helper functions to ease the selection of a set of variables. I give three
examples here, and encourage you to look at the documentation (`?select()`) to
find out more.
```{r select_help}
phylacine %>% select(where(is.numeric))
phylacine %>% select(contains("mass") | contains("diet"))
habitats <- c("terrestrial", "marine", "arboreal", "fossorial", "freshwater")
phylacine %>% select(any_of(habitats))
```
### Rearranging variable order with `relocate()`
The order of variables seldom matters in `dplyr`, but due to popular demand,
`dplyr` now has a dedicated verb to rearrange the order of variables. The syntax
is identical to `rename()`, `select()`.
```{r relocate}
phylacine %>% relocate(mass_g, .before = binomial)
phylacine %>% relocate(starts_with("diet"), .after = species)
```
## Working with observations
### Ordering rows by value - `arrange()`
`arrange()` sorts rows in the data by **ascending** value for a given variable.
Use the wrapper `desc()` to sort by descending values instead.
```{r}
# Smallest mammals
phylacine %>%
arrange(mass_g) %>%
select(binomial, mass_g)
# Largest mammals
phylacine %>%
arrange(desc(mass_g)) %>%
select(binomial, mass_g)
# Extra variables are used to sort ties in the first variable
phylacine %>%
arrange(mass_g, desc(binomial)) %>%
select(binomial, mass_g)
```
*Important*: `NA` values, if present, are always ordered at the end!
### Subset rows by position - `slice()`
Use `slice()` and its variants to extract particular rows.
```{r slice}
phylacine %>% slice(3) # third row
phylacine %>% slice(5, 1, 2) # fifth, first and second row
phylacine %>% slice(rep(3, 2)) # duplicate the third row
phylacine %>% slice(-c(2:5830)) # exclude all but first and last row
phylacine %>% slice_tail(n = 3) # last three rows
phylacine %>% slice_max(mass_g) # largest mammal
```
You can also sample random rows in the data:
```{r slice_sample}
phylacine %>% slice_sample() # a random row
# bootstrap
phylacine %>% slice_sample(n = 5831, replace = TRUE)
```
### Subsetting rows by value with `filter()`
`filter()` does a similar job as `slice()`, but extract rows that satisfy a set
of conditions. The conditions are supplied much the same way as you would do
for an `if` statement.
Along with `mutate()` (next section), this is probably the function you are
going to use the most.
For example, I might want to extract mammals above a given mass:
```{r megafauna}
# megafauna
phylacine %>%
filter(mass_g > 1e5) %>% # 100 kg
select(binomial, mass_g)
# non-extinct megafauna
phylacine %>%
filter(mass_g > 1e5, iucn_status != "EP") %>%
select(binomial, mass_g, iucn_status)
```
Are there any flying mammals that aren't bats?
```{r no_bats}
phylacine %>%
filter(aerial == 1, order != "Chiroptera")
# no :(
```
Are humans included in the table?
```{r human_after_all}
phylacine %>% filter(binomial == "Homo_sapiens")
```
`filter()` can be used to deal with NAs:
```{r filter_na}
phylacine %>%
filter(!is.na(mass_comparison))
```
Tip: `dplyr` introduces the useful function `between()` that does exactly what the name implies
```{r between}
between(1:5, 2, 4)
# Mesofauna
phylacine %>%
filter(mass_g > 1e3, mass_g < 1e5) %>%
select(binomial, mass_g)
# same thing
phylacine %>%
filter(mass_g %>% between(1e3, 1e5)) %>%
select(binomial, mass_g)
```
Note that you can pipe operations inside function arguments as in the last line
above (arguments are expressions, after all!).
## Making new variables
### Create new variables with `mutate()`
Very often in data analysis, you will want to create new variables, or edit
existing ones. This is done easily through `mutate()`. For example, consider
the diet data:
```{r}
diet <- phylacine %>%
select(
binomial,
contains("diet") & !contains(c("method", "source"))
)
diet
```
These three variables show the percentage of each category of food that make the
diet of that species. They should sum to 100, unless the authors made a typo or
other entry error. To assert this, I'm going to create a new variable,
`total_diet`.
```{r total_diet}
diet <- diet %>% mutate(
"total_diet" = diet_vertebrate + diet_invertebrate + diet_plant
)
diet
all(diet$total_diet == 100)
# cool and good
```
`mutate()` adds a variable to the table, and keeps all other variables.
Sometimes you may want to just keep the new variable, and drop the other ones.
That's the job of `mutate()`'s twin sibling, `transmute()`. For example,
I want to combine `diet_invertebrate` and `diet_vertebrate` together:
```{r diet_animal}
diet %>%
transmute(
"diet_animal" = diet_invertebrate + diet_vertebrate
)
```
You may want to keep some variables and drop others. You could pipe `mutate()`
and `select()` to do so, or you could just pass the variables to keep to
`transmute()`.
```{r both_diets}
diet %>%
transmute(
"diet_animal" = diet_invertebrate + diet_vertebrate,
diet_plant
)
```
You can also refer to variables you're creating to derive new variables from
them as part of the same operation, this is not an issue.
```{r total_diet2}
diet %>%
transmute(
"diet_animal" = diet_invertebrate + diet_vertebrate,
diet_plant,
"total_diet" = diet_animal + diet_plant
)
```
Sometimes, you may need to perform an operation based on the row number
(I don't have a good example in mind). `tibble` has a built-in function to do
just that:
```{r row_index}
phylacine %>%
select(binomial) %>%
tibble::rownames_to_column(var = "row_nb")
```
### Summarise observations with `summarise()`
`mutate()` applies operations to all observations in a table. By contrast,
`summarise()` applies operations to *groups* of observations, and returns, er,
summaries. The default grouping unit is the entire table:
```{r summarise}
phylacine %>%
summarise(
"nb_species" = n(), # counts observations
"nb_terrestrial" = sum(terrestrial),
"nb_marine" = sum(marine),
"nb_freshwater" = sum(freshwater),
"nb_aerial" = sum(aerial),
"mean_mass_g" = mean(mass_g)
)
```
Above you can see that bats account for a large portion of mammal species
diversity (`nb_aerial`). How much exactly? Just as with `mutate()`, you can
perform operations on the variables you just created, in the same statement:
```{r prop_bats}
phylacine %>%
summarise(
"nb_species" = n(),
"nb_aerial" = sum(aerial), # bats
"prop_aerial" = nb_aerial / nb_species
)
```
One fifth!
If the british spelling bothers you, `summarize()` exists and is strictly
equivalent.
Here's a simple trick with logical (TRUE / FALSE) variables. Their sum is the
count of observations that evaluate to `TRUE` (because `TRUE` is taken as 1 and
`FALSE` as 0) and their mean is the proportion of `TRUE` observations. This can
be exploited to count the number of observations that satisfy a condition:
```{r sum_logi}
phylacine %>%
summarise(
"nb_species" = n(),
"nb_megafauna" = sum(mass_g > 100000),
"p_megafauna" = mean(mass_g > 100000)
)
```
There are more summaries that just means and counts
(see `?summarise()` for some helpful functions). In fact, summarise can
use any function or expression that evaluates to a single value or a *vector*
of values. This includes base R `max()`, `quantiles`, etc.
`mutate()` and `transmute()` can compute summaries as well, but they will
return the summary once for each observation, in a new column.
```{r mutate_summary}
phylacine %>%
mutate("nb_species" = n()) %>%
select(binomial, nb_species)
```
### Grouping observations by variables
In most cases you don't want to run summary operations on the entire set of observations, but instead on observations that share a common value, i.e. groups.
For example, I want to run the summary displayed above, but for each Order of mammals.
`distinct()` extracts all the unique values of a variable
```{r distinct_orders}
phylacine %>% distinct(order)
```
I could work my way with what we have already seen, filtering observations (`filter(order == "Rodentia")`) and then pipeing the output to `summarise()`, and do it again for each Order. But that would be tedious.
Instead, I can use `group_by()` to pool observations by `order`.
```{r group_by_order}
phylacine %>%
group_by(order)
```
At first glance, nothing has changed, apart from an extra line of information in the output that tells me the observations have been grouped. But now here's what happen if I run the same `summarise()` statement on an ungrouped and a grouped table
```{r summarise_order}
phylacine %>%
summarise(
"n_species" = n(),
"mean_mass_g" = mean(mass_g)
)
phylacine %>%
group_by(order) %>%
summarise(
"n_species" = n(),
"mean_mass_g" = mean(mass_g)
)
```
I get one value for each group.
Observations can be grouped by multiple variables, which will output a summary for every unique combination of groups.
```{r multiple_groups}
phylacine %>%
group_by(order, iucn_status) %>%
summarise(
"n_species" = n()
)
```
Whenever you call `summarise()`, the last level of grouping is dropped. Note how in the output table above, observations are still grouped by order, and no longer by IUCN status. If I summarise observations again:
```{r drop_grouping}
phylacine %>%
group_by(order, iucn_status) %>%
summarise(
"n_species" = n()
) %>%
summarise(
"n_species_2" = n()
)
```
I get the summary across orders, and the table is no longer grouped at all. This is useful to consider if you need to work on summaries across different levels of the data.
For example, I would like to know how the species in each order are distributed between the different levels of threat in the IUCN classification. To get these proportions, I need to first get the count of each number of species in a level of threat inside an order, and divide that by the number of species in that order.
```{r iucn_prop}
phylacine %>%
group_by(order, iucn_status) %>%
summarise("n_order_iucn" = n()) %>%
# grouping by iucn_status silently dropped
mutate(
"n_order" = sum(n_order_iucn),
"p_iucn" = n_order_iucn / n_order
)
```
10.2% of Carnivores are Endangered ("EN").
### Grouped data and other `dplyr` verbs
Grouping does not only affect the behaviour of `summarise`, but under circumstances, other verbs can (and will!) perform operations by groups.
```{r grouped_filter}
# Species with a higher mass than the mammal mean
phylacine %>%
select("binomial", "mass_g") %>%
filter(mass_g > mean(mass_g, na.rm = TRUE))
# Species with a higher mass than the mean in their order
phylacine %>%
group_by(order) %>%
select("binomial", "mass_g") %>%
filter(mass_g > mean(mass_g, na.rm = TRUE))
```
```{r grouped_slice}
# Largest mammal
phylacine %>%
select(binomial, mass_g) %>%
slice_max(mass_g)
# Largest species in each order
phylacine %>%
group_by(order) %>%
select(binomial, mass_g) %>%
slice_max(mass_g)
```
To avoid grouped operations, you can simply drop grouping with `ungroup()`.
## Working with multiple tables
### Binding tables
`dplyr` introduces `bind_rows()` and `bind_cols()`, which are equivalent to base
R `rbind()` and `cbind()`, with a few extra feature. They are faster, and can
bind many tables at once, and bind data frames with vectors or lists.
`bind_rows()` has an option to pass a variable specifying which dataset each
observation originates from.
```{r id_bind}
porpoises <- phylacine %>%
filter(family == "Phocoenidae") %>%
select(binomial, iucn_status)
echidnas <- phylacine %>%
filter(family == "Tachyglossidae") %>%
select(binomial, iucn_status)
bind_rows(
"porpoise" = porpoises,
"echidna" = echidnas,
.id = "kind"
)
```
### Combining variables of two tables with mutating joins
Mutating joins are tailored to combine tables that share a set of observations
but have different variables.
As an example, let's split the `phylacine` dataset in two smaller datasets,
one containing information on diet and one on the dominant habitat.
```{r split_phylacine}
diet <- phylacine %>%
select(binomial, diet_plant:diet_invertebrate) %>%
slice(1:5)
diet
life_habit <- phylacine %>% select(binomial, terrestrial:aerial) %>%
slice(1:3, 6:7)
life_habit
```
The two datasets each contain 5 species, the first three are shared, and the
two last differ between the two.
```{r intersect}
intersect(diet$binomial, life_habit$binomial)
setdiff(diet$binomial, life_habit$binomial)
```
To use mutate-joins, both tables need to have a **key**, a variable that
identifies each observation. Here, that would be `binomial`, the sepcies names.
If your table doesn't have such a key and the rows between the tables match
one another, remember you can create a row number variable easily with
`tibble::column_to_rownames()`.
```{r inner_join}
inner_join(diet, life_habit, by = "binomial")
```
`inner_join` combined the variables, and dropped the observations that weren't
matched between the two tables. There are three other variations of mutating
joins, differing in what they do with unmatching variables.
```{r outer_joins}
left_join(diet, life_habit, by = "binomial")
right_join(diet, life_habit, by = "binomial")
full_join(diet, life_habit, by = "binomial")
semi_join(diet, life_habit, by = "binomial")
anti_join(diet, life_habit, by = "binomial")
```
### Filtering matching observations between two tables wiht filtering joins
So-called filtering joins return row from the first table that are matched
(or not, for `anti_join()`) in the second.
```{r filter_joins}
semi_join(diet, life_habit, by = "binomial")
anti_join(diet, life_habit, by = "binomial")
```