-
Notifications
You must be signed in to change notification settings - Fork 32
/
04-descriptives.Rmd
3173 lines (2376 loc) · 104 KB
/
04-descriptives.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Descriptive statistics and data manipulation
Now that we are familiar with some R objects and know how to import data, it is time to write some
code. In this chapter, we are going to compute descriptive statistics for a single dataset, but
also for a list of datasets later in the chapter. However, I will not give a list of functions to
compute descriptive statistics; if you need a specific function you can find easily in the *Help*
pane in Rstudio or using any modern internet search engine. What I will do is show you a workflow
that allows you to compute the descripitive statisics you need fast. R has a lot of built-in
functions for descriptive statistics; however, if you want to compute statistics for different
sub-groups, some more complex manipulations are needed. At least this was true in the past.
Nowadays, thanks to the packages from the `{tidyverse}`, it is very easy and fast to compute
descriptive statistics by any stratifying variable(s). The package we are going to use for this is
called `{dplyr}`. `{dplyr}` contains a lot of functions that make manipulating data and computing
descriptive statistics very easy. To make things easier for now, we are going to use example data
included with `{dplyr}`. So no need to import an external dataset; this does not change anything to
the example that we are going to study here; the source of the data does not matter for this. Using
`{dplyr}` is possible only if the data you are working with is already in a useful shape. When data
is more messy, you will need to first manipulate it to bring it a *tidy* format. For this, we will
use `{tidyr}`, which is very useful package to reshape data and to do advanced cleaning of your
data. All these tidyverse functions are also called *verbs*. However, before getting to know these
verbs, let's do an analysis using standard, or *base* R functions. This will be the benchmark
against which we are going to measure a `{tidyverse}` workflow.
## A data exploration exercice using *base* R
Let's first load the `starwars` data set, included in the `{dplyr}` package:
```{r}
library(dplyr)
data(starwars)
```
Let's first take a look at the data:
```{r}
head(starwars)
```
This data contains information on Star Wars characters. The first question you have to answer is
to find the average height of the characters:
```{r}
mean(starwars$height)
```
As discussed in Chapter 2, `$` allows you to access columns of a `data.frame` objects.
Because there are `NA` values in the data, the result is also `NA`. To get the result, you need to
add an option to `mean()`:
```{r}
mean(starwars$height, na.rm = TRUE)
```
Let's also take a look at the standard deviation:
```{r}
sd(starwars$height, na.rm = TRUE)
```
It might be more informative to compute these two statistics by sex, so for this, we are going
to use `aggregate()`:
```{r}
aggregate(starwars$height,
by = list(sex = starwars$sex),
mean)
```
Oh, shoot! Most groups have missing values in them, so we get `NA` back. We need to use `na.rm = TRUE`
just like before. Thankfully, it is possible to pass this option to `mean()` inside `aggregate()` as well:
```{r}
aggregate(starwars$height,
by = list(sex = starwars$sex),
mean, na.rm = TRUE)
```
Later in the book, we are also going to see how to define our own functions (with the default options that
are useful to us), and this will also help in this sort of situation.
Even though we can use `na.rm = TRUE`, let's also use `subset()` to filter out the `NA` values beforehand:
```{r}
starwars_no_nas <- subset(starwars,
!is.na(height))
aggregate(starwars_no_nas$height,
by = list(sex = starwars_no_nas$sex),
mean)
```
(`aggregate()` also has a `subset = ` option, but I prefer to explicitely subset the data set with `subset()`).
Even if you are not familiar with `aggregate()`, I believe the above lines are quite
self-explanatory. You need to provide `aggregate()` with 3 things; the variable you want to
summarize (or only the data frame, if you want to summarize all variables), a list of grouping
variables and then the function that will be applied to each subgroup. And by the way, to test for
`NA`, one uses the function `is.na()` not something like `species == "NA"` or anything like that.
`!is.na()` does the opposite (`!` reverses booleans, so `!TRUE` becomes `FALSE` and vice-versa).
You can easily add another grouping variable:
```{r}
aggregate(starwars_no_nas$height,
by = list(Sex = starwars_no_nas$sex,
`Hair color` = starwars_no_nas$hair_color),
mean)
```
or use another function:
```{r}
aggregate(starwars_no_nas$height,
by = list(Sex = starwars_no_nas$sex),
sd)
```
(let's ignore the `NA`s). It is important to note that `aggregate()` returns a `data.frame` object.
You can only give one function to `aggregate()`, so if you need the mean and the standard deviation of `height`,
you must do it in two steps.
Since R 4.1, a new infix operator `|>` has been introduced, which is really handy for writing the kind of
code we've been looking at in this chapter. `|>` is also called a pipe, or the *base* pipe to distinguish
it from *another* pipe that we'll discuss in the next section. For now, let's learn about `|>`.
Consider the following:
```{r}
10 |> sqrt()
```
This computes `sqrt(10)`; so what `|>` does, is pass the left hand side (`10`, in the example above) to the
right hand side (`sqrt()`). Using `|>` might seem more complicated and verbose than not using it, but you
will see in a bit why it can be useful. The next function I would like to introduce at this point is `with()`.
`with()` makes it possible to apply functions on `data.frame` columns without having to write `$` all the time.
For example, consider this:
```{r}
mean(starwars$height, na.rm = TRUE)
with(starwars,
mean(height, na.rm = TRUE))
```
The advantage of using `with()` is that we can directly reference `height` without using `$`. Here again, this
is more verbose than simply using `$`... so why bother with it? It turns out that by combining `|>` and `with()`,
we can write very clean and concise code. Let's go back to a previous example to illustrate this idea:
```{r}
starwars_no_nas <- subset(starwars,
!is.na(height))
aggregate(starwars_no_nas$height,
by = list(sex = starwars_no_nas$sex),
mean)
```
First, we created a new dataset where we filtered out rows where `height` is `NA`. This dataset is useless otherwise,
but we need it for the next part, where we actually do what we want (computing the average `height` by `sex`).
Using `|>` and `with()`, we can write this in one go:
```{r}
starwars |>
subset(!is.na(sex)) |>
with(aggregate(height,
by = list(Species = species,
Sex = sex),
mean))
```
So let's unpack this. In the first two rows, using `|>`, we pass the `starwars` `data.frame` to `subset()`:
```{r}
starwars |>
subset(!is.na(sex))
```
as I explained before, this is exactly the same as `subset(starwars, !is.na(sex))`. Then, we pass the result of
`subset()` to the next function, `with()`. The first argument of `with()` must be a `data.frame`, and this is exactly
what `subset()` returns! So now the output of `subset()` is passed down to `with()`, which makes it now possible
to reference the columns of the `data.frame` in `aggregate()` directly. If you have a hard time understanding what
is going on, you can use `quote()` to see what's going on. `quote()` returns an expression with evaluating it:
```{r}
quote(log(10))
```
Why am I bring this up? Well, since `a |> f()` is exactly equal to `f(a)`, quoting the code above will return
an expression with `|>`. For instance:
```{r}
quote(10 |> log())
```
So let's quote the big block of code from above:
```{r}
quote(
starwars |>
subset(!is.na(sex)) |>
with(aggregate(height,
by = list(Species = species,
Sex = sex),
mean))
)
```
I think now you see why using `|>` makes code much clearer; the nested expression you would need to write otherwise
is much less readable, unless you define intermediate objects. And without `with()`, this is what you
would need to write:
```{r, eval = F}
b <- subset(starwars, !is.na(height))
aggregate(b$height, by = list(Species = b$species, Sex = b$sex), mean)
```
To finish this section, let's say that you wanted to have the average `height` and `mass` by sex. In this case
you need to specify the columns in `aggregate()` with `cbind()` (let's use `na.rm = TRUE` again instead of
`subset()`ing the data beforehand):
```{r}
starwars |>
with(aggregate(cbind(height, mass),
by = list(Sex = sex),
FUN = mean, na.rm = TRUE))
```
Let's now continue with some more advanced operations using this fake dataset:
```{r}
survey_data_base <- as.data.frame(
tibble::tribble(
~id, ~var1, ~var2, ~var3,
1, 1, 0.2, 0.3,
2, 1.4, 1.9, 4.1,
3, 0.1, 2.8, 8.9,
4, 1.7, 1.9, 7.6
)
)
```
```{r}
survey_data_base
```
Depending on what you want to do with this data, it is not in the right shape. For example, it
would not be possible to simply compute the average of `var1`, `var2` and `var3` for each `id`.
This is because this would require running `mean()` by row, but this is not very easy. This is
because R is not suited to row-based workflows. Well I'm lying a little bit here, it turns here
that R comes with a `rowMeans()` function. So this would work:
```{r}
survey_data_base |>
transform(mean_id = rowMeans(cbind(var1, var2, var3))) #transform adds a column to a data.frame
```
But there is no `rowSD()` or `rowMax()`, etc... so it is much better to reshape the data and put it in a
format that gives us maximum flexibility. To reshape the data, we'll be using the aptly-called `reshape()` command:
```{r}
survey_data_long <- reshape(survey_data_base,
varying = list(2:4), v.names = "variable", direction = "long")
```
We can now easily compute the average of `variable` for each `id`:
```{r}
aggregate(survey_data_long$variable,
by = list(Id = survey_data_long$id),
mean)
```
or any other variable:
```{r}
aggregate(survey_data_long$variable,
by = list(Id = survey_data_long$id),
max)
```
As you can see, R comes with very powerful functions right out of the box, ready to use. When I was
studying, unfortunately, my professors had been brought up on FORTRAN loops, so we had to do to all
this using loops (not reshaping, thankfully), which was not so easy.
Now that we have seen how *base* R works, let's redo the analysis using `{tidyverse}` verbs.
The `{tidyverse}` provides many more functions, each of them doing only one single thing. You will
shortly see why this is quite important; by focusing on just one task, and by focusing on the data frame
as the central object, it becomes possible to build really complex workflows, piece by piece,
very easily.
But before deep diving into the `{tidyverse}`, let's take a moment to discuss about another infix
operator, `%>%`.
## Smoking is bad for you, but pipes are your friend
The title of this section might sound weird at first, but by the end of it, you'll get this
(terrible) pun.
You probably know the following painting by René Magritte, *La trahison des images*:
```{r, echo=FALSE}
knitr::include_graphics("assets/pas_une_pipe.png")
```
It turns out there's an R package from the `tidyverse` that is called `magrittr`. What does this
package do? This package introduced *pipes* to R, way before `|>` in R 4.1. Pipes are a concept
from the Unix operating system; if you're using a GNU+Linux distribution or macOS, you're basically
using a *modern* unix (that's an oversimplification, but I'm an economist by training, and
outrageously oversimplifying things is what we do, deal with it). The *magrittr* pipe is written as
`%>%`. Just like `|>`, `%>%` takes the left hand side to feed it as the first argument of the
function in the right hand side. Try the following:
```{r, include = FALSE}
library(magrittr)
```
```{r, eval = FALSE}
library(magrittr)
```
```{r}
16 %>% sqrt
```
You can chain multiple functions, as you can with `|>`:
```{r}
16 %>%
sqrt %>%
log
```
But unlike with `|>`, you can omit `()`. `%>%` also has other features. For example, you can
pipe things to other infix operators. For example, `+`. You can use `+` as usual:
```{r}
2 + 12
```
Or as a prefix operator:
```{r}
`+`(2, 12)
```
You can use this notation with `%>%`:
```{r}
16 %>% sqrt %>% `+`(18)
```
This also works using `|>` since R version 4.2, but only if you use the `_` pipe placeholder:
```{r}
16 |> sqrt() |> `+`(x = _, 18)
```
The output of `16` (`16`) got fed to `sqrt()`, and the output of `sqrt(16)` (4) got fed to `+(18)`
(so we got `+(4, 18)` = 22). Without `%>%` you'd write the line just above like this:
```{r}
sqrt(16) + 18
```
Just like before, with `|>`, this might seem overly complicated, but using these pipes will
make our code much more readable. I'm sure you'll be convinced by the end of this chapter.
`%>%` is not the only pipe operator in `magrittr`. There's `%T%`, `%<>%` and `%$%`. All have their
uses, but are basically shortcuts to some common tasks with `%>%` plus another function. Which
means that you can live without them, and because of this, I will not discuss them.
## The `{tidyverse}`'s *enfant prodige*: `{dplyr}`
The best way to get started with the tidyverse packages is to get to know `{dplyr}`. `{dplyr}`
provides a lot of very useful functions that makes it very easy to get discriptive statistics or
add new columns to your data.
### A first taste of data manipulation with `{dplyr}`
This section will walk you through a typical analysis using `{dplyr}` funcitons. Just go with it; I
will give more details in the next sections.
First, let's load `{dplyr}` and the included `starwars` dataset. Let's also take a look at the
first 5 lines of the dataset:
```{r}
library(dplyr)
data(starwars)
head(starwars)
```
`data(starwars)` loads the example dataset called `starwars` that is included in the package
`{dplyr}`. As I said earlier, this is just an example; you could have loaded an external dataset,
from a `.csv` file for instance. This does not matter for what comes next.
Like we saw earlier, R includes a lot of functions for descriptive statistics, such as `mean()`,
`sd()`, `cov()`, and many more. What `{dplyr}` brings to the table is a grammar of data
manipulation that makes it very easy to apply descriptive statistics functions, or any other,
very easily.
Just like before, we are going to compute the average height by `sex`:
```{r}
starwars %>%
group_by(sex) %>%
summarise(mean_height = mean(height, na.rm = TRUE))
```
The very nice thing about using `%>%` and `{dplyr}` verbs/functions, is that this is really
readable. The above three lines can be translated like so in English:
*Take the starwars dataset, then group by sex, then compute the mean height (for each subgroup) by
omitting missing values.*
`%>%` can be translated by "then". Without `%>%` you would need to change the code to:
```{r}
summarise(group_by(starwars, sex), mean(height, na.rm = TRUE))
```
Unlike with the *base* approach, each function does only one thing. With the base function
`aggregate()` was used to also define the subgroups. This is not the case with `{dplyr}`; one
function to create the groups (`group_by()`) and then one function to compute the summaries
(`summarise()`). Also, `group_by()` creates a specific subgroup for individuals where `sex` is
missing. This is the last line in the data frame, where `sex` is `NA`. Another nice thing is that
you can specify the column containing the average height. I chose to name it `mean_height`.
Now, let's suppose that we want to filter some data first:
```{r}
starwars %>%
filter(gender == "masculine") %>%
group_by(sex) %>%
summarise(mean_height = mean(height, na.rm = TRUE))
```
Again, the `%>%` makes the above lines of code very easy to read. Without it, one would need to
write:
```{r}
summarise(group_by(filter(starwars, gender == "masculine"), sex), mean(height, na.rm = TRUE))
```
I think you agree with me that this is not very readable. One way to make it more readable would
be to save intermediary variables:
```{r}
filtered_data <- filter(starwars, gender == "masculine")
grouped_data <- group_by(filter(starwars, gender == "masculine"), sex)
summarise(grouped_data, mean(height))
```
But this can get very tedious. Once you're used to `%>%`, you won't go back to not use it.
Before continuing and to make things clearer; `filter()`, `group_by()` and `summarise()` are
functions that are included in `{dplyr}`. `%>%` is actually a function from `{magrittr}`, but this
package gets loaded on the fly when you load `{dplyr}`, so you do not need to worry about it.
The result of all these operations that use `{dplyr}` functions are actually other datasets, or
`tibbles`. This means that you can save them in variable, or write them to disk, and then work with
these as any other datasets.
```{r}
mean_height <- starwars %>%
group_by(sex) %>%
summarise(mean(height))
class(mean_height)
head(mean_height)
```
You could then write this data to disk using `rio::export()` for instance. If you need more than
the mean of the height, you can keep adding as many functions as needed (another advantage over
`aggregate()`:
```{r}
summary_table <- starwars %>%
group_by(sex) %>%
summarise(mean_height = mean(height, na.rm = TRUE),
var_height = var(height, na.rm = TRUE),
n_obs = n())
summary_table
```
I've added more functions, namely `var()`, to get the variance of height, and `n()`, which
is a function from `{dplyr}`, not base R, to get the number of observations. This is quite useful,
because we see that there is a group with only one individual. Let's focus on the
sexes for which we have more than 1 individual. Since we save all the previous operations (which
produce a `tibble`) in a variable, we can keep going from there:
```{r}
summary_table2 <- summary_table %>%
filter(n_obs > 1)
summary_table2
```
As mentioned before, there's a lot of `NA`s; this is because by default, `mean()` and `var()`
return `NA` if even one single observation is `NA`. This is good, because it forces you to look at
the data to see what is going on. If you would get a number, even if there were `NA`s you could
very easily miss these missing values. It is better for functions to fail early and often than the
opposite. This is way we keep using `na.rm = TRUE` for `mean()` and `var()`.
Now let's actually take a look at the rows where `sex` is `NA`:
```{r}
starwars %>%
filter(is.na(sex))
```
There's only 4 rows where `sex` is `NA`. Let's ignore them:
```{r}
starwars %>%
filter(!is.na(sex)) %>%
group_by(sex) %>%
summarise(ave_height = mean(height, na.rm = TRUE),
var_height = var(height, na.rm = TRUE),
n_obs = n()) %>%
filter(n_obs > 1)
```
And why not compute the same table, but first add another stratifying variable?
```{r}
starwars %>%
filter(!is.na(sex)) %>%
group_by(sex, eye_color) %>%
summarise(ave_height = mean(height, na.rm = TRUE),
var_height = var(height, na.rm = TRUE),
n_obs = n()) %>%
filter(n_obs > 1)
```
Ok, that's it for a first taste. We have already discovered some very useful `{dplyr}` functions,
`filter()`, `group_by()` and summarise `summarise()`.
Now, we are going to learn more about these functions in more detail.
### Filter the rows of a dataset with `filter()`
We're going to use the `Gasoline` dataset from the `plm` package, so install that first:
```{r, eval = FALSE}
install.packages("plm")
```
Then load the required data:
```{r}
data(Gasoline, package = "plm")
```
and load dplyr:
```{r}
library(dplyr)
```
This dataset gives the consumption of gasoline for 18 countries from 1960 to 1978. When you load
the data like this, it is a standard `data.frame`. `{dplyr}` functions can be used on standard
`data.frame` objects, but also on `tibble`s. `tibble`s are just like data frame, but with a better
print method (and other niceties). I'll discuss the `{tibble}` package later, but for now, let's
convert the data to a `tibble` and change its name, and also transform the `country` column to
lower case:
```{r}
gasoline <- as_tibble(Gasoline)
gasoline <- gasoline %>%
mutate(country = tolower(country))
```
`filter()` is pretty straightforward. What if you would like to subset the data to focus on the
year 1969? Simple:
```{r}
filter(gasoline, year == 1969)
```
Let's use `%>%`, since we're familiar with it now:
```{r}
gasoline %>%
filter(year == 1969)
```
You can also filter more than just one year, by using the `%in%` operator:
```{r}
gasoline %>%
filter(year %in% seq(1969, 1973))
```
It is also possible use `between()`, a helper function:
```{r}
gasoline %>%
filter(between(year, 1969, 1973))
```
To select non-consecutive years:
```{r}
gasoline %>%
filter(year %in% c(1969, 1973, 1977))
```
`%in%` tests if an object is part of a set.
### Select columns with `select()`
While `filter()` allows you to keep or discard rows of data, `select()` allows you to keep or
discard entire columns. To keep columns:
```{r}
gasoline %>%
select(country, year, lrpmg)
```
To discard them:
```{r}
gasoline %>%
select(-country, -year, -lrpmg)
```
To rename them:
```{r}
gasoline %>%
select(country, date = year, lrpmg)
```
There's also `rename()`:
```{r}
gasoline %>%
rename(date = year)
```
`rename()` does not do any kind of selection, but just renames.
You can also use `select()` to re-order columns:
```{r}
gasoline %>%
select(year, country, lrpmg, everything())
```
`everything()` is a helper function, and there's also `starts_with()`, and `ends_with()`. For
example, what if we are only interested in columns whose name start with "l"?
```{r}
gasoline %>%
select(starts_with("l"))
```
`ends_with()` works in a similar fashion. There is also `contains()`:
```{r}
gasoline %>%
select(country, year, contains("car"))
```
You can read more about these helper functions [here](https://tidyselect.r-lib.org/reference/language.html), but we're going to look more into
them in a coming section.
Another verb, similar to `select()`, is `pull()`. Let's compare the two:
```{r}
gasoline %>%
select(lrpmg)
```
```{r}
gasoline %>%
pull(lrpmg) %>%
head() # using head() because there's 337 elements in total
```
`pull()`, unlike `select()`, does not return a `tibble`, but only the column you want, as a
vector.
### Group the observations of your dataset with `group_by()`
`group_by()` is a very useful verb; as the name implies, it allows you to create groups and then,
for example, compute descriptive statistics by groups. For example, let's group our data by
country:
```{r}
gasoline %>%
group_by(country)
```
It looks like nothing much happened, but if you look at the second line of the output you can read
the following:
```{r}
## # Groups: country [18]
```
this means that the data is grouped, and every computation you will do now will take these groups
into account. It is also possible to group by more than one variable:
```{r}
gasoline %>%
group_by(country, year)
```
and so on. You can then also ungroup:
```{r}
gasoline %>%
group_by(country, year) %>%
ungroup()
```
Once your data is grouped, the operations that will follow will be executed inside each group.
### Get summary statistics with `summarise()`
Ok, now that we have learned the basic verbs, we can start to do more interesting stuff. For
example, one might want to compute the average gasoline consumption in each country, for
the whole period:
```{r}
gasoline %>%
group_by(country) %>%
summarise(mean(lgaspcar))
```
`mean()` was given as an argument to `summarise()`, which is a `{dplyr}` verb. What we get is
another `tibble`, that contains the variable we used to group, as well as the average per country.
We can also rename this column:
```{r}
gasoline %>%
group_by(country) %>%
summarise(mean_gaspcar = mean(lgaspcar))
```
and because the output is a `tibble`, we can continue to use `{dplyr}` verbs on it:
```{r}
gasoline %>%
group_by(country) %>%
summarise(mean_gaspcar = mean(lgaspcar)) %>%
filter(country == "france")
```
`summarise()` is a very useful verb. For example, we can compute several descriptive statistics at once:
```{r}
gasoline %>%
group_by(country) %>%
summarise(mean_gaspcar = mean(lgaspcar),
sd_gaspcar = sd(lgaspcar),
max_gaspcar = max(lgaspcar),
min_gaspcar = min(lgaspcar))
```
Because the output is a `tibble`, you can save it in a variable of course:
```{r}
desc_gasoline <- gasoline %>%
group_by(country) %>%
summarise(mean_gaspcar = mean(lgaspcar),
sd_gaspcar = sd(lgaspcar),
max_gaspcar = max(lgaspcar),
min_gaspcar = min(lgaspcar))
```
And then you can answer questions such as, *which country has the maximum average gasoline
consumption?*:
```{r}
desc_gasoline %>%
filter(max(mean_gaspcar) == mean_gaspcar)
```
Turns out it's Turkey. What about the minimum consumption?
```{r}
desc_gasoline %>%
filter(min(mean_gaspcar) == mean_gaspcar)
```
Because the output of `{dplyr}` verbs is a tibble, it is possible to continue working with it. This
is one shortcoming of using the base `summary()` function. The object returned by that function is
not very easy to manipulate.
### Adding columns with `mutate()` and `transmute()`
`mutate()` adds a column to the `tibble`, which can contain any transformation of any other
variable:
```{r}
gasoline %>%
group_by(country) %>%
mutate(n())
```
Using `mutate()` I've added a column that counts how many times the country appears in the `tibble`,
using `n()`, another `{dplyr}` function. There's also `count()` and `tally()`, which we are going to
see further down. It is also possible to rename the column on the fly:
```{r}
gasoline %>%
group_by(country) %>%
mutate(count = n())
```
It is possible to do any arbitrary operation:
```{r}
gasoline %>%
group_by(country) %>%
mutate(spam = exp(lgaspcar + lincomep))
```
`transmute()` is the same as `mutate()`, but only returns the created variable:
```{r}
gasoline %>%
group_by(country) %>%
transmute(spam = exp(lgaspcar + lincomep))
```
### Joining `tibble`s with `full_join()`, `left_join()`, `right_join()` and all the others
I will end this section on `{dplyr}` with the very useful verbs: the `*_join()` verbs. Let's first
start by loading another dataset from the `plm` package. `SumHes` and let's convert it to `tibble`
and rename it:
```{r, include = FALSE}
gasoline <- as_tibble(Gasoline) %>%
mutate(country = tolower(country))
```
```{r}
data(SumHes, package = "plm")
pwt <- SumHes %>%
as_tibble() %>%
mutate(country = tolower(country))
```
Let's take a quick look at the data:
```{r}
glimpse(pwt)
```
We can merge both `gasoline` and `pwt` by country and year, as these two variables are common to
both datasets. There are more countries and years in the `pwt` dataset, so when merging both, and
depending on which function you use, you will either have `NA`'s for the variables where there is
no match, or rows that will be dropped. Let's start with `full_join`:
```{r}
gas_pwt_full <- gasoline %>%
full_join(pwt, by = c("country", "year"))
```
Let's see which countries and years are included:
```{r}
gas_pwt_full %>%
count(country, year)
```
As you see, every country and year was included, but what happened for, say, the U.S.S.R? This country
is in `pwt` but not in `gasoline` at all:
```{r}
gas_pwt_full %>%
filter(country == "u.s.s.r.")
```
As you probably guessed, the variables from `gasoline` that are not included in `pwt` are filled with
`NA`s. One could remove all these lines and only keep countries for which these variables are not
`NA` everywhere with `filter()`, but there is a simpler solution:
```{r}
gas_pwt_inner <- gasoline %>%
inner_join(pwt, by = c("country", "year"))
```
Let's use the `tabyl()` from the `janitor` packages which is a very nice alternative to the `table()`
function from base R:
```{r}
library(janitor)
gas_pwt_inner %>%
tabyl(country)
```
Only countries with values in both datasets were returned. It's almost every country from `gasoline`,
apart from Germany (called "germany west" in `pwt` and "germany" in `gasoline`. I left it as is to
provide an example of a country not in `pwt`). Let's also look at the variables:
```{r}
glimpse(gas_pwt_inner)
```
The variables from both datasets are in the joined data.
Contrast this to `semi_join()`:
```{r}
gas_pwt_semi <- gasoline %>%
semi_join(pwt, by = c("country", "year"))
glimpse(gas_pwt_semi)
gas_pwt_semi %>%
tabyl(country)
```
Only columns of `gasoline` are returned, and only rows of `gasoline` that were matched with rows
from `pwt`. `semi_join()` is not a commutative operation:
```{r}
pwt_gas_semi <- pwt %>%
semi_join(gasoline, by = c("country", "year"))
glimpse(pwt_gas_semi)
gas_pwt_semi %>%
tabyl(country)
```
The rows are the same, but not the columns.
`left_join()` and `right_join()` return all the rows from either the dataset that is on the
"left" (the first argument of the fonction) or on the "right" (the second argument of the
function) but all columns from both datasets. So depending on which countries you're interested in,
you're going to use either one of these functions:
```{r}
gas_pwt_left <- gasoline %>%
left_join(pwt, by = c("country", "year"))
gas_pwt_left %>%
tabyl(country)
```
```{r}
gas_pwt_right <- gasoline %>%
right_join(pwt, by = c("country", "year"))
gas_pwt_right %>%
tabyl(country) %>%
head()
```
The last merge function is `anti_join()`:
```{r}
gas_pwt_anti <- gasoline %>%
anti_join(pwt, by = c("country", "year"))
glimpse(gas_pwt_anti)
gas_pwt_anti %>%
tabyl(country)
```
`gas_pwt_anti` has the columns the `gasoline` dataset as well as the only country from `gasoline`
that is not in `pwt`: "germany".
That was it for the basic `{dplyr}` verbs. Next, we're going to learn about `{tidyr}`.
## Reshaping and sprucing up data with `{tidyr}`
Note: this section is going to be a lot harder than anything you've seen until now. Reshaping
data is tricky, and to really grok it, you need time, and you need to run each line, and see what
happens. Take your time, and don't be discouraged.
Another important package from the `{tidyverse}` that goes hand in hand with `{dplyr}` is `{tidyr}`.
`{tidyr}` is the package you need when it's time to reshape data.
I will start by presenting `pivot_wider()` and `pivot_longer()`.
### `pivot_wider()` and `pivot_longer()`