-
Notifications
You must be signed in to change notification settings - Fork 12
/
23-starting-with-r.Rmd
1061 lines (795 loc) · 32.7 KB
/
23-starting-with-r.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Introduction to R {#sec-startr}
**Learning Objectives**
- Define the following terms as they relate to R: object, assign,
call, function, arguments, options.
- Assign values to objects in R.
- Learn how to _name_ objects
- Use comments to inform script.
- Solve simple arithmetic operations in R.
- Call functions and use arguments to change their default options.
- Inspect the content of vectors and manipulate their content.
- Subset and extract values from vectors.
- Analyse vectors with missing data.
## Creating objects in R
You can get output from R simply by typing math in the console:
```{r, purl=FALSE}
3 + 5
12 / 7
```
However, to do useful and interesting things, we need to assign _values_ to
_objects_. To create an object, we need to give it a name followed by the
assignment operator `<-`, and the value we want to give it:
```{r, purl=FALSE}
weight_kg <- 55
```
`<-` is the assignment operator. It assigns values on the right to
objects on the left. So, after executing `x <- 3`, the value of `x` is
`3`. The arrow can be read as 3 **goes into** `x`. For historical
reasons, you can also use `=` for assignments, but not in every
context. Because of the
[slight](http://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html)
[differences](http://r.789695.n4.nabble.com/Is-there-any-difference-between-and-tp878594p878598.html)
in syntax, it is good practice to always use `<-` for assignments.
In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> (push <kbd>Alt</kbd>
at the same time as the <kbd>-</kbd> key) will write ` <- ` in a
single keystroke in a PC, while typing <kbd>Option</kbd> +
<kbd>-</kbd> (push <kbd>Option</kbd> at the same time as the
<kbd>-</kbd> key) does the same in a Mac.
### Naming variables {-}
Objects can be given any name such as `x`, `current_temperature`, or
`subject_id`. You want your object names to be explicit and not too
long. They cannot start with a number (`2x` is not valid, but `x2`
is). R is case sensitive (e.g., `weight_kg` is different from
`Weight_kg`). There are some names that cannot be used because they
are the names of fundamental functions in R (e.g., `if`, `else`,
`for`, see
[here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html)
for a complete list). In general, even if it's allowed, it's best to
not use other function names (e.g., `c`, `T`, `mean`, `data`, `df`,
`weights`). If in doubt, check the help to see if the name is already
in use. It's also best to avoid dots (`.`) within an object name as in
`my.dataset`. There are many functions in R with dots in their names
for historical reasons, but because dots have a special meaning in R
(for methods) and other programming languages, it's best to avoid
them. It is also recommended to use nouns for object names, and verbs
for function names. It's important to be consistent in the styling of
your code (where you put spaces, how you name objects, etc.). Using a
consistent coding style makes your code clearer to read for your
future self and your collaborators. In R, some popular style guides
are [Google's](https://google.github.io/styleguide/Rguide.xml), the
[tidyverse's](http://style.tidyverse.org/) style and the [Bioconductor
style
guide](https://bioconductor.org/developers/how-to/coding-style/). The
tidyverse's is very comprehensive and may seem overwhelming at
first. You can install the
[**`lintr`**](https://github.com/jimhester/lintr) package to
automatically check for issues in the styling of your code.
> **Objects vs. variables** What are known as `objects` in `R` are
> known as `variables` in many other programming languages. Depending
> on the context, `object` and `variable` can have drastically
> different meanings. However, in this lesson, the two words are used
> synonymously. For more information see:
> https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects
When assigning a value to an object, R does not print anything. You
can force R to print the value by using parentheses or by typing the
object name:
```{r, purl=FALSE}
weight_kg <- 55 # doesn't print anything
(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg`
weight_kg # and so does typing the name of the object
```
Now that R has `weight_kg` in memory, we can do arithmetic with it. For
instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg):
```{r, purl=FALSE}
2.2 * weight_kg
```
We can also change an object's value by assigning it a new one:
```{r, purl=FALSE}
weight_kg <- 57.5
2.2 * weight_kg
```
This means that assigning a value to one object does not change the values of
other objects. For example, let's store the animal's weight in pounds in a new
object, `weight_lb`:
```{r, purl=FALSE}
weight_lb <- 2.2 * weight_kg
```
and then change `weight_kg` to 100.
```{r}
weight_kg <- 100
```
`r msmbstyle::question_begin()`
What do you think is the current content of the object `weight_lb`?
126.5 or 220?
`r msmbstyle::question_end()`
## Comments
The comment character in R is `#`, anything to the right of a `#` in a
script will be ignored by R. It is useful to leave notes, and
explanations in your scripts.
RStudio makes it easy to comment or uncomment a paragraph: after
selecting the lines you want to comment, press at the same time on
your keyboard <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. If
you only want to comment out one line, you can put the cursor at any
location of that line (i.e. no need to select the whole line), then
press <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>.
`r msmbstyle::question_begin()`
What are the values after each statement in the following?
```{r, purl=FALSE}
mass <- 47.5 # mass?
age <- 122 # age?
mass <- mass * 2.0 # mass?
age <- age - 20 # age?
mass_index <- mass/age # mass_index?
```
`r msmbstyle::question_end()`
## Functions and their arguments
Functions are "canned scripts" that automate more complicated sets of commands
including operations assignments, etc. Many functions are predefined, or can be
made available by importing R *packages* (more on that later). A function
usually gets one or more inputs called *arguments*. Functions often (but not
always) return a *value*. A typical example would be the function `sqrt()`. The
input (the argument) must be a number, and the return value (in fact, the
output) is the square root of that number. Executing a function ('running it')
is called *calling* the function. An example of a function call is:
```{r, eval=FALSE, purl=FALSE}
b <- sqrt(a)
```
Here, the value of `a` is given to the `sqrt()` function, the `sqrt()` function
calculates the square root, and returns the value which is then assigned to
the object `b`. This function is very simple, because it takes just one argument.
The return 'value' of a function need not be numerical (like that of `sqrt()`),
and it also does not need to be a single item: it can be a set of things, or
even a dataset. We'll see that when we read data files into R.
Arguments can be anything, not only numbers or filenames, but also other
objects. Exactly what each argument means differs per function, and must be
looked up in the documentation (see below). Some functions take arguments which
may either be specified by the user, or, if left out, take on a *default* value:
these are called *options*. Options are typically used to alter the way the
function operates, such as whether it ignores 'bad values', or what symbol to
use in a plot. However, if you want something specific, you can specify a value
of your choice which will be used instead of the default.
Let's try a function that can take multiple arguments: `round()`.
```{r, results='show', purl=FALSE}
round(3.14159)
```
Here, we've called `round()` with just one argument, `3.14159`, and it has
returned the value `3`. That's because the default is to round to the nearest
whole number. If we want more digits we can see how to do that by getting
information about the `round` function. We can use `args(round)` or look at the
help for this function using `?round`.
```{r, results='show', purl=FALSE}
args(round)
```
```{r, eval=FALSE, purl=FALSE}
?round
```
We see that if we want a different number of digits, we can
type `digits=2` or however many we want.
```{r, results='show', purl=FALSE}
round(3.14159, digits = 2)
```
If you provide the arguments in the exact same order as they are defined you
don't have to name them:
```{r, results='show', purl=FALSE}
round(3.14159, 2)
```
And if you do name the arguments, you can switch their order:
```{r, results='show', purl=FALSE}
round(digits = 2, x = 3.14159)
```
It's good practice to put the non-optional arguments (like the number you're
rounding) first in your function call, and to specify the names of all optional
arguments. If you don't, someone reading your code might have to look up the
definition of a function with unfamiliar arguments to understand what you're
doing.
## Vectors and data types
A vector is the most common and basic data type in R, and is pretty much
the workhorse of R. A vector is composed by a series of values, which can be
either numbers or characters. We can assign a series of values to a vector using
the `c()` function. For example we can create a vector of animal weights and assign
it to a new object `weight_g`:
```{r, purl=FALSE}
weight_g <- c(50, 60, 65, 82)
weight_g
```
A vector can also contain characters:
```{r, purl=FALSE}
molecules <- c("dna", "rna", "protein")
molecules
```
The quotes around "dna", "rna", etc. are essential here. Without the
quotes R will assume there are objects called `dna`, `rna` and
`protein`. As these objects don't exist in R's memory, there will be
an error message.
There are many functions that allow you to inspect the content of a
vector. `length()` tells you how many elements are in a particular vector:
```{r, purl=FALSE}
length(weight_g)
length(molecules)
```
An important feature of a vector, is that all of the elements are the
same type of data. The function `class()` indicates the class (the
type of element) of an object:
```{r, purl=FALSE}
class(weight_g)
class(molecules)
```
The function `str()` provides an overview of the structure of an
object and its elements. It is a useful function when working with
large and complex objects:
```{r, purl=FALSE}
str(weight_g)
str(molecules)
```
You can use the `c()` function to add other elements to your vector:
```{r}
weight_g <- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g
```
In the first line, we take the original vector `weight_g`, add the
value `90` to the end of it, and save the result back into
`weight_g`. Then we add the value `30` to the beginning, again saving
the result back into `weight_g`.
We can do this over and over again to grow a vector, or assemble a
dataset. As we program, this may be useful to add results that we are
collecting or calculating.
A **vector** is the simplest R **data type** and is a linear vector of
a single type. Above, we saw 2 of the 6 main **vector** types that R
uses: `"character"` and `"numeric"` (or `"double"`). These are the
basic building blocks that all R objects are built from. The other 4
**vector** types are:
- `"logical"` for `TRUE` and `FALSE` (the boolean data type)
- `"integer"` for integer numbers (e.g., `2L`, the `L` indicates to R
that it's an integer)
- `"complex"` to represent complex numbers with real and imaginary
parts (e.g., `1 + 4i`) and that's all we're going to say about them
- `"raw"` for bitstreams that we won't discuss further
You can check the type of your vector using the `typeof()` function
and inputting your vector as the argument.
Vectors are one of the many **data structures** that R uses. Other
important ones are lists (`list`), matrices (`matrix`), data frames
(`data.frame`), factors (`factor`) and arrays (`array`).
`r msmbstyle::question(text = "We've seen that vectors can be of type character, numeric (or double), integer, and logical. But what happens if we try to mix these types in a single vector?")`
`r msmbstyle::solution(text = "R implicitly converts them to all be the same type")`
`r msmbstyle::question_begin()`
What will happen in each of these examples? (hint: use `class()` to
check the data type of your objects):
```{r, eval=TRUE}
num_char <- c(1, 2, 3, "a")
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)
tricky <- c(1, 2, 3, "4")
```
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
class(num_char)
class(num_logical)
class(char_logical)
class(tricky)
```
`r msmbstyle::solution_end()`
`r msmbstyle::question(text = "Why do you think it happens?")`
`r msmbstyle::solution(text = "Vectors can be of only one data type. R tries to convert (coerce) the content of this vector to find a *common denominator* that doesn't lose any information.")`
`r msmbstyle::question_begin()`
How many values in `combined_logical` are `"TRUE"` (as a character) in
the following example:
```{r, eval=TRUE}
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)
combined_logical <- c(num_logical, char_logical)
```
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
Only one. There is no memory of past data types, and the coercion
happens the first time the vector is evaluated. Therefore, the `TRUE`
in `num_logical` gets converted into a `1` before it gets converted
into `"1"` in `combined_logical`.
```{r}
combined_logical
```
`r msmbstyle::solution_end()`
`r msmbstyle::question(text = " In R, we call converting objects from one class into another class _coercion_. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?")`
`r msmbstyle::solution(text = "logical → numeric → character ← logical")`
```{r, echo=FALSE, eval=FALSE, purl=TRUE}
## We’ve seen that vectors can be of type character, numeric, integer,
## and logical. But what happens if we try to mix these types in a
## single vector?
## What will happen in each of these examples? (hint: use `class()` to
## check the data type of your object)
num_char <- c(1, 2, 3, "a")
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)
tricky <- c(1, 2, 3, "4")
## Why do you think it happens?
## You've probably noticed that objects of different types get
## converted into a single, shared type within a vector. In R, we call
## converting objects from one class into another class
## _coercion_. These conversions happen according to a hierarchy,
## whereby some types get preferentially coerced into other types. Can
## you draw a diagram that represents the hierarchy of how these data
## types are coerced?
```
## Subsetting vectors
If we want to extract one or several values from a vector, we must
provide one or several indices in square brackets. For instance:
```{r, results='show', purl=FALSE}
molecules <- c("dna", "rna", "peptide", "protein")
molecules[2]
molecules[c(3, 2)]
```
We can also repeat the indices to create an object with more elements
than the original one:
```{r, results='show', purl=FALSE}
more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)]
more_molecules
```
Note: R indices start at 1. Programming languages like Fortran, MATLAB,
Julia, and R start counting at 1, because that's what human beings
typically do. Languages in the C family (including C++, Java, Perl,
and Python) count from 0 because that's simpler for computers to do.
Finally, it is also possible to get all the elements of a vector
except some specified elements using negative indices:
```{r}
molecules ## all molecules
molecules[-1] ## all but the first one
molecules[-c(1, 3)] ## all but 1st/3rd ones
molecules[c(-1, -3)] ## all but 1st/3rd ones
```
`r msmbstyle::question_begin()`
Here is another example of a character vector called `fruits`:
```{r}
fruits <- c("apple", "orange", "grape")
```
* add the elements *melon* and *pineapple* to this vector
* sort them in alphabetic order
+ manually by using their index position,
+ and by using `sort()` (see `?sort`).
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
# add the elements *melon* and *pineapple*
fruits <- c(fruits, "melon", "pineapple")
# sorting based on the index position
fruits[c(1,3, 4,2,5)]
# sorting based on sort()
sort(fruits)
```
`r msmbstyle::solution_end()`
## Conditional subsetting
Another common way of subsetting is by using a logical vector. `TRUE` will
select the element with the same index, while `FALSE` will not:
```{r, purl = FALSE}
weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
```
Typically, these logical vectors are not typed by hand, but are the
output of other functions or logical tests. For instance, if you
wanted to select only the values above 50:
```{r, purl = FALSE}
## will return logicals with TRUE for the indices that meet
## the condition
weight_g > 50
## so we can use this to select only the values above 50
weight_g[weight_g > 50]
```
You can combine multiple tests using `&` (both conditions are true,
AND) or `|` (at least one of the conditions is true, OR):
```{r, results='show', purl=FALSE}
weight_g[weight_g < 30 | weight_g > 50]
weight_g[weight_g >= 30 & weight_g == 21]
```
Here, `<` stands for "less than", `>` for "greater than", `>=` for
"greater than or equal to", and `==` for "equal to". The double equal
sign `==` is a test for numerical equality between the left and right
hand sides, and should not be confused with the single `=` sign, which
performs variable assignment (similar to `<-`).
A common task is to search for certain strings in a vector. One could
use the "or" operator `|` to test for equality to multiple values, but
this can quickly become tedious. The function `%in%` allows you to
test if any of the elements of a search vector are found:
```{r, purl = FALSE}
molecules <- c("dna", "rna", "protein", "peptide")
molecules[molecules == "rna" | molecules == "dna"] # returns both rna and dna
molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")
molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")]
```
`r msmbstyle::question_begin()`
Based on the `height` vector below, select heights that are above 190 or below or equal to 170
```{r}
height <- c(163, 189, 210, 177, 168, 192, 170)
```
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
height[height>190 | height <= 170]
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Based on the `fruits` vector below:
* subset the vector to only have melon and apple
* test that orange is included in this vector and mango is not
```{r}
fruits <- c("apple", "orange", "grape", "melon", "pineapple",
"banana", "grape", "orange", "melon")
```
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
# subset the vector to only have melon and apple
fruits[fruits == "melon" | fruits == "apple"]
# test that orange is included in this vector and mango is not
"orange" %in% fruits
"mango" %in% fruits
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Can you figure out why `"four" > "five"` returns `TRUE`?
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
"four" > "five"
```
When using `>` or `<` on strings, R compares their alphabetical order.
Here `"four"` comes after `"five"`, and therefore is *greater than*
it.
`r msmbstyle::solution_end()`
## Names
It is possible to name each element of a vector. The code chunk below
show a initial vector without any names, how names are set, and
retrieved.
```{r}
x <- c(1, 5, 3, 5, 10)
names(x) ## no names
names(x) <- c("A", "B", "C", "D", "E")
names(x) ## now we have names
```
When a vector has names, it is possible to access elements by their
name, in addition to their index.
```{r}
x[c(1, 3)]
x[c("A", "C")]
```
## Missing data
As R was designed to analyze datasets, it includes the concept of
missing data (which is uncommon in other programming
languages). Missing data are represented in vectors as `NA`.
When doing operations on numbers, most functions will return `NA` if
the data you are working with include missing values. This feature
makes it harder to overlook the cases where you are dealing with
missing data. You can add the argument `na.rm = TRUE` to calculate
the result while ignoring the missing values.
```{r}
heights <- c(2, 4, 4, NA, 6)
mean(heights)
max(heights)
mean(heights, na.rm = TRUE)
max(heights, na.rm = TRUE)
```
If your data include missing values, you may want to become familiar
with the functions `is.na()`, `na.omit()`, and `complete.cases()`. See
below for examples.
```{r}
## Extract those elements which are not missing values.
heights[!is.na(heights)]
## Returns the object with incomplete cases removed. The returned
## object is a vector of type `"numeric"` (or `"double"`).
na.omit(heights)
## Extract those elements which are complete cases. The returned
## object is a vector of type `"numeric"` (or `"double"`).
heights[complete.cases(heights)]
```
`r msmbstyle::question_begin()`
1. Using this vector of heights in inches, create a new vector with the NAs removed.
```{r}
heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63,
63, NA, 72, 65, 64, 70, 63, 65)
```
2. Use the function `median()` to calculate the median of the `heights` vector.
3. Use R to figure out how many people in the set are taller than 67 inches.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
heights_no_na <- heights[!is.na(heights)]
## or
heights_no_na <- na.omit(heights)
```
```{r}
median(heights, na.rm = TRUE)
```
```{r}
heights_above_67 <- heights_no_na[heights_no_na > 67]
length(heights_above_67)
```
`r msmbstyle::solution_end()`
## Generating vectors {#sec-genvec}
```{r, echo = FALSE}
set.seed(1)
```
### Constructors {-}
There exists some functions to generate vectors of different type. To
generate a vector of numerics, one can use the `numeric()`
constructor, providing the length of the output vector as
parameter. The values will be initialised with 0.
```{r}
numeric(3)
numeric(10)
```
Note that if we ask for a vector of numerics of length 0, we obtain
exactly that:
```{r}
numeric(0)
```
There are similar constructors for characters and logicals, named
`character()` and `logical()` respectively.
`r msmbstyle::question_begin()`
What are the defaults for character and logical vectors?
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
character(2) ## the empty charater
logical(2) ## FALSE
```
`r msmbstyle::solution_end()`
### Replicate elements {-}
The `rep` function allow to repeat a value a certain number of
times. If we want to initiate a vector of numerics of length 5 with
the value -1, for example, we could do the following:
```{r}
rep(-1, 5)
```
Similarly, to generate a vector populated with missing values, which
is often a good way to start, without setting assumptions on the data
to be collected:
```{r}
rep(NA, 5)
```
`rep` can take vectors of any length as input (above, we used vectors
of length 1) and any type. For example, if we want to repeat the
values 1, 2 and 3 five times, we would do the following:
```{r}
rep(c(1, 2, 3), 5)
```
`r msmbstyle::question_begin()`
What if we wanted to repeat the values 1, 2 and 3 five times, but
obtain five 1s, five 2s and five 3s in that order? There are two
possibilities - see `?rep` or `?sort` for help.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
rep(c(1, 2, 3), each = 5)
sort(rep(c(1, 2, 3), 5))
```
`r msmbstyle::solution_end()`
### Sequence generation {-}
Another very useful function is `seq`, to generate a sequence of
numbers. For example, to generate a sequence of integers from 1 to 20
by steps of 2, one would use:
```{r}
seq(from = 1, to = 20, by = 2)
```
The default value of `by` is 1 and, given that the generate of a
sequence of one value to another with steps of 1 is frequently used,
there's a shortcut:
```{r}
seq(1, 5, 1)
seq(1, 5) ## default by
1:5
```
To generate a sequence of numbers from 1 to 20 of final length of 3,
one would use:
```{r}
seq(from = 1, to = 20, length.out = 3)
```
### Random samples and permutations {-}
A last group of useful functions are those that generate random
data. The first one, `sample`, generates a random permutation of
another vector. For example, to draw a random order to 10 students
oral example, I first assign each student a number from 1 to then (for
instance based on the alphabetic order of their name) and then:
```{r}
sample(1:10)
```
Without further arguments, `sample` will return a permutation of all
elements of the vector. If I want a random sample of a certain size, I
would set this value as second argument. Below, I sample 5 random
letters from the alphabet contained in the pre-defined `letters` vector:
```{r}
sample(letters, 5)
```
If I wanted an output larger than the input vector, or being able to
draw some elements multiple times, I would need to set the `replace`
argument to `TRUE`:
```{r}
sample(1:5, 10, replace = TRUE)
```
`r msmbstyle::question_begin()`
When trying the functions above out, you will have realised that the
samples are indeed random and that one doesn't get the same
permutation twice. To be able to reproduce these random draws, one can
set the random number generation seed manually with `set.seed()`
before drawing the random sample.
- Test this feature with your neighbour. First draw two random
permutations of `1:10` independently and observe that you get
different results.
- Now set the seed with, for example, `set.seed(123)` and repeat the
random draw. Observe that you now get the same random draws.
- Repeat by setting a different seed.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
Different permutations
```{r}
sample(1:10)
sample(1:10)
```
Same permutations with seed 123
```{r}
set.seed(123)
sample(1:10)
set.seed(123)
sample(1:10)
```
A different seed
```{r}
set.seed(1)
sample(1:10)
set.seed(1)
sample(1:10)
```
`r msmbstyle::solution_end()`
### Drawing samples from a normal distribution {-}
The last function we are going to see is `rnorm`, that draws a random
sample from a normal distribution. Two normal distributions of means 0
and 100 and standard deviations 1 and 5, noted noted *N(0, 1)* and
*N(100, 5)*, are shown below
```{r echo=FALSE, fig.width = 12, fig.height = 6, fig.cap = "Two normal distributions: *N(0, 1)* on the left and *N(100, 5)* on the right."}
par(mfrow = c(1, 2))
plot(density(rnorm(1000)), main = "", sub = "N(0, 1)")
plot(density(rnorm(1000, 100, 5)), main = "", sub = "N(100, 5)")
```
The three arguments, `n`, `mean` and `sd`, define the size of the
sample, and the parameters of the normal distribution, i.e the mean
and its standard deviation. The defaults of the latter are 0 and 1.
```{r}
rnorm(5)
rnorm(5, 2, 2)
rnorm(5, 100, 5)
```
Now that we have learned how to write scripts, and the basics of R's
data structures, we are ready to start working with larger data, and
learn about data frames.
## Additional exercises
`r msmbstyle::question_begin()`
- Create two vectors `x` and `y` containing the numbers 1 to 10 and 10
to 1 respectively. You can use the `seq` or `:` functions rather
than constructing them by hand.
- Check their type. Depending how they were created, they can be
integers or doubles.
- Take the sum (see the `sum()` function) of each vector and verify
they are identical.
- Sum vectors element-wise, and verify that all results are identical.
- Swap the value or `x` and `y`.
`r msmbstyle::question_end()`
`r msmbstyle::question_begin()`
- Create a vector named x containing the numbers 20 to 2. Retrieve
elements that are strictly larger than 5 and smaller or equal than 15.
- Remove the first 8 elements from `x` and store the result in `x2`.
`r msmbstyle::question_end()`
```{r, echo=FALSE, include=FALSE}
x <- 20:2
x
x[x > 5 & x < 15]
x2 <- x[-(1:8)]
```
`r msmbstyle::question_begin()`
You're doing an colony counting experiment, counting every day, from
Monday to Friday how many molds you see in your cell cultures.
- Create a vector named `molds` containing the results of your counts:
1, 2, 5, 8 and 10.
- Set the names of `molds` using week days and extract the number of
molds identified on Wednesday.
`r msmbstyle::question_end()`
```{r, echo=FALSE, include=FALSE}
molds <- c(1, 2, 5, 8, 10)
days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(molds) <- days
molds["Wednesday"]
```
`r msmbstyle::question_begin()`
- Calculate the mean of a random distribution *N(15, 1)* of size 100
and store it in variable `m1`.
- Calculate the mean of a random distribution *N(0, 1)* of size 100
and store it in variable `m2`.
- Calculate the mean of another random distribution *N(15, 1)* of size
1000 and store it in variable `m3`.
- Can you guess which one of `m1` and `m2` will be larger? Verify in R.
- Can you guess which one of `m1` and `m3` will be larger? Verify in R.
`r msmbstyle::question_end()`
`r msmbstyle::question_begin()`
- Using the `sample` function, simulate a set of 100 students voting
(randomly) for 1, 2 or 3 breaks during the WSBIM1207 course.
- Display the values as a table of votes.
- Compute the number of students that wanted more that 1 break.
- Bonus: as above, but setting the probability for votes to 1/5, 2/5
and 2/5 respectively. Read `?sample` to find out how to do that.
`r msmbstyle::question_end()`
```{r, echo=FALSE, include=FALSE}
m1 <- mean(rnorm(100, 15, 1))
m2 <- mean(rnorm(100, 0, 1))
m3 <- mean(rnorm(1000, 15, 1))
## From the nature of the distributions, I expect m1 > m2
m1 > m2
## I cannot predict which one of m1 and m3 will be larger, only that
## they will be very close to each other, variating around 15
m1
m3
```
`r msmbstyle::question_begin()`
Given vectors `v1`, `v2` and `v3` below
```{r}
v1 <- c(1, 2, 3, "4")
v2 <- c(45, 23, TRUE, 21, 12, 34)
v3 <- c(v1, v2)
```
- What is the class of `v3`?
- What is the length of `v3`?
- Assign names `"a"`, `"b"`, .. to the `v3`.
- What is the value of `v3["e"]`?
- Re-using `v1`, create a vector `v4` containing
```{r, echo = FALSE, comment = NA}
(v4 <- c(v1[2:1], "NEW", v1[3:4]))
```
- What is the command to round 3.1234 to two decimanl digits?
- If you execute `round(3.1234)`, you get `3`. Why?
The WSBIM1207 students were asked how many breaks they wanted during