-
Notifications
You must be signed in to change notification settings - Fork 3
/
03-R-Intro.Rmd
570 lines (423 loc) · 18.4 KB
/
03-R-Intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
# Some Useful Bits of R
## You Gotta Have Style {#style-guide}
Good programming style is incredibly important. It makes your code easier to read
and edit. That leads to fewer errors.
Here is a brief style guide you are expected to follow for all code in this course:
For more detailed see <http://adv-r.had.co.nz/Style.html>, on which this is based.
1. **No long lines.**
Lines of code should have at most 80 characters.
Programming lines should not wrap, you should choose the line breaks yourself.
Choose them in natural places.
In R markdown, long codes lines don't wrap, they flow off the page,
so the end isn't visible. (And it makes it obvious that you didn't
look at your own print out.)
```{r ch03-no-long-lines}
# Don't ever use really long lines. Not even in comments. They spill off the page and make people wonder what they are missing.
```
2. **Use your space bar.**
There is a reason it is the largest key on the keyboard. Use it often.
Spaces **after commas** (always). Spaces **around operators** (always).
Spaces after the comment symbol `#` (always). Use other spaces
judiciously to align similar code chunks to make things easier to
read or compare.
```{r ch03-spacy, eval = FALSE}
x<-c(1,2,4)+5 # BAD BAD BAD
x <- c(1, 2, 4) + 5 # Ah :^)
```
3. But don't go crazy with the space bar.
There are a few places you should not use spaces:
* after open parentheses or before closed parentheses
* between function names and the parentheses that follow
4. **Indent** to show the structure of your code.
Use 2 spaces to indent (to keep things from drifting right too quickly).
Fortunately, this is really easy. Highlight your code and hit
`<CTRL>`-i (PC) or `<command>`-i (Mac). If the indention looks odd to you,
you have most likely messed up commas, quotes, parentheses, or
curly braces.
5. **Choose names wisely and consistently**.
Naming things is hard, but take a moment to choose good names, and go back
and change them if you come up with a better name later. Here are some
helpful hints:
a. Very **short names** should only be used for a very **short time**
(a couple lines of code). Else we tend to forget what they meant.
Avoid names like `x`, `f`, etc. unless the use is brief and mimics
some common mathematical formula.
b. **Break long names visually**. Common ways to do this are with
a dot (`.`), an underscore `_`, or `camelCase`. There are R coders who
prefer all three, but don't mix and match for similar kinds of things,
that just makes it harder to remember what to do the next time.
```{r ch03-long-names}
good_name <- 10
good.name <- 10
goodName <- 10 # note alignment via extra space in this line
really_terrible.ideaToDo <- -5
```
**The trend in R is toward using underscore (`_`)** and I recommend it.
Older code often used dot (`.`). CamelCase is the least common in R.
c. Recommendation: capitalize data frame names; use lower case for
variables inside data frames.
This is not a common convention in R, but it can really help to
keep things straight. I'll do this in the data sets I create, but
when we use other data sets, they may not follow this convention.
d. Avoid using names that are already in use by R.
This can be hard to avoid when you are starting out because you don't
know what all is defined. Here are a few things to avoid.
```{r ch03-avoid, eval = FALSE }
T # abbreviation for TRUE
F # abbreviation for FALSE
c # used to concetenate vectors and lists
df # density function for f distributions
dt # density function for t distributions
```
6. **Use comments (`#`)**, but use them for the right thing.
Comments can be used to clarify names, point out subtlties in code, etc.
They should not be used for your analysis or discussion of results.
Don't comment things that are obvious without comment. Comments should add
value.
```{r ch03-comments}
x <- 4 # set x to 4 <----------------------- no need for this comment
x <- 4 # b/c there are four grade levels in the study <------- useful
```
7. **Exceptions** should be exceptional.
No style guide works perfectly in all situations. Ocassionally you may
need to violate the style guide. But these instances should be rare and
should have a good reason. They should not arise form your sloppiness
or laziness.
### An additional note about homework
When you do homework, I want to see your code and the results (and your
discussion of those results). Writing in R Markdown makes this all easy
to do.
But make sure that I can see all the necessary things to evaluate what you
are doing. You have access to your code and can investigate variables, etc.
But make sure I can see what's going one in the document. This often
means displaying intermediate results. Once common way to do this is
with a semi-colon:
```{r ch03-show-intermediate}
x <- 57 * 23; x
```
## Vectors, Lists, and Data Frames
### Vectors
In R, a vector is a homogeneous ordered collection (indexing starts at 1).
By homogeneous, we mean that each element is he same kind of thing.
Short vectors can be created using `c()`:
```{r ch03-vectors}
x <- c(1, 3, 5)
x
```
Evenly spaced sequences can be created using `seq()`:
```{r ch03-seq}
x <- seq(0, 100, by = 5); x
y <- seq(0, 1, length.out = 11); y
0:10 # short cut for consecutive integers
```
Repeated values can be created with `rep()`:
```{r ch03-rep}
rep(5, 3)
rep(c(1, 2, 3), each = 2)
rep(c(1, 2, 3), times = 2)
rep(c(1, 2, 3), times = c(1, 2, 3))
rep(c(1, 2, 3), each = c(1, 2, 3)) # Ack! see warning message.
```
When a function acts on a vector, there are several things that could happen.
* One result can be computed from the entire vector.
```{r ch03-vectors-and-functions-01}
x <- c(1, 3, 6, 10)
length(x)
mean(x)
```
* The function may be applied to each element of the vector and a vector of
results returned. (Such functions are called **vectorized**.)
```{r ch03-vectors-and-functions-02}
log(x)
2 * x
x^2
```
* The first element of the vector may be used and the others ignored.
(Less common but dangerous -- be on the lookout. See example above.)
Items in a vector can be accessed using `[]`:
```{r ch03-square-bracket}
x <- seq(10, 20, by = 2)
x[2]
x[10] # NA indicates a missing value
x[10] <- 4
x # missing values filled in to make room!
is.na(x)
```
In addition to using integer indices, there are two other ways to access
elements of a vector: names and logicals.
If the items in a vector are named, names are displayed when
a vector is displayed, and names can be used to access elements.
```{r ch03-named-vector}
x <- c(a = 5, b = 3, c = 12, 17, 1)
x
names(x)
x["b"]
```
Logicals (`TRUE` and `FALSE`) are very interesting in R. In indexing, they
tell us which items to keep and which to discard.
```{r ch03-logical-indexing}
x <- (1:10)^2
x < 50
x[x < 50]
x[c(TRUE, FALSE)] # T/F recycled to length 10
which(x < 50)
```
### Lists
Lists are a lot like vectors, but
* Create a list with `list()`.
* The elements can be different kinds of things (including other lists)
* Use `[[ ]]` to access elements. You can also use `$` to access named elements
* If you use `[]` you will get a list back not an element.
```{r ch03-lists}
# a messy list
L <- list(5, list(1, 2, 3), x = TRUE, y = list(5, a = 3, 7))
L
L[[1]]
L[[2]]
L[1]
L$y
L[["x"]]
L[1:3]
glimpse(L)
```
### Data frames for rectangular data
Rectangular data is organized in rows and columns (much like an excel
spreadsheet). These rows and columns have a particular meaning:
* Each **row** represents one **observational unit**. Observational units go
by many others names depending whether they are people, or inanimate objects,
our events, etc. Examples include case, subject, item, etc. Regardless, the
observational units are the things about which we collect information, and each
one gets its own row in rectangular data.
* Each **column** represents a **variable** -- one thing that is "measured" and
recorded (at least in principle, some measurements might be missing) for each
observational unit.
**Example** In a study of nutritional habits of college students, our observational
units are the college students in the study. Each student gets her own row in the data
frame. The variables might include things like an ID number (or name), sex, height,
weight, whether the student lives on campus or off, what type of meal plan they have
at the dining hall, etc., etc. Each of these is recorded in a separate column.
Data frames are the standard way to store **rectangular data** in R.
Usually variables (elements of the list) are vectors, but this isn't required, sometimes you will see list variables in data frames.
Each element (ie, variable) must have the same length (to keep things rectangular).
Here, for example, are the first few rows of a data set called `KidsFeet`:
```{r ch03-KidsFeet-00}
library(mosaicData) # Load package to make KidsFeet data available
head(KidsFeet) # first few rows
```
#### Accessing via [ ]
We can access rows, columns, or individual elements of a data frame using `[ ]`.
This is the more usual way to do things.
```{r ch03-KidsFeet-01}
KidsFeet[, "length"]
KidsFeet[, 4]
KidsFeet[3, ]
KidsFeet[3, 4]
KidsFeet[3, 4, drop = FALSE] # keep it a data frame
KidsFeet[1:3, 2:3]
```
By default,
* Accessing a row returns a 1-row data frame.
* Accessing a column returns a vector (at least for vector columns)
* Accessing a element returns that element (technically a vector with
one element in it).
#### Accessing columns via $
We can also access individual variables using the `$` operator:
```{r ch03-KidsFeet-02}
KidsFeet$length
KidsFeet$length[2]
KidsFeet[2, "length"]
KidsFeet[2, "length", drop = FALSE] # keep it a data frame
```
As we will see, there are are other tools that will help us avoid
needing to us `$` or `[ ]` for access columns in a data frame. This is
especially nice when we are working with several variables all coming
from the same data frame.
#### Accessing by number is dangerous
Generally speaking, it is safer to access things by name than by number when
that is an option.
It is easy to miscalculate the row or column number you need, and if rows
or columns are added to or deleted from a data frame, the numbering can change.
#### Implementation
Data frames are implemented in R as a special type (technically, class) of list.
The elements of the list are the *columns* in the data frame. Each column must
have the same length (so that our data frame has coherent *rows*). Most often the
columns are vectors, but this isn't required.
This explains why `$` works the way it does -- we are just accessing
one item in a list. It also means that we can use `[[ ]]` to access
a column:
```{r ch03-KidsFeet-03}
KidsFeet[["length"]]
KidsFeet[[4]]
KidsFeet[4]
```
### Other types of data
Some types of data do not work well in a rectangular arrangement of a data frame,
and there are many other ways to store data. In R, other types of data commonly
get stored in a list of some sort.
## Plotting with ggformula
R has several plotting systems. Base graphics is the oldest. `lattice` and
`ggplot2` are both built on a system called grid graphics. `ggformula`
is built on `ggplot2` to make it easier to use and to bring in some of
the advantages of `lattice`.
You can find out about more about `ggformula` at <https://projectmosaic.github.io/ggformula/news/index.html>.
## Creating data with expand.grid()
We will frequently have need of synthetic data that includes all combinations
of some variable values. `expand.grid()` does this for us:
```{r ch03-expand-grid}
expand.grid(
a = 1:3,
b = c("A", "B", "C", "D"))
```
## Transforming and summarizing data dplyr and tidyr
See the tutorial at
<http://rsconnect.calvin.edu/wrangling-jmm2019> or
<https://rpruim.shinyapps.io/wrangling-jmm2019>
## Writing Functions
### Why write functions?
There are two main reasons for writing functions.
1. You may want to use a tool that requires a function as input.
To use `integrate()`, for example, you must provide the integrand as a function.
2. To make your own work easier.
Functions make it easier to reuse code or to break larger tasks into
smaller parts.
### Function parts
Functions consist of several parts. Most importantly
1. An argument list.
A list of named inputs to the function. These may have default values (or not).
There is also a special argument `...` which gathers up any other arguments
provided by the user. Many R functions make use of `...`
```{r ch03-args}
args(ilogit) # one argument, called x, no default value
```
2. The body.
This is the code the tells R to do when the function is executed.
```{r ch03-body}
body(ilogit)
```
3. An environment where code is executed.
Each function has its own "scratch pad" where it can do work without
interfering with global computations. But environments in R are
nested, so it is possible to reach outside of this narrow environment
to access other things (and possible to change them). For the most
part we won't worry about this, but if you use a variable not defined
in your function but defined elsewhere, you may see unexpecte results.
If you type the name of a function without parenthesis, you will see all three
parts listed:
```{r ch03-ilogit-look}
ilogit
```
### The function() function has its function
To write a function we use the `function()` function to specify the
arguments and the body. (R will assign an environment for us.)
The general outline is
```{r ch03-function-outline, eval = FALSE}
my_function_name <-
function(arg1 = default1, arg2 = default2, arg3, arg4, ...) {
# stuff for my function to do
}
```
We may include as many named arguments as we like, and some or all or none
of them may have default values. The results of the last line of the function
are returned. If we like, we can also use the `return()` function to make it
clear what is being returned when.
Let's write a function that adds. (Redundant, but a useful illustration.)
```{r ch03-write-function}
foo <- function(x, y = 5) {
x + y # or return(x + y)
}
foo(3, 5)
foo(3)
foo(2, x = 3) # Note: this makes y = 2
foo(x = 1:3, y = 100)
foo(x = 1:3, y = c(100, 200, 300)) # vectorized!
foo(x = 1:3, y = c(100, 200)) # You have been warned!
```
Here is a more useful example.
Suppose we want to integrate $f(x) = x^2 (2-x)$ on the interval from 0 to 2.
Since this is such a simple function, if we are not going to reuse it,
we don't need to bother naming it, we can just create the function inside
our call to `integrate()`.
```{r ch03-integrate}
integrate(function(x) { x^2 * (2-x) }, 0, 2)
```
## Some common error messages
### object not found
If R claims some object is not found, the two most likely causes are
* a typo -- if you spell the name of the object slightly differently, R can't figure out
what you mean
```{r ch03-object-not-found, error = TRUE}
blah <- 17
bla
```
* forgetting to load a package -- if the object is in a package, that package must be
loaded
```{r ch03-package-load-avoids-error, error = TRUE}
detach("package:mosaic", unload = TRUE)
detach("package:ggformula", unload = TRUE) # without packages, no gf_line()
gf_line()
library(ggformula) # reload package; now things work
gf_line()
```
### Package inputenc Error: Unicode char not set up for use with LaTeX.
Sometimes if you copy and paste text from a web page or PDF document you will get
symbols knitr doesn't know how to handle. Smart quotes, ligatures, and other special
characters are the most likely cause.
`tools::showNonASCIIfile()` can help you locate non-ASCII characters in a file.
### Any message mentioning yaml
YAML stand for yet another markup language. The first part of an R Markdown
file is called the YAML header. If you get a yaml error when you knit a
document, most likely you have messed up the YAML header someone. If you don't
see the problem, you can start a new document and copy-and-paste the contents
(without the TAML header) into the new document (after its YAML header).
## Exercises {#ch03-exercises}
1. Create a function in R that converts Fahrenheit temperatures to
Celsius temperatures. [Hint: $C = (F-32) \cdot 5/9$.]
What you turn in should show
a. the code that defines your function.
b. some test cases that show that your function is working. (Show that
-40, 32, 98.6, and 212 convert to -40, 0, 37, and 100.)
Note: you should be able to test all these cases by calling the function
only once. Use `c(-40, 32, 98.6, 212)` as the input.
2. See if you can predict the output of each line below. Then run in R to see if
you are correct. If you are not correct, see if you can figure out why R does what
it does (and make a note so you are not surprised the next time).
```{r ch03-guess-and-check, eval = FALSE}
odds <- 1 + 2 * (0:4); odds
primes <- c(2, 3, 5, 7, 11, 13)
length(odds)
length(primes)
odds + 1
odds + primes
odds * primes
odds > 5
sum(odds > 5)
sum(primes < 5 | primes > 9)
odds[3]
odds[10]
odds[-3]
primes[odds]
primes[primes >= 7]
sum(primes[primes > 5])
sum(odds[odds > 5])
odds[10] <- 1 + 2 * 9
odds
y <- 1:10
y <- 1:10; y
(x <- 1:5)
```
3. The problem uses the `KidsFeet` data set from the `mosaicData` package.
The hints are suggested functions that might be of use.
a. How many kids are represented in the data set. [Hint: `nrow()` or `dim()`]
b. Which of the variables are factors? [Hint: `glimpse()`]
c. Add a new variable called `foot_ratio` that is equal to
`length` divided by `width`. [Hint: `mutate()`]
c. Add a new variable called `biggerfoot2` that has values
`"dom"` (if `domhand` and `biggerfoot` are the same) and
`"nondom"` (if `domhand` and `biggerfoot` are different).
[Hint: `mutate()`, `==`, `ifelse()`]
d. Create new data set called `Boys` that contains only the boys.
[Hint: `filter()`, `==`]
e. What is the name of the boy with the largest `foot_ratio`?
Show how to find this programmatically, don't just scan through
the whole data set yourself. [Hint: `max()` or `arrange()`]
## Footnotes