-
Notifications
You must be signed in to change notification settings - Fork 2
/
summary.Rmd
201 lines (161 loc) · 14.8 KB
/
summary.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
```{r include = FALSE}
if(!knitr:::is_html_output())
{
options("width"=56)
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=56, indent = 2), tidy = TRUE)
knitr::opts_chunk$set(fig.pos = 'H')
}
```
# Summary
Overall, I hope that this book has helped you grasp the basics of the R programming language. If you're a student, then I hope this was pitched at the correct level to challenge you. If you're a teacher, then I hope this book helped round out your knowledge of R such that you would feel comfortable imparting your knowledge onto those that want to hear it.
If you feel that there are improvements to be made to the content or how it was taught, then please do let me know via a [GitHub issue on the teacheR](https://github.com/ARawles/teacheR/issues) repository.
## Answers
### For Students
#### Operators {#operators-answers}
1. Why do `1 != 2` and `!(1 == 2)` both equal true?
+ The `!` operator negates the following criteria. So our first statement is `1 not equal to 2` and whereas our second is `not 1 equal to 2`. But both of these evaluate to TRUE.
2. Reading the [R documentation](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Logic.html) on logical operators, what is the difference between `|` and `||` (and `&` and `&&`)?
+ `|` and `&` perform the comparison in a step-wise method for vectors. So `c(1,2) == 1 & c(2,3) == 2` returns `TRUE` first (`1 == 1 & 2 == 2`) and then `FALSE` (`2 == 1 & 3 == 2`), because it evaluates the criteria for the first and second values in the vectors separately. `||` and `&&` operate on the first value in each vector only. So `c(1,2) == 1 & c(2,3) == 2` returns only `TRUE` and nothing else because it only evaluates the first values in each vector against the defined criteria (i.e. `1 == 1 & 2 == 2`).
#### Variable Assigment {#variable-assignments-answers}
1. Is `.2nd` a valid name? Why/why not?
+ It begins with a `.` which is valid, but the first character after the `.` is a digit which isn't allowed.
2. Why are names like `if`, `function`, and `TRUE` not allowed?
+ These are reserved words in R. There are reserved because they are used for specific functions, and so creating variables with these names could very easily confuse things!
3. Why might it be a bad idea to assign a value to a name like `mean` or `sum`?
+ `mean()` and `sum()` are functions in base R. Therefore, giving variables the same names can make reading a script very difficult, and might lead you to inadvertently use the incorrect value.
#### Data Types {#data-types-answers}
1. Why are `2` and `2L` different?
+ `2` is stored as a double, whereas because we've appended `L` to `2L`, we're telling R to store it as an integer. Although, both are numeric.
2. What is an ordered factor and how is it different to a character string?
+ An ordered factor stores a set of groups in a set order. A character string does not necessarily represent any type of grouping and has no inherent ordering.
3. Why does `as.Date("19/01/2019", format = "%d/%m/%y")` return the date 19th Jan 2020 and not 19th Jan 2019?
+ The `%y` part of the `format` paramter tells R that the year value is 2 digits long. R therefore parses the date as `19/01/20` and ignores everything after it - resulting in 19th Jan 2020.
4. Why does `as.numeric(TRUE)` return 1? What will `as.logical(2)` return?
+ `TRUE` and `FALSE` have historically been stored as `1` and `0` respectively, so `as.logical(1)` will also return TRUE. R treats any numeric value greater than 0 as `TRUE`, so `as.logical(2)` will return TRUE.
5. Look at the `identical()` function. How is that different from `==`?
+ The `identical()` function does not use implicit conversion whereas `==` does. You can test this by comparing `identical(1, 1L)` and `1 == 1L`.
#### Data Structures {#data-structures-answers}
1. If I want to store a set of integers, what data structure should I use and why?
+ A vector would be most appropriate. All values are of the same type, and so a list would be unnecessary (but still valid).
2. Reading in Excel and .csv files into R will convert them into data.frames. Why do you think this is?
+ Excel and .csv files have a tabular structure (they are similar to tables), and often have columns of different types. As data frames fulfill both of these requirements, data frames are the default data structures for this type of data.
3. What does `is(matrix())` return? What does this tell us about the underlying difference between matrices and dataframes?
+ Matrices don't store their columns as lists, but as vectors (and that's why `is(matrix())` returns `vector` but not `list`). As a vector stores atomic values that must all be of the same type, that's why matrices must have values of the same type whilst dataframes (that rely on columns as lists) can have values of any type.
#### Subsetting {#subsetting-answers}
1. Why can dataframes and lists be subsetted in a similar way?
+ Dataframes are essentially just lists of lists, so they can be subsetted like lists because they are lists!
2. What happens if you miss the last character off when subsetting a dataframe column with `$` (e.g. `df$co` instea of `df$col`)? Does the same thing happen when subsetting using `[[]]`?
+ R will find the closest match when using `$`, but will not do the same with `[[]]`. `$` is therefore good for interactive use (when you're trying lots of things quickly), but not if you're programming, because you might accidently subset the wrong column without noticing.
#### Functions (students) {#functions-answers}
1. Why does `` `<-`(test, 2) `` work? What does this tell us about `<-`?
+ `` `<-`(test, 2) `` works because `<-` is just another function. It assigns the second value to the first, and so we can use it like any other function (with brackets).
2. Why does `mean(1,2)` not return the output you'd expect but `sum(1,2)` does?
+ `mean()` is expecting a vector of values provided as the `x` argument. So the correct use of `mean` would be `mean(c(1,2))`. `sum()`, however, uses `...` to capture all arguments that are provided within the brackets and sum them together, so `sum(1,2)` is appropriate.
3. Other than `Sys.Date()`, can you think of another example of a function that be executed without any explicit input parameters?
+ `Sys.time()` is another in the same style. `c()` and `matrix()` also do not require explicit input parameters to operate.
### For Teachers
#### Functions (teachers) {#functions-teachers-answers}
1. How are `mean()` and `sum()` different in their implementation of `...`?
+ `sum()` uses `...` to accept an indeterminate number of arguments to sum. `mean()` requires a vector of values to operate on, and uses `...` to pass on to its subsequent methods (i.e. how exactly it calculates the mean for different data structures).
2. If functions are objects, how would you construct a function that returns a function? What might be a use for this?
+ Example function:
```{r}
function_factory <- function(power = 2){
ret_function <- function(x) {
x ^ power
}
ret_function
}
square_function <- function_factory(power = 2)
square_function(3)
```
3. Look at the `ellipsis` package. What issue does this package look to solve?
+ If you provide incorrect or misspelled arguments to a function via `...`, then these will be silently ignored, making debugging difficult. For instance, if we run `mean(1,2,3)`, which looks like perfectly valid code. We get `1`. This is because `mean()` expects a vector of values to calculate the mean for as the first argument (`x`), with all other arguments being passed via `...`. The `ellipsis` package can be used to check that the parameters provided are actually used.
4. What issues might arise when passing `...` to multiple functions inside your function?
+ If an argument provided in the ellipsis matches an argument name in more than one function, then every function with the matching argument name will use that value. For instance, if you pass the `...` to two functions, both of which accept an `x` parameter, then the value of `x` provided to your function will be used for both.
5. How could you solve these issues using the `do.call()` function?
+ The `do.call()` function allows you to call a function and provide a list of input parameters. To define a function that passes arguments to more than one function, you could have input parameters such as `fun1_options` and `fun2_options` that are lists of input parameters. Then, by using `do.call(fun1, args = fun1_options)` and `do.call(fun2, fun2_options)`, you can specify arguments to each function separately without worrying about name conflicts with `...`.
6. Why might one use the `missing()` approach instead of assigning a default value? What are the drawbacks of this?
+ `missing()` allows you to handle errors in more detail and lets you set more complex or conditional default values. However, using `missing()` in this way means that someone who uses your function might not be able to tell what the default value is going to be without seeing the body of the function!
#### Environments {#environments-answers}
1. In what situations can we have two environments with the same name? Why is this?
+ As long as those two environments don't share the same parent environment, then they can have the same name. This is because they will still both have unique identifiers by virtue of having different inherited environments.
2. Search 'namespacing'. How does that concept relate to environments?
+ Namespacing is the concept of naming objects to allow them to be uniquely identifiable. Environments allow for namespacing by encapsulating names within a scope. Then, even if the name isn't unique, the pattern of inheritance and the name will be together, so the object will be uniquely identifiable.
3. What might be an issue with create a function that uses superassignment on an object with the name `x`?
+ `x` is quite a common variable name. If someone is using your function and doesn't realise that you're using super assignment with an object named `x`, then you might overwrite their `x` variable.
#### Objects and classes {#objects-and-classes-answers}
1. Why does `is(1L)` return integer, then double, then numeric in that order?
+ integer inherits from the double class, which in turn inherits from numeric. Integer inherits from double because all integers are doubles but not all doubles are integers. Double inherits from numeric because all doubles are numeric but not all numerics are doubles.
2. Now we understand inheritance, why can data frames be subsetted with a `$`?
+ Data frames inherit from the `list` class, meaning that methods that operate on lists (should) also work on data frames, like `$`.
3. If we create a dataframe and then call `print.default()` on it, why do we get the output that we do? Why does it look more like a list than a dataframe?
+ When we print a data frame normally, `print.data.frame()` is called. When we call `print.default()`, we're using a different method that prints the object differently (more akin to a list).
4. Why are constructors not a fool-proof way of making sure that all our objects of our class will have the same structure? How else might one create an object of class `address`?
+ Because we can change the class of an object with assignment, someone could create an object that didn't match your constructor function and then assign the class like this: `class(not_an_address) <- "address"`
5. `methods("mean")` returns methods for dates and datetimes (and `.default` which operates on numeric objects), but no other object. Why is this? What does this mean when constructing methods for a new generic?
+ These are the only data types for which a `mean` function would make sense. For instance, finding the mean for a bunch of character strings wouldn't really make much sense (although you could build one if you wanted). This means that you only really need to create methods for your generic that *make sense* (i.e. that are appropriate for what you want your generic to do).
#### Expressions {#expressions-answers}
1. Why might an expression like this `fun(x)` be useful? Particularly, when paired with `substitute()`.
+ By using subsitute, we can replace `fun` with any function we like:
```{r}
quoted_fun <- substitute(fun(x), env = list(fun = sum, x = 1))
quoted_fun
eval(quoted_fun)
```
2. What's the difference between
```{r, eval = FALSE}
quote({
x <- 1
x + 10
})
```
and
```{r, eval = FALSE}
list(
quote(x <- 1),
quote(x + 10)
)
```
and why do they evaluate to different things?
+ The first is a single expression. When we evaluate the first, both lines `x <- 1` and `x + 10` are evaluated at (roughly) the same time:
```{r}
eval(
quote({
x <- 1
x + 10
})
)
```
\ + The second is a list of separate expressions. If we evaluate them using:
```{r}
lapply(
list(
quote(x <- 1),
quote(x + 10)
),
eval
)
```
Then each line is evaluated separately, resulting in two lines instead of 1.
#### If / Else {#if-else-answers}
1. Rather than writing `if (x == 1 | x == 2 | x == 3)`, how could you use the `%in%` operator to make it shorter?
+ `if (x %in% c(1,2,3))`
2. How does the `switch()` function relate to if / else statements?
+ `switch` takes an expression that returns a character string or number and then evaluates the appropriate argument depending on the value. The `switch` function therefore operates like a string of `if else (x == x)` statements.
#### Iteration {#iteration-answers}
1. In what situation would you use `i in x` over `i in seq_along(x)`?
+ `i in x` is used to access the value at each position in `x`, whereas `i in seq_along(x)` will go along the indices. So if you want to access the value (and you're not doing something like incrementing another variable)
2. Why is `seq_along(x)` preferable to `1:length(x)`?
+ If you have something that has a length of 0, then `1:length(x)` will return `1 0`, which isn't what you want as you'll go from the first index (which won't exist anyway and so will error) and then to the 0th index. `seq_along(x)` will return an empty integer object instead, which is preferable.
3. What would be the for loop code that would be required to replicate `lapply(number_list, FUN = max)`? Which would be easier to debug?
+ Loop code:
```{r, eval = FALSE}
max_list <- list(rep(NULL, length(number_list)))
for (i in seq_along(number_list)) {
max_list[[i]] <- max(number_list[[i]])
}
```
## Next Steps
### opeRate
To help you learn how to apply the skills and understanding you've acquired through this book, I've developed a second book, [opeRate](https://operate.arawles.co.uk). opeRate was developed in exactly the same way as teacheR, but focuses on applied data analysis and practical learning. If you're interested in applying your knowledge, then I would recommend giving it a read.