diff --git a/examples/00-setup/00-setup.html b/examples/00-setup/00-setup.html index d364608da..17676b55c 100644 --- a/examples/00-setup/00-setup.html +++ b/examples/00-setup/00-setup.html @@ -61,24 +61,21 @@
Do you need to work through the tutorial? Take the quiz below to find out.
-RStudio is an Integrated Development Environment for R. What does that mean? Well, if you think of R as a language, which it is, you can think of RStudio as a program that helps you write and work in the language. RStudio makes programming in R much easier and I suggest that you use it!
-A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). An example is the mpg
data frame found in the ggplot2 package (aka ggplot2::mpg
). The mpg
data frame contains observations collected by the US Environmental Protection Agency on 38 models of cars. To see the mpg
data frame, type mpg
in the code chunk below and then click “Submit Answer.”
mpg <- as.data.frame(mpg)
Hint: Type mpg
and then click the blue button.
# checking code
You can learn more about mpg
by opening its help page. The help page will explain where the mpg
dataset comes from and what each variable in mpg
describes. To open the help page, type ?mpg
in the code chunk below and then click “Submit Answer”.
Hint: Type ?mpg
and then click the blue button.
# checking code
Now let’s look at a special type of data frame that you will encounter in R: the tibble.
The flights
data frame in the nycflights13 package is an example of a tibble. flights
describes every flight that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights
.
Use the code chunk below to print the contents of flights
.
Hint: Type the name of the data frame that you want to print and then click the blue button. I’ve already loaded the nycflight13 package for you.
# checking code
flights
-# A tibble: 336,776 × 19
+# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
-1 2013 1 1 517 515 2 830
-2 2013 1 1 533 529 4 850
-3 2013 1 1 542 540 2 923
-4 2013 1 1 544 545 -1 1004
-5 2013 1 1 554 600 -6 812
-6 2013 1 1 554 558 -4 740
-7 2013 1 1 555 600 -5 913
-8 2013 1 1 557 600 -3 709
-9 2013 1 1 557 600 -3 838
+ 1 2013 1 1 517 515 2 830
+ 2 2013 1 1 533 529 4 850
+ 3 2013 1 1 542 540 2 923
+ 4 2013 1 1 544 545 -1 1004
+ 5 2013 1 1 554 600 -6 812
+ 6 2013 1 1 554 558 -4 740
+ 7 2013 1 1 555 600 -5 913
+ 8 2013 1 1 557 600 -3 709
+ 9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
@@ -229,14 +229,14 @@ Type codes
Test your knowledge
-
Congratulations
@@ -289,7 +289,7 @@ Congratulations
diff --git a/examples/03a-data-manip-filter/03a-data-manip-filter.html b/examples/03a-data-manip-filter/03a-data-manip-filter.html
index 2a9c39182..df99e652d 100644
--- a/examples/03a-data-manip-filter/03a-data-manip-filter.html
+++ b/examples/03a-data-manip-filter/03a-data-manip-filter.html
@@ -79,34 +79,34 @@ Filter rows with filter()
filter()
filter()
allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on January 1st with:
-
+
filter(flights, month == 1, day == 1)
When you run that line of code, dplyr executes the filtering operation and returns a new data frame. dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-
.
Rerun the command in the code chunk below, but first arrange to save the output to an object named jan1
.
-
+
filter(flights, month == 1, day == 1)
-
+
jan1 <- filter(flights, month == 1, day == 1)
-
+
"Good job! You can now see the results by running the name jan1 by itself. Or you can pass `jan1` to a function that takes data frames as input."
()
R either prints out the results of a command, or saves the the results to a variable. If you want to do both, you can wrap the assignment in parentheses. Wrap the following command in parentheses, i.e. add a (
to the start of the line and a )
to the end. Then run the code. What happens?
-
+
dec25 <- filter(flights, month == 12, day == 25)
-
+
(dec25 <- filter(flights, month == 12, day == 25))
-
+
"Very Nice! When you surround an assignment in parentheses, R both assigns the result to the object and prints the result to the screen. You can go ahead and check: an object named `dec25` now exists and it contains the data set that you see printed."
@@ -118,12 +118,12 @@ Comparison operators
To use filtering effectively, you have to know how to select the observations that you want with R’s comparison operators. R provides the standard suite of comparisons: >
, >=
, <
, <=
, !=
(not equal), and ==
(equal).
When you’re starting out with R, the easiest mistake to make is to test for equality with =
instead of ==
. When this happens you’ll get an informative error:
filter(flights, month = 1)
-## Error: filter() takes unnamed arguments. Do you need `==`?
+## Error: `month` (`month = 1`) must not be named, do you need `==`?
Floating point arithmetic
There’s another common problem you might encounter when using ==
: floating point numbers. These results might surprise you! To get a feel for floating point numbers, predict what the code below shoudl return, then click “Run Code.” Does everything work as you predict?
-
+
sqrt(2) ^ 2 == 2
1/49 * 49 == 1
@@ -147,14 +147,14 @@ &, |, and !
-
Common mistakes
@@ -176,14 +176,14 @@ Missing values
NA
Missing values can make comparisons tricky in R. R uses NA
to represent missing or unknown values. NA
s are “contagious” because almost any operation involving an unknown value (NA
) will also be unknown (NA
). For example, can you determine what value these expressions that use missing values shoudl evaluate to? Make a prediction and then click “Submit Answer”.
-
+
NA > 5
10 == NA
NA + 10
NA / 2
-
+
"In every case, R does not have enough information to compute a result. Hence, each result is an unknown value, `NA`."
@@ -212,8 +212,6 @@ filter() and NAs
filter()
only includes rows where the condition is TRUE
; it excludes both FALSE
and NA
values. If you want to preserve missing values, ask for them explicitly:
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
-## Warning in filter_impl(.data, dots): '.Random.seed' is not an integer
-## vector but of type 'NULL', so ignored
-
+
filter(flights, arr_delay >= 2)
Flew to Houston (IAH
or HOU
)
-
+
-
+
filter(flights, dest %in% c("IAH", "HOU"))
-Hint: This is a good case for the %in%
operator.
+Hint: This is a good case for the %in%
operator.
Were operated by United (UA
), American (AA
), or Delta (DL
)
-
+
-
+
filter(flights, carrier %in% c("UA", "AA", "DL"))
-Hint: The carrier
variable lists the airline that operated each flight. This is another good case for the %in%
operator.
+Hint: The carrier
variable lists the airline that operated each flight. This is another good case for the %in%
operator.
Departed in summer (July, August, and September)
-
+
-
+
filter(flights, 6 < month, month < 10)
-Hint: When converted to numbers, July, August, and September become 7, 8, and 9.
+Hint: When converted to numbers, July, August, and September become 7, 8, and 9.
Arrived more than two hours late, but didn’t leave late
-
+
-
+
filter(flights, arr_delay > 120, dep_delay < 0)
-Hint: Remember that departure and arrival delays are recorded in minutes.
+Hint: Remember that departure and arrival delays are recorded in minutes.
Were delayed by at least an hour, but made up over 30 minutes in flight
-
+
-
+
filter(flights, dep_delay > 60, (dep_delay - arr_delay) >= 30)
-Hint: The time a plane makes up is dep_delay - arr_delay
.
+Hint: The time a plane makes up is dep_delay - arr_delay
.
Departed between midnight and 6am (inclusive)
-
+
-
+
filter(flights, dep_time <= 600 | dep_time == 2400)
-Hint: Don’t forget flights thsat left at eactly midnight (2400
). This is a good case for an “or” operator.
+Hint: Don’t forget flights thsat left at eactly midnight (2400
). This is a good case for an “or” operator.
Exercise 2
Another useful dplyr filtering helper is between()
. What does it do? Can you use between()
to simplify the code needed to answer the previous challenges?
-
+
?between
@@ -313,23 +311,23 @@ Exercise 2
Exercise 3
How many flights have a missing dep_time
? What other variables are missing? What might these rows represent?
-
+
-
+
filter(flights, is.na(dep_time))
Hint: This is a good case for is.na()
.
-
+
"Good Job! these look like they might be cancelled flights."
Exercise 4
Why is NA ^ 0
not missing? Why is NA | TRUE
not missing? Why is FALSE & NA
not missing? Can you figure out the general rule? (NA * 0
is a tricky counterexample!)
-
+
@@ -453,7 +451,7 @@ Exercise 4
diff --git a/examples/03b-data-manip-mutate/03b-data-manip-mutate.html b/examples/03b-data-manip-mutate/03b-data-manip-mutate.html
index d0cf9d2b7..c286ce433 100644
--- a/examples/03b-data-manip-mutate/03b-data-manip-mutate.html
+++ b/examples/03b-data-manip-mutate/03b-data-manip-mutate.html
@@ -67,7 +67,7 @@ Add new variables with mutate()
select()
You can select a subset of variables by name with the select()
function in dplyr. Run the code below to see the narrow data set that select()
creates.
-
+
flights_sml <- select(flights,
arr_delay,
dep_delay,
@@ -80,7 +80,7 @@ select()
mutate()
The code below creates two new variables with dplyr’s mutate()
function. mutate()
returns a new data frame that contains the new variables appended to a copy of the original data set. Take a moment to imagine what this will look like, and then click “Run Code” to find out.
-
+
flights_sml <- select(flights,
arr_delay,
dep_delay,
@@ -88,7 +88,7 @@ mutate()
air_time
)
-
+
mutate(flights_sml,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60
@@ -110,7 +110,7 @@ mutate()
transmute()
mutate()
will always return the new variables appended to a copy of the original data. If you want to return only the new variables, use transmute()
. In the code below, replace mutate()
with transmute()
and then spot the difference in the results.
-
+
mutate(flights,
gain = arr_delay - dep_delay,
hours = air_time / 60,
@@ -118,14 +118,14 @@ transmute()
)
-
+
transmute(flights,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
-
+
"Excellent job! `transmute()` and `mutate()` do the same thing, but `transmute()` only returnsd the new variables. `mutate()` returns a copy of the original data set with the new variables appended."
@@ -186,52 +186,52 @@ Exercises
Exercise 1
Currently dep_time
and sched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
-
+
-
+
mutate(flights, dep_time = dep_time %/% 100 * 60 + dep_time %% 100,
sched_dep_time = sched_dep_time %/% 100 * 60 + sched_dep_time %% 100)
Hint: 423 %% 100
returns 23
, 423 %/% 100
returns 4
.
-
+
"Good Job!"
Exercise 2
Compare air_time
with arr_time - dep_time
. What do you expect to see? What do you see? How do you explain this?
-
+
# flights <- mutate(flights, total_time = _____________)
# flight_times <- select(airtime, total_time)
# filter(flight_times, air_time != total_time)
-
+
flights <- mutate(flights, total_time = arr_time - dep_time)
flight_times <- select(airtime, total_time)
filter(flight_times, air_time != total_time)
-
+
"Good Job! it doesn't make sense to do math with `arr_time` and `dep_time` until you convert the values to minutes past midnight (as you did with `dep_time` and `sched_dep_time` in the previous exercise)."
Exercise 3
Compare dep_time
, sched_dep_time
, and dep_delay
. How would you expect those three numbers to be related?
-
+
Exercise 4
Find the 10 most delayed flights (dep_delay
) using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank()
.
-
+
-
+
?min_rank
flights <- mutate(flights, delay_rank = min_rank(dep_delay))
filter(flights, delay_rank <= 10)
@@ -239,30 +239,30 @@ Exercise 4
Hint: Once you compute a rank, you can filter the data set based on the ranks.
-
+
"Excellent! It's not possible to choose exactly 10 flights unless you pick an arbitrary method to choose between ties."
Exercise 5
What does 1:3 + 1:10
return? Why?
-
+
-
+
1:3 + 1:10
Hint: Remember R’s recycling rules.
-
+
"Nice! R repeats 1:3 three times to create a vector long enough to add to 1:10. Since the length of the new vector is not exactly the length of 1:10, R also returns a warning message."
Exercise 6
What trigonometric functions does R provide? Hint: look up the help page for Trig
.
-
+
@@ -344,7 +344,7 @@ Exercise 6
diff --git a/examples/03c-data-manip-summarise/03c-data-manip-summarise.html b/examples/03c-data-manip-summarise/03c-data-manip-summarise.html
index 3f63f41b2..ea74aefcb 100644
--- a/examples/03c-data-manip-summarise/03c-data-manip-summarise.html
+++ b/examples/03c-data-manip-summarise/03c-data-manip-summarise.html
@@ -88,23 +88,23 @@ summarise()
group_by()
summarise()
is not terribly useful unless you pair it with group_by()
. group_by()
changes the unit of analysis of the data frame: it assigns observations in the data frame to separate groups, and it instructs dplyr to apply functions separately to each group. group_by()
assigns groups by grouping together observations that have the same combinations of values for the variables that you pass to group_by()
.
For example, the summarise()
code above computes the average delay for the entire data set. If we apply exactly the same code to a data set that has been grouped by date (i.e. the unique combinations of year
, month
, and day
), we get the average delay per date. Click “Run Code” to see what I mean:
-
+
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE),
total = sum(dep_delay, na.rm = TRUE))
-
+
"Good job!"
Exercise 1
Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n())
)
-
+
-
+
flights %>%
group_by(carrier) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
@@ -114,17 +114,17 @@ Exercise 1
Hint: Usemin_rank(desc(avg_delay))
to rank avg_delay
(for example) such that the largest delay receives rank one.
-
+
"Great work! Frontier airlines (`F9`) was the highest average departure delay."
Exercise 2
For each plane, count the number of flights before the first delay of greater than 1 hour.
-
+
-
+
flights %>%
filter(!is.na(dep_delay)) %>%
group_by(tailnum) %>%
@@ -135,14 +135,14 @@ Exercise 2
Hint: One strategy would be to: * filter out all rows where dep_delay
is NA
. * Then group by plane, * create a variable that tests whether each flight was delayed longer than an hour * create a variable that identifies flights that occur before the first big delay with !cumany()
* sum up the number of trues
-
+
"Great work! That was tough. Be sure you understand each of the steps and functions involved."
Grouping by multiple variables
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset. Run the code below and inspect each result to see how its grouping criteria has changed (the grouping criteria is displayed at the top of the tibble).
-
+
daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, total = sum(dep_delay, na.rm = TRUE)))
(per_month <- summarise(per_day, total = sum(total, na.rm = TRUE)))
@@ -253,7 +253,7 @@ Aggregating functions
)
Measures of position: first(x)
, nth(x, 2)
, last(x)
. These work similarly to x[1]
, x[2]
, and x[length(x)]
but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:
@@ -321,7 +321,7 @@ Exercise 3
99% of the time a flight is on time. 1% of the time it’s 2 hours late.
Which is more important: arrival delay or departure delay?
-
+
@@ -382,7 +382,7 @@ n()
Wow, there are some planes that have an average delay of 5 hours (300 minutes)!
The story is actually a little more nuanced. We can get more insight if we draw a scatterplot of number of flights vs. average delay. Fill in the blank code below to compute and then plot the number of flights by the mean arrival delay (arr_delay
).
-
+
# delays <- not_cancelled %>%
# group_by(tailnum) %>%
# summarise(
@@ -394,7 +394,7 @@ n()
# geom_point(alpha = 1/10)
-
+
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
@@ -443,6 +443,10 @@ Sample size, average performance, and rank
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam'
+## Warning in seq.default(0, 1, length = nk): partial argument match of
+## 'length' to 'length.out'
+## Warning in model.matrix.default(Terms[[i]], mf, contrasts = object
+## $contrasts): partial argument match of 'contrasts' to 'contrasts.arg'
This also has important implications for ranking. If you look closely, the people with the best batting averages are clearly lucky, not skilled.
You can find a good explanation of this problem at http://varianceexplained.org/r/empirical_bayes_baseball/ and http://www.evanmiller.org/how-not-to-sort-by-average-rating.html.
@@ -469,10 +473,10 @@ count()
Exercise 5
Come up with another approach that will give you the same output as not_cancelled %>% count(dest)
and not_cancelled %>% count(tailnum, wt = distance)
(without using count()
).
-
+
-
+
not_cancelled %>%
group_by(dest) %>%
summarise(n = n())
@@ -484,14 +488,14 @@ Exercise 5
Hint: Consider the tools at your disposal" group_by()
, summarise()
, n()
, sum()
, and ?count
-
+
"Excellent Job! This was a tricky one, but you can now see that `count()` is a handy short cut for `group_by()` + `summarise()` + `n()` (or `sum()`)."
Exercise 6
What does the sort
argument to count()
do. When might you use it?
-
+
?count
@@ -499,7 +503,7 @@ Exercise 6
Exercise 7
Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
-
+
# Task 1
# begin with a variable that shows the day of the year
# flights %>%
@@ -516,7 +520,7 @@ Exercise 7
# plot one against the other
-
+
flights %>%
mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
mutate(cancelled = is.na(dep_delay) | is.na(arr_delay)) %>%
@@ -537,7 +541,7 @@ Exercise 7
Hint: Don’t forget to use na.rm = TRUE
where appropriate.
-
+
"Wow! You did awesome."
@@ -619,7 +623,7 @@ Exercise 7
diff --git a/examples/03c-data-manip-summarise/03c-data-manip-summarise_files/figure-html/unnamed-chunk-20-1.png b/examples/03c-data-manip-summarise/03c-data-manip-summarise_files/figure-html/unnamed-chunk-20-1.png
index f6eab6735..80ed33bd9 100644
Binary files a/examples/03c-data-manip-summarise/03c-data-manip-summarise_files/figure-html/unnamed-chunk-20-1.png and b/examples/03c-data-manip-summarise/03c-data-manip-summarise_files/figure-html/unnamed-chunk-20-1.png differ