diff --git a/examples/00-setup/00-setup.html b/examples/00-setup/00-setup.html index d364608da..17676b55c 100644 --- a/examples/00-setup/00-setup.html +++ b/examples/00-setup/00-setup.html @@ -61,24 +61,21 @@

Outline

Is this tutorial for you?

Do you need to work through the tutorial? Take the quiz below to find out.

-
+
- +

Install R

How to install R

-
- - -
+

Test your knowledge

@@ -88,40 +85,40 @@

Test your knowledge

-
+
- + -
+
- + -
+
- + @@ -133,10 +130,7 @@

Install RStudio

How to install RStudio

RStudio is an Integrated Development Environment for R. What does that mean? Well, if you think of R as a language, which it is, you can think of RStudio as a program that helps you write and work in the language. RStudio makes programming in R much easier and I suggest that you use it!

-
- - -
+

Test your knowledge

@@ -146,53 +140,53 @@

Test your knowledge

-
+
- + -
+
- + -
+
- + -
+
- + @@ -203,10 +197,7 @@

Test your knowledge

Install Packages

How to install R packages

-
- - -
+

Test your knowledge

@@ -216,40 +207,40 @@

Test your knowledge

-
+
- + -
+
- + -
+
- + @@ -264,7 +255,7 @@

Test your knowledge

diff --git a/examples/01-data-basics/01-data-basics.html b/examples/01-data-basics/01-data-basics.html index 54ff26412..02dc5f47d 100644 --- a/examples/01-data-basics/01-data-basics.html +++ b/examples/01-data-basics/01-data-basics.html @@ -66,16 +66,16 @@

What is a data frame?

  • As tibbles, which are a special type of data frame
  • A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). An example is the mpg data frame found in the ggplot2 package (aka ggplot2::mpg). The mpg data frame contains observations collected by the US Environmental Protection Agency on 38 models of cars. To see the mpg data frame, type mpg in the code chunk below and then click “Submit Answer.”

    -
    +
    mpg <- as.data.frame(mpg)
    -
    +

    Hint: Type mpg and then click the blue button.

    -
    +
    # checking code
    @@ -98,13 +98,13 @@

    Help pages

    How to open a help page

    You can learn more about mpg by opening its help page. The help page will explain where the mpgdataset comes from and what each variable in mpg describes. To open the help page, type ?mpg in the code chunk below and then click “Submit Answer”.

    -
    +

    Hint: Type ?mpg and then click the blue button.

    -
    +
    # checking code
    @@ -115,7 +115,7 @@

    ? syntax

    Exercises

    -
    +
    @@ -124,40 +124,40 @@

    Exercises

    -
    +
    - + -
    +
    - + -
    +
    - + @@ -171,13 +171,13 @@

    What is a tibble?

    Now let’s look at a special type of data frame that you will encounter in R: the tibble.

    The flights data frame in the nycflights13 package is an example of a tibble. flights describes every flight that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights.

    Use the code chunk below to print the contents of flights.

    -
    +

    Hint: Type the name of the data frame that you want to print and then click the blue button. I’ve already loaded the nycflight13 package for you.

    -
    +
    # checking code
    @@ -190,20 +190,20 @@

    The tibble display

    Variable types

    Type codes

    -
    +
    flights
    -
    # A tibble: 336,776 × 19
    +
    # A tibble: 336,776 x 19
         year month   day dep_time sched_dep_time dep_delay arr_time
        <int> <int> <int>    <int>          <int>     <dbl>    <int>
    -1   2013     1     1      517            515         2      830
    -2   2013     1     1      533            529         4      850
    -3   2013     1     1      542            540         2      923
    -4   2013     1     1      544            545        -1     1004
    -5   2013     1     1      554            600        -6      812
    -6   2013     1     1      554            558        -4      740
    -7   2013     1     1      555            600        -5      913
    -8   2013     1     1      557            600        -3      709
    -9   2013     1     1      557            600        -3      838
    + 1  2013     1     1      517            515         2      830
    + 2  2013     1     1      533            529         4      850
    + 3  2013     1     1      542            540         2      923
    + 4  2013     1     1      544            545        -1     1004
    + 5  2013     1     1      554            600        -6      812
    + 6  2013     1     1      554            558        -4      740
    + 7  2013     1     1      555            600        -5      913
    + 8  2013     1     1      557            600        -3      709
    + 9  2013     1     1      557            600        -3      838
     10  2013     1     1      558            600        -2      753
     # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
     #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
    @@ -229,14 +229,14 @@ 

    Type codes

    Test your knowledge

    -
    +
    - +

    Congratulations

    @@ -289,7 +289,7 @@

    Congratulations

    diff --git a/examples/03a-data-manip-filter/03a-data-manip-filter.html b/examples/03a-data-manip-filter/03a-data-manip-filter.html index 2a9c39182..df99e652d 100644 --- a/examples/03a-data-manip-filter/03a-data-manip-filter.html +++ b/examples/03a-data-manip-filter/03a-data-manip-filter.html @@ -79,34 +79,34 @@

    Filter rows with filter()

    filter()

    filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on January 1st with:

    -
    +
    filter(flights, month == 1, day == 1)

    When you run that line of code, dplyr executes the filtering operation and returns a new data frame. dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-.

    Rerun the command in the code chunk below, but first arrange to save the output to an object named jan1.

    -
    +
    filter(flights, month == 1, day == 1)
    -
    +
    jan1 <- filter(flights, month == 1, day == 1)
    -
    +
    "Good job! You can now see the results by running the name jan1 by itself. Or you can pass `jan1` to a function that takes data frames as input."

    ()

    R either prints out the results of a command, or saves the the results to a variable. If you want to do both, you can wrap the assignment in parentheses. Wrap the following command in parentheses, i.e. add a ( to the start of the line and a ) to the end. Then run the code. What happens?

    -
    +
    dec25 <- filter(flights, month == 12, day == 25)
    -
    +
    (dec25 <- filter(flights, month == 12, day == 25))
    -
    +
    "Very Nice! When you surround an assignment in parentheses, R both assigns the result to the object and prints the result to the screen. You can go ahead and check: an object named `dec25` now exists and it contains the data set that you see printed."
    @@ -118,12 +118,12 @@

    Comparison operators

    To use filtering effectively, you have to know how to select the observations that you want with R’s comparison operators. R provides the standard suite of comparisons: >, >=, <, <=, != (not equal), and == (equal).

    When you’re starting out with R, the easiest mistake to make is to test for equality with = instead of ==. When this happens you’ll get an informative error:

    filter(flights, month = 1)
    -
    ## Error: filter() takes unnamed arguments. Do you need `==`?
    +
    ## Error: `month` (`month = 1`) must not be named, do you need `==`?

    Floating point arithmetic

    There’s another common problem you might encounter when using ==: floating point numbers. These results might surprise you! To get a feel for floating point numbers, predict what the code below shoudl return, then click “Run Code.” Does everything work as you predict?

    -
    +
    sqrt(2) ^ 2 == 2
     1/49 * 49 == 1
    @@ -147,14 +147,14 @@

    &, |, and !

    -
    +
    - +

    Common mistakes

    @@ -176,14 +176,14 @@

    Missing values

    NA

    Missing values can make comparisons tricky in R. R uses NA to represent missing or unknown values. NAs are “contagious” because almost any operation involving an unknown value (NA) will also be unknown (NA). For example, can you determine what value these expressions that use missing values shoudl evaluate to? Make a prediction and then click “Submit Answer”.

    -
    +
    NA > 5
     10 == NA
     NA + 10
     NA / 2
    -
    +
    "In every case, R does not have enough information to compute a result. Hence, each result is an unknown value, `NA`."
    @@ -212,8 +212,6 @@

    filter() and NAs

    filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:

    df <- tibble(x = c(1, NA, 3))
     filter(df, x > 1)
    -
    ## Warning in filter_impl(.data, dots): '.Random.seed' is not an integer
    -## vector but of type 'NULL', so ignored
    -
    +
    filter(flights, arr_delay >= 2)
  • Flew to Houston (IAH or HOU)

    -
    +
    -
    +
    filter(flights, dest %in% c("IAH", "HOU"))
    -

    Hint: This is a good case for the %in% operator.

    +Hint: This is a good case for the %in% operator.
  • Were operated by United (UA), American (AA), or Delta (DL)

    -
    +
    -
    +
    filter(flights, carrier %in% c("UA", "AA", "DL"))
    -

    Hint: The carrier variable lists the airline that operated each flight. This is another good case for the %in% operator.

    +Hint: The carrier variable lists the airline that operated each flight. This is another good case for the %in% operator.
  • Departed in summer (July, August, and September)

    -
    +
    -
    +
    filter(flights, 6 < month, month < 10)
    -

    Hint: When converted to numbers, July, August, and September become 7, 8, and 9.

    +Hint: When converted to numbers, July, August, and September become 7, 8, and 9.
  • Arrived more than two hours late, but didn’t leave late

    -
    +
    -
    +
    filter(flights, arr_delay > 120, dep_delay < 0)
    -

    Hint: Remember that departure and arrival delays are recorded in minutes.

    +Hint: Remember that departure and arrival delays are recorded in minutes.
  • Were delayed by at least an hour, but made up over 30 minutes in flight

    -
    +
    -
    +
    filter(flights, dep_delay > 60, (dep_delay - arr_delay) >= 30)
    -

    Hint: The time a plane makes up is dep_delay - arr_delay.

    +Hint: The time a plane makes up is dep_delay - arr_delay.
  • Departed between midnight and 6am (inclusive)

    -
    +
    -
    +
    filter(flights, dep_time <= 600 | dep_time == 2400)
    -

    Hint: Don’t forget flights thsat left at eactly midnight (2400). This is a good case for an “or” operator.

    +Hint: Don’t forget flights thsat left at eactly midnight (2400). This is a good case for an “or” operator.
  • Exercise 2

    Another useful dplyr filtering helper is between(). What does it do? Can you use between() to simplify the code needed to answer the previous challenges?

    -
    +
    ?between
    @@ -313,23 +311,23 @@

    Exercise 2

    Exercise 3

    How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

    -
    +
    -
    +
    filter(flights, is.na(dep_time))

    Hint: This is a good case for is.na().

    -
    +
    "Good Job! these look like they might be cancelled flights."

    Exercise 4

    Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

    -
    +
    @@ -453,7 +451,7 @@

    Exercise 4

    diff --git a/examples/03b-data-manip-mutate/03b-data-manip-mutate.html b/examples/03b-data-manip-mutate/03b-data-manip-mutate.html index d0cf9d2b7..c286ce433 100644 --- a/examples/03b-data-manip-mutate/03b-data-manip-mutate.html +++ b/examples/03b-data-manip-mutate/03b-data-manip-mutate.html @@ -67,7 +67,7 @@

    Add new variables with mutate()

    select()

    You can select a subset of variables by name with the select() function in dplyr. Run the code below to see the narrow data set that select() creates.

    -
    +
    flights_sml <- select(flights, 
       arr_delay, 
       dep_delay,
    @@ -80,7 +80,7 @@ 

    select()

    mutate()

    The code below creates two new variables with dplyr’s mutate() function. mutate() returns a new data frame that contains the new variables appended to a copy of the original data set. Take a moment to imagine what this will look like, and then click “Run Code” to find out.

    -
    +
    flights_sml <- select(flights, 
       arr_delay, 
       dep_delay,
    @@ -88,7 +88,7 @@ 

    mutate()

    air_time )
    -
    +
    mutate(flights_sml,
       gain = arr_delay - dep_delay,
       speed = distance / air_time * 60
    @@ -110,7 +110,7 @@ 

    mutate()

    transmute()

    mutate() will always return the new variables appended to a copy of the original data. If you want to return only the new variables, use transmute(). In the code below, replace mutate() with transmute() and then spot the difference in the results.

    -
    +
    mutate(flights,
       gain = arr_delay - dep_delay,
       hours = air_time / 60,
    @@ -118,14 +118,14 @@ 

    transmute()

    )
    -
    +
    transmute(flights,
       gain = arr_delay - dep_delay,
       hours = air_time / 60,
       gain_per_hour = gain / hours
     )
    -
    +
    "Excellent job! `transmute()` and `mutate()` do the same thing, but `transmute()` only returnsd the new variables. `mutate()` returns a copy of the original data set with the new variables appended."
    @@ -186,52 +186,52 @@

    Exercises

    Exercise 1

    Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

    -
    +
    -
    +
    mutate(flights, dep_time = dep_time %/% 100 * 60 + dep_time %% 100,
            sched_dep_time = sched_dep_time %/% 100 * 60 + sched_dep_time %% 100)

    Hint: 423 %% 100 returns 23, 423 %/% 100 returns 4.

    -
    +
    "Good Job!"

    Exercise 2

    Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? How do you explain this?

    -
    +
    # flights <- mutate(flights, total_time = _____________)
     # flight_times <- select(airtime, total_time)
     # filter(flight_times, air_time != total_time)
    -
    +
    flights <- mutate(flights, total_time = arr_time - dep_time)
     flight_times <- select(airtime, total_time)
     filter(flight_times, air_time != total_time)
    -
    +
    "Good Job! it doesn't make sense to do math with `arr_time` and `dep_time` until you convert the values to minutes past midnight (as you did with `dep_time` and `sched_dep_time` in the previous exercise)."

    Exercise 3

    Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

    -
    +

    Exercise 4

    Find the 10 most delayed flights (dep_delay) using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

    -
    +
    -
    +
    ?min_rank
     flights <- mutate(flights, delay_rank = min_rank(dep_delay))
     filter(flights, delay_rank <= 10)
    @@ -239,30 +239,30 @@

    Exercise 4

    Hint: Once you compute a rank, you can filter the data set based on the ranks.

    -
    +
    "Excellent! It's not possible to choose exactly 10 flights unless you pick an arbitrary method to choose between ties."

    Exercise 5

    What does 1:3 + 1:10 return? Why?

    -
    +
    -
    +
    1:3 + 1:10

    Hint: Remember R’s recycling rules.

    -
    +
    "Nice! R repeats 1:3 three times to create a vector long enough to add to 1:10. Since the length of the new vector is not exactly the length of 1:10, R also returns a warning message."

    Exercise 6

    What trigonometric functions does R provide? Hint: look up the help page for Trig.

    -
    +
    @@ -344,7 +344,7 @@

    Exercise 6

    diff --git a/examples/03c-data-manip-summarise/03c-data-manip-summarise.html b/examples/03c-data-manip-summarise/03c-data-manip-summarise.html index 3f63f41b2..ea74aefcb 100644 --- a/examples/03c-data-manip-summarise/03c-data-manip-summarise.html +++ b/examples/03c-data-manip-summarise/03c-data-manip-summarise.html @@ -88,23 +88,23 @@

    summarise()

    group_by()

    summarise() is not terribly useful unless you pair it with group_by(). group_by() changes the unit of analysis of the data frame: it assigns observations in the data frame to separate groups, and it instructs dplyr to apply functions separately to each group. group_by() assigns groups by grouping together observations that have the same combinations of values for the variables that you pass to group_by().

    For example, the summarise() code above computes the average delay for the entire data set. If we apply exactly the same code to a data set that has been grouped by date (i.e. the unique combinations of year, month, and day), we get the average delay per date. Click “Run Code” to see what I mean:

    -
    +
    by_day <- group_by(flights, year, month, day)
     summarise(by_day, delay = mean(dep_delay, na.rm = TRUE),
                       total = sum(dep_delay, na.rm = TRUE))
    -
    +
    "Good job!"

    Exercise 1

    Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))

    -
    +
    -
    +
    flights %>% 
       group_by(carrier) %>% 
       summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
    @@ -114,17 +114,17 @@ 

    Exercise 1

    Hint: Usemin_rank(desc(avg_delay)) to rank avg_delay (for example) such that the largest delay receives rank one.

    -
    +
    "Great work! Frontier airlines (`F9`) was the highest average departure delay."

    Exercise 2

    For each plane, count the number of flights before the first delay of greater than 1 hour.

    -
    +
    -
    +
    flights %>% 
       filter(!is.na(dep_delay)) %>% 
       group_by(tailnum) %>% 
    @@ -135,14 +135,14 @@ 

    Exercise 2

    Hint: One strategy would be to: * filter out all rows where dep_delay is NA. * Then group by plane, * create a variable that tests whether each flight was delayed longer than an hour * create a variable that identifies flights that occur before the first big delay with !cumany() * sum up the number of trues

    -
    +
    "Great work! That was tough. Be sure you understand each of the steps and functions involved."

    Grouping by multiple variables

    When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset. Run the code below and inspect each result to see how its grouping criteria has changed (the grouping criteria is displayed at the top of the tibble).

    -
    +
    daily <- group_by(flights, year, month, day)
     (per_day <- summarise(daily, total = sum(dep_delay, na.rm = TRUE)))
     (per_month <- summarise(per_day, total = sum(total, na.rm = TRUE)))
    @@ -253,7 +253,7 @@ 

    Aggregating functions

    )
  • Measures of position: first(x), nth(x, 2), last(x). These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:

    @@ -321,7 +321,7 @@

    Exercise 3

  • 99% of the time a flight is on time. 1% of the time it’s 2 hours late.

  • Which is more important: arrival delay or departure delay?

    -
    +
    @@ -382,7 +382,7 @@

    n()

    Wow, there are some planes that have an average delay of 5 hours (300 minutes)!

    The story is actually a little more nuanced. We can get more insight if we draw a scatterplot of number of flights vs. average delay. Fill in the blank code below to compute and then plot the number of flights by the mean arrival delay (arr_delay).

    -
    +
    # delays <- not_cancelled %>% 
     #   group_by(tailnum) %>% 
     #   summarise(
    @@ -394,7 +394,7 @@ 

    n()

    # geom_point(alpha = 1/10)
    -
    +
    delays <- not_cancelled %>% 
       group_by(tailnum) %>% 
       summarise(
    @@ -443,6 +443,10 @@ 

    Sample size, average performance, and rank

    geom_point() + geom_smooth(se = FALSE)
    ## `geom_smooth()` using method = 'gam'
    +
    ## Warning in seq.default(0, 1, length = nk): partial argument match of
    +## 'length' to 'length.out'
    +
    ## Warning in model.matrix.default(Terms[[i]], mf, contrasts = object
    +## $contrasts): partial argument match of 'contrasts' to 'contrasts.arg'

    This also has important implications for ranking. If you look closely, the people with the best batting averages are clearly lucky, not skilled.

    You can find a good explanation of this problem at http://varianceexplained.org/r/empirical_bayes_baseball/ and http://www.evanmiller.org/how-not-to-sort-by-average-rating.html.

    @@ -469,10 +473,10 @@

    count()

    Exercise 5

    Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).

    -
    +
    -
    +
    not_cancelled %>% 
       group_by(dest) %>% 
       summarise(n = n())
    @@ -484,14 +488,14 @@ 

    Exercise 5

    Hint: Consider the tools at your disposal" group_by(), summarise(), n(), sum(), and ?count

    -
    +
    "Excellent Job! This was a tricky one, but you can now see that `count()` is a handy short cut for `group_by()` + `summarise()` + `n()` (or `sum()`)."

    Exercise 6

    What does the sort argument to count() do. When might you use it?

    -
    +
    ?count    
    @@ -499,7 +503,7 @@

    Exercise 6

    Exercise 7

    Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

    -
    +
    # Task 1
     # begin with a variable that shows the day of the year
     # flights %>% 
    @@ -516,7 +520,7 @@ 

    Exercise 7

    # plot one against the other
    -
    +
    flights %>% 
       mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%  
       mutate(cancelled = is.na(dep_delay) | is.na(arr_delay)) %>% 
    @@ -537,7 +541,7 @@ 

    Exercise 7

    Hint: Don’t forget to use na.rm = TRUE where appropriate.

    -
    +
    "Wow! You did awesome."
    @@ -619,7 +623,7 @@

    Exercise 7

    diff --git a/examples/03c-data-manip-summarise/03c-data-manip-summarise_files/figure-html/unnamed-chunk-20-1.png b/examples/03c-data-manip-summarise/03c-data-manip-summarise_files/figure-html/unnamed-chunk-20-1.png index f6eab6735..80ed33bd9 100644 Binary files a/examples/03c-data-manip-summarise/03c-data-manip-summarise_files/figure-html/unnamed-chunk-20-1.png and b/examples/03c-data-manip-summarise/03c-data-manip-summarise_files/figure-html/unnamed-chunk-20-1.png differ