Skip to content

R Programming Coursera Class

Connor Van Meter edited this page Jan 29, 2020 · 1 revision

Notes on R Programming Coursera Course

[Coursera] (https://www.coursera.org/learn/r-programming)

Week 1: Nuts and Bolts

  • Working directory

    • getwd() and setwd()
  • Assignment operator

    • <-
  • Comments

    • #
  • Sequence

    • : for integer sequences
  • Object basic classes (check with class())

    • Character, numeric, integer, complex, logical
    • Vectors: contains only objects of the same class, exception is a list
      • vector() can create vectors
    • Numbers are numeric, unless you enter L as a suffix
      • NaN is 0/0 (undefined), Inf is 1/0
    • Objects have attributes attributes()
      • Names, dimensions, class, length, metadata
    • c() concatenate - used to create vectors
      • `x <- c(0.5, 0.6)
      • coercion reduces mixed-object vectors to the same class
      • explicit coercion
        • as.*() numeric, logical, character
      • Nonsensical coercion -> NA
      • Lists: can contain elements of different classes
        • x <- list(1, "a", TRUE, 1 + 4i)
  • Matrices are vectors with a dimension

    • nrow, ncol m <- matrix(nrow = 2, ncol=3) -> dim(m) = 2 3
    • Constructed column-wise, so starts in upper left corner
    • Can create directly
      • m <- 1:10 dim(m <- c(2,5)
    • Can create by column binding and row binding

x <- 1:3 y <- 10:12 cbind (x,y)

get an X column and a Y column

rbind (x,y)

get an X row and a Y row


* Factors
* For categorical data, ordered or unordered
* An integer vector that has labels, treated specially by modeling
 * `x <- factor(c("yes, "yes", "no", "yes", "no"))`
   * Can summarize with table(x)
 * `levels =` matters because first level is the baseline level

* Missing values
* `is.na()` or `is.nan` returns logicals
* `NA` values have a class, `NaN` is also `NA` but not converse

* Data frames
* Tabular data
* A special type of list, can stores different classes of objects in each column
* `read.table()` or `read.csv()`
* Create a data frame (will show up as 4 rows, 2 columns)
 * `x <- data.frame(foo = 1:4, bar = c(T, T, F, F))`

* Names
* `names(x) <- c("foo", "bar", "norf")
* Lists can have names too, as well as matrices

m <- matrix(1:4), nrow = 2, ncol = 2) dimnames(m) <- list(c("a", "b"), c("c", "d") m


* Reading Tabular Data
* `read.table` and `read.csv` for tabular data, `readLines` for text file, `source` and `dget` for R code, `load` and `unserialize` for binary objects
* Writing data: `write.table`, `writeLines`, `dump`, `dput`, `save`, `serialize`
* `data <- read.table()`
  * `file` for name, `header` logical for header, `sep` for how columns are separated, `colClasses` for class of columns, `nrows` for number of rows, `comment.char` for comment character, `skip` for number of lines to skip, `stringsAsFactors` for coding characters as factors
  * `read.csv` is identical to `read.table` with default separator as a comma
* For large datasets
  * `comment.char = ""` if no comments
  * `tabAll <- read.table ("x.txt", colClasses = "numeric", nrows = 100, ...)`

* Textual Data 
* ` dumping` and `dputing` - editable with metadata
* Textual formats good to store data, are longer lived, work with version control, not space-efficient
* `dput` deparses, read back in with `dget`, write R code to reconstruct an R object
* `dump` can be used for multiple R objects 
  
* Interfaces and connections
* `file` `url` `gzfile` `bzfile`
* `file()`
  * `"r"` is read only, "w" is write, "a" is append, "rb", "wb", "ab" do so in binary mode
* Useful to read lines of a text file (`writeLines` can write)

con <- gzfile ("words.gz") x <- readLines (con, 10) x con2 <- url("http:/www.jhsph.edu", "r") x <- readLines(con2) head(x)


* Subsetting
* `[` return object of same class
  * Negative integers - select all other than `x[c(-2, -10)]
* `[[` extract elements of list or a data frame
* `$` extract elements of list or a data frame by name 

X <- c("a", "b", "c", "c", "d", "a") x[1] x[2] x[1:4] x[x > "a"] u <- x >"a" u x[u]

x <- list(foo = 1:4, bar = 0.6) x[1]

get 1 2 3 4

x1

get 1 2 3 4

x$bar

0.6

x$"bar"

0.6

x["bar"]

0.6 - nice, can use name

* `[[` operator can be used with *computed* indices, `$` can only be used with literal names

x <- list(foo = 1:4, bar = 0.6, baz = "hello") x[c(1,3)]

get 1 2 3 4

get "hello"

name <- "foo" xname

1 2 3 4

x$name #NULL x$foo

1 2 3 4


* `[[` can take an integer sequence (nested elements)

x <- list(a=list(10, 12, 14), b = c(3.14, 2.81)) xc(1 ,3)

get 14

x13

get 14

xc(2, 1) #3.14


* Subsetting matrices
* ` x< - matrix(1:6, 2, 3)` `x[1, 2]` `#3` `x[2,1]` `#2` 
* Can have missing indices `x[1, ]` `# 1 3 5`
* If a single element of a matrix is retrieved, it is returned as a vector of length 1 rather than a 1 x 1 matrix
  * Use `x[1, 2, drop = FALSE]`
* Subsetting a single column or row gives a vector, not a matrix, unless you use drop=FALSE

* Subsetting and Partial matching
* `$` looks for a name in the list that matches the letter a, so aardvark is matched
* Double bracket expects exact matching, except with exact = FALSE

x <- list(aardvark = 1:5) x$a

1 2 3 4 5

x"a"

NULL

x"a", exact = FALSE

1 2 3 4 5


* Subsetting and removing missing values
* Create a logical vector to remove
* `x[!is.na(x) & x > 0]`
* Complete cases function can work, or `!` for not ...

x <- c(1, 2, NA, 4, NA, 5) y <- c("a", "b", NA, "d", NA, "f") good<- complete.cases (x,y) bad <- is.na(x) x[!bad]

1 2 4 5

good

1 2 4 5

"a" "b" "d" "f"

* Can use complete cases to remove missing values - very handy for large datasets

airquality[1:6, ] good <- complete.cases(airquality) airquality[good, ][1:6, ]


* Vectorized operations
* `x + y` `x * y` `x / y`
* R performs these element by element with vector recycling
*  `x >= 2` `y == 8` to test equality

x <- matrix(1:4, 2, ); y <- matrix(rep(10,4), 2, 2) x * y ## element wise multiplication # 10 20 30 40 (down columns) x/y # 0.1 0.2 0.3 0.4 (down columns) x %*% y # true matrix multiplication # 40 60 40 60 (down columns)


* Swirl
* `ls()` and `dir()
* `list.files()
* `dir.create("")`
* `file.create("")`, `file.exists("")`, `file.info("")`, `file.rename("")`, `file.remove("")` `file.copy("")`, `file.path("")
* `seq (0, 10, by=0.5)`
* Get the length of a vector
  * `length(my_seq)`
* `rep(c(0, 1, 2), times=40)` `rep(c(0, 1, 2), each = 10)`
* Logicals: `< > >= <= == !=` 
  * `|` means at least one of the pieces is true
* Collapse elements of character vectors 
  * `paste(vector, collapse = " ")`
* Join elements of multiple character vectors
  * `paste("Hello", "world!", sep = " ")`
  * `paste(LETTERS, 1:4, sep = "-")`
* Sampling
  * `my_data <- sample(c(y,z), 100)`
  * `x <- rnorm(1000)`
* Data frame allows you to mix classes, but matrix is all one class


### Week 2 
* Control structures
* `if, else`: testing a condition
* `for`: executing a loop a fixed number of times
* `while`: execute a loop *while* a condition is true
* `repeat`: execute an infinite loop
* `break`: break the execution of a loop
* `next`: skip an iteration of a loop
* `return`: exit a function

* If-else

if() { ## do something } else { ## do something else } if() { ## do something } else if() { ## do something different } else { ## do something else } if(x>3) { y <- 10 } else { y <- 0 } y <- if(x>3) { 10 } else { 0 }


* For loops
* iterating over the elements of an object

for(i in 1:10) { print(i) } #takes i variable and in each iteration of the loop gives it values 1, 2, 3, ..., 10 x <- c("a", "b", "c", "d") for(i in 1:10) { print(x[i]) }

* Can be nested

x <- matrix(1:6, 2, 3) for(i in seq_len(nrow(x))) { for(j in seq_len(ncol(x)) { print(x[i, j]) }


* While loops (use with care!)
* Begin by testing a condition, then execute the loop body, then repeated
  * Will stop at count = 10
  ```
count <- 0
while(count < 10) {
      print(count)
      count <- count + 1
}
z <- 5
while(z >= 3 && z<=10) {
  print(z)
  coin <- rbinom(1, 1, 0.5)

  if(coin == 1) { ## random walk
        z <- z + 1 
      } else {
        z <- z -1
      }
}
  ```
* Repeat, Next, Break
* Repeat initiates an infinite loop, call break to exit

x0 <- 1 tol <- 1e-8

repeat { x1 <- computeEstimate()

    if(abs(x1-x0) < tol) {
            break
    } else {
           x0 <- x1

}

* No guarantee to stop, better off using a **for loop** for limited iterations
* Can skip iterations

for(i in 1:100) { if(i <= 20) { ## Skip the first 20 iterations next } ## Do something here }

* return signals a function should exit and return a given value
  * `invisible(x)` stops auto-printing
* `apply` functions can be more useful for command-line work

* Function writing

add2 <- function(x, y) { x + y } above10 <- function(x) { use <- x > 10 x[use] } above <- function(x, n) { use <- x > n x[use] }

can specify n=10 as a default

columnmean <- function(y, removeNA = TRUE) { nc <- ncol(y) means <- numeric(nc) for( i in 1:nc) { means[i] <- mean(y[, i]) } means # returned } columnmean(airquality)

* Functions
* `function()`, arguments are evaluated lazily (only as needed)
* Can be passed as arguments to other functions, can be nested, treated like R objects
* Arguments can be missing, have default values, or not all be used
* `na.rm = FALSE` not always present, argument matching with named arguments (i.e data =)
* Partial matching of arguments can work
* `...` argument can indicate a number of arguments that are passed onto other functions
  * Preserve functions or used for generic arguments or if number of arguments isn't known in advance
    * Such as the paste function that collapses text
  * Must name arguments after the `...` specifically, no partial matching

* Scoping rules - symbol binding
* Binding a value to a symbol, searches through environments
  * Environment is a collection of (symbol, value) pairs
* Global environment is always first element of search list
* Lexical scoping for free variables
* Important because functions are defined inside of other functions
* Global environment value of y versus calling environment value of y
* Optimization
  * Pass functions through `optim`, `nlm`, and `optimize`

* Dates and Times
* `Date` class
* `POSIXct` or `POSIXlt` classes for times
  * ct is good for storing times in a data frame, is a large integer
  * lt stores other info, day of week, day of the month
  * `weekdays`, `months`, `quarters`
  * Have `as.POSIXct` and `as.POSIXlt` coercion
* `x <- as.Date("1970-01-01")`
* `strptime` function to convert character vectors into POSIXlt time formats

datestring <- c("January 10, 2012 10:40", "December 9, 2011 9:10") x<- strptime(datestring, "%B %d, %Y %H:%M) x

* Can add and subtract dates, as well as do `==` and `<=` comparisons

d1 <- Sys.Date() class(d1) unclass(d1) d1 d2 <- as.Date("1969-01-01") unclass(d2) t1 <- Sys.time() class(t1) unclass(t1) t2 <- as.POSIXlt(Sys.time()) class(t2) t2 unclass(t2) str(unclass(t2)) t2$min weekdays(d1) months(t1) quarters(t2) strptime() t3 <- "October 17, 1986 08:24" t4 <- strptime(t3, "%B %d, %Y %H:%M") t4 class(t4) Sys.time() > t1 Sys.time() - t1 difftime(Sys.time(), t1, units = 'days')

* Can keep track of leap years, leap seconds, daylight savings, and time zones
* `unclass()` can tell you exact numbers
* `difftime` for control over units

* Swirl
* Logic
  * `&` and `&&` operators: Both the left and right operands must be true for the expression to be true
  ```
TRUE & c(TRUE, FALSE, FALSE) #TRUE is carried over to each part of the right operand
# FALSE
TRUE && c(TRUE, FALSE, FALSE) #only evaluates the first member of the right operand
# TRUE
   ```
 * `|` (OR) operators: only one needs be true

TRUE | c(TRUE, FALSE, FALSE) #TRUE is carried over to each part of the right operand TRUE || c(TRUE, FALSE, FALSE) #only evaluates the first member of the right operand

* Can chain: `5 > 8 || 6 != 8 && 4 > 3.9`
* `isTRUE()`
* `identical()`
* `xor` is exclusive OR, meaning false and true arguments together will give a TRUE
* `which()` function
* `any()`: TRUE if one or more elements in logical vector is TRUE
* `all()`: TRUE if all elements in logical vector are TRUE

telegram <- function(...){ paste("START", ..., "STOP") } evaluate(function(x){x[length(x)]}, c(8, 4, 0)) mad_libs <- function(...){ args <- list(...) place <- args"place" adjective <- args"adjective" noun <- args"noun" paste("News from", place, "today where", adjective, "students took to the streets in protest of the new", noun, "being installed on campus.") } "%p%" <- function(left, right){ # Remember to add arguments! paste(left, right) } "I" %p% "love" %p% "R!"


## Week 3: Loop Functions and Debugging
* Execute a loop over an object or set of objects
* `lapply`: Loop over a list and evaluate a function on each element
 * Apply a function to a list (argument) - can do `as.List()`
 * `x <- list(a = 1:4, b = rnorm(10), c = rnorm(20,1), d = rnorm(100,5))1
 * `lapply(x, mean)
 * Make use of *anonymous* functions
   * `lapply(x, function(elt) elt[,1]` to extract first column
* `sapply`: Same as `lapply` but try to simplify the result
 * Variant of `lapply`
 * Will return a vector if a list with every element of length 1
 * Will return a matrix if a list where every element is a vector of same length (>1) 
* `apply`: Apply a function over the margins of an array (matrices)
 * Evaluates a function, apply to rows and columns of a matrix
 * Less typing than writing a loop
 * `x <- matrix(rnorm(200), 20, 10)`
   * `apply(x, 2, mean)` : means take the mean of each of 10 columns (2nd element) - vector of 10
   * `apply(x, 1, sum)` : means take the sum of each of 20 rows (1st element) - vector of 20
 * Some shortcuts
   * `rowSums` = `apply(x, 1, sum)
   * `rowMeans` = `apply(x, 1, mean)
   * `colSums` = `apply(x, 2, sum)
   * `colMeans` = `apply(x, 2, mean)
 * `apply(x, 1, quantile, probs = c(0.25, 0.75))`
 * Matrix in an array
   * `a <- array(rnorm(2 * 2* 10), c(2, 2, 10))`
   * `apply(a, c(1, 2), mean)` : keep 1st and second dimensions
   * `rowMeans(a, dims=2)` : does same as just apply
* `tapply`: Apply a function over subsets of a vector
 * Takes a factor variable to take group means and ranges
 ``` 
x<- c(rnorm(10), runif(10), rnorm(10,1))
f<- gl(3, 10) # 3 levels of 10
tapply(x, f, mean)
tapply(x, f, range)
 ```
 * Tapply is useful because it splits up a vector into, into little pieces and it applies a, a summary statistic or function to those little pieces, and then after it applies a function it kind of brings the pieces back together again
* `mapply`: Multivariate version of `lapply`
 * Applies a function over a set of arguments, unlike the other apply family functions
 * Tedious: `list(rep(1, 4), rep(2, 3), rep(3, 2), rep(4, 1))`
   * Better: `mapply(rep, 1:4, 4:1)`
   ```
noise <- function(n, mean, sd) {
rnorm(n, mean, sd)
}
noise(1:5, 1:5, 2)
mapply(noise, 1:5, 1:5, 2)
# same as list(noise(1, 1, 2), noise(2, 2, 2), noise(3, 3, 2), noise(4, 4, 2), noise(5, 5, 2))
   ```
* `split` is an auxiliary function that is useful with `lapply` or `sapply`
 * Takes a factor variable and returns a list
   * List can then be used by `lapply` or `sapply`
   ``` 
x<- c(rnorm(10), runif(10), rnorm(10,1))
f<- gl(3, 10) # 3 levels of 10
split(x, f)
lapply(split(x, f), mean)
   ```
   * Column means of 3 variables for each month
   ``` 
s <- split(airquality, airquality$Month) # Use Month as a factor
lapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind)]))
sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind)], na.rm = TRUE))
   ```
 * Split on more than one level
 ```
x <- rnorm(10)
f1 <- gl(2, 5) # 2 levels
f2 <- gl(5, 2) # 5 levels
interaction(f1, f2)
# 10 total levels, but list(f1, f2) would do the same thing as interaction
(str(split(x, list(f1, f2), drop = TRUE))
# some empty levels can be dropped 
 ```
* Debugging
* `message`, `warning`, `error`, `condition`
*  Consider: input, call, expectations, output, results, differences, reproduction   
* Interactive tools
 * `traceback`: does nothing if no error, prints out function call, call immediately after an error
 * `debug`: most handy, step through execution line by line
 * `browser`: suspends execution wherever called
 * `trace`: allows you to insert debugging code into a function at specific place 
 * `recover`: modify error behavior, choose from a list `options(error = recover)`
* Insert print/cat statements in the function

*Swirl
* `lapply` and `sapply`

cls_list <- lapply(flags, class) #class of each column in flags cls_vect <- sapply(flags, class) flag_colors <- flags[, 11:17] lapply(flag_colors, sum) # Number of flags containing each color sapply(flag_colors, mean) # proportion of flags containing each color shape_mat <- sapply(flag_shapes, range) # range of times a shape appears sapply(unique_vals, length) # of unique values of each column lapply(unique_vals, function(elem) elem[2]) # Write own anonymousfunction taking the second element of each

* `vapply` and `tapply`
  * `vapply` allows you to specify format of output
  ```
vapply(flags, class, character(1))
tapply(flags$animate, flags$landmass, mean) # proportion of flags with animate objects by landmass group

  ```

## Week 4: Simulation and Profiling
* `str` is a diagnostic summary function
* `summary()` works well too
* Simulation
* Generating random numbers for normal, poisson, binomial, exponential, gamma, etc...
  * `set.seed()` for reproducibility
  * r for random number generation
    * `rnorm` - random normal with a given mean and SD `rnorm(n, mean = , sd = )`
    * `rpois` - random Poisson variates with a given rate
  * d for density
    * `dnorm` - Normal prob. density at a point with given mean and SD
  * p for cumulative distribution
    * ` pnorm` - evaluate CDF `pnorm(n, mean = , sd = , lower.tail = TRUE, log.p = FALSE)`
  * q for quantile function
* Linear model

set.seed(20) x <- rnorm(100) x2 <- rbinom(100, 1, 0.5) e <- rnorm(100, 0, 2) y <- 0.5 + 2 * x + e summary(y) plot(x, y) plot(x2, y)

#Poisson log.mu <- 0.5 + 0.3*x y2 <- rpois(100, exp(log.mu)) summary(y) plot(x, y2)

sample(1:6, 4, replace = TRUE) #rolling four six-sided dice coinflips <- sample(c(0,1), 100, replace = TRUE, prob = c(0.3, 0.7)) rbinom(1, size = 100, prob = 0.7) flips2 <- rbinom(100, size = 1, prob = 0.7) #100 observations my_pois <- replicate(100, rpois(5, 10)) cm <- colMeans(my_pois) hist(cm)

* Random sampling
  * `sample(1:10, 4)` : vector and number chosen
    * can do `sample( , replace = TRUE)` or `sample()` which just does a permutation

* R Profiler
* See how much time is spent in different parts of the program, but don't optimize early
* `system.time` returns time taken to evaluate (User time and elapsed time)
* `Rprof()` or `summaryRprof()` if you don't know where to start
  * Normalize `by.total` divides time spent in each function by total time
  * Normalize `by.self` first subtracts time spent in functions above