03-R-Intro.Rmd


# Some Useful Bits of R

## You Gotta Have Style {#style-guide}

Good programming style is incredibly important. It makes your code easier to read
and edit. That leads to fewer errors.

Here is a brief style guide you are expected to follow for all code in this course:
For more detailed see <http://adv-r.had.co.nz/Style.html>, on which this is based.

1. **No long lines.** 

    Lines of code should have at most 80 characters.

    Programming lines should not wrap, you should choose the line breaks yourself.
    Choose them in natural places.
    In R markdown, long codes lines don't wrap, they flow off the page, 
    so the end isn't visible.  (And it makes it obvious that you didn't 
    look at your own print out.)
    
    ```{r ch03-no-long-lines}
    # Don't ever use really long lines.  Not even in comments.  They spill off the page and make people wonder what they are missing. 
    ```
    
2. **Use your space bar.** 

    There is a reason it is the largest key on the keyboard.  Use it often.
    Spaces **after commas** (always). Spaces **around operators** (always).
    Spaces after the comment symbol `#` (always). Use other spaces 
    judiciously to align similar code chunks to make things easier to
    read or compare.

    ```{r ch03-spacy, eval = FALSE}
    x<-c(1,2,4)+5          # BAD BAD BAD
    x <- c(1, 2, 4) + 5    # Ah :^)
    ```
   
3. But don't go crazy with the space bar.

    There are a few places you should not use spaces:
    
    * after open parentheses or before closed parentheses
    * between function names and the parentheses that follow
    
4. **Indent** to show the structure of your code.

    Use 2 spaces to indent (to keep things from drifting right too quickly).
    
    Fortunately, this is really easy. Highlight your code and hit 
    `<CTRL>`-i (PC) or `<command>`-i (Mac).  If the indention looks odd to you,
    you have most likely messed up commas, quotes, parentheses, or 
    curly braces.

5. **Choose names wisely and consistently**.

    Naming things is hard, but take a moment to choose good names, and go back
    and change them if you come up with a better name later.  Here are some
    helpful hints:
    
    a. Very **short names** should only be used for a very **short time** 
    (a couple lines of code). Else we tend to forget what they meant.
    Avoid names like `x`, `f`, etc. unless the use is brief and mimics
    some common mathematical formula.
    
    b. **Break long names visually**.  Common ways to do this are with
    a dot (`.`), an underscore `_`, or `camelCase`. There are R coders who
    prefer all three, but don't mix and match for similar kinds of things,
    that just makes it harder to remember what to do the next time.
    
    ```{r ch03-long-names}
    good_name <- 10
    good.name <- 10
    goodName  <- 10    # note alignment via extra space in this line
    really_terrible.ideaToDo <- -5
    ```
    
    **The trend in R is toward using underscore (`_`)** and I recommend it. 
    Older code often used dot (`.`).  CamelCase is the least common in R.
    
    c. Recommendation: capitalize data frame names; use lower case for 
    variables inside data frames.
    
        This is not a common convention in R, but it can really help to
        keep things straight.  I'll do this in the data sets I create, but 
        when we use other data sets, they may not follow this convention.
        
    d. Avoid using names that are already in use by R.
    
        This can be hard to avoid when you are starting out because you don't 
        know what all is defined.  Here are a few things to avoid.
        
        ```{r ch03-avoid, eval = FALSE }
        T           # abbreviation for TRUE
        F           # abbreviation for FALSE
        c           # used to concetenate vectors and lists
        df          # density function for f distributions
        dt          # density function for t distributions
        ```
6. **Use comments (`#`)**, but use them for the right thing.

    Comments can be used to clarify names, point out subtlties in code, etc.
    They should not be used for your analysis or discussion of results.
    Don't comment things that are obvious without comment.  Comments should add
    value.
    
    ```{r ch03-comments}
    x <- 4   # set x to 4  <----------------------- no need for this comment
    x <- 4   # b/c there are four grade levels in the study  <------- useful
    ```

7. **Exceptions** should be exceptional.

    No style guide works perfectly in all situations. Ocassionally you may
    need to violate the style guide.  But these instances should be rare and 
    should have a good reason.  They should not arise form your sloppiness 
    or laziness.

### An additional note about homework

When you do homework, I want to see your code and the results (and your 
discussion of those results).  Writing in R Markdown makes this all easy
to do.

But make sure that I can see all the necessary things to evaluate what you
are doing. You have access to your code and can investigate variables, etc.
But make sure I can see what's going one in the document.  This often
means displaying intermediate results.  Once common way to do this is
with a semi-colon:

```{r ch03-show-intermediate}
x <- 57 * 23; x
```

## Vectors, Lists, and Data Frames

### Vectors

In R, a vector is a homogeneous ordered collection (indexing starts at 1).
By homogeneous, we mean that each element is he same kind of thing.

Short vectors can be created using `c()`:

```{r ch03-vectors}
x <- c(1, 3, 5)
x
```

Evenly spaced sequences can be created using `seq()`:

```{r ch03-seq}
x <- seq(0, 100, by = 5); x
y <- seq(0, 1, length.out = 11); y
0:10      # short cut for consecutive integers
```

Repeated values can be created with `rep()`:

```{r ch03-rep}
rep(5, 3)
rep(c(1, 2, 3), each  = 2)
rep(c(1, 2, 3), times = 2)
rep(c(1, 2, 3), times = c(1, 2, 3))
rep(c(1, 2, 3), each  = c(1, 2, 3))    # Ack! see warning message.
```


When a function acts on a vector, there are several things that could happen.

* One result can be computed from the entire vector.

    ```{r ch03-vectors-and-functions-01}
    x <- c(1, 3, 6, 10)
    length(x)
    mean(x)
    ```

* The function may be applied to each element of the vector and a vector of 
results returned. (Such functions are called **vectorized**.)

    ```{r ch03-vectors-and-functions-02}
    log(x)
    2 * x
    x^2
    ```

* The first element of the vector may be used and the others ignored.
(Less common but dangerous -- be on the lookout. See example above.)

Items in a vector can be accessed using `[]`:

```{r ch03-square-bracket}
x <- seq(10, 20, by = 2)
x[2]
x[10]       # NA indicates a missing value
x[10] <- 4          
x           # missing values filled in to make room!
is.na(x)    
```

In addition to using integer indices, there are two other ways to access
elements of a vector: names and logicals.
If the items in a vector are named, names are displayed when
a vector is displayed, and names can be used to access elements.

```{r ch03-named-vector}
x <- c(a = 5, b = 3, c = 12, 17, 1)
x
names(x)
x["b"]
```

Logicals (`TRUE` and `FALSE`) are very interesting in R. In indexing, they
tell us which items to keep and which to discard.

```{r ch03-logical-indexing}
x <- (1:10)^2
x < 50
x[x < 50]
x[c(TRUE, FALSE)]   # T/F recycled to length 10
which(x < 50)
```

### Lists

Lists are a lot like vectors, but

* Create a list with `list()`.
* The elements can be different kinds of things (including other lists)
* Use `[[ ]]` to access elements.  You can also use `$` to access named elements
* If you use `[]` you will get a list back not an element.

```{r ch03-lists}
# a messy list
L <- list(5, list(1, 2, 3), x = TRUE, y = list(5, a = 3, 7))  
L
L[[1]]
L[[2]]
L[1]
L$y
L[["x"]]
L[1:3]
glimpse(L)
```

### Data frames for rectangular data

Rectangular data is organized in rows and columns (much like an excel
spreadsheet).  These rows and columns have a particular meaning:

  * Each **row** represents one **observational unit**.  Observational units go
  by many others names depending whether they are people, or inanimate objects,
  our events, etc.  Examples include case, subject, item, etc. Regardless, the 
  observational units are the things about which we collect information, and each
  one gets its own row in rectangular data.
  
  * Each **column** represents a **variable** -- one thing that is "measured" and 
  recorded (at least in principle, some measurements might be missing) for each
  observational unit. 
  
**Example** In a study of nutritional habits of college students, our observational
units are the college students in the study.  Each student gets her own row in the data
frame.  The variables might include things like an ID number (or name), sex, height, 
weight, whether the student lives on campus or off, what type of meal plan they have
at the dining hall, etc., etc.  Each of these is recorded in a separate column.

Data frames are the standard way to store **rectangular data** in R.
Usually variables (elements of the list) are vectors, but this isn't required, sometimes you will see list variables in data frames.
Each element (ie, variable) must have the same length (to keep things rectangular).

Here, for example, are the first few rows of a data set called `KidsFeet`:

```{r ch03-KidsFeet-00}
library(mosaicData)   # Load package to make KidsFeet data available
head(KidsFeet)        # first few rows
```

#### Accessing via [ ]

We can access rows, columns, or individual elements of a data frame using `[ ]`.
This is the more usual way to do things.

```{r ch03-KidsFeet-01}
KidsFeet[, "length"]
KidsFeet[, 4]
KidsFeet[3, ]
KidsFeet[3, 4]
KidsFeet[3, 4, drop = FALSE]  # keep it a data frame
KidsFeet[1:3, 2:3]
```

By default,

* Accessing a row returns a 1-row data frame.
* Accessing a column returns a vector (at least for vector columns)
* Accessing a element returns that element (technically a vector with
one element in it).

#### Accessing columns via $

We can also access individual variables using the `$` operator:

```{r ch03-KidsFeet-02}
KidsFeet$length
KidsFeet$length[2]
KidsFeet[2, "length"]
KidsFeet[2, "length", drop = FALSE]    # keep it a data frame
```


As we will see, there are are other tools that will help us avoid 
needing to us `$` or `[ ]` for access columns in a data frame.  This is 
especially nice when we are working with several variables all coming
from the same data frame.

#### Accessing by number is dangerous

Generally speaking, it is safer to access things by name than by number when
that is an option. 
It is easy to miscalculate the row or column number you need, and if rows 
or columns are added to or deleted from a data frame, the numbering can change.

#### Implementation

Data frames are implemented in R as a special type (technically, class) of list.
The elements of the list are the *columns* in the data frame.  Each column must
have the same length (so that our data frame has coherent *rows*).  Most often the 
columns are vectors, but this isn't required.

This explains why `$` works the way it does -- we are just accessing
one item in a list.  It also means that we can use `[[ ]]` to access
a column:

```{r ch03-KidsFeet-03}
KidsFeet[["length"]]
KidsFeet[[4]]
KidsFeet[4]
```

### Other types of data

Some types of data do not work well in a rectangular arrangement of a data frame,
and there are many other ways to store data.  In R, other types of data commonly
get stored in a list of some sort.

## Plotting with ggformula

R has several plotting systems.  Base graphics is the oldest. `lattice` and 
`ggplot2` are both built on a system called grid graphics.  `ggformula`
is built on `ggplot2` to make it easier to use and to bring in some of 
the advantages of `lattice`.

You can find out about more about `ggformula` at <https://projectmosaic.github.io/ggformula/news/index.html>.


## Creating data with expand.grid()

We will frequently have need of synthetic data that includes all combinations
of some variable values.  `expand.grid()` does this for us:

```{r ch03-expand-grid}
expand.grid(
  a = 1:3, 
  b = c("A", "B", "C", "D"))
```

## Transforming and summarizing data dplyr and tidyr

See the tutorial at 
<http://rsconnect.calvin.edu/wrangling-jmm2019> or 
<https://rpruim.shinyapps.io/wrangling-jmm2019>

## Writing Functions

### Why write functions?

There are two main reasons for writing functions.

1. You may want to use a tool that requires a function as input.
  
    To use `integrate()`, for example, you must provide the integrand as a function.
    
2. To make your own work easier.

    Functions make it easier to reuse code or to break larger tasks into 
    smaller parts.
    
### Function parts

Functions consist of several parts. Most importantly

1. An argument list.

    A list of named inputs to the function.  These may have default values (or not).
    There is also a special argument `...` which gathers up any other arguments
    provided by the user. Many R functions make use of `...`
    
    ```{r ch03-args}
    args(ilogit)  # one argument, called x, no default value
    ```

2. The body.

    This is the code the tells R to do when the function is executed.
    
    ```{r ch03-body}
    body(ilogit)
    ```

3. An environment where code is executed.

    Each function has its own "scratch pad" where it can do work without
    interfering with global computations.  But environments in R are 
    nested, so it is possible to reach outside of this narrow environment
    to access other things (and possible to change them).  For the most
    part we won't worry about this, but if you use a variable not defined 
    in your function but defined elsewhere, you may see unexpecte results.
    
If you type the name of a function without parenthesis, you will see all three
parts listed:

```{r ch03-ilogit-look}
ilogit
```

### The function() function has its function

To write a function we use the `function()` function to specify the 
arguments and the body.  (R will assign an environment for us.)

The general outline is 

```{r ch03-function-outline, eval = FALSE}
my_function_name <- 
  function(arg1 = default1, arg2 = default2, arg3, arg4, ...) {
    
    # stuff for my function to do
  }
```

We may include as many named arguments as we like, and some or all or none
of them may have default values.  The results of the last line of the function
are returned.  If we like, we can also use the `return()` function to make it
clear what is being returned when.

Let's write a function that adds.  (Redundant, but a useful illustration.)

```{r ch03-write-function}
foo <- function(x, y = 5) {
  x + y     # or return(x + y)
}
foo(3, 5)
foo(3)
foo(2, x = 3)   # Note: this makes y = 2
foo(x = 1:3, y = 100)
foo(x = 1:3, y = c(100, 200, 300))  # vectorized!
foo(x = 1:3, y = c(100, 200))       # You have been warned!
```

Here is a more useful example. 
Suppose we want to integrate $f(x) = x^2 (2-x)$ on the interval from 0 to 2.
Since this is such a simple function, if we are not going to reuse it,
we don't need to bother naming it, we can just create the function inside
our call to `integrate()`.

```{r ch03-integrate}
integrate(function(x) { x^2 * (2-x) }, 0, 2)
```

## Some common error messages

### object not found

If R claims some object is not found, the two most likely causes are 

* a typo -- if you spell the name of the object slightly differently, R can't figure out
what you mean

    ```{r ch03-object-not-found, error = TRUE}
    blah <- 17
    bla
    ```

* forgetting to load a package -- if the object is in a package, that package must be
loaded

    ```{r ch03-package-load-avoids-error, error = TRUE}
    detach("package:mosaic", unload = TRUE)
    detach("package:ggformula", unload = TRUE)   # without packages, no gf_line()
    gf_line()
    library(ggformula)      # reload package; now things work
    gf_line()
    ```
### Package inputenc Error: Unicode char not set up for use with LaTeX.

Sometimes if you copy and paste text from a web page or PDF document you will get
symbols knitr doesn't know how to handle.  Smart quotes, ligatures, and other special
characters are the most likely cause.

`tools::showNonASCIIfile()` can help you locate non-ASCII characters in a file.

### Any message mentioning yaml

YAML stand for yet another markup language.  The first part of an R Markdown
file is called the YAML header.  If you get a yaml error when you knit a
document, most likely you have messed up the YAML header someone.  If you don't
see the problem, you can start a new document and copy-and-paste the contents
(without the TAML header) into the new document (after its YAML header).

## Exercises {#ch03-exercises}


1. Create a function in R that converts Fahrenheit temperatures to 
Celsius temperatures.  [Hint: $C = (F-32) \cdot 5/9$.]

    What you turn in should show
    
    a. the code that defines your function.
    b. some test cases that show that your function is working.  (Show that
    -40, 32, 98.6, and 212 convert to -40, 0, 37, and 100.)
    Note: you should be able to test all these cases by calling the function 
    only once.  Use `c(-40, 32, 98.6, 212)` as the input.

2. See if you can predict the output of each line below. Then run in R to see if 
you are correct.  If you are not correct, see if you can figure out why R does what
it does (and make a note so you are not surprised the next time).

```{r ch03-guess-and-check, eval = FALSE}
odds <- 1 + 2 * (0:4); odds
primes <- c(2, 3, 5, 7, 11, 13)
length(odds)
length(primes)
odds + 1
odds + primes
odds * primes
odds > 5
sum(odds > 5)
sum(primes < 5 | primes > 9)
odds[3]
odds[10]
odds[-3]
primes[odds]
primes[primes >= 7]
sum(primes[primes > 5])
sum(odds[odds > 5])
odds[10] <- 1 + 2 * 9
odds
y <- 1:10
y <- 1:10; y
(x <- 1:5)
```

3. The problem uses the `KidsFeet` data set from the `mosaicData` package.
The hints are suggested functions that might be of use.

    a. How many kids are represented in the data set. [Hint: `nrow()` or `dim()`]
    b. Which of the variables are factors? [Hint: `glimpse()`]
    c. Add a new variable called `foot_ratio` that is equal to
    `length` divided by `width`. [Hint: `mutate()`]
    c. Add a new variable called `biggerfoot2` that has values 
    `"dom"` (if `domhand` and `biggerfoot` are the same) and 
    `"nondom"` (if `domhand` and `biggerfoot` are different).
    [Hint: `mutate()`, `==`, `ifelse()`]
    d. Create new data set called `Boys` that contains only the boys.
    [Hint: `filter()`, `==`]
    e. What is the name of the boy with the largest `foot_ratio`?
    Show how to find this programmatically, don't just scan through
    the whole data set yourself.  [Hint: `max()` or `arrange()`]
    
## Footnotes