03-data-frames.Rmd

---
layout: topic
title: Using data in data frames
author: Data Carpentry contributors
minutes: 30
---

```{r, echo=FALSE, purl=TRUE}
## The data.frame class
```

------------

> ## Learning Objectives
>
> * Extract values from vectors and data frames.
> * Perform operations on columns in a data frame.
> * Append columns to a data frame.
> * Create subsets of a data frame.


------------

In this lesson you will learn how to extract and manipulate data stored in data frames in R. We will work with the *E. coli* metadata file that we used previously. Be sure to read this file into a dataframe named `metadata`, if you haven't already done so.

```{r, eval=TRUE,  purl=FALSE}
metadata <- read.csv('data/Ecoli_metadata.csv')
```

Because the columns of a data frame are vectors, we will first learn how to extract elements from vectors and then learn how to apply this concept to select rows and columns from a data frame.  

# Extracting values with indexing and sequences

```{r, echo=FALSE, purl=TRUE}
## Indexing and sequences
```

## Vectors

Let's create a vector containing the first ten letters of the alphabet.

```{r, purl=FALSE, eval=FALSE}
ten_letters <- c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j')
```

In order to extract one or several values from a vector, we must provide one or several indices in square brackets, just as we do in math. R indexes start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that's what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that's simpler for computers to do.

So, to extract the 2nd element of `ten_letters` we type:
 
```{r, purl=FALSE, eval=FALSE}
ten_letters[2]
```

We can extract multiple elements at a time by specifying mulitple indices inside the square brackets as a vector. Notice how you can use `:` to make a vector of all integers two numbers. 

```{r, purl=FALSE, eval=FALSE}
ten_letters[c(1,7)]

ten_letters[3:6]

ten_letters[10:1]

ten_letters[c(2, 8:10)]

```

Quick exercise / formative assessment: Select every other element in `ten_letters`.

What if we were dealing with a much longer vector? We can use the `seq()` function to quickly create sequences of numbers.

```{r, purl=FALSE, eval=FALSE}
seq(1, 10, by = 2)
seq(20, 4, by = -3)
```

<!--
  Consider including:
  # Create sequences between two numbers, given the number of values (length.out = number of values)
  seq(1, 10, length.out = 2)
  seq(20, 4, length.out = 3)
  
  and discuss why they differ.
-->

> ## Exercise
> 
> Fill in the blank to select the even elements of ten_letters using the seq() function.
>
> ten_letters[____________]
> 
> > ## Solution
> > ten_letters[seq(2, 10, by = 2)]
> {: .solution}
{: .challenge}


## Data frames

The metadata data frame has rows and columns (it has 2 dimensions), if we want to
extract some specific data from it, we need to specify the "coordinates" we want
from it. Row numbers come first, followed by column numbers (i.e. [row, column]).

```{r, purl=FALSE, eval=FALSE}
metadata[1, 2]   # 1st element in the 2nd column 
metadata[1, 6]   # 1st element in the 6th column
metadata[1:3, 7] # First three elements in the 7th column
metadata[3, ]    # 3rd element for all columns
metadata[, 7]    # Entire 7th column
```


> ## Challenge
> 
> The function `nrow()` on a `data.frame` returns the number of rows. For example, try typing nrow(metadata)`.
> Use `nrow()` and `seq()` to create a new data frame called `meta_by_2` that includes all even numbered rows of `metadata`.
> 
> ## Solution
> > meta_data[seq(2, nrow(metadata), by = 2, ]
> > 
> >
> {: .solution}
{: .challenge}

For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. Sometimes the column number for a particular variable can change if your analysis adds or removes columns. The best practice when working with columns in a data frame is to refer to them by name. This also makes your code easier to read and your intentions clearer.

There are two ways to select a column by name from a data frame:

* Using `dataframe[ , "column_name"]`
* Using `dataframe$column_name`

You can do operations on a particular column, by selecting it using the `$`
sign. In this case, the entire column is a vector. You can use
`names(metadata)` or `colnames(metadata)` to remind yourself of the column names.
For instance, to extract all the strain information from our datasets:

```{r, eval=FALSE}
# Select the strain column from metadata
metadata[ , "strain"]

# Alternatively...
metadata$strain
```

The first method allows you to select multiple columns at once. Suppose we wanted strain and clade information:

```{r, eval=FALSE}
metadata[, c("strain", "clade")]
```

You can even access columns by column name _and_ select specific rows of interest. For example, if we wanted the strain and clade of just rows 4 through 7, we could do:

```{r, eval=FALSE}
metadata[4:7, c("strain", "clade")]
```


<!--
Still need to address the following learning objectives:
 * Append columns to a data frame.
 * Create subsets of a data frame.

The following headings are just suggestions.

>

# Manipulating columns


## Mathematical operations

## Appending new columns


# Creating subsets