This repository has been archived by the owner on Aug 4, 2020. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 76
/
03-data-frames.Rmd
176 lines (120 loc) · 5.17 KB
/
03-data-frames.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
layout: topic
title: Using data in data frames
author: Data Carpentry contributors
minutes: 30
---
```{r, echo=FALSE, purl=TRUE}
## The data.frame class
```
------------
> ## Learning Objectives
>
> * Extract values from vectors and data frames.
> * Perform operations on columns in a data frame.
> * Append columns to a data frame.
> * Create subsets of a data frame.
------------
In this lesson you will learn how to extract and manipulate data stored in data frames in R. We will work with the *E. coli* metadata file that we used previously. Be sure to read this file into a dataframe named `metadata`, if you haven't already done so.
```{r, eval=TRUE, purl=FALSE}
metadata <- read.csv('data/Ecoli_metadata.csv')
```
Because the columns of a data frame are vectors, we will first learn how to extract elements from vectors and then learn how to apply this concept to select rows and columns from a data frame.
# Extracting values with indexing and sequences
```{r, echo=FALSE, purl=TRUE}
## Indexing and sequences
```
## Vectors
Let's create a vector containing the first ten letters of the alphabet.
```{r, purl=FALSE, eval=FALSE}
ten_letters <- c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j')
```
In order to extract one or several values from a vector, we must provide one or several indices in square brackets, just as we do in math. R indexes start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that's what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that's simpler for computers to do.
So, to extract the 2nd element of `ten_letters` we type:
```{r, purl=FALSE, eval=FALSE}
ten_letters[2]
```
We can extract multiple elements at a time by specifying mulitple indices inside the square brackets as a vector. Notice how you can use `:` to make a vector of all integers two numbers.
```{r, purl=FALSE, eval=FALSE}
ten_letters[c(1,7)]
ten_letters[3:6]
ten_letters[10:1]
ten_letters[c(2, 8:10)]
```
Quick exercise / formative assessment: Select every other element in `ten_letters`.
What if we were dealing with a much longer vector? We can use the `seq()` function to quickly create sequences of numbers.
```{r, purl=FALSE, eval=FALSE}
seq(1, 10, by = 2)
seq(20, 4, by = -3)
```
<!--
Consider including:
# Create sequences between two numbers, given the number of values (length.out = number of values)
seq(1, 10, length.out = 2)
seq(20, 4, length.out = 3)
and discuss why they differ.
-->
> ## Exercise
>
> Fill in the blank to select the even elements of ten_letters using the seq() function.
>
> ten_letters[____________]
>
> > ## Solution
> > ten_letters[seq(2, 10, by = 2)]
> {: .solution}
{: .challenge}
## Data frames
The metadata data frame has rows and columns (it has 2 dimensions), if we want to
extract some specific data from it, we need to specify the "coordinates" we want
from it. Row numbers come first, followed by column numbers (i.e. [row, column]).
```{r, purl=FALSE, eval=FALSE}
metadata[1, 2] # 1st element in the 2nd column
metadata[1, 6] # 1st element in the 6th column
metadata[1:3, 7] # First three elements in the 7th column
metadata[3, ] # 3rd element for all columns
metadata[, 7] # Entire 7th column
```
> ## Challenge
>
> The function `nrow()` on a `data.frame` returns the number of rows. For example, try typing nrow(metadata)`.
> Use `nrow()` and `seq()` to create a new data frame called `meta_by_2` that includes all even numbered rows of `metadata`.
>
> ## Solution
> > meta_data[seq(2, nrow(metadata), by = 2, ]
> >
> >
> {: .solution}
{: .challenge}
For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. Sometimes the column number for a particular variable can change if your analysis adds or removes columns. The best practice when working with columns in a data frame is to refer to them by name. This also makes your code easier to read and your intentions clearer.
There are two ways to select a column by name from a data frame:
* Using `dataframe[ , "column_name"]`
* Using `dataframe$column_name`
You can do operations on a particular column, by selecting it using the `$`
sign. In this case, the entire column is a vector. You can use
`names(metadata)` or `colnames(metadata)` to remind yourself of the column names.
For instance, to extract all the strain information from our datasets:
```{r, eval=FALSE}
# Select the strain column from metadata
metadata[ , "strain"]
# Alternatively...
metadata$strain
```
The first method allows you to select multiple columns at once. Suppose we wanted strain and clade information:
```{r, eval=FALSE}
metadata[, c("strain", "clade")]
```
You can even access columns by column name _and_ select specific rows of interest. For example, if we wanted the strain and clade of just rows 4 through 7, we could do:
```{r, eval=FALSE}
metadata[4:7, c("strain", "clade")]
```
<!--
Still need to address the following learning objectives:
* Append columns to a data frame.
* Create subsets of a data frame.
The following headings are just suggestions.
>
# Manipulating columns
## Mathematical operations
## Appending new columns
# Creating subsets