-
Notifications
You must be signed in to change notification settings - Fork 0
/
in_Tutorial06DataWrangling.Rmd
251 lines (164 loc) · 6.09 KB
/
in_Tutorial06DataWrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: "Tutorial 6: Filtering Summarising Data"
output:
html_document:
toc: true
toc_float: yes
toc_depth: 2
df_print: paged
theme: flatly
highlight: tango
---
```{r setup, include=FALSE}
# OPTIONS -----------------------------------------------
knitr::opts_chunk$set(echo = TRUE,
warning=FALSE,
message = FALSE)
```
```{r, include=FALSE}
# PACKAGES-----------------------------------------------
# Tutorial packages
library(vembedr)
library(skimr)
library(yarrr)
library(RColorBrewer)
library(GGally)
library(tidyverse)
library(plotly)
library(readxl)
library(rvest)
library(biscale)
library(tidycensus)
library(cowplot)
library(units)
frost <- readxl::read_excel("./Data/DataG364_frostday.xlsx")
```
<br><br>
This tutorial covers subsetting, filtering, removing missing values..
# 6.3 Removing missing values
## 6.3.1 na.omit
The na.omit command will remove any row with ANY missing value. Note I'm overwriting my variable with the smaller one. It's up to you if you want to do this or make a new variable
```{r, eval=FALSE}
frost <- na.omit(frost)
```
<br>
## 6.3.2 Missing values in single columns
If you want to only remove rows with missing values in a single column, we can use the complete cases command
```{r, eval=FALSE}
frost <- frost[complete.cases(frost$Latitude), ]
frost <- frost[complete.cases(frost$Longitude), ]
```
E.g. take the frost table and ONLY include values where the longitude is not missing. Note, this can be any column name at all.
<br><br>
## Turning values into NA
Sometimes, you have -999 or something as NA. You can apply a RULE using the filter command below or like this to make them actually NAs
```{r}
frost$Dist_to_Coast[frost$Dist_to_Coast < -0] <- NA
```
E.g. here I select all the distance to coasts that are less than 0 (e.g impossible) and set them to NA.
<br>
# 6.2 Choosing columns
Sometimes you don't want all the columns. You can subset in a few ways.
For example, here are the frost data columns
```{r}
names(frost)
```
I can choose which column NUMBERS I want like this
```{r}
# choose columns 1 6 and 2 in that order
frostnew <- frost[,c(1,6,2)]
#print the first 2 rows.
frostnew[1:2, ]
```
or the names like this
```{r}
# choose columns 1 6 and 2 in that order
frostnew <- frost[,c("Dist_to_Coast","State","Latitude")]
#print the first 2 rows.
frostnew[1:2, ]
```
<br><br>
# 6.3 Renaming columns
Good column names don't have spaces or special characters. You can either rename in excel or using the names command e.g.
```{r}
names(frostnew)
```
```{r}
names(frostnew)[3]
```
```{r}
names(frostnew)[3] <- "hellloooooo"
```
```{r}
summary(frostnew)
```
<br><br>
# 6.4 Choosing rows/filtering
## 6.4A The dplyr filter command (tidyverse)
Filtering means selecting rows/observations based on their values. To filter in R, use the command `filter()` from the dplyr package. I tend to write it as `dplyr:filter()` to force it to be correct. Visually, filtering rows looks like this.
```{r, echo=FALSE, fig.align='center',out.width="80%"}
knitr::include_graphics('./Figures/fig30_DplyrFilter.png')
```
First, make sure that this library is in your library code chunk and that you have run it.
```{r,eval=FALSE}
library(tidyverse)
```
Let's filter our frost data.
```{r}
frost <- read_excel("./Data/DataG364_frostday.xlsx")
names(frost)
```
```{r}
head(frost)
```
Here we can apply the filter command to choose specific rows that meet certain criteria. There are a few filter commands out there, so I'm specifying I want the dplyr one.
```{r, results="hide"}
dplyr::filter(frost, State == "FL")
```
- The double equal operator `==` means equal to. The command is telling R to keep the rows in *frost* where the *State* column equals "FL".
- Note, there's no <- so I'm printing the results out on the screen. If you want to save your result, put `answer <- dplyr::filter(frost, State == "FL")`
<br><br>
### Several categories
If you want a few categories, choose the %in% operator, using the `c()` command to stick together the categories you want. For example, here are states in Florida and Virginia.
```{r, results="hide"}
filter(frost, State %in% c("FL","VA"))
```
We can also explicitly exclude cases and keep everything else by using the not equal operator `!=`. The following code *excludes* airport stations.
```{r, results="hide"}
filter(frost, Type_Fake != "Airport")
```
### Greater or lesser
What about filtering if a row has a value greater than a specified value? For example, Stations with an elevation greater than 500 feet?
```{r, results="hide"}
filter(frost, Elevation > 500)
```
Or less-than-or-equal-to 200 feet.
```{r, results="hide"}
# or save the result to a new variable
lowland_stations <- filter(frost, Elevation < 200)
summary(lowland_stations)
```
<br>
### Complex commands
In addition to comparison operators, filtering may also utilize logical operators that make multiple selections. There are three basic logical operators: `&` (and), `|` (or), and `!` (not). We can keep Stations with an *Elevation* greater than 300 **and** *State* in Alabama `&`.
```{r, results="hide"}
filter(frost, Elevation > 300 & State == "AL")
```
Use `|` to keep Stations with a *Type_Fake* of "Airport" **or** a last spring frost date after April (\~ day 90 of the year).
```{r, results="hide"}
airportwarm <- filter(frost,
Type_Fake == "Airport" | Avg_DOY_SpringFrost > 90 )
airportwarm
```
# 6.5 Sorting data
## 6.5A The dplyr arrange command (tidyverse)
\
We use the `arrange()` function to sort a data frame by one or more variables. You might want to do this to get a sense of which cases have the highest or lowest values in your data set or sort counties by their name. For example, let's sort in ascending order by elevation.
```{r}
arrange(frost, Latitude)
```
By default, `arrange()` sorts in ascending order. We can sort by a variable in descending order by using the `desc()` function on the variable we want to sort by. For example, to sort the dataframe by *Avg_DOY_SpringFrost* in descending order we use
```{r}
arrange(frost, desc(Avg_DOY_SpringFrost))
```
<br> <br>