in_Tutorial06DataWrangling.Rmd

---
title: "Tutorial 6: Filtering Summarising Data"
output:
  html_document:
    toc: true
    toc_float: yes
    toc_depth: 2
    df_print: paged
    theme: flatly
    highlight: tango
---

```{r setup, include=FALSE}
# OPTIONS -----------------------------------------------
knitr::opts_chunk$set(echo = TRUE, 
                      warning=FALSE, 
                      message = FALSE)
```

```{r, include=FALSE}
# PACKAGES-----------------------------------------------
# Tutorial packages
library(vembedr)
library(skimr)
library(yarrr)
library(RColorBrewer)
library(GGally) 
library(tidyverse)
library(plotly)
library(readxl)
library(rvest)
library(biscale)
library(tidycensus)
library(cowplot)
library(units)
frost    <- readxl::read_excel("./Data/DataG364_frostday.xlsx")

```

<br><br>

This tutorial covers subsetting, filtering, removing missing values..

# 6.3 Removing missing values

## 6.3.1 na.omit

The na.omit command will remove any row with ANY missing value. Note I'm overwriting my variable with the smaller one. It's up to you if you want to do this or make a new variable

```{r, eval=FALSE}
frost <- na.omit(frost)
```

<br>

## 6.3.2 Missing values in single columns

If you want to only remove rows with missing values in a single column, we can use the complete cases command


```{r, eval=FALSE}
frost <- frost[complete.cases(frost$Latitude), ]
frost <- frost[complete.cases(frost$Longitude), ]

```

E.g. take the frost table and ONLY include values where the longitude is not missing.   Note, this can be any column name at all.

<br><br>


## Turning values into NA

Sometimes, you have -999 or something as NA.  You can apply a RULE using the filter command below or like this to make them actually NAs

```{r}
frost$Dist_to_Coast[frost$Dist_to_Coast < -0]   <- NA
```

E.g. here I select all the distance to coasts that are less than 0 (e.g impossible) and set them to NA.


<br>

# 6.2 Choosing columns

Sometimes you don't want all the columns. You can subset in a few ways.

For example, here are the frost data columns
```{r}
names(frost)
```

I can choose which column NUMBERS I want like this

```{r}
# choose columns 1 6 and 2 in that order
frostnew <- frost[,c(1,6,2)]

#print the first 2 rows.
frostnew[1:2, ]
```

or the names like this

```{r}
# choose columns 1 6 and 2 in that order
frostnew <- frost[,c("Dist_to_Coast","State","Latitude")]

#print the first 2 rows.
frostnew[1:2, ]
```

<br><br>

# 6.3 Renaming columns

Good column names don't have spaces or special characters. You can either rename in excel or using the names command  e.g.

```{r}
names(frostnew)
```


```{r}
names(frostnew)[3]
```

```{r}
names(frostnew)[3] <- "hellloooooo"
```

```{r}
summary(frostnew)
```


<br><br>


# 6.4 Choosing rows/filtering

## 6.4A The dplyr filter command (tidyverse)

Filtering means selecting rows/observations based on their values. To filter in R, use the command `filter()` from the dplyr package. I tend to write it as `dplyr:filter()` to force it to be correct. Visually, filtering rows looks like this.

```{r, echo=FALSE, fig.align='center',out.width="80%"}
knitr::include_graphics('./Figures/fig30_DplyrFilter.png')
```
First, make sure that this library is in your library code chunk and that you have run it.

```{r,eval=FALSE}
library(tidyverse)
```


Let's filter our frost data.
```{r}
frost <- read_excel("./Data/DataG364_frostday.xlsx")
names(frost)
```


```{r}
head(frost)
```

Here we can apply the filter command to choose specific rows that meet certain criteria. There are a few filter commands out there, so I'm specifying I want the dplyr one. 


```{r, results="hide"}
dplyr::filter(frost, State == "FL")
```

 - The double equal operator `==` means equal to. The command is telling R to keep the rows in *frost* where the *State* column equals "FL".

 - Note, there's no <- so I'm printing the results out on the screen. If you want to save your result, put `answer <- dplyr::filter(frost, State == "FL")`

<br><br>

### Several categories

If you want a few categories, choose the %in% operator, using the `c()` command to stick together the categories you want. For example, here are states in Florida and Virginia.

```{r, results="hide"}
filter(frost, State %in% c("FL","VA"))
```

We can also explicitly exclude cases and keep everything else by using the not equal operator `!=`. The following code *excludes* airport stations.

```{r, results="hide"}
filter(frost, Type_Fake != "Airport")
```

### Greater or lesser

What about filtering if a row has a value greater than a specified value? For example, Stations with an elevation greater than 500 feet?

```{r, results="hide"}
filter(frost, Elevation > 500)
```

Or less-than-or-equal-to 200 feet.

```{r, results="hide"}

# or save the result to a new variable
lowland_stations <- filter(frost, Elevation < 200)
summary(lowland_stations)
```

<br>

### Complex commands

In addition to comparison operators, filtering may also utilize logical operators that make multiple selections. There are three basic logical operators: `&` (and), `|` (or), and `!` (not). We can keep Stations with an *Elevation* greater than 300 **and** *State* in Alabama `&`.

```{r, results="hide"}
filter(frost, Elevation > 300 & State == "AL")
```

Use `|` to keep Stations with a *Type_Fake* of "Airport" **or** a last spring frost date after April (\~ day 90 of the year).

```{r, results="hide"}
airportwarm <- filter(frost, 
                      Type_Fake == "Airport" | Avg_DOY_SpringFrost > 90 )

airportwarm

```


# 6.5 Sorting data

## 6.5A The dplyr arrange command (tidyverse)

\

We use the `arrange()` function to sort a data frame by one or more variables. You might want to do this to get a sense of which cases have the highest or lowest values in your data set or sort counties by their name. For example, let's sort in ascending order by elevation.

```{r}
arrange(frost, Latitude)
```

By default, `arrange()` sorts in ascending order. We can sort by a variable in descending order by using the `desc()` function on the variable we want to sort by. For example, to sort the dataframe by *Avg_DOY_SpringFrost* in descending order we use

```{r}
arrange(frost, desc(Avg_DOY_SpringFrost))
```

<br> <br>