Assignment 2.rmd

---
title: "Homework 2"
author: "Leonel Carlen, Stefan Ciric / PSTAT131" 
date: "`r format(Sys.Date(), '%B %d, %Y')`"
graphics: yes
geometry: margin=0.75in
output: pdf_document
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, 
                      cache=FALSE, 
                      fig.width=5, 
                      fig.height=5,
                      fig.align='center')
indent1 = '    '      
indent2 = paste(rep(indent1, 2), collapse='')
```

----------------------

# Spam detection with `spambase` dataset

Following packages are needed below:
```{r pkg, message=FALSE}
library(tidyverse)
library(tree)
library(plyr)
library(class)
library(rpart)
library(maptree)
library(ROCR)
```

__Data Info__: The Data Set was obtained by the UCI Machine Learning database. From the website, 
        
> The "spam" concept is diverse: advertisements for products/web sites, make
> money fast schemes, chain letters, pornography... 
> 
> Our collection of spam e-mails came from our postmaster and individuals who had
> filed spam. Our collection of non-spam e-mails came from filed work and
> personal e-mails, and hence the word 'george' and the area code '650' are
> indicators of non-spam. These are useful when constructing a personalized spam
> filter. One would either have to blind such non-spam indicators or get a very
> wide collection of non-spam to generate a general purpose spam filter.
   
Dataset `spambase.tab` can be read with the following code. Next, standardize
each numerical attribute in the dataset. Each standardized column should have
zero mean and unit variance. 

```{r, warning=FALSE, results='hide', message=FALSE}
spam <- read_table2("spambase.tab", guess_max=2000)
spam <- spam %>% 
    mutate(y = factor(y, levels=c(0,1), labels=c("good", "spam"))) %>%   # label as factors
    mutate_at(.vars=vars(-y), .funs=scale)                               # scale others
```

__Attribute Information__: The last column of 'spambase.tab' denotes whether
the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial
e-mail. Most of the attributes indicate whether a particular word or character
was frequently occurring in the e-mail. The run-length attributes (55-57)
measure the length of sequences of consecutive capital letters. For the
statistical measures of each attribute, see the end of this file. Here are the
definitions of the attributes: 

* 48 continuous real [0,100] attributes of type `word_freq_WORD` = percentage
  of words in the e-mail that match `WORD`, i.e. 100 * (number of times the
  `WORD` appears in the e-mail) / total number of words in e-mail. A `WORD` in
  this case is any string of alphanumeric characters bounded by
  non-alphanumeric characters or end-of-string. 

* 6 continuous real [0,100] attributes of type `char_freq_CHAR` = percentage of
  characters in the e-mail that match `CHAR`, i.e. 100 * (number of `CHAR`
  occurrences) / total characters in e-mail 

* 1 continuous real [1,...] attribute of type `capital_run_length_average` =
  average length of uninterrupted sequences of capital letters 

* 1 continuous integer [1,...] attribute of type `capital_run_length_longest` =
  length of longest uninterrupted sequence of capital letters 

* 1 continuous integer [1,...] attribute of type `capital_run_length_total` =
  sum of length of uninterrupted sequences of capital letters = total number of
  capital letters in the e-mail 

* 1 nominal {0,1} class attribute of type `spam` = denotes whether the e-mail was
  considered spam (1) or not (0), i.e. unsolicited commercial e-mail. 

**Classification Task**: We will build models to classify emails into good vs.
spam. 

In this dataset, we will apply several classification methods and compare their
training error rates and test error rates. We define a new function, named
`calc_error_rate()`, that will calculate misclassification error rate. Any error in this
homework (unless specified otherwise) imply misclassification error.

```{r ter, cache=TRUE}
calc_error_rate <- function(predicted.value, true.value){
  return(mean(true.value!=predicted.value))
}
```

Throughout this homework, we will calculate the error rates to measure and
compare classification performance. To keep track of error rates of all
methods, we will create a matrix called `records`: 
```{r record}
records = matrix(NA, nrow=3, ncol=2)
colnames(records) <- c("train.error","test.error")
rownames(records) <- c("knn","tree","logistic")
```

Attribute folds to data:
**Training/test sets**: Split randomly the data set in a train and a test
set:
```{r, results="hide"}
set.seed(1)
test.indices = sample(1:nrow(spam), 1000)
spam.train=spam[-test.indices,]
spam.test=spam[test.indices,]
```

**$10$-fold cross-validation**: Using `spam.train` data, 10-fold cross
validation will be performed throughout this homework. In order to ensure data
partitioning is consistent, define `folds` which contain fold assignment for
each observation in `spam.train`.

```{r, folds-definition}
nfold = 10
set.seed(1)
folds = seq.int(nrow(spam.train)) %>%       ## sequential obs ids
    cut(breaks = nfold, labels=FALSE) %>%   ## sequential fold ids
    sample                                  ## random fold ids

```

----------------------

## K-Nearest Neighbor Method

1. **(Selecting number of neighbors)** Use 10-fold cross validation to select
   the best number of neighbors `best.kfold` out of six values of $k$ in `kvec = c(1, seq(10, 50, length.out=5))`. Use the folds defined above and use the following `do.chunk` definition in your code. Again put `set.seed(1)` before your code.  What value of $k$ leads to the smallest estimated test error?

```{r 90,indent=indent1,message=F,warning=F, cache = TRUE}
do.chunk <- function(chunkid, folddef, Xdat, Ydat, k){
  
  train = (folddef!=chunkid)
  
  Xtr = Xdat[train,]
  Ytr = Ydat[train]
  
  Xvl = Xdat[!train,]
  Yvl = Ydat[!train]

  ## get classifications for current training chunks
  predYtr = knn(train = Xtr, test = Xtr, cl = Ytr, k = k)
  
  ## get classifications for current test chunk
  predYvl = knn(train = Xtr, test = Xvl, cl = Ytr, k = k)
  
  data.frame(train.error = calc_error_rate(predYtr, Ytr),
             val.error = calc_error_rate(predYvl, Yvl))
}
```

Define design matrix and true labels for test set and training set
```{r}
XTrain <- spam.train%>%select(-y)
YTrain <- spam.train$y

XTest <- spam.test%>%select(-y)
YTest <- spam.test$y
```

Build and train the knn model
```{r, indent=indent1, cache=TRUE}
# Set error.folds (a vector) to save validation errors in future
error.folds = NULL 

# Give possible number of nearest neighbours to be considered
kvec = c(1, seq(10, 50, length.out=5))

# Set seed since do.chunk() contains a random component induced by knn()
#set.seed(66)

# Loop through different number of neighbors
for (j in c(1, 10, 20, 30, 40, 50)){

    tmp = (ldply(1:nfold, do.chunk, folddef=folds, Xdat=XTrain, Ydat=YTrain, k=j))
    tmp$neighbors = j # Keep track of each value of neighors

    error.folds = rbind(error.folds, tmp) # combine results 

}
```

Select best K by comparing average test error for each model
```{r, cache=TRUE}
test.error <- NULL
for (i in kvec)
test.error <- rbind(test.error, error.folds%>%filter(neighbors == i)%>%summarise(mean(val.error)))
ktable <- cbind(test.error, kvec)
k <- max(ktable$kvec[test.error==min(test.error)])
k
knntrain.err <- min(test.error)
```

2. **(Training and Test Errors)** Now that the best number of neighbors has
been determined, compute the training error using `spam.train` and
test error using `spam.train` for the $k =$ `best.kfold`.  Use the function
`calc_error_rate()` to get the errors from the predicted class labels.  Fill in
the first row of `records` with the train and test error from the `knn` fit.

```{r loocv test, indent=indent1, cache=TRUE}
# Set random seed to make the results reproducible
set.seed(67)

# Best k used
pred.YTest = knn(train=XTrain, test=XTest, cl=YTrain, k=k)

# Confusion matrix
conf.matrix = table(predicted=pred.YTest, true=YTest)
conf.matrix

#Test accuracy rate
sum(diag(conf.matrix)/sum(conf.matrix))
# Test error rate
knntest.err <- 1 - sum(diag(conf.matrix)/sum(conf.matrix))
```
Populate records with training and test values of knn machine
```{r, cache=TRUE}
records[1,1] = knntrain.err
records[1,2] = knntest.err
```
-----------------------------------------

## Decision Tree Method

3. **(Controlling Decision Tree Construction)** Function `tree.control`
   specifies options for tree construction: set `minsize` equal to 5 (the minimum number of observations in each leaf) and `mindev` equal to 1e-5. See the help for `tree.control` for more information.  The output of `tree.control` should be passed into `tree` function in the `control` argument. Construct a decision tree using training set `spam.train`, call the resulting tree `spamtree`.  `summary(spamtree)` gives some basic information about the tree.  How many leaf nodes are there? How many of the training observations are misclassified?

```{r, cache = TRUE}
spamtree <- tree(y~., data = spam.train, control = tree.control(nobs=nrow(spam.train), minsize = 5, mindev = 1e-5))
summary(spamtree)
```
*There are 184 leaves. The number of misclassified observations is 48.*

</b>
4. **(Decision Tree Pruning)** We can prune a tree using the `prune.tree` function.  Pruning iteratively removes the leaves that have the least effect on the overall misclassification. Prune the tree until there are only $10$ leaf nodes so that we can easily visualize the tree.  Use `draw.tree` function from the `maptree` package to visualize the pruned tree. Set `nodeinfo=TRUE`.

```{r, cache = TRUE}
prune.tree(spamtree, best = 10)
draw.tree(prune.tree(spamtree, best = 10, method = "misclass"), nodeinfo = TRUE, cex = 0.35)
```


5.   In this problem we will use cross validation to prune the tree. Fortunately, the `tree` package provides and easy to use function to do the cross validation for us with the `cv.tree` function.  Use the same fold partitioning you used in the KNN problem (refer to `cv.tree` help page for detail about `rand` argument).  Also be sure to set `method=misclass`.  Plot the misclassification as function of tree size.  Determine the optimal tree size that minimizes misclassification. __Important__: if there are multiple tree sizes that have the same minimum estimated misclassification, you should choose the smallest tree.  This reflects the idea that we want to choose the simplest model that explains the data well ("Occam's razor"). Show the optimal tree size `best.size.cv` in the plot.
```{r, cache=TRUE}
set.seed(2)
newtree <- cv.tree(tree(y~., data = spam.train), FUN = prune.misclass, K = 10)
sizedev <- as.data.frame(cbind(newtree$size, newtree$dev))
sizedev <- sizedev[order(sizedev$V1),]
best.size.cv <- sizedev$V1[which.min(sizedev$V2)]
plot(newtree$dev~newtree$size, col=ifelse(newtree$size==best.size.cv, "cyan4", "red"), pch=ifelse(newtree$size == best.size.cv, 19, 1), main = "Misclassification Volume by Size of Tree", ylab = "Misclassification", xlab = "Size")
```


 *Best tree has size 13 with 348 misclassification errors*

6. **(Training and Test Errors)**

We previous pruned the tree to a small tree so that it could be easily visualized.  Now, prune the original tree to size `best.size.cv` and call the new tree `spamtree.pruned`.  Calculate the training error and test error  when `spamtree.pruned` is used for prediction. Use function `calc_error_rate()` to compute  misclassification error. Also, fill in the second row of the matrix   `records` with the training error rate and test error rate.

```{r, cache=TRUE}
spamtree.pruned <- prune(spamtree, best = 13)
draw.tree(spamtree.pruned, nodeinfo = TRUE, cex = 0.35)
spampred = predict(spamtree.pruned, spam.test, type="class")
error = table(spampred, spam.test$y)
sum(diag(error))/sum(error)
testErr <- 1-sum(diag(error))/sum(error)
trainErr <- min(sizedev$V2)/nrow(spam.train)
records[2,1] <- trainErr
records[2,2] <- testErr
```

----------------------


## Logistic regression


7. In a binary classification problem, let $p$ represent the probability of class
   label "1"", which implies $1-p$ represents probability of class label "0".
   The *logistic function* (also called the "inverse logit") is the cumulative distribution function of logistic
   distribution, which maps a real number $z$ to the open interval $(0,1)$:
   \begin{equation} p(z)=\frac{e^z}{1+e^z}. \end{equation}
   It is easy to see -->
   that when $z\rightarrow-\infty$, function $p(z) \rightarrow 0$, and as
   $z\rightarrow\infty$, function $p(z) \rightarrow 1$.
   
  a. Show that indeed the inverse of a logistic function is the _logit_ function:
    \begin{equation}
    z(p)=\ln\left(\frac{p}{1-p}\right).
    \end{equation}
    Let $y=p(z)$.
   
   Therefore, $y= \frac{e^z}{(1+e^z)}$
   
   $y(1+e^z) = e^z$
   
   $y + e^{z}y = e^z$
   
   $y = e^z - ye^z$
   
   $y = e^z(1-y)$
   
   $e^z = \frac{y}{1-y}$
   
   $z = ln(\frac{y}{1-y})$
   
   Replacing y with p(z), we can conclude that:
   
   $z(p) = ln(\frac{p}{1-p})$
   Hence proved.


  b. The logit function is a commonly used _link function_ for a generalized linear model of binary
    data.  One reason for this is that implies interpretable coefficients.  Assume that $z=\beta_0 + \beta_1 x_1$, and $p = \text{logistic}(z)$.  How does the odds of the outcome change if you increase $x_1$ by two? Assume $\beta_1$ is negative: what value does $p$ approach as $x_1 \to \infty$? What value does $p$ approach as $x_1 \to -\infty$?


If you increase $x_1$ by 2, the odds of the outcome increase by $e^{2\beta_1}$. This is because when the term $x_1$ in the odds equation gets increased by 2, the initial term in the odds equation $e^{\beta_1 x_1}$ becomes $e^{\beta_1 (x_1+2)}$ and so the odds increases by a factor of $e^{2\beta_1}$.

$p(z) = \frac{e^z}{(1+e^z)}$ Here, $z=\beta_0 + \beta_1 x_1$.

Therefore, $p(x) = \frac{e^{\beta_0 + \beta_1 x_1}}{(1+e^{\beta_0 + \beta_1 x_1})}$

Given that $\beta_1$ is negative, 

Let $\beta$ denote the absolute value of $\beta_1$.

Therefore, $p(x) = \frac{e^{\beta_0 - \beta x_1}}{(1+e^{\beta_0 - \beta x_1})}$

  $p(x) = \frac{e^{\beta_0}e^{-\beta x_1}}{(1+e^{\beta_0}e^{-\beta x_1})}$
  
  If we multiply $e^{\beta x_1}$ in both the numerator and denominator, we get:
  
  $p(x) = \frac{e^{\beta_0}}{(e^{\beta x_1}+e^{\beta_0})}$
  
  Because $e^{\beta_0}$ is a constant here, when $x_1 \rightarrow \infty$, $e^{x_1} \rightarrow \infty$ and when $x_1 \rightarrow -\infty$, $e^{x_1} \rightarrow 0$.
            
With the assumption that $\beta_1$ is negative, $p$ approaches 0 when $x_1 \to \infty$.

With the assumption that $\beta_1$ is negative, $p$ approaches 1 when $x_1 \to -\infty$.


#8. 

Use logistic regression to perform classification. Logistic regression specifically estimates the probability that an observation as     a particular class label. We can define a probability threshold for assigning class labels based     on the probabilities returned by the `glm` fit.

In this problem, we will simply use the "majority rule".  If the probability is larger than 50\% class as spam.  Fit a logistic regression to predict spam given all other features in the dataset using the `glm` function.  Estimate the class labels using the majority rule and calculate the training and test errors.  Add the training and test errors to the third row of `records`.  Print the full `records` matrix.  Which method had the lowest misclassification error on the test set?
```{r, cache=TRUE}
#fit object
glm.spamfit = glm(y ~ .,
             data=spam.train, family=binomial)

#predict response for training data
glm.spamtrain <- predict(glm.spamfit, type = "response")

#add interpreted prediction attribute to training dataframe
NUspam.train = spam.train %>%
  mutate(predSPAM=as.factor(ifelse(glm.spamtrain<=0.5, "good", "spam")))

#make confusion matrix to get error probability
glmErr <- table(pred=NUspam.train$predSPAM, true=NUspam.train$y)
glmTrainErr <- 1-sum(diag(glmErr))/sum(glmErr)

#view error probability
# glmTrainErr

#predict response for test data
glm.spamtest <- predict(glm.spamfit, spam.test, type = "response")

#add interpreted prediction attribute to testing datafram
NUspam.test = spam.test %>%
  mutate(predSPAM=as.factor(ifelse(glm.spamtest<=0.5, "good", "spam")))

#make confusion matrix to get test error probability
glmErr <- table(pred=NUspam.test$predSPAM, true=NUspam.test$y)
glmTestErr <- 1-sum(diag(glmErr))/sum(glmErr)

#view test error probability
# glmTestErr

#update records
records[3,1] <- glmTrainErr
records[3,2] <- glmTestErr
records
```
*Logistic regression has the lowest classification error on the test set*
----------------------

## Receiver Operating Characteristic curve

9. (ROC curve) We will construct ROC curves based on the predictions of the _test_ data from the model defined in `spamtree.pruned` and the logistic regression model above. Plot the ROC for the test data for both the decision tree and the logistic regression on the same plot.  Compute the area under the curve for both models (AUC).  Which classification method seems to perform the best by this metric?

## Tree plot and AUC:
```{r}
library("ROCR")
tree.train <- predict(spamtree.pruned, spam.test, type="vector")
predTree <- prediction(tree.train[,2], spam.test$y)
perfTree = performance(predTree, measure="tpr", x.measure="fpr")
plot(perfTree, col=2, lwd=3, main="ROC curve")
abline(0,1)
auc_Tree <- performance(predTree, measure = "auc")
  auc_Tree <- auc_Tree@y.values[[1]]
print(auc_Tree)
```

## GLM plot and AUC:
```{r, eval = FALSE, indent=indent2, cache=TRUE}
predGLM <- prediction(glm.spamtest, spam.test$y)
perfGLM = performance(predGLM, measure="tpr", x.measure="fpr")
plot(perfGLM, col=2, lwd=3, main="ROC curve")
abline(0,1)
auc_GLM <- performance(predGLM, measure = "auc")
  auc_GLM <- auc_GLM@y.values[[1]]
print(auc_GLM)
```
For logistic regression one needs to predict type `response`

```{r, eval = FALSE, indent=indent2}
predGLM <- prediction(glm.spamtest, spam.test$y)
perfGLM = performance(predGLM, measure="tpr", x.measure="fpr")
plot(perfGLM, col=2, lwd=3, main="ROC curve")
abline(0,1)
```
*GLM auc is slightly larger, therefor this model is preferable, as it is interpreted to have a better false positive to true positive trade off.*


10. In the SPAM example, take "positive" to mean "spam".  If you are the designer of a spam filter, are you more concerned about the potential for false positive rates that are too large or true positive rates that are too small? Argue your case.

*As a designer of a spam filter, I am more concerned about false positives being too large than true positives being too few. This is because a large number of false positives can lead to a large number of important emails being classified as spam, which can have very negative consequences for the user (like leading to the user missing out on important emails), as opposed to the user having to delete a large number of spam emails that weren't correctly classified as spam (a mild inconvenience). The former situation has more dire consequences than the latter and hence, I would be more worried about a large number of false positives while building my spam filter.*

--------------------------------------------------------------------------------


# **Problems below for 231 students only**

11. A multivariate normal distribution has density

$$f(x) = \frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}exp\left(-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu)\right)$$

  In quadratic discriminant analysis with two groups we use Bayes rule to calculate the probability that $Y$ has class label "1":

$$Pr(Y=1 \mid X=x) = \frac{f_1(x)\pi_1}{\pi_1f_1(x) + \pi_2f_2(x)}$$

where $\pi_2 = 1 - \pi_1$ is the prior probability of being in group $2$. Suppose we classify $\hat Y= k$ whenever $Pr(Y=k \mid X=x) > \tau$ for some probability threshold $\tau$ and that $f_k$ is a multivariate normal density with covariance $\Sigma_k$ and mean $\mu_k$.  Note that for a vector $x$ of length $p$ and a $p \times p$ symmetric matrix $A$, $x^TAx$ is the _vector quadratic form_ (the multivariate analog of $x^2$). Show that the decision boundary is indeed quadratic by showing that $\hat Y = 1$ if

$$\delta_1(x) - \delta_2(x) > M(\tau)$$

  where

$$\hat\delta_k(x) = -{1\over 2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) - {1\over
2}\log|\Sigma_k| + \log\pi_k$$


  and $M(\tau)$ is some function of the probability threshold $\tau$.  What is the decision threshold, M(1/2), corresponding to a probability threshold of 1/2?
  
We have been given that $Pr(Y=1 \mid X=x) > \tau$.

$\frac{f_1(x)\pi_1}{\pi_1f_1(x) + \pi_2f_2(x)} > \tau$

$f_1(x)\pi_1 > \tau({\pi_1f_1(x) + \pi_2f_2(x)})$

$f_1(x)\pi_1(1-\tau)> {\pi_1f_1(x) + \pi_2f_2(x)}$

$\frac{f_1(x)\pi_1}{f_2(x)\pi_2} > \frac{\tau}{1-\tau}$

Taking log on both sides, 

$log(\frac{f_1(x)\pi_1}{f_2(x)\pi_2}) > log(\frac{\tau}{1-\tau})$

$log(\frac{f_1(x)}{f_2(x)}) + log(\frac{\pi_1}{\pi_2}) > log(\frac{\tau}{1-\tau})$

Expanding this, we get:

$log(f_1(x)) - log(f_2(x)) + log(\pi_1) - log(\pi_2) > log(\frac{\tau}{1-\tau})$

$-\frac{1}{2}(x-\mu_1)^T\sum_1^{-1}(x-\mu_1) - \frac{p}{2}log(2\pi) - \frac{1}{2}log(\sum_1) - [-\frac{1}{2}(x-\mu_2)^T\sum_2^{-1}(x-\mu_2) - \frac{p}{2}log(2\pi) - \frac{1}{2}log(\sum_2)] + log(\pi_1) - log(\pi_2) > log(\frac{\tau}{1-\tau})$

Simplifying this, we get:

$-\frac{1}{2}(x-\mu_1)^T\sum_1^{-1}(x-\mu_1) - \frac{1}{2}log(\sum_1) + \frac{1}{2}(x-\mu_2)^T\sum_2^{-1}(x-\mu_2) + \frac{1}{2}log(\sum_2) + log(\pi_1) - log(\pi_2) > log(\frac{\tau}{1-\tau})$

Substituting the given values of $\delta_1(x)$ and $\delta_2(x)$, this inequality becomes:

$\delta_1(x) - \delta_2(x) > log(\frac{\tau}{1-\tau})$

Hence, the function $M(\tau) = log(\frac{\tau}{1-\tau})$.

Substituting the value of $\tau=1/2$ as given, we get a decision threshold of 0.


**Algae Classification**

  Questions 12-13 relate to `algaeBloom` dataset.  Get the dataset `algaeBloom.txt` from the homework archive file, and read it with the following code:

```{r 14-0, warning=FALSE, include=FALSE}
algae <- read_table2("algaeBloom.txt", col_names=
                      c('season','size','speed','mxPH','mnO2','Cl','NO3','NH4',
                        'oPO4','PO4','Chla','a1','a2','a3','a4','a5','a6','a7'),
                      na="XXXXXXX")
```

In homework 1 and homework 2, we investigated basic exploratory data analysis for the `algaeBloom` dataset. One of the explaining variables is `a1`, which is a numerical attribute. In homework 2, we conducted linear regression for variable `a1` using other 8 chemical variables and 3 categorical variables. Here, after standardization, we will transform `a1` into a categorical variable with 2 levels: high and low, and conduct classification predictions using those 11 variables (i.e. do not include a2, a3,..., a7).

12. **(Variable Standardization and Discretization)** Improve the normality of the  the numerical attributes by taking the log of all chemical variables. _After_ log transformation, impute missing values using the median method from homework 1.  Transform the variable `a1` into a categorical variable with two levels: high if a1 is greater than 0.5,  and low if a1 is smaller than or equal to 0.5.

```{r}
algae.log <- log(algae[4:11])

#Imputing missing values with median method
algae.logmed <- algae.log%>%
  mutate_at(.vars = vars(1:8), .funs = funs(ifelse(is.na(.), median(., na.rm = TRUE), .)))

# Transforming a1 into a categorical variable
algae.categorical = algae.logmed %>%
  mutate(a1_cat=as.factor(ifelse(algae$a1<=0.5, "Low", "High")))
```


13. **Linear and Quadratic Discriminant Analysis**

a. In LDA we assume that $\Sigma_1 = \Sigma_2$.  Use LDA to predict whether `a1` is high or low using the `MASS::lda()` function.  The `CV` argument in the `MASS::lda` function uses Leave-one-out cross validation LOOCV) when estimating the fitted values to avoid overfitting.  Set the `CV` argument to true.  Plot an ROC curve for the fitted values.


```{r}
algae.lda <- MASS::lda(a1_cat ~., algae.categorical[1:9], CV=TRUE)
predLDA <- prediction(algae.lda$posterior[,2], algae.categorical$a1_cat)
perfLDA <- performance(predLDA, measure="tpr", x.measure="fpr")
plot(perfLDA, col="turquoise1", lwd=3, main="ROC curve")
abline(0,1)

auc_LDA <- performance(predLDA, measure = "auc")
  auc_LDA <- auc_LDA@y.values[[1]]
print(auc_LDA)
```

b.  Quadratic discriminant analysis is strictly more flexible than LDA because it is not required that $\Sigma_1 = \Sigma_2$.  In this sense, LDA can be considered a special case of QDA with the covariances constrained to be the same.  Use a quadratic discriminant model to predict the `a1` using the function `MASS::qda`.  Again setting `CV=TRUE` and plot the ROC on the same plot as the LDA ROC.  Compute the area under the ROC (AUC) for each model.  To get the predicted class probabilities look at the value of `posterior` in the `lda` and `qda` objects.  Which model has better performance?  Briefly explain, in terms of the bias-variance tradeoff, why you believe the better model outperforms the worse model?

```{r}
algae.qda <- MASS::qda(a1_cat ~., algae.categorical[1:9], CV=TRUE)
predQDA <- prediction(algae.qda$posterior[,2], algae.categorical$a1_cat)
perfQDA <- performance(predQDA, measure="tpr", x.measure="fpr")
plot(perfQDA, col="red", lwd=3, main="ROC curve")
plot(perfLDA, add=TRUE, col = "turquoise1")
abline(0,1)
legend(0.6, 0.4, legend=c("QDA", "LDA"),
       col=c("red", "turquoise1"), lty=1, cex=0.8)

auc_QDA <- performance(predQDA, measure = "auc")
  auc_QDA <- auc_QDA@y.values[[1]]
print(auc_QDA)
```
*We think that QDA has a higher performance than LDA, because the area under the curve is higher. Therefore the TP:FP tradeoff for QDA is lesser than that for LDA, and so we see higher accuracy for the QDA model. Looking to the risk of high variance that comes with a QDA model, we are comfortable due to the small number of covariates (8), and the high number of observations (200).*