-
Notifications
You must be signed in to change notification settings - Fork 0
/
Assignment 2.rmd
566 lines (405 loc) · 25.5 KB
/
Assignment 2.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
---
title: "Homework 2"
author: "Leonel Carlen, Stefan Ciric / PSTAT131"
date: "`r format(Sys.Date(), '%B %d, %Y')`"
graphics: yes
geometry: margin=0.75in
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE,
cache=FALSE,
fig.width=5,
fig.height=5,
fig.align='center')
indent1 = ' '
indent2 = paste(rep(indent1, 2), collapse='')
```
----------------------
# Spam detection with `spambase` dataset
Following packages are needed below:
```{r pkg, message=FALSE}
library(tidyverse)
library(tree)
library(plyr)
library(class)
library(rpart)
library(maptree)
library(ROCR)
```
__Data Info__: The Data Set was obtained by the UCI Machine Learning database. From the website,
> The "spam" concept is diverse: advertisements for products/web sites, make
> money fast schemes, chain letters, pornography...
>
> Our collection of spam e-mails came from our postmaster and individuals who had
> filed spam. Our collection of non-spam e-mails came from filed work and
> personal e-mails, and hence the word 'george' and the area code '650' are
> indicators of non-spam. These are useful when constructing a personalized spam
> filter. One would either have to blind such non-spam indicators or get a very
> wide collection of non-spam to generate a general purpose spam filter.
Dataset `spambase.tab` can be read with the following code. Next, standardize
each numerical attribute in the dataset. Each standardized column should have
zero mean and unit variance.
```{r, warning=FALSE, results='hide', message=FALSE}
spam <- read_table2("spambase.tab", guess_max=2000)
spam <- spam %>%
mutate(y = factor(y, levels=c(0,1), labels=c("good", "spam"))) %>% # label as factors
mutate_at(.vars=vars(-y), .funs=scale) # scale others
```
__Attribute Information__: The last column of 'spambase.tab' denotes whether
the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial
e-mail. Most of the attributes indicate whether a particular word or character
was frequently occurring in the e-mail. The run-length attributes (55-57)
measure the length of sequences of consecutive capital letters. For the
statistical measures of each attribute, see the end of this file. Here are the
definitions of the attributes:
* 48 continuous real [0,100] attributes of type `word_freq_WORD` = percentage
of words in the e-mail that match `WORD`, i.e. 100 * (number of times the
`WORD` appears in the e-mail) / total number of words in e-mail. A `WORD` in
this case is any string of alphanumeric characters bounded by
non-alphanumeric characters or end-of-string.
* 6 continuous real [0,100] attributes of type `char_freq_CHAR` = percentage of
characters in the e-mail that match `CHAR`, i.e. 100 * (number of `CHAR`
occurrences) / total characters in e-mail
* 1 continuous real [1,...] attribute of type `capital_run_length_average` =
average length of uninterrupted sequences of capital letters
* 1 continuous integer [1,...] attribute of type `capital_run_length_longest` =
length of longest uninterrupted sequence of capital letters
* 1 continuous integer [1,...] attribute of type `capital_run_length_total` =
sum of length of uninterrupted sequences of capital letters = total number of
capital letters in the e-mail
* 1 nominal {0,1} class attribute of type `spam` = denotes whether the e-mail was
considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
**Classification Task**: We will build models to classify emails into good vs.
spam.
In this dataset, we will apply several classification methods and compare their
training error rates and test error rates. We define a new function, named
`calc_error_rate()`, that will calculate misclassification error rate. Any error in this
homework (unless specified otherwise) imply misclassification error.
```{r ter, cache=TRUE}
calc_error_rate <- function(predicted.value, true.value){
return(mean(true.value!=predicted.value))
}
```
Throughout this homework, we will calculate the error rates to measure and
compare classification performance. To keep track of error rates of all
methods, we will create a matrix called `records`:
```{r record}
records = matrix(NA, nrow=3, ncol=2)
colnames(records) <- c("train.error","test.error")
rownames(records) <- c("knn","tree","logistic")
```
Attribute folds to data:
**Training/test sets**: Split randomly the data set in a train and a test
set:
```{r, results="hide"}
set.seed(1)
test.indices = sample(1:nrow(spam), 1000)
spam.train=spam[-test.indices,]
spam.test=spam[test.indices,]
```
**$10$-fold cross-validation**: Using `spam.train` data, 10-fold cross
validation will be performed throughout this homework. In order to ensure data
partitioning is consistent, define `folds` which contain fold assignment for
each observation in `spam.train`.
```{r, folds-definition}
nfold = 10
set.seed(1)
folds = seq.int(nrow(spam.train)) %>% ## sequential obs ids
cut(breaks = nfold, labels=FALSE) %>% ## sequential fold ids
sample ## random fold ids
```
----------------------
## K-Nearest Neighbor Method
1. **(Selecting number of neighbors)** Use 10-fold cross validation to select
the best number of neighbors `best.kfold` out of six values of $k$ in `kvec = c(1, seq(10, 50, length.out=5))`. Use the folds defined above and use the following `do.chunk` definition in your code. Again put `set.seed(1)` before your code. What value of $k$ leads to the smallest estimated test error?
```{r 90,indent=indent1,message=F,warning=F, cache = TRUE}
do.chunk <- function(chunkid, folddef, Xdat, Ydat, k){
train = (folddef!=chunkid)
Xtr = Xdat[train,]
Ytr = Ydat[train]
Xvl = Xdat[!train,]
Yvl = Ydat[!train]
## get classifications for current training chunks
predYtr = knn(train = Xtr, test = Xtr, cl = Ytr, k = k)
## get classifications for current test chunk
predYvl = knn(train = Xtr, test = Xvl, cl = Ytr, k = k)
data.frame(train.error = calc_error_rate(predYtr, Ytr),
val.error = calc_error_rate(predYvl, Yvl))
}
```
Define design matrix and true labels for test set and training set
```{r}
XTrain <- spam.train%>%select(-y)
YTrain <- spam.train$y
XTest <- spam.test%>%select(-y)
YTest <- spam.test$y
```
Build and train the knn model
```{r, indent=indent1, cache=TRUE}
# Set error.folds (a vector) to save validation errors in future
error.folds = NULL
# Give possible number of nearest neighbours to be considered
kvec = c(1, seq(10, 50, length.out=5))
# Set seed since do.chunk() contains a random component induced by knn()
#set.seed(66)
# Loop through different number of neighbors
for (j in c(1, 10, 20, 30, 40, 50)){
tmp = (ldply(1:nfold, do.chunk, folddef=folds, Xdat=XTrain, Ydat=YTrain, k=j))
tmp$neighbors = j # Keep track of each value of neighors
error.folds = rbind(error.folds, tmp) # combine results
}
```
Select best K by comparing average test error for each model
```{r, cache=TRUE}
test.error <- NULL
for (i in kvec)
test.error <- rbind(test.error, error.folds%>%filter(neighbors == i)%>%summarise(mean(val.error)))
ktable <- cbind(test.error, kvec)
k <- max(ktable$kvec[test.error==min(test.error)])
k
knntrain.err <- min(test.error)
```
2. **(Training and Test Errors)** Now that the best number of neighbors has
been determined, compute the training error using `spam.train` and
test error using `spam.train` for the $k =$ `best.kfold`. Use the function
`calc_error_rate()` to get the errors from the predicted class labels. Fill in
the first row of `records` with the train and test error from the `knn` fit.
```{r loocv test, indent=indent1, cache=TRUE}
# Set random seed to make the results reproducible
set.seed(67)
# Best k used
pred.YTest = knn(train=XTrain, test=XTest, cl=YTrain, k=k)
# Confusion matrix
conf.matrix = table(predicted=pred.YTest, true=YTest)
conf.matrix
#Test accuracy rate
sum(diag(conf.matrix)/sum(conf.matrix))
# Test error rate
knntest.err <- 1 - sum(diag(conf.matrix)/sum(conf.matrix))
```
Populate records with training and test values of knn machine
```{r, cache=TRUE}
records[1,1] = knntrain.err
records[1,2] = knntest.err
```
-----------------------------------------
## Decision Tree Method
3. **(Controlling Decision Tree Construction)** Function `tree.control`
specifies options for tree construction: set `minsize` equal to 5 (the minimum number of observations in each leaf) and `mindev` equal to 1e-5. See the help for `tree.control` for more information. The output of `tree.control` should be passed into `tree` function in the `control` argument. Construct a decision tree using training set `spam.train`, call the resulting tree `spamtree`. `summary(spamtree)` gives some basic information about the tree. How many leaf nodes are there? How many of the training observations are misclassified?
```{r, cache = TRUE}
spamtree <- tree(y~., data = spam.train, control = tree.control(nobs=nrow(spam.train), minsize = 5, mindev = 1e-5))
summary(spamtree)
```
*There are 184 leaves. The number of misclassified observations is 48.*
</b>
4. **(Decision Tree Pruning)** We can prune a tree using the `prune.tree` function. Pruning iteratively removes the leaves that have the least effect on the overall misclassification. Prune the tree until there are only $10$ leaf nodes so that we can easily visualize the tree. Use `draw.tree` function from the `maptree` package to visualize the pruned tree. Set `nodeinfo=TRUE`.
```{r, cache = TRUE}
prune.tree(spamtree, best = 10)
draw.tree(prune.tree(spamtree, best = 10, method = "misclass"), nodeinfo = TRUE, cex = 0.35)
```
5. In this problem we will use cross validation to prune the tree. Fortunately, the `tree` package provides and easy to use function to do the cross validation for us with the `cv.tree` function. Use the same fold partitioning you used in the KNN problem (refer to `cv.tree` help page for detail about `rand` argument). Also be sure to set `method=misclass`. Plot the misclassification as function of tree size. Determine the optimal tree size that minimizes misclassification. __Important__: if there are multiple tree sizes that have the same minimum estimated misclassification, you should choose the smallest tree. This reflects the idea that we want to choose the simplest model that explains the data well ("Occam's razor"). Show the optimal tree size `best.size.cv` in the plot.
```{r, cache=TRUE}
set.seed(2)
newtree <- cv.tree(tree(y~., data = spam.train), FUN = prune.misclass, K = 10)
sizedev <- as.data.frame(cbind(newtree$size, newtree$dev))
sizedev <- sizedev[order(sizedev$V1),]
best.size.cv <- sizedev$V1[which.min(sizedev$V2)]
plot(newtree$dev~newtree$size, col=ifelse(newtree$size==best.size.cv, "cyan4", "red"), pch=ifelse(newtree$size == best.size.cv, 19, 1), main = "Misclassification Volume by Size of Tree", ylab = "Misclassification", xlab = "Size")
```
*Best tree has size 13 with 348 misclassification errors*
6. **(Training and Test Errors)**
We previous pruned the tree to a small tree so that it could be easily visualized. Now, prune the original tree to size `best.size.cv` and call the new tree `spamtree.pruned`. Calculate the training error and test error when `spamtree.pruned` is used for prediction. Use function `calc_error_rate()` to compute misclassification error. Also, fill in the second row of the matrix `records` with the training error rate and test error rate.
```{r, cache=TRUE}
spamtree.pruned <- prune(spamtree, best = 13)
draw.tree(spamtree.pruned, nodeinfo = TRUE, cex = 0.35)
spampred = predict(spamtree.pruned, spam.test, type="class")
error = table(spampred, spam.test$y)
sum(diag(error))/sum(error)
testErr <- 1-sum(diag(error))/sum(error)
trainErr <- min(sizedev$V2)/nrow(spam.train)
records[2,1] <- trainErr
records[2,2] <- testErr
```
----------------------
## Logistic regression
7. In a binary classification problem, let $p$ represent the probability of class
label "1"", which implies $1-p$ represents probability of class label "0".
The *logistic function* (also called the "inverse logit") is the cumulative distribution function of logistic
distribution, which maps a real number $z$ to the open interval $(0,1)$:
\begin{equation} p(z)=\frac{e^z}{1+e^z}. \end{equation}
It is easy to see -->
that when $z\rightarrow-\infty$, function $p(z) \rightarrow 0$, and as
$z\rightarrow\infty$, function $p(z) \rightarrow 1$.
a. Show that indeed the inverse of a logistic function is the _logit_ function:
\begin{equation}
z(p)=\ln\left(\frac{p}{1-p}\right).
\end{equation}
Let $y=p(z)$.
Therefore, $y= \frac{e^z}{(1+e^z)}$
$y(1+e^z) = e^z$
$y + e^{z}y = e^z$
$y = e^z - ye^z$
$y = e^z(1-y)$
$e^z = \frac{y}{1-y}$
$z = ln(\frac{y}{1-y})$
Replacing y with p(z), we can conclude that:
$z(p) = ln(\frac{p}{1-p})$
Hence proved.
b. The logit function is a commonly used _link function_ for a generalized linear model of binary
data. One reason for this is that implies interpretable coefficients. Assume that $z=\beta_0 + \beta_1 x_1$, and $p = \text{logistic}(z)$. How does the odds of the outcome change if you increase $x_1$ by two? Assume $\beta_1$ is negative: what value does $p$ approach as $x_1 \to \infty$? What value does $p$ approach as $x_1 \to -\infty$?
If you increase $x_1$ by 2, the odds of the outcome increase by $e^{2\beta_1}$. This is because when the term $x_1$ in the odds equation gets increased by 2, the initial term in the odds equation $e^{\beta_1 x_1}$ becomes $e^{\beta_1 (x_1+2)}$ and so the odds increases by a factor of $e^{2\beta_1}$.
$p(z) = \frac{e^z}{(1+e^z)}$ Here, $z=\beta_0 + \beta_1 x_1$.
Therefore, $p(x) = \frac{e^{\beta_0 + \beta_1 x_1}}{(1+e^{\beta_0 + \beta_1 x_1})}$
Given that $\beta_1$ is negative,
Let $\beta$ denote the absolute value of $\beta_1$.
Therefore, $p(x) = \frac{e^{\beta_0 - \beta x_1}}{(1+e^{\beta_0 - \beta x_1})}$
$p(x) = \frac{e^{\beta_0}e^{-\beta x_1}}{(1+e^{\beta_0}e^{-\beta x_1})}$
If we multiply $e^{\beta x_1}$ in both the numerator and denominator, we get:
$p(x) = \frac{e^{\beta_0}}{(e^{\beta x_1}+e^{\beta_0})}$
Because $e^{\beta_0}$ is a constant here, when $x_1 \rightarrow \infty$, $e^{x_1} \rightarrow \infty$ and when $x_1 \rightarrow -\infty$, $e^{x_1} \rightarrow 0$.
With the assumption that $\beta_1$ is negative, $p$ approaches 0 when $x_1 \to \infty$.
With the assumption that $\beta_1$ is negative, $p$ approaches 1 when $x_1 \to -\infty$.
#8.
Use logistic regression to perform classification. Logistic regression specifically estimates the probability that an observation as a particular class label. We can define a probability threshold for assigning class labels based on the probabilities returned by the `glm` fit.
In this problem, we will simply use the "majority rule". If the probability is larger than 50\% class as spam. Fit a logistic regression to predict spam given all other features in the dataset using the `glm` function. Estimate the class labels using the majority rule and calculate the training and test errors. Add the training and test errors to the third row of `records`. Print the full `records` matrix. Which method had the lowest misclassification error on the test set?
```{r, cache=TRUE}
#fit object
glm.spamfit = glm(y ~ .,
data=spam.train, family=binomial)
#predict response for training data
glm.spamtrain <- predict(glm.spamfit, type = "response")
#add interpreted prediction attribute to training dataframe
NUspam.train = spam.train %>%
mutate(predSPAM=as.factor(ifelse(glm.spamtrain<=0.5, "good", "spam")))
#make confusion matrix to get error probability
glmErr <- table(pred=NUspam.train$predSPAM, true=NUspam.train$y)
glmTrainErr <- 1-sum(diag(glmErr))/sum(glmErr)
#view error probability
# glmTrainErr
#predict response for test data
glm.spamtest <- predict(glm.spamfit, spam.test, type = "response")
#add interpreted prediction attribute to testing datafram
NUspam.test = spam.test %>%
mutate(predSPAM=as.factor(ifelse(glm.spamtest<=0.5, "good", "spam")))
#make confusion matrix to get test error probability
glmErr <- table(pred=NUspam.test$predSPAM, true=NUspam.test$y)
glmTestErr <- 1-sum(diag(glmErr))/sum(glmErr)
#view test error probability
# glmTestErr
#update records
records[3,1] <- glmTrainErr
records[3,2] <- glmTestErr
records
```
*Logistic regression has the lowest classification error on the test set*
----------------------
## Receiver Operating Characteristic curve
9. (ROC curve) We will construct ROC curves based on the predictions of the _test_ data from the model defined in `spamtree.pruned` and the logistic regression model above. Plot the ROC for the test data for both the decision tree and the logistic regression on the same plot. Compute the area under the curve for both models (AUC). Which classification method seems to perform the best by this metric?
## Tree plot and AUC:
```{r}
library("ROCR")
tree.train <- predict(spamtree.pruned, spam.test, type="vector")
predTree <- prediction(tree.train[,2], spam.test$y)
perfTree = performance(predTree, measure="tpr", x.measure="fpr")
plot(perfTree, col=2, lwd=3, main="ROC curve")
abline(0,1)
auc_Tree <- performance(predTree, measure = "auc")
auc_Tree <- [email protected][[1]]
print(auc_Tree)
```
## GLM plot and AUC:
```{r, eval = FALSE, indent=indent2, cache=TRUE}
predGLM <- prediction(glm.spamtest, spam.test$y)
perfGLM = performance(predGLM, measure="tpr", x.measure="fpr")
plot(perfGLM, col=2, lwd=3, main="ROC curve")
abline(0,1)
auc_GLM <- performance(predGLM, measure = "auc")
auc_GLM <- [email protected][[1]]
print(auc_GLM)
```
For logistic regression one needs to predict type `response`
```{r, eval = FALSE, indent=indent2}
predGLM <- prediction(glm.spamtest, spam.test$y)
perfGLM = performance(predGLM, measure="tpr", x.measure="fpr")
plot(perfGLM, col=2, lwd=3, main="ROC curve")
abline(0,1)
```
*GLM auc is slightly larger, therefor this model is preferable, as it is interpreted to have a better false positive to true positive trade off.*
10. In the SPAM example, take "positive" to mean "spam". If you are the designer of a spam filter, are you more concerned about the potential for false positive rates that are too large or true positive rates that are too small? Argue your case.
*As a designer of a spam filter, I am more concerned about false positives being too large than true positives being too few. This is because a large number of false positives can lead to a large number of important emails being classified as spam, which can have very negative consequences for the user (like leading to the user missing out on important emails), as opposed to the user having to delete a large number of spam emails that weren't correctly classified as spam (a mild inconvenience). The former situation has more dire consequences than the latter and hence, I would be more worried about a large number of false positives while building my spam filter.*
--------------------------------------------------------------------------------
# **Problems below for 231 students only**
11. A multivariate normal distribution has density
$$f(x) = \frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}exp\left(-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu)\right)$$
In quadratic discriminant analysis with two groups we use Bayes rule to calculate the probability that $Y$ has class label "1":
$$Pr(Y=1 \mid X=x) = \frac{f_1(x)\pi_1}{\pi_1f_1(x) + \pi_2f_2(x)}$$
where $\pi_2 = 1 - \pi_1$ is the prior probability of being in group $2$. Suppose we classify $\hat Y= k$ whenever $Pr(Y=k \mid X=x) > \tau$ for some probability threshold $\tau$ and that $f_k$ is a multivariate normal density with covariance $\Sigma_k$ and mean $\mu_k$. Note that for a vector $x$ of length $p$ and a $p \times p$ symmetric matrix $A$, $x^TAx$ is the _vector quadratic form_ (the multivariate analog of $x^2$). Show that the decision boundary is indeed quadratic by showing that $\hat Y = 1$ if
$$\delta_1(x) - \delta_2(x) > M(\tau)$$
where
$$\hat\delta_k(x) = -{1\over 2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) - {1\over
2}\log|\Sigma_k| + \log\pi_k$$
and $M(\tau)$ is some function of the probability threshold $\tau$. What is the decision threshold, M(1/2), corresponding to a probability threshold of 1/2?
We have been given that $Pr(Y=1 \mid X=x) > \tau$.
$\frac{f_1(x)\pi_1}{\pi_1f_1(x) + \pi_2f_2(x)} > \tau$
$f_1(x)\pi_1 > \tau({\pi_1f_1(x) + \pi_2f_2(x)})$
$f_1(x)\pi_1(1-\tau)> {\pi_1f_1(x) + \pi_2f_2(x)}$
$\frac{f_1(x)\pi_1}{f_2(x)\pi_2} > \frac{\tau}{1-\tau}$
Taking log on both sides,
$log(\frac{f_1(x)\pi_1}{f_2(x)\pi_2}) > log(\frac{\tau}{1-\tau})$
$log(\frac{f_1(x)}{f_2(x)}) + log(\frac{\pi_1}{\pi_2}) > log(\frac{\tau}{1-\tau})$
Expanding this, we get:
$log(f_1(x)) - log(f_2(x)) + log(\pi_1) - log(\pi_2) > log(\frac{\tau}{1-\tau})$
$-\frac{1}{2}(x-\mu_1)^T\sum_1^{-1}(x-\mu_1) - \frac{p}{2}log(2\pi) - \frac{1}{2}log(\sum_1) - [-\frac{1}{2}(x-\mu_2)^T\sum_2^{-1}(x-\mu_2) - \frac{p}{2}log(2\pi) - \frac{1}{2}log(\sum_2)] + log(\pi_1) - log(\pi_2) > log(\frac{\tau}{1-\tau})$
Simplifying this, we get:
$-\frac{1}{2}(x-\mu_1)^T\sum_1^{-1}(x-\mu_1) - \frac{1}{2}log(\sum_1) + \frac{1}{2}(x-\mu_2)^T\sum_2^{-1}(x-\mu_2) + \frac{1}{2}log(\sum_2) + log(\pi_1) - log(\pi_2) > log(\frac{\tau}{1-\tau})$
Substituting the given values of $\delta_1(x)$ and $\delta_2(x)$, this inequality becomes:
$\delta_1(x) - \delta_2(x) > log(\frac{\tau}{1-\tau})$
Hence, the function $M(\tau) = log(\frac{\tau}{1-\tau})$.
Substituting the value of $\tau=1/2$ as given, we get a decision threshold of 0.
**Algae Classification**
Questions 12-13 relate to `algaeBloom` dataset. Get the dataset `algaeBloom.txt` from the homework archive file, and read it with the following code:
```{r 14-0, warning=FALSE, include=FALSE}
algae <- read_table2("algaeBloom.txt", col_names=
c('season','size','speed','mxPH','mnO2','Cl','NO3','NH4',
'oPO4','PO4','Chla','a1','a2','a3','a4','a5','a6','a7'),
na="XXXXXXX")
```
In homework 1 and homework 2, we investigated basic exploratory data analysis for the `algaeBloom` dataset. One of the explaining variables is `a1`, which is a numerical attribute. In homework 2, we conducted linear regression for variable `a1` using other 8 chemical variables and 3 categorical variables. Here, after standardization, we will transform `a1` into a categorical variable with 2 levels: high and low, and conduct classification predictions using those 11 variables (i.e. do not include a2, a3,..., a7).
12. **(Variable Standardization and Discretization)** Improve the normality of the the numerical attributes by taking the log of all chemical variables. _After_ log transformation, impute missing values using the median method from homework 1. Transform the variable `a1` into a categorical variable with two levels: high if a1 is greater than 0.5, and low if a1 is smaller than or equal to 0.5.
```{r}
algae.log <- log(algae[4:11])
#Imputing missing values with median method
algae.logmed <- algae.log%>%
mutate_at(.vars = vars(1:8), .funs = funs(ifelse(is.na(.), median(., na.rm = TRUE), .)))
# Transforming a1 into a categorical variable
algae.categorical = algae.logmed %>%
mutate(a1_cat=as.factor(ifelse(algae$a1<=0.5, "Low", "High")))
```
13. **Linear and Quadratic Discriminant Analysis**
a. In LDA we assume that $\Sigma_1 = \Sigma_2$. Use LDA to predict whether `a1` is high or low using the `MASS::lda()` function. The `CV` argument in the `MASS::lda` function uses Leave-one-out cross validation LOOCV) when estimating the fitted values to avoid overfitting. Set the `CV` argument to true. Plot an ROC curve for the fitted values.
```{r}
algae.lda <- MASS::lda(a1_cat ~., algae.categorical[1:9], CV=TRUE)
predLDA <- prediction(algae.lda$posterior[,2], algae.categorical$a1_cat)
perfLDA <- performance(predLDA, measure="tpr", x.measure="fpr")
plot(perfLDA, col="turquoise1", lwd=3, main="ROC curve")
abline(0,1)
auc_LDA <- performance(predLDA, measure = "auc")
auc_LDA <- [email protected][[1]]
print(auc_LDA)
```
b. Quadratic discriminant analysis is strictly more flexible than LDA because it is not required that $\Sigma_1 = \Sigma_2$. In this sense, LDA can be considered a special case of QDA with the covariances constrained to be the same. Use a quadratic discriminant model to predict the `a1` using the function `MASS::qda`. Again setting `CV=TRUE` and plot the ROC on the same plot as the LDA ROC. Compute the area under the ROC (AUC) for each model. To get the predicted class probabilities look at the value of `posterior` in the `lda` and `qda` objects. Which model has better performance? Briefly explain, in terms of the bias-variance tradeoff, why you believe the better model outperforms the worse model?
```{r}
algae.qda <- MASS::qda(a1_cat ~., algae.categorical[1:9], CV=TRUE)
predQDA <- prediction(algae.qda$posterior[,2], algae.categorical$a1_cat)
perfQDA <- performance(predQDA, measure="tpr", x.measure="fpr")
plot(perfQDA, col="red", lwd=3, main="ROC curve")
plot(perfLDA, add=TRUE, col = "turquoise1")
abline(0,1)
legend(0.6, 0.4, legend=c("QDA", "LDA"),
col=c("red", "turquoise1"), lty=1, cex=0.8)
auc_QDA <- performance(predQDA, measure = "auc")
auc_QDA <- [email protected][[1]]
print(auc_QDA)
```
*We think that QDA has a higher performance than LDA, because the area under the curve is higher. Therefore the TP:FP tradeoff for QDA is lesser than that for LDA, and so we see higher accuracy for the QDA model. Looking to the risk of high variance that comes with a QDA model, we are comfortable due to the small number of covariates (8), and the high number of observations (200).*