-
Notifications
You must be signed in to change notification settings - Fork 0
/
Lab_06.Rmd
290 lines (163 loc) · 15.5 KB
/
Lab_06.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
# Lab 6
```{r, echo=FALSE,message=FALSE,warning=FALSE}
library("tidyverse") # Lots of data processing commands
library("knitr") # Helps make good output files
library("rmarkdown") # Helps make good output files
library("lattice") # Makes nice plots
library("RColorBrewer") # Makes nice color-scales
library("ISLR") # contains a credit dataset
library("yarrr") # contains a toy dataset about pirates
library("skimr") # Summary statistics
library("Stat2Data") # Regression specific commands
library("nortest") # Regression specific commands
library("lmtest") # Regression specific commands
library("IMTest") # Regression specific commands
library("MASS") # Regression specific commands
#library("moderndive")# Regression specific commands
library("corrplot") # correlation plots
library("car") # this one sometimes has problems, don't panic if you get errors
library("ggpubr") # QQplots
library("olsrr") # Regression specific commands
library("plotly")
```
## Lab 6 General information
Welcome to STAT-462 lab 6. The aim of this week is to:
- Be able to identify, assess and as necessary remove residuals
- To show how transforming data can allows you to overcome broken linear regression assumptions.
This lab also provides less explicit instructions, but builds heavily on previous labs. So if you are not sure what to do, see what I was asking you to do in previous homeworks or in the lecture notes.
##### General comments {-}
- You do not have to submit things you try from the tutorials, you will just be graded on the lab 6 assignment.
- There is also a canvas discussion board for this lab which will be the fastest place to get an answer. https://psu.instructure.com/courses/2115020/discussion_topics/13706461
##### Lab 6 Tutorials {-}
It is worth reading the Lab 6 tutorials. [Tutorial 6A](#S.Tutorial.6A)) discusses Outlier diagnostics, which link with [Tutorial 5B](#S.Tutorial.5B)). [Tutorial 6B](#S.Tutorial.6B)) discusses transformations and [Tutorial 6C](#S.Tutorial.6C)) discusses goodness of fit.
## Lab 6 Setup
- Save all your work and if you haven't already, create a Lab 6 folder in your STAT-462 folder
- Create a new R-project called Lab 6 that is linked to your Lab 6 folder (instructions in [Tutorial 1B](#S.Tutorial.1B)). It will probably open a new version of R-Studio.
- Create a new Markdown file called "username_Lab6" (for me it will be hlg5155_Lab6)
- Remove the "friendly text" (see [Tutorial 1E](#S.Tutorial.1E.3) if you have no idea what I mean)
- Follow the instructions to edit your YAML code to look like my example in [Tutorial 2B](#S.Tutorial.2B.1). Choose any theme that you like.
- Leave a blank line under your YAML code, then create a level-1 Heading called "Markdown". Save and make sure that it will preview/knit.
- Make a new code chunk. Inside here, you can add all the libraries you need (if you do not have them, you can install them using the tutorial from last week). Start by entering these, but if you need any more packages you can add them here and rerun the code chunk. Run the code chunk a few times to make sure they load with no errors.
```{r, message=FALSE,warning=FALSE}
# Load libraries
library(tidyverse)
library(dplyr)
library(ggpubr)
library(Stat2Data)
library(corrplot)
library(olsrr)
library(sf)
library(tmap)
library(readxl)
library(plotly)
library(nortest)
## you may need additional libraries. Just add them to the list as you use them.
```
- Edit the "code chunk code" at the top of the code chunk so that the code is run and it is shown, but none of the warnings or messages show up (e.g. none of the the friendly text is shown).
## Diagnostic summaries
## Florida Fish Challenge
### Important background
Small amounts of the element mercury are present in many everyday foods. These do not normally affect your health, but too much mercury can be poisonous. Mercury itself is naturally occurring, but the amounts in the environment have been on the rise from industrialization. The metal can make its way into soil and water, and eventually build up in animals like fish, which are then eaten by people. More details here:
- https://www.wearecognitive.com/project/extra-narrative/bbc-mercury
- https://medium.com/predict/mercury-pollution-reaches-the-deep-sea-f59a4938dc7c
In the late 1980s, there were widespread public safety concerns in Florida about high mercury concentrations in sport fish. In 1989, the State of Florida issued an advisory urging the public to limit consumption of "top level" predatory fish from Lake Tohopekaliga and connected waters: including largemouth bass (Micropterus salmoides), bowfin (Amia calva), and gar (Lepisosteus spp.). This severely impacted tourism and the economy in the area. Urgent research was required to inform public policy about which lakes needed to be closed.
We are going to reproduce part of one study on the topic conducted by T.R. Lange in 1993. Dr Lange and their team took samples from 53 lakes in the Central Florida area. Using water samples collected from each of the lakes, the researchers measured the pH level, as well as the amount of chlorophyll, calcium and alkalinity. The Mercury concentration in the muscle tissue of lake fish was also recorded and standardized to take into account the different ages of fish caught in the lake.
```{r, out.width = "100%", echo=FALSE,fig.align='center',fig.cap="a. (Left): The mercury food chain in fish.(Wikimedia commons, Bretwood Higman, Ground Truth Trekking) b. (middle) A large bass caught and released in a central Florida lake (https://www.wired2fish.com/news/young-man-catches-releases-huge-bass-from-bank/) c. (right). The location of the lakes in Florida (Google maps)"}
knitr::include_graphics('images/Fig06_2Fish.png')
```
__You have been asked to assess whether the alkalinity levels of a lake might impact Mercury levels in largemouth bass. You will be presenting your results to the Florida State Governor in order to set new fishing regulations. She has asked for a report to provide your full thinking and workflow in how you decide on your final model.__
The units of the your dataset are:
| **Variable** | **Unit** |
|:------------:|:---------------------------:|
| Chlorophyll | micrograms/Litre, $\mu g/L$ |
| Alkalinity | miligrams/Litre, $mg/L$ |
| pH | Unitless |
| Calcium | miligrams/Litre, $mg/L$ |
| Mercury | Micrograms, $\mu g$ |
| Age | Years |
### Lab 6 Access the data
Optional Hint: If you have lots of different variables in your workspace/environment, you can clear them by clicking on the broom or by typing `rm(list=ls())` into the console.
Download __bassdata.csv__ from Canvas and read into R as a variable called bass.
### Question 1 [OPTIONAL BONUS 2%] : {-}
Within one of your answers, include a professional looking table using equations where relevant. For example this could be comparing model diagnostic outputs, talking about model parameters, writing up LINE results..
- *Hint, here is the place I make all my tables, then copy them across: https://www.tablesgenerator.com/markdown_tables.*
- *Here is the place I create my equations: https://www.codecogs.com/latex/eqneditor.php*
### Question 2 - explore data
Explore and describe the data.
If you are unsure what to do here, think about your credit analysis in lab 2. E.g. How much data is there?, how is that data distributed? What are some summary statistics? Any missing values? Is each variable normally distributed?
Remember that you are writing this as a report for the Florida State Governor, so summarise any results you find in the text. Essentially your aim is to explore/describe the data so that the Florida State Governor is happy that the data is robust and you understand it.
### Question 3 - response and predictor
Identify your response and predictor variables. Create a professional looking scatterplot (including units, axis titles, good formatting etc) and fully describe the plot (in the way that you have done in Lecture 7).
As part of your response, explain if Simple Linear Regression might/might not be appropriate in this case.
### Question 4 - fit model
Fit a Simple Linear Regression model for the problem. Re-plot your scatter plot with the line of best fit.
Clearly interpret the estimated model parameters (slope & intercept)/model summary-statistics in the the context of the problem, in a way that would be understandable to a policy maker. For example you could show the regression equation including the parameter values and units in the description. You could talk about the amount of variation explained by the model etc.
### Question 5 - regression diagnostics
Use regression diagnostics to assess whether the model is appropriate. For example, does it meet the LINE assumptions? ([Tutorial 5B](#S.Tutorial.5B)) Again, make sure to explain any plots or statistical tests that you use in words (e.g. what are you trying to test for, what did you find out, what did it mean).
Assess if there are any outliers, high leverage points [Tutorial 6A](#S.Tutorial.6A).
### Question 6 - indentify residuals
Identify:
a. The name of the lake with highest residual mercury value [4 marks]
b. The name of the lake with highest leverage [4 marks]
c. The name of the lake with highest Cook's distance [4 marks]
### Question 7 - influential points
Another policy maker suggested that there are several influential points in the data that should be investigated. From your analysis, do you agree with this comment? Explain your reasoning and provide evidence.
### Question 8 - transformation
After examining the data, the results were double-checked and it was decided to keep ALL of the data points. Another team member suggested that perhaps the observed residual diagnostics are because the there is a lack of linearity between the two variables of interest. They proposed you should apply a transformation and refit the data.
Using the lecture notes/class discussions about possible starting points for transformations, how would you proceed? (i.e. would you transform the response variable, explanatory variable or both? What transformations could you use?) Clearly explain why you came to this conclusion.
_Note, we will be discussing this in the classes on Monday-15 March and Friday-19 March. (it's a 2 week lab)_
### Question 9 - two models
After some research, two models were proposed that might fit the data:
a. A log transformation on the explanatory variable (`log()` command)
b. The square root of the explanatory variable (`sqrt()` command)
Apply the two transformations to the data and choose which one is the best fit.
Explain your reasoning referring to any goodness of fit measures that you used.
### Question 10 - log transform
After talking with some marine biologists, you decide to use the log transformation as it has more physical relevance.
Summarise your new model. E.g. make a scatterplot, plot your new line of best fit and summarise the model itself, including the new model equation IN THE CONTEXT OF THE DATA (e.g. what are your coeffients now showing?)
Examine the residual diagnostics for your new fit. Explain clearly whether the new model meets each of the assumptions needed for linear regression and assess if there are any influential outliers. Is this model a better fit to the data than the original one?
### Question 11 - new lake
The Governor recently had a question from a member if the public who went fishing in a new lake that was not part of the study. We know the alkalinity level of that lake was 40mg/L. The member of the public wants to be 99% sure that they won't exceed the Florida Health Advisory level for Mercury levels in Fish, which is 1 $\mu g$ of Mercury.
Should they eat the fish? Explain your answer and show your evidence for how you came to your conclusion.
### Question 12 - health check
_This question is designed to be more difficult and realistic. I will answer points of clarification, but I will not help anyone work through it before the labs are submitted. However I will award partial marks for workings and how far you get_
The Florida Health Advisory level for Mercury levels in Fish is 1 $\mu g$ of Mercury. The Governor has accepted your model and is requiring state-wide alkalinity tests.
What is your safety cut-off value of alkalinity for new lakes? (You would like to be 95% sure that you aren't just seeing this result by chance). Provide evidence of how you got to your answer
## Submitting Lab 6
Remember to save your work throughout and to spell check your writing (next to the save button). Now, press the knit button again. If you have not made any mistakes in the code then R should create a new html file which includes your answers. This can be found in your Lab 6 folder and have a .html ending.
Check your html is complete by double clicking on to open it in your web-browser.
Now go to Canvas and submit BOTH your html and your .Rmd file in Lab 6. (See the end of Lab 1 for a screenshot)
### Lab 6 submission check
**HTML FILE SUBMISSION - 5 marks**
**RMD CODE SUBMISSION - 5 marks**
**MARKDOWN/CODE/WRITING STYLE - 10 MARKS**
10/10 - your report is very professional. There are tables of contents, headings/subheadings, your plots look great, you answer in full sentences and have used the spell check. You have written your answers below the relevant code chunk in full sentences in a way that is easy to find and grade. It's clear you put thought and effort into writing a good markdown document. I could use this as a class example. You would be comfortable showing it in a job interview.
8/10 - your report is fine on the basics, but not quite as snazzy & is clearly a homework.
less - as your report becomes harder to read.
**QUESTION 1 - 2 marks** BONUS (capped at 100%)
Your answer to question 1 was in professional table format.
**QUESTION 2 - 5 marks**
You have clearly summmarised the data
**QUESTION 3 - 5 marks**
Your response and predictor are identified. Your plot looks professional enough that you could give it to the Florida state governor (you can also use fig.cap in the chunk description if you want to add more information that way underneath the plot). You have explained if Simple Linear regression is appropriate.
**QUESTION 4 - 10 marks**
You have correctly fitted the simple linear model to the data and interpreted the model output clearly in the text in a way that would be understandable to a policy maker.
**QUESTION 5 - 9 marks**
You have correctly used model diagnostics to assess whether the model meets LINE assumptions. You have assessed whether there are outliers, influential points or points with high leverage
**QUESTION 6 - 6 marks**
The name of the lake with highest residual mercury value [2 marks]
The name of the lake with highest leverage [2 marks]
The name of the lake with highest Cook’s distance [2 marks]
**QUESTION 7 - 5 marks**
You have discussed whether there are two influential points or what might be going on
**QUESTION 8 - 5 marks**
You have discussed which transformations you might apply and why.
**QUESTION 9 - 10 marks**
You have transformed the data and compared the two models.
**QUESTION 10 - 5 marks**
You have fully summarised the new model and discussed if it is a better fit.
**QUESTION 11 - 10 marks**
You have assessed the question from the public about whether they should eat the fish and provided evidence for your answer.
**QUESTION 12 - 10 marks**
You have assessed the question about alkalinity safety levels (or as far as you get) & explained your answer.
[100 marks total]