-
Notifications
You must be signed in to change notification settings - Fork 0
/
data-analysis-visualization-glm-glmm-fuelEfficiency-Rmd.Rmd
296 lines (248 loc) · 10.1 KB
/
data-analysis-visualization-glm-glmm-fuelEfficiency-Rmd.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
---
title: "Statistical Modelling of Fuel Efficiency"
author: "Brian Pan"
date: "3/10/2020"
output:
md_document:
variant: markdown_github
---
```{r setup, include=FALSE, message=FALSE, echo=TRUE}
library(Pmisc)
library(tidyverse)
library(lme4)
knitr::opts_chunk$set(echo = TRUE)
```
A data description is available at https://www.fueleconomy.gov/feg/ws/index.shtml#vehicle.\
\
```{r, data loading}
cUrl = 'https://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip'
cFile = file.path(tempdir(), basename(cUrl))
download.file(cUrl, cFile)
cFile2 = unzip(cFile, exdir=tempdir())
x = read.table(cFile2, sep=',', header=TRUE, stringsAsFactors=FALSE)
```
\
```{r data cleaning}
# have a look
head(x)
# Restrict to non-electric vehicles, since electic vehicles are hard to compare with.
xSub = x[grep("Electricity|CNG", x$fuelType, invert=TRUE), ]
# Create a decade variable, centerd on 2000
xSub$decade = (xSub$year - 2000)/10
# Create a table of the car makes, in order form most to least cars in the dataset
makeTable = sort(table(xSub$make), decreasing=TRUE)
# Use the table above, to make a factor that returns unique car makes.
xSub$makeFac = factor(xSub$make, levels=names(makeTable))
# Make a factor for the number of cylinders, (return unique values for cylinders),
# and make 4-cylinders vehicle the reference group.
xSub$cylFac = relevel(factor(xSub$cylinders), '4')
# Check this worked
levels(xSub$cylFac)
# Get rid of vehicles with missing cylinder number
xSub = xSub[!is.na(xSub$cylFac), ]
# Make the transmission variable nicer, use grepl function to search the pattern:
xSub$transmission = factor(
grepl("Manual", xSub$trany), levels=c(FALSE,TRUE),
labels = c('Automatic', 'Manual'))
# Make a variable indicating if the vehicle had four-wheel drive
xSub$FWD = grepl("([[:punct:]]|[[:space:]])+4(WD|wd)", xSub$VClass)
# Make a new, simpler vehicle type variable
xSub$type = gsub("Vans.*", "Vans", xSub$VClass)
xSub$type = gsub("Vehicles", "Vehicle", xSub$type)
xSub$type = gsub("Standard[[:space:]]+", "", xSub$type)
xSub$type = factor(gsub("([[:punct:]]|[[:space:]])+(2|4)(WD|wd)", "", xSub$type))
```
\
We are interested in some variables:\
**comb08**: A measure of fuel use across highway and city driving;\
**decade**: Which year the car was made in, centered on 2000;\
**transmission**: Indicating whether the vehicles are manual or automatic;\
**makeFac**: The make of the cars;\
**cylFac**: Number of cylinders;\
**type**: The type of the cars;\
**FWD**: Indicating whether the car is four-wheel drive or not.\
\
After selecting the variables, we get the final dataset:
```{r final dataset}
# final dataset
car_final <- xSub %>%
select(comb08, decade, transmission, makeFac, cylFac, type, FWD)
```
As it shown in the above table, there are 41696 observations, 1 numerical response variable, and 6 covariates (5 categorical).\
\
```{r linear model}
lm = lm(comb08 ~., data = car_final)
# summary(lm)
```
However, I have concerned about the independence assumption for linear regression, because there are several car makers producing multiple vehicle types, which means, for example, the Audi Branch produces A3, A4, A5, A7, etc., and their fuel efficiencies are somehow related.\
\
I would like to visualize fuel efficiency by make:
```{r diagram_fuelEffByMake}
# Using tidyverse
car_final %>%
mutate(grand_mean = mean(comb08)) %>%
group_by(makeFac) %>%
mutate(n_make = n()) %>%
filter(n_make >= 1000 | n_make %in% (5:15)) %>%
mutate(mean = mean(comb08)) %>%
ggplot(aes(x = makeFac, y = comb08)) +
geom_boxplot() +
geom_point(aes(y=mean), color = "red") +
geom_hline(aes(yintercept = grand_mean), color = "blue") +
coord_flip()
```
It is reasonable to conclude that there are correlations between different types under the same make. Thus, we should fit a linear mixed model.\
\
Fitting a LMM:
```{r lmm model}
lmm_make <- lme4::lmer(comb08 ~ cylFac +
decade + transmission +
(1|makeFac),
data=car_final)
summary(lmm_make)
```
**makeFac** has a variance ($\sigma^2$) under random effects with a value of 3.308, that is the random itercept, meaning the variance between groups. Residual has a variance of 9.238, and that is represented by $\tau^2$.\
<center>$\sigma^2/(\sigma^2+\tau^2)=0.17$</center>\
That means, the import of the random intercept helps us explain 17 percent more of the variance, which is good.\
\
Exploring the intercepts for each make:
```{r}
my_random_effects <- lme4::ranef(lmm_make, condVar=TRUE)
ranef_df <- as.data.frame(my_random_effects)
# ggplot
ranef_df %>%
ggplot(aes(x = grp, y = condval, ymin = condval - 2*condsd, ymax = condval + 2*condsd)) +
geom_point() +
geom_errorbar() +
coord_flip()
```
The black solid points in the diagram above represent the relative estimated values of each car makes. And the lines represent the confidence intervals respectively. And we see there are some non-overlapped CIs, meaning there exist at least two random intercepts are not equal. Thus, import random effect is helpful.\
\
Fuel efficiency of auto manufacturers adjusted for engine type:
```{r diagram_automanufacturers}
x = data.frame(
make = rownames(my_random_effects$makeFac),
est = my_random_effects$makeFac[[1]],
se = drop(attributes(my_random_effects$makeFac)$postVar),
stringsAsFactors = FALSE
)
x$lower = x$est - 2*x$se
x$upper = x$est + 2*x$se
x = x[x$se < 2, ]
x = x[order(x$est), ]
x$index = rank(x$est)
x$accurate = rank(x$se) < 40
x$col= rep_len(RColorBrewer::brewer.pal(8, 'Set2'), nrow(x))
x$colTrans = paste0(x$col, '40')
x$colLine = x$col
x[!x$accurate,'colLine'] = x[!x$accurate,'colTrans']
x$cex = -log(x$se)
x$cex = x$cex - min(x$cex)
x$cex = 3*x$cex / max(x$cex)
x$textpos = rep_len(c(4,2), nrow(x))
x[!x$accurate & x$est > 0, 'textpos'] = 4
x[!x$accurate & x$est < 0, 'textpos'] = 2
x$textloc = x$est
x$textCex = c(0.5, 0.9)[1+x$accurate]
par(mar=c(4,0,0,0), bty='n')
plot(x$est, x$index, yaxt='n', xlim = range(x$est),
#xlim = range(x[,c('lower','upper')]),
xlab='mpg', ylab='', pch=15, col=x$colTrans , cex=x$cex)
x[!x$accurate & x$est > 0, 'textloc'] = par('usr')[1]
x[!x$accurate & x$est < 0, 'textloc'] = par('usr')[2]
abline(v=0, col='grey')
segments(x$lower, x$index, x$upper, x$index, pch=15, col=x$colLine)
text(
x$textloc,
x$index, x$make,
pos = x$textpos,
col=x$col,
cex=x$textCex, offset=1)
```
Looking at the complex diagram above, it is reasonable to conclude that the tendency follows a "S" shape, and that is confirming the random intercept is useful.\
\
After exploring the diagrams, we are going to focus on whether the covariates are useful.\
RUN a linear mixed model, adding number of cylinders, vehichle type and FWD:
```{r lmm_add_variables}
lmm_make_2 = lme4::lmer(comb08~cylFac+type+FWD+decade+transmission+
(1|makeFac),
data = car_final)
summary(lmm_make_2)
```
To find which model is better, we will conduct a likelihood ratio test:
```{r LRT}
lmtest::lrtest(lmm_make_2, lmm_make)
```
The p-value is significant, therefore we prefer model 1 (lmm_make_2).\
\
Fit LMM with random slope:
```{r LMM_randomSlope}
lmm_make_3 = lme4::lmer(comb08 ~ cylFac+type+FWD+decade+transmission+
(1+decade|makeFac),
data=car_final)
summary(lmm_make_3)
```
Conduct another LRT to compare the random slope:
```{r LRT2}
lmtest::lrtest(lmm_make_2, lmm_make_3)
```
the p-value is significant, therefore we say the random slope "decade" is needed.\
\
**However**, the problem here is the p-value is always significant, because the sample size is too big. This is due to the test being on a boundary condition ($\sigma^2>0$). To fix that, we need to reduce the sample size.\
```{r reduce sample size}
# restrict the data to just car make with more than 800 vehicles
car_800 <- car_final %>%
mutate(grand_mean = mean(comb08)) %>%
group_by(makeFac) %>%
mutate(n_make = n()) %>%
filter(n_make >= 800)
# re-fit model with new dataset
lmm_make_reduced1 = lme4::lmer(comb08~cylFac+type+FWD+decade+transmission+(1|makeFac), data = car_800)
# re-fit model with new dataset random slope
lmm_make_reduced2 = lme4::lmer(comb08~cylFac+type+FWD+decade+transmission+(1+decade|makeFac), data = car_800)
# redo LRT
lmtest::lrtest(lmm_make_reduced1, lmm_make_reduced2)
```
The p-value is still significant, we should include the random slope, meaning that there are significant differences between the slopes of each car make.\
\
Check the slopes by graphing:
```{r diagram_sloperandom}
car_800 %>%
ggplot(aes(x=decade, y=comb08))+
geom_point(alpha=0.1)+
geom_smooth(method = "lm")+
facet_wrap(~makeFac)
```
This diagram is sort of surprising. It seems the slopes among different makes are similar.\
This is one of the issue with LRT, for very large datasets, we wil often see a significant p-value, we'd hesitate to claim it looks like we need different slopes when eyeballing that data.\
\
Use car_final data again!\
We are intending to check whether the response value needs a link function.\
```{r histogram}
# hist of response
hist(car_final$comb08)
```
As we can see, this is not normal.\
Double check the result with ggplot,
```{r ggplot response}
# ggplot
car_final %>%
ggplot(aes(x=comb08))+
geom_histogram(fill="darkgrey", color="black", bins=60)
```
Indeed, it is a right-skewed gamma (since continuous) distribution instead of normal. We should use link function log.\
The distribution is centered at 20, the mean will be higher than the median due to the right-skew. The highest value is almost 60.\
We are concerned about fitting just Linear Mixed Model.\
Fit a generalized linear mixed model:
```{r GLMM}
glmm_make = lme4::glmer(comb08 ~ cylFac+transmission+decade+(1|makeFac),
family=Gamma(link=log),
data=xSub[xSub$year < 2000, ])
lme4::VarCorr(glmm_make)
lattice::dotplot(lme4::ranef(glmm_make), condVar = TRUE)
```
The random effects are quite different.\
\
We see that the random effect works good, thus we conclude that this model is the final model that we would use in predicting the relationship between the vehicle fuel efficiency and the year, the transmission type, the make, the number of cylinders, the car types, and the FWD.\
\
**End of the statistical modelling report.**