-
Notifications
You must be signed in to change notification settings - Fork 0
/
2 Training and variable selection.Rmd
488 lines (397 loc) · 26.1 KB
/
2 Training and variable selection.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
---
title: "2 Training and Variable Selection"
author: "Paul Hively"
date: "April 14, 2016"
output: html_document
---
Source can be viewed [on GitHub](https://github.com/phively/gsb-sols-proj/blob/master/2%20Training%20and%20variable%20selection.Rmd)
# Logistic modeling
## First attempt
Begin by loading the data used for the visualizations in part 1.
```{r, message=F, warning=F, cache=F}
#### Run Rscript0 to load useful packages and functions ----
source("Rscript0 - libraries.R")
source("f.Wrangle.R")
```
```{r, message=F, warning=F, cache=F}
#### Model data cleanup ----
source("Rscript1a - modeling data appends.R")
```
* [Rscript0 - libraries.R](https://github.com/phively/gsb-sols-proj/blob/master/Rscript0%20-%20libraries.R)
* [f.Wrangle.R](https://github.com/phively/gsb-sols-proj/blob/master/f.Wrangle.R)
* [Rscript1a - modeling data appends.R](https://github.com/phively/gsb-sols-proj/blob/master/Rscript1a%20-%20modeling%20data%20appends.R)
Recall:
> The existing model ("JMod") has two components. First, the probability a solicitation $S_{i}$ comes in before the end of the fiscal year is a function of the current stage progress $c$ and time until the end of the year $t$, $P(S_{i}=1)=f(c_{i},t)$. The expected FRP in the current fiscal year is this probability times the expected amount, $E(FRP)=P(S_{i}=1)E(S_{i})$.
>
> Focus on the probability model. I assume that there should also be some effect due to the average close rate, season (e.g. people give more in December), planned ask amount, actual ask amount, and potentially interactions. Something like this:
>
> $$P(S_{ij}=1)=f(c_{i},t,\mu,t_{i}^{*},a_{j}^{*},a_{j})$$
Start by trying to directly predict the booked in FY probabilities. The simplest approach is to treat everything that isn't continuous as a factor and chuck it into a logistic regression model.
$$logit(p)=log\Big(\frac{p}{1-p}\Big)=\eta$$
$$E(\eta_{i})=c_{i}+t_{i}^{*}$$
In other words, the logit of the expected close rate is dependent on some intercept $c_{i}$ due to stage, and some month effect $t_{i}^{*}$.
```{r, message=F, warning=F, cache=F}
# Load some functions I wrote to calculate model diagnostics
source("f.Diagnostics.R")
```
* [f.Diagnostics.R](https://github.com/phively/gsb-sols-proj/blob/master/f.Diagnostics.R)
```{r, message=F, warning=F, cache=F}
## Model each stage by when it was asked
glm.pln.1 <- glm(FY.Plan.Book ~ as.factor(month(mdat$Planning.Dt)), data=mdat, family=binomial())
glm.clr.1 <- glm(FY.Clear.Book ~ as.factor(month(mdat$Clear.Dt)), data=mdat, family=binomial())
glm.ask.1 <- glm(FY.Ask.Book ~ as.factor(month(mdat$Ask.Dt)), data=mdat, family=binomial())
glm.ora.1 <- glm(FY.Oral.Book ~ as.factor(month(mdat$Oral.Dt)), data=mdat, family=binomial())
modnames <- c("Plan","Clear","Ask","Oral")
# Compare each of the models
kable(CompareDeviances(models=list(glm.pln.1, glm.clr.1, glm.ask.1, glm.ora.1), modnames=modnames, debug=F), digits=3)
```
Month-by-month coefficients result in an unambiguously better model compared to just an intercept per stage.
Why four separate models rather than a factor for stage? Note that the goal is to determine whether hitting a certain stage results in a solicitation closing *that fiscal year*; the target $Y$ changes from stage to stage, so the model needs to change as well.
JMod has the following structure. Cell entries are probability to close given that a solicitation is in the indicated stage and month, with an expected close date that fiscal year:
| Stage | July-Feb| Mar-Apr | May | June |
|----------|:-------:|:-------:|:-------:|:-------:|
| Plan | 1/6 | 1/8 | 0 | 0 |
| Clear | 1/3 | 1/6 | 1/8 | 0 |
| Ask | 2/3 | 1/3 | 1/6 | 1/8 |
| Oral | 2/3 | 1/3 | 1/6 | 1/8 |
| Paperwork| 1 | 1 | 1 | 1 |
Here are the confusion matrices, with a classification threshold of $p=.5$:
```{r, message=F, warning=F, cache=F}
# Count tables
ConfusionMatrix(list(glm.pln.1, glm.clr.1, glm.ask.1, glm.ora.1), modnames=modnames)
# Proportion tables
ConfusionMatrix(list(glm.pln.1, glm.clr.1, glm.ask.1, glm.ora.1), modnames=modnames, counts=F)
```
Compare errors for JMod versus the month intercepts logistic model by stage. My error function is the $L_{1}$ norm:
$$L_{1}(\epsilon)=\sum_{i=1}^{n}\left|\epsilon_{i}\right| = \sum_{i=1}^{n}\left|\hat{p}_{i}-p_{i}\right|$$
I also included the squared $L_{2}$ norm (equivalent to $RSS$ in OLS regression) for comparison, but it seems like an odd metric for range-restricted data.
```{r, message=F, warning=F, cache=F}
## Load functions for JMod
source("f.JMod.R")
```
* [f.JMod.R](https://github.com/phively/gsb-sols-proj/blob/master/f.JMod.R)
```{r, echo=F, message=F, warning=F, cache=F}
## Planning stage
# Residuals
jmod.1 <- JMod(curr.stage=1, curr.dt=mdat$Planning.Dt, expected.dt=mdat$Expected.Dt)
jmod <- na.omit(JModError(probs=jmod.1$prob, truth=mdat$FY.Plan.Book))
glmod <- JModError(probs=glm.pln.1$fitted, truth=model.frame(glm.pln.1)[,1])
# Combined residuals
ggdat <- data.frame(jmod, glm=glmod, closed=na.omit(mdat$FY.Plan.Book))
ggplot(ggdat, aes(x=glm, y=jmod, color=closed)) + geom_point() + geom_hline(yintercept=mean(jmod), alpha=.3) + geom_vline(xintercept=mean(glmod), alpha=.3) + labs(title="Plan JMod versus glm residual error and bias") + xlim(c(-1,1)) + ylim(c(-1,1))
# Metrics
matrix(c(LpNorm(jmod, p=1), LpNorm(jmod)^2, LpNorm(glmod, p=1), LpNorm(glmod)^2), nrow=2, dimnames=list(c("L1","L2"),c("JMod","glm")))
```
JMod wins on the sum of absolute residuals, but it is very negatively biased, to the tune of:
```{r, echo=F, message=F, warning=F, cache=F}
print(mean(jmod))
```
```{r, echo=F, message=F, warning=F, cache=F}
## Clear stage
# Residuals
jmod.2 <- JMod(curr.stage=2, curr.dt=mdat$Clear.Dt, expected.dt=mdat$Expected.Dt)
jmod <- na.omit(JModError(probs=jmod.2$prob, truth=mdat$FY.Clear.Book))
glmod <- JModError(probs=glm.clr.1$fitted, truth=model.frame(glm.clr.1)[,1])
# Combined residuals
ggdat <- data.frame(jmod, glm=glmod, closed=na.omit(mdat$FY.Clear.Book))
ggplot(ggdat, aes(x=glm, y=jmod, color=closed)) + geom_point() + geom_hline(yintercept=mean(jmod), alpha=.3) + geom_vline(xintercept=mean(glmod), alpha=.3) + labs(title="Clear JMod versus glm residual error and bias") + xlim(c(-1,1)) + ylim(c(-1,1))
# Metrics
matrix(c(LpNorm(jmod, p=1), LpNorm(jmod)^2, LpNorm(glmod, p=1), LpNorm(glmod)^2), nrow=2, dimnames=list(c("L1","L2"),c("JMod","glm")))
```
A tiny bit closer on the sum of absolute residuals, but JMod is still negatively biased, around:
```{r, echo=F, message=F, warning=F, cache=F}
print(mean(jmod))
```
```{r, echo=F, message=F, warning=F, cache=F}
## Ask stage
# Residuals
jmod.3 <- JMod(curr.stage=3, curr.dt=mdat$Ask.Dt, expected.dt=mdat$Expected.Dt)
jmod <- na.omit(JModError(probs=jmod.3$prob, truth=mdat$FY.Ask.Book))
glmod <- JModError(probs=glm.ask.1$fitted, truth=model.frame(glm.ask.1)[,1])
# Combined residuals
ggdat <- data.frame(jmod, glm=glmod, closed=na.omit(mdat$FY.Ask.Book))
ggplot(ggdat, aes(x=glm, y=jmod, color=closed)) + geom_point() + geom_hline(yintercept=mean(jmod), alpha=.3) + geom_vline(xintercept=mean(glmod), alpha=.3) + labs(title="Ask JMod versus glm residual error and bias") + xlim(c(-1,1)) + ylim(c(-1,1))
# Metrics
matrix(c(LpNorm(jmod, p=1), LpNorm(jmod)^2, LpNorm(glmod, p=1), LpNorm(glmod)^2), nrow=2, dimnames=list(c("L1","L2"),c("JMod","glm")))
```
Clear win for JMod; bias is only:
```{r, echo=F, message=F, warning=F, cache=F}
print(mean(jmod))
```
```{r, echo=F, message=F, warning=F, cache=F}
## Oral stage
# Residuals
jmod.4 <- JMod(curr.stage=4, curr.dt=mdat$Oral.Dt, expected.dt=mdat$Expected.Dt) # Ask and Oral are considered the same stage
jmod <- na.omit(JModError(probs=jmod.4$prob, truth=mdat$FY.Oral.Book))
glmod <- JModError(probs=glm.ora.1$fitted, truth=model.frame(glm.ora.1)[,1])
# Combined residuals
ggdat <- data.frame(jmod, glm=glmod, closed=na.omit(mdat$FY.Oral.Book))
ggplot(ggdat, aes(x=glm, y=jmod, color=closed)) + geom_point() + geom_hline(yintercept=mean(jmod), alpha=.3) + geom_vline(xintercept=mean(glmod), alpha=.3) + labs(title="Oral JMod versus glm residual error and bias") + xlim(c(-1,1)) + ylim(c(-1,1))
# Metrics
matrix(c(LpNorm(jmod, p=1), LpNorm(jmod)^2, LpNorm(glmod, p=1), LpNorm(glmod)^2), nrow=2, dimnames=list(c("L1","L2"),c("JMod","glm")))
```
JMod bias is:
```{r, echo=F, message=F, warning=F, cache=F}
print(mean(jmod))
```
Basically, what this comes down to is how concerned are we about bias? The negative bias is due to overaggressive discounting as the end of the fiscal year (June 30) approaches.
One more idea for model comparison: why not look at the classification tables directly? Since JMod has hard cutoffs it doesn't really make sense to use a threshold. Instead, add the predicted probabilities to the corresponding cells of a confusion matrix for each observation.
```{r, message=F, warning=F, cache=F}
## Confusion matrix summing probabilities, rather than counting classifications
# GLM for plan stage
ConfusionMatrixWeighted(preds=fitted(glm.pln.1), truth=model.frame(glm.pln.1)[,1], dimnames=c("truth", "pred"))
# JMod for plan stage
(jmod.1$errtable <- ConfusionMatrixWeighted(preds=jmod.1$prob, truth=mdat$FY.Plan.Book, dimnames=c("truth", "pred")))
```
Note that the antidiagonal terms sum to the corresponding $L_{1}$ norm, from the "Plan JMod versus glm residual error and bias" table above.
Planned solicitations -- while the glm is well balanced, JMod overwhelmingly classifies planned solicitations as unlikely to come in. This could be because *JMod is tuned toward directly predicting actual amounts*, and uses its weights to discount MG rather than worrying about the close probability of every AF gift.
```{r, message=F, warning=F, cache=F}
# GLM for clear stage
ConfusionMatrixWeighted(preds=fitted(glm.clr.1), truth=model.frame(glm.clr.1)[,1], dimnames=c("truth", "pred"))
# JMod for clear stage
(jmod.2$errtable <- ConfusionMatrixWeighted(preds=na.omit(jmod.2$prob), truth=na.omit(mdat$FY.Clear.Book), dimnames=c("truth", "pred")))
```
Same story -- for cleared solicitations JMod overwhelmingly categorizes them as unlikely to close.
```{r, message=F, warning=F, cache=F}
# GLM for ask stage
ConfusionMatrixWeighted(preds=fitted(glm.ask.1), truth=model.frame(glm.ask.1)[,1], dimnames=c("truth", "pred"))
# JMod for ask stage
(jmod.3$errtable <- ConfusionMatrixWeighted(preds=na.omit(jmod.3$prob), truth=na.omit(mdat$FY.Ask.Book), dimnames=c("truth", "pred")))
```
Same story for asked, but look how much better JMod does in the True-True quadrant, bottom right.
```{r, message=F, warning=F, cache=F}
# GLM for oral stage
ConfusionMatrixWeighted(preds=fitted(glm.ora.1), truth=model.frame(glm.ora.1)[,1], dimnames=c("truth", "pred"))
# JMod for oral stage
(jmod.4$errtable <- ConfusionMatrixWeighted(preds=na.omit(jmod.4$prob), truth=na.omit(mdat$FY.Oral.Book), dimnames=c("truth", "pred")))
```
Oral is different; JMod is much too pessimistic. Oral pledges actually behave closer to paperwork in house than to ask, and should mainly be assigned $p=0$ when scheduled to be closed in a future fiscal year.
### Conclusions
* Oral stage really needs its own weights.
* The *most important* variable not currently captured by glm is whether the expected date is in a future year.
# Variable selection
Definitely include an indicator for **expected date in the future**.
From [part 1](https://github.com/phively/gsb-sols-proj/blob/master/1%20Data%20exploration%20commentary.Rmd), some interesting variables include **expected amount**, and whether **ask amount/planned ask** is greater or less than 1.
Potentially interesting derived variables include the **number of days left in the year**, the **number of days until expected close**, and the **amount of time that's passed since the previous stage**.
Some other thoughts.
1. Is logistic regression really suitable (overdispersion)?
2. Should it be restricted to MG-range gifts, $[\$25\text{k},\$5\text{M})$?
3. What nonparametric methods are worth trying?
First, let's create those derived variables.
```{r, message=F, warning=F, cache=F}
#### Derived variables ----
source("Rscript1b - derived variables.R")
```
* [Rscript1b - derived variables.R](https://github.com/phively/gsb-sols-proj/blob/master/Rscript1b%20-%20derived%20variables.R)
Think about #3 first. I'm a fan of classification trees, and that general method is the closest to the existing approach. It would also be interesting to see which of the proposed variables end up being chosen.
Try the random forest approach to select variables. First split up the variables by which stages it's suitable to use them for prediction.
```{r, message=F, warning=F, cache=F}
## Identify which dependent variables contain information specific to the various stages
## Obviously shouldn't include after-the-fact information (FY. objects, actual amount, etc.)
head.all <- names(mderived)[names(mderived) %in% c("Solicitation.Type.Desc", "Planned.Fiscal.Mo", "Planned.Amt", "Expected.Fiscal.Mo", "Expected.Amt", "Planned.Ask.Band")]
# Look for "plan" and "Plan" in header ([pP] = both cases), but ignore the FY. outcome objects
head.plan <- names(mderived)[setdiff(grep("[pP]lan", names(mderived)), grep("FY.", names(mderived)))]
head.plan <- setdiff(head.plan, c(head.all, "plan2clear")) #take out plan2clear which wouldn't have been known
# Look for "Clear" in header
head.clear <- names(mderived)[setdiff(grep("[cC]lear", names(mderived)), grep("FY.", names(mderived)))]
head.clear <- setdiff(head.clear, c(head.all, "clear2ask")) #take out clear2ask which wouldn't have been known
# Look for "Ask" in header
head.ask <- names(mderived)[setdiff(grep("[aA]sk", names(mderived)), grep("FY.", names(mderived)))]
head.ask <- setdiff(head.ask, c(head.all, "ask2oral")) #take out ask2oral which wouldn't have been known
head.ask <- c(head.ask, "plan2clear") #add fields that would have been known
# Look for "Oral" in header
head.oral <- names(mderived)[setdiff(grep("[oO]ral", names(mderived)), grep("FY.", names(mderived)))]
head.oral <- setdiff(head.oral, head.all)
head.oral <- c(head.oral, "plan2clear", "clear2ask", "Ask.Amt", "Ask.Band", "Ask.Amt.Over.Pln.Amt", "Ask.Amt.Over.Pln.Fac") #add fields that would've been known
```
Now grow a random forest for each stage.
```{r, message=F, warning=F, cache=F}
library(randomForest)
set.seed(15584)
# Planned stage forest
treedat <- na.omit(mderived[,c("FY.Plan.Book", head.all, head.plan)])
rf.plan <- randomForest(factor(FY.Plan.Book) ~ ., data=treedat, importance=T)
# Clear stage forest
treedat <- na.omit(mderived[,c("FY.Clear.Book", head.all, head.clear)])
rf.clear <- randomForest(factor(FY.Clear.Book) ~ ., data=treedat, importance=T)
# Ask stage forest
treedat <- na.omit(mderived[,c("FY.Ask.Book", head.all, head.ask)])
rf.ask <- randomForest(factor(FY.Ask.Book) ~ ., data=treedat, importance=T)
# Oral stage forest
treedat <- na.omit(mderived[,c("FY.Oral.Book", head.all, head.oral)])
rf.oral <- randomForest(factor(FY.Oral.Book) ~ ., data=treedat, importance=T)
```
The green lines are the out-of-bag error (similar to cross-validation). (Note that `PlotForest()` is included in [f.Diagnostics.R](https://github.com/phively/gsb-sols-proj/blob/master/f.Diagnostics.R).)
```{r, message=F, warning=F, cache=F}
PlotForest(rf.plan, title="Plan")
PlotForest(rf.clear, title="Clear")
PlotForest(rf.ask, title="Ask")
PlotForest(rf.oral, title="Oral")
```
Finally, the GLM errors from the confusion matrix, threshold $p=0.5$, versus the JMod errors from the weighted confusion matrix, $\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}\{x_{i}\text{ misclassified}\}$, versus the out-of-bag decision forest errors:
```{r, echo=F, message=F, warning=F, cache=F}
# Put together the confusion matrices again
confuse <- ConfusionMatrix(list(glm.pln.1, glm.clr.1, glm.ask.1, glm.ora.1), modnames=modnames, counts=F)
# First logistic regression errors
dat1 <- CalcErrorRates(confuse, model.name="GLM")
## JMod errors
dat2 <- data.frame(Model="JMod", Plan=1-sum(diag(jmod.1$errtable))/sum(jmod.1$errtable), Clear=1-sum(diag(jmod.2$errtable))/sum(jmod.2$errtable), Ask=1-sum(diag(jmod.3$errtable))/sum(jmod.3$errtable), Oral=1-sum(diag(jmod.4$errtable))/sum(jmod.4$errtable))
## randomForest errors
dat3 <- data.frame(Model="Random forest", Plan=tail(rf.plan$err.rate[,"OOB"],1), Clear=tail(rf.clear$err.rate[,"OOB"],1), Ask=tail(rf.ask$err.rate[,"OOB"],1), Oral=tail(rf.oral$err.rate[,"OOB"],1))
## error table
errors <- rbind(dat1, dat2, dat3)
kable(errors, digits=2)
```
Great out-of-bag error rates across the board. The random forests are definitely doing something right. Which variables do they find the most helpful?
```{r, echo=F, message=F, warning=F, cache=F}
## Coerce importances into a data frame
imp.plan <- importance(rf.plan, scale=F, type=1)
imp.clear <- importance(rf.clear, scale=F, type=1)
imp.ask <- importance(rf.ask, scale=F, type=1)
imp.oral <- importance(rf.oral, scale=F, type=1)
rf.importances <- rbind(data.frame(Model="Plan", imp.plan, Variable=rownames(imp.plan)), data.frame(Model="Clear", imp.clear, Variable=rownames(imp.clear)), data.frame(Model="Ask", imp.ask, Variable=rownames(imp.ask)), data.frame(Model="Oral", imp.oral, Variable=rownames(imp.oral)), make.row.names=F)
## Plot importances
# y axis will end at nearest .02 above the maximum in the data
top <- ceiling(max(rf.importances$MeanDecreaseAccuracy)*50)/50
ggplot(rf.importances, aes(x=Variable, y=MeanDecreaseAccuracy, color=Model, group=Model)) + geom_point() + geom_line() + theme(axis.text.x=element_text(angle=90, hjust=1)) + labs(title="Variable importances", y="Mean Decrease in Accuracy") + scale_y_continuous(breaks=seq(0, top, by=.02))
```
* The `Stage.Future`, `stage2EOFY`, `stage.fiscal.mo`, and `stage2expect` variables show up as a distinctive double peak pattern per stage. Note that they are all correlated -- because of how the forests are constructed (random branching) that means they all show up as fairly important.
* Out of the four of them I would personally ditch `stage2EOFY` and consider dropping one of the `stage.fiscal.mo` or `stage2expect` variables as well.
* `Expected.Amt` and `Planned.Amt` are good predictors before the Oral stage; `Planned.Ask.Band` is worse across the board.
* `Expected.Fiscal.Mo` is important across the board, while `Planned.Fiscal.Mo` is useful before the Oral stage.
* Similarly, `Solicitation.Type.Desc` is a decent predictor before the oral stage.
* The various times between stages e.g. `plan2clear` are not good predictors.
* `Ask.Amt` is a good predictor in the Ask stage, certainly much better than `Ask.Band`; the two `Ask.Amt.Over.` fractions are useless.
Re-fit the random forests, cutting out the clearly unimportant variables.
```{r, message=F, warning=F, cache=F}
## Update the explanatory variables
exclude <- c("plan2EOFY", "clear2EOFY", "ask2EOFY", "oral2EOFY", "plan2clear", "clear2ask", "ask2oral", "Planned.Ask.Band", "Ask.Band", "Ask.Amt.Over.Pln.Amt", "Ask.Amt.Over.Pln.Fac")
head.all <- setdiff(head.all, exclude)
head.plan <- setdiff(head.plan, exclude)
head.clear <- setdiff(head.clear, exclude)
head.ask <- setdiff(head.ask, exclude)
head.oral <- setdiff(head.oral, exclude)
## Regrow the random forests
set.seed(66461)
# Planned stage forest
treedat <- na.omit(mderived[,c("FY.Plan.Book", head.all, head.plan)])
rf.plan <- randomForest(factor(FY.Plan.Book) ~ ., data=treedat, importance=T)
# Clear stage forest
treedat <- na.omit(mderived[,c("FY.Clear.Book", head.all, head.clear)])
rf.clear <- randomForest(factor(FY.Clear.Book) ~ ., data=treedat, importance=T)
# Ask stage forest
treedat <- na.omit(mderived[,c("FY.Ask.Book", head.all, head.ask)])
rf.ask <- randomForest(factor(FY.Ask.Book) ~ ., data=treedat, importance=T)
# Oral stage forest
treedat <- na.omit(mderived[,c("FY.Oral.Book", head.all, head.oral)])
rf.oral <- randomForest(factor(FY.Oral.Book) ~ ., data=treedat, importance=T)
```
Plotting the new importances:
```{r, echo=F, message=F, warning=F, cache=F}
## Coerce importances into a data frame
imp.plan <- importance(rf.plan, scale=F, type=1)
imp.clear <- importance(rf.clear, scale=F, type=1)
imp.ask <- importance(rf.ask, scale=F, type=1)
imp.oral <- importance(rf.oral, scale=F, type=1)
rf.importances <- rbind(data.frame(Model="Plan", imp.plan, Variable=rownames(imp.plan)), data.frame(Model="Clear", imp.clear, Variable=rownames(imp.clear)), data.frame(Model="Ask", imp.ask, Variable=rownames(imp.ask)), data.frame(Model="Oral", imp.oral, Variable=rownames(imp.oral)), make.row.names=F)
## Plot importances
# y axis will end at nearest .02 above the maximum in the data
top <- ceiling(max(rf.importances$MeanDecreaseAccuracy)*50)/50
ggplot(rf.importances, aes(x=Variable, y=MeanDecreaseAccuracy, color=Model, group=Model)) + geom_point() + geom_line() + theme(axis.text.x=element_text(angle=90, hjust=1)) + labs(title="Variable importances", y="Mean Decrease in Accuracy") + scale_y_continuous(breaks=seq(0, top, by=.02))
```
And the new errors table:
```{r, echo=F, message=F, warning=F, cache=F}
## randomForest errors
errors <- rbind(errors, data.frame(Model="Random forest 2", Plan=tail(rf.plan$err.rate[,"OOB"],1), Clear=tail(rf.clear$err.rate[,"OOB"],1), Ask=tail(rf.ask$err.rate[,"OOB"],1), Oral=tail(rf.oral$err.rate[,"OOB"],1), stringsAsFactors=F))
## error table
kable(errors, digits=2)
```
Looks like I might even be able to get away with taking out some more predictors. What if we exclude the redundant `stage2expect`, and `Expected.Fiscal.Mo` and `Planned.Fiscal.Mo` which are overwhelmingly December or June?
```{r, message=F, warning=F, cache=F}
## Update the explanatory variables
exclude <- c(exclude, "plan2expect", "clear2expect", "ask2expect", "oral2expect", "Expected.Fiscal.Mo", "Planned.Fiscal.Mo")
head.all <- setdiff(head.all, exclude)
head.plan <- setdiff(head.plan, exclude)
head.clear <- setdiff(head.clear, exclude)
head.ask <- setdiff(head.ask, exclude)
head.oral <- setdiff(head.oral, exclude)
## Regrow the random forests
set.seed(48676)
# Planned stage forest
treedat <- na.omit(mderived[,c("FY.Plan.Book", head.all, head.plan)])
rf.plan <- randomForest(factor(FY.Plan.Book) ~ ., data=treedat, importance=T)
# Clear stage forest
treedat <- na.omit(mderived[,c("FY.Clear.Book", head.all, head.clear)])
rf.clear <- randomForest(factor(FY.Clear.Book) ~ ., data=treedat, importance=T)
# Ask stage forest
treedat <- na.omit(mderived[,c("FY.Ask.Book", head.all, head.ask)])
rf.ask <- randomForest(factor(FY.Ask.Book) ~ ., data=treedat, importance=T)
# Oral stage forest
treedat <- na.omit(mderived[,c("FY.Oral.Book", head.all, head.oral)])
rf.oral <- randomForest(factor(FY.Oral.Book) ~ ., data=treedat, importance=T)
## randomForest errors
errors <- rbind(errors, data.frame(Model="Random forest 3", Plan=tail(rf.plan$err.rate[,"OOB"],1), Clear=tail(rf.clear$err.rate[,"OOB"],1), Ask=tail(rf.ask$err.rate[,"OOB"],1), Oral=tail(rf.oral$err.rate[,"OOB"],1), stringsAsFactors=F))
## error table
kable(errors, digits=2)
```
Pretty much the same as the other forest methods, so I prefer this last Random forest 3 as the more parsimonious model. Finally, how do random forests compare to JMod when given the same amount of information?
```{r, message=F, warning=F, cache=F}
## Regrow the random forests
set.seed(30635)
# Planned stage forest
treedat <- na.omit(mderived[,c("FY.Plan.Book", "planning.fiscal.mo", "plan2EOFY")])
rf.plan.jm <- randomForest(factor(FY.Plan.Book) ~ ., data=treedat, importance=T)
# Clear stage forest
treedat <- na.omit(mderived[,c("FY.Clear.Book", "clear.fiscal.mo", "clear2EOFY")])
rf.clear.jm <- randomForest(factor(FY.Clear.Book) ~ ., data=treedat, importance=T)
# Ask stage forest
treedat <- na.omit(mderived[,c("FY.Ask.Book", "ask.fiscal.mo", "ask2EOFY")])
rf.ask.jm <- randomForest(factor(FY.Ask.Book) ~ ., data=treedat, importance=T)
# Oral stage forest
treedat <- na.omit(mderived[,c("FY.Oral.Book", "oral.fiscal.mo", "oral2EOFY")])
rf.oral.jm <- randomForest(factor(FY.Oral.Book) ~ ., data=treedat, importance=T)
## randomForest errors
errors <- rbind(errors, data.frame(Model="Random forest, JMod fields only", Plan=tail(rf.plan.jm$err.rate[,"OOB"],1), Clear=tail(rf.clear.jm$err.rate[,"OOB"],1), Ask=tail(rf.ask.jm$err.rate[,"OOB"],1), Oral=tail(rf.oral.jm$err.rate[,"OOB"],1), stringsAsFactors=F))
## error table
kable(errors, digits=2)
## save to disk for later
write.table(errors, "classification.errors.txt", sep="\t", row.names=F)
```
Interesting -- limiting random forest to JMod fields (allows splits only on `Stage`, `stage.fiscal.mo`, and `stage2EOFY`) results in a model that doesn't do any better than the basic GLM. Random forest 3 is parsimonious so it looks like we may have a final set of variables.
## Plan
```{r, echo=F, message=F, warning=F, cache=F}
c(head.all, head.plan)
```
## Clear
```{r, echo=F, message=F, warning=F, cache=F}
c(head.all, head.clear)
```
## Ask
```{r, echo=F, message=F, warning=F, cache=F}
c(head.all, head.ask)
```
## Oral
```{r, echo=F, message=F, warning=F, cache=F}
c(head.all, head.oral)
```
We see the following are consistently important:
* `Solicitation.Type.Desc`, `Planned.Amt`, and `Expected.Amt` are useful across the board
* `Stage.Future` and `stage.fiscal.mo` for the corresponding stages are useful
* `Ask.Amt` is useful in the ask and oral stages
`Stage.Future` and `stage.fiscal.mo` are both used by JMod so this introduces three new predictors useful for assessing a solicitation's probability to be booked in a given year (plus `Ask.Amt` for the last two stages).
```{r, echo=F, message=F, warning=F, cache=F}
## Coerce importances into a data frame
imp.plan <- importance(rf.plan, scale=F, type=1)
imp.clear <- importance(rf.clear, scale=F, type=1)
imp.ask <- importance(rf.ask, scale=F, type=1)
imp.oral <- importance(rf.oral, scale=F, type=1)
rf.importances <- rbind(data.frame(Model="Plan", imp.plan, Variable=rownames(imp.plan)), data.frame(Model="Clear", imp.clear, Variable=rownames(imp.clear)), data.frame(Model="Ask", imp.ask, Variable=rownames(imp.ask)), data.frame(Model="Oral", imp.oral, Variable=rownames(imp.oral)), make.row.names=F)
## Plot importances
# y axis will end at nearest .02 above the maximum in the data
top <- ceiling(max(rf.importances$MeanDecreaseAccuracy)*50)/50
ggplot(rf.importances, aes(x=Variable, y=MeanDecreaseAccuracy, color=Model, group=Model)) + geom_point() + geom_line() + theme(axis.text.x=element_text(angle=90, hjust=1)) + labs(title="Variable importances", y="Mean Decrease in Accuracy") + scale_y_continuous(breaks=seq(0, top, by=.02))
```
# Packages used
```{r}
session_info()
```