-
Notifications
You must be signed in to change notification settings - Fork 0
/
gsb-sols-writeup.Rmd
742 lines (606 loc) · 46.8 KB
/
gsb-sols-writeup.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
---
title: "Booth Solicitations Write-up"
author: "Paul Hively"
date: "June 16, 2016"
output: html_document
---
# Problem statement
My goal is to find a way to accurately forecast fundraising revenue.
The existing model ("JMod") has two components. First, the probability a solicitation $Sol_{i}$ is successfully booked before the end of the fiscal year is a function of the current stage progress $Stage$ and time until the end of the year $t$, $P(Sol_{i}=1)=f(Stage_{i},t)$. The expected FRP in the current fiscal year is this probability times the expected amount, $E(FRP)=\sum_{i} P(Sol_{i}=1)E(Sol_{i})$, discounted to 0 if the expected close date is after the end of the fiscal year.
The probabilities are adjusted from year to year; these were the weights used in fiscal year 2015, 7/1/14 -- 6/30/15.
| Stage | July-Feb| Mar-Apr | May | June |
|----------|:-------:|:-------:|:-------:|:-------:|
| Plan | 1/6 | 1/8 | 0 | 0 |
| Clear | 1/3 | 1/6 | 1/8 | 0 |
| Ask | 2/3 | 1/3 | 1/6 | 1/8 |
| Oral | 2/3 | 1/3 | 1/6 | 1/8 |
| Paperwork| 1 | 1 | 1 | 1 |
I investigated the following:
* How accurate is this method?
* Are each of these variables (expected amount, current date, expected date) predictive?
* Is there an alternate model for $P(Sol_{i}=1)$ that performs better?
* Are there additional variables that will improve the model's accuracy?
# Data
The response variable is whether a solicitation was booked by the end of the fiscal year given its state at some earlier point. Covariates include current stage, current date, expected close date, planned ask amount, etc.
For model fitting and comparison, data was compiled on 1089 solicitations opened and closed between 7/1/2011 and 6/30/2015 with known outcome (booked, refused, cancelled).
For initial model validation, point-in-time data was compiled from 153 historical reports run between 8/1/2011 and 2/2/2016. One solicitation can be included in consecutive months provided that it hasn't yet closed.
Final model comparisons include out-of-sample data compiled from reports run between 7/1/2015 and 6/30/2016.
# Model structure
The most straightforward statistical model extends JMod, assuming that the probability a solicitation closes is some function of the solicitation stage and time of year $t$ -- in this case, fiscal month as a factor. The logistic regression model is:
$$ logit(P(Sol_{i}=1 | Stage_{j})) = \mu_{j} + t_{i} + \epsilon_{i} $$
Note that this does not include any information about expected close date. Comparing this approach to JMod using the 1089-row dataset (i.e. an in-sample test) yields the following error rates:
```{r, echo=F, message=F, warning=F, cache=F}
## Load libraries and functions
source("Rscript0 - libraries.R")
source("f.Wrangle.R")
source("f.Diagnostics.R")
source("f.JMod.R")
## Retrieve data
source("Rscript1b - derived variables.R")
## Back up mdat for use later
mdat.bak <- mdat
## Exclude solicitations closing after 6/30/2015 so we have a fair out-of-sample comparison later
mdat <- mdat %>% filter(Final.Sol.Stage.Dt <= mdy("06/30/2015"))
mderived <- mderived %>% filter(Final.Sol.Stage.Dt <= mdy("06/30/2015"))
## Error rates for JMod
jmod.1 <- JMod(curr.stage=1, curr.dt=mdat$Planning.Dt, expected.dt=mdat$Expected.Dt)
jmod.1$errtable <- ConfusionMatrixWeighted(preds=jmod.1$prob, truth=mdat$FY.Plan.Book, dimnames=c("truth", "pred"))
jmod.2 <- JMod(curr.stage=2, curr.dt=mdat$Clear.Dt, expected.dt=mdat$Expected.Dt)
jmod.2$errtable <- ConfusionMatrixWeighted(preds=na.omit(jmod.2$prob), truth=na.omit(mdat$FY.Clear.Book), dimnames=c("truth", "pred"))
jmod.3 <- JMod(curr.stage=3, curr.dt=mdat$Ask.Dt, expected.dt=mdat$Expected.Dt)
jmod.3$errtable <- ConfusionMatrixWeighted(preds=na.omit(jmod.3$prob), truth=na.omit(mdat$FY.Ask.Book), dimnames=c("truth", "pred"))
jmod.4 <- JMod(curr.stage=4, curr.dt=mdat$Oral.Dt, expected.dt=mdat$Expected.Dt) # Ask and Oral are considered the same stage
jmod.4$errtable <- ConfusionMatrixWeighted(preds=na.omit(jmod.4$prob), truth=na.omit(mdat$FY.Oral.Book), dimnames=c("truth", "pred"))
errors <- data.frame(Model="JMod", Plan=1-sum(diag(jmod.1$errtable))/sum(jmod.1$errtable), Clear=1-sum(diag(jmod.2$errtable))/sum(jmod.2$errtable), Ask=1-sum(diag(jmod.3$errtable))/sum(jmod.3$errtable), Oral=1-sum(diag(jmod.4$errtable))/sum(jmod.4$errtable))
## Error rates for basic GLM
## Model each stage by when it was asked
glm.pln.1 <- glm(FY.Plan.Book ~ as.factor(month(mdat$Planning.Dt)), data=mdat, family=binomial())
glm.clr.1 <- glm(FY.Clear.Book ~ as.factor(month(mdat$Clear.Dt)), data=mdat, family=binomial())
glm.ask.1 <- glm(FY.Ask.Book ~ as.factor(month(mdat$Ask.Dt)), data=mdat, family=binomial())
glm.ora.1 <- glm(FY.Oral.Book ~ as.factor(month(mdat$Oral.Dt)), data=mdat, family=binomial())
modnames <- c("Plan","Clear","Ask","Oral")
confuse <- ConfusionMatrix(list(glm.pln.1, glm.clr.1, glm.ask.1, glm.ora.1), modnames=modnames, counts=F)
errors <- rbind(errors, CalcErrorRates(confuse, model.name="GLM"))
## Table output
kable(errors, digits=2)
```
These are fairly close, except that using the simple logistic regression, the Oral error rates are *much* lower. This suggests that at the least, the Ask and Oral stages should be treated differently. This can be confirmed by examining close rates in the underlying data by month and stage:
```{r, echo=F, message=F, warning=F, cache=F}
# Data manupulation
ggdat <- mdat %>%
mutate(
GiftSize = ifelse(Ask.Amt >= 5000000, "PG",
ifelse(Ask.Amt >= 25000, "MG",
ifelse(Ask.Amt < 25000, "AF",
"Unasked"))),
Plan = as.numeric(month(Planning.Dt)),
Clear = as.numeric(month(Clear.Dt)),
Ask = as.numeric(month(Ask.Dt)),
Oral = as.numeric(month(Oral.Dt)),
Count = 1
) %>%
select(Count, Plan, Clear, Ask, Oral, FY.Plan.Book, FY.Clear.Book, FY.Ask.Book, FY.Oral.Book, GiftSize) %>%
gather("Stage", "Month", 2:5, factor_key=T) %>%
gather("Outcome", "Booked.In.FY", 2:5) %>%
# Need to get rid of mismatched rows, e.g. where Stage=Ask and Outcome=FY.Clear.Book
filter(
(Stage=="Plan" & Outcome=="FY.Plan.Book") |
(Stage=="Clear" & Outcome=="FY.Clear.Book") |
(Stage=="Ask" & Outcome=="FY.Ask.Book") |
(Stage=="Oral" & Outcome=="FY.Oral.Book")
) %>%
mutate(Month = factor(Month, labels=strtrim(month.name, 3))) %>%
group_by(GiftSize, Stage, Month, Booked.In.FY) %>%
summarise(Count=sum(Count)) %>%
na.omit()
# Reorder months to be in fiscal order
ggdat$Month <- factor(ggdat$Month, levels(ggdat$Month)[c(7:12,1:6)])
# Add proportions and CIs
ggdat <- ggdat %>% mutate(n = sum(Count), Prop = Count/n, ci = sqrt(Prop*(1-Prop)/n))
# # Plot counts by month
# ggplot(ggdat, aes(x=Month, y=Count, color=Booked.In.FY, group=Booked.In.FY)) + geom_point() + geom_line(alpha=.5) + facet_grid(Stage~GiftSize) + theme(text=element_text(size=12), axis.text.x=element_text(angle=90, vjust=.4))
# Plot prop by month
ggplot(ggdat, aes(x=Month, y=Prop, color=Booked.In.FY, fill=Booked.In.FY, group=Booked.In.FY)) + geom_point() + geom_line(alpha=.5) + geom_ribbon(aes(ymin=Prop-ci, ymax=Prop+ci), color=NA, alpha=.15) + labs(y="Percent") + theme(text=element_text(size=12), axis.text.x=element_text(angle=90, vjust=.4)) + facet_grid(Stage~GiftSize) + scale_y_continuous(labels=percent)
```
Confidence bands are based on the standard error for a binomial random variable, $\sqrt{\frac{p(1-p)}{n}}$ (assuming independence). The center column (MG) represents gifts in the $25k to $5M range and is our focus. Note that Ask and Oral do behave very differently.
# Variable selection
In order to create a drop-in JMod replacement, the set of covariates should be restricted to fields already included in the monthly solicitation pipeline reports, plus derived predictors. Compared to JMod, GLM does not use information about a solicitation's expected close date, so this is an easy choice to include.
To get an idea of which covariates are most predictive, I ran them all through random forests (500 trees). We carry over stage, current fiscal month, and expected close date from JMod, and add covariates for planned ask amount, ask amount, expected amount, days between stages, and several other derived variables. Here's the out-of-bag error (similar to cross-validation) for the random forests themselves, compared to the more parsimonious models.
```{r, echo=F, message=F, warning=F, cache=F}
## Set up for Random Forests
library(randomForest)
set.seed(15584)
## Set up headers
## Obviously shouldn't include after-the-fact information (FY. objects, actual amount, etc.)
head.all <- names(mderived)[names(mderived) %in% c("Solicitation.Type.Desc", "Planned.Fiscal.Mo", "Planned.Amt", "Expected.Fiscal.Mo", "Expected.Amt", "Planned.Ask.Band")]
# Look for "plan" and "Plan" in header ([pP] = both cases), but ignore the FY. outcome objects
head.plan <- names(mderived)[setdiff(grep("[pP]lan", names(mderived)), grep("FY.", names(mderived)))]
head.plan <- setdiff(head.plan, c(head.all, "plan2clear")) #take out plan2clear which wouldn't have been known
# Look for "Clear" in header
head.clear <- names(mderived)[setdiff(grep("[cC]lear", names(mderived)), grep("FY.", names(mderived)))]
head.clear <- setdiff(head.clear, c(head.all, "clear2ask")) #take out clear2ask which wouldn't have been known
# Look for "Ask" in header
head.ask <- names(mderived)[setdiff(grep("[aA]sk", names(mderived)), grep("FY.", names(mderived)))]
head.ask <- setdiff(head.ask, c(head.all, "ask2oral")) #take out ask2oral which wouldn't have been known
head.ask <- c(head.ask, "plan2clear") #add fields that would have been known
# Look for "Oral" in header
head.oral <- names(mderived)[setdiff(grep("[oO]ral", names(mderived)), grep("FY.", names(mderived)))]
head.oral <- setdiff(head.oral, head.all)
head.oral <- c(head.oral, "plan2clear", "clear2ask", "Ask.Amt", "Ask.Band", "Ask.Amt.Over.Pln.Amt", "Ask.Amt.Over.Pln.Fac") #add fields that would've been known
## Grow forests
# Planned stage forest
treedat <- na.omit(mderived[,c("FY.Plan.Book", head.all, head.plan)])
rf.plan <- randomForest(factor(FY.Plan.Book) ~ ., data=treedat, importance=T)
# Clear stage forest
treedat <- na.omit(mderived[,c("FY.Clear.Book", head.all, head.clear)])
rf.clear <- randomForest(factor(FY.Clear.Book) ~ ., data=treedat, importance=T)
# Ask stage forest
treedat <- na.omit(mderived[,c("FY.Ask.Book", head.all, head.ask)])
rf.ask <- randomForest(factor(FY.Ask.Book) ~ ., data=treedat, importance=T)
# Oral stage forest
treedat <- na.omit(mderived[,c("FY.Oral.Book", head.all, head.oral)])
rf.oral <- randomForest(factor(FY.Oral.Book) ~ ., data=treedat, importance=T)
## Recompute errors
errors <- rbind(errors, data.frame(Model="Random forest", Plan=tail(rf.plan$err.rate[,"OOB"],1), Clear=tail(rf.clear$err.rate[,"OOB"],1), Ask=tail(rf.ask$err.rate[,"OOB"],1), Oral=tail(rf.oral$err.rate[,"OOB"],1)))
kable(errors, digits=2)
```
This is a great result, of course, and represents something of a best-case cross-validation scenario. However, we're interested in explanation, not pure prediction, so random forests are not my first choice of model. However, they are extremely useful for variable selection. Here are the variable importances for each covariate:
```{r, echo=F, message=F, warning=F, cache=F}
## Coerce importances into a data frame
imp.plan <- importance(rf.plan, scale=F, type=1)
imp.clear <- importance(rf.clear, scale=F, type=1)
imp.ask <- importance(rf.ask, scale=F, type=1)
imp.oral <- importance(rf.oral, scale=F, type=1)
rf.importances <- rbind(data.frame(Model="Plan", imp.plan, Variable=rownames(imp.plan)), data.frame(Model="Clear", imp.clear, Variable=rownames(imp.clear)), data.frame(Model="Ask", imp.ask, Variable=rownames(imp.ask)), data.frame(Model="Oral", imp.oral, Variable=rownames(imp.oral)), make.row.names=F)
## Plot importances
# y axis will end at nearest .02 above the maximum in the data
top <- ceiling(max(rf.importances$MeanDecreaseAccuracy)*50)/50
ggplot(rf.importances, aes(x=Variable, y=MeanDecreaseAccuracy, color=Model, group=Model)) + geom_point() + geom_line() + theme(axis.text.x=element_text(angle=90, hjust=1)) + labs(title="Random forest variable importances", y="Mean Decrease in Accuracy") + scale_y_continuous(breaks=seq(0, top, by=.02))
```
Intuitively, importance is a measure of how useful a predictor is for distinguishing between classes, so the higher on the y-axis a point falls, the better it is for prediction. Note that because each stage (color) has its own outcome variable they do not share the same predictors, but certain predictors are comparable across stages.
For example, the distinctive double-peak pattern in each stage results from three variables related to the calendar date (`Stage.Future` = indicator for expected to close in the future, `stage2EOFY` = days left until the end of the current fiscal year, and `stage2expect` = days until the expected close date). These three variables are highly correlated -- because of how the forests are constructed (random branching) this means they all have high importance. Some observations from this plot and previous exploration:
* The `Stage.Future`, `stage2EOFY`, `stage.fiscal.mo`, and `stage2expect` variables are by far the most important. They are also highly correlated.
* Out of the four of them I would personally ditch `stage2EOFY` and consider dropping one of the `stage.fiscal.mo` or `stage2expect` variables as well.
* `Expected.Amt` and `Planned.Amt` are good predictors before the Oral stage; `Planned.Ask.Band` is worse across the board.
* `Expected.Fiscal.Mo` is important across the board, while `Planned.Fiscal.Mo` is useful before the Oral stage.
* Similarly, `Solicitation.Type.Desc` is a decent predictor before the oral stage.
* The various times between stages e.g. `plan2clear` are not good predictors.
* `Ask.Amt` is a good predictor in the Ask stage, certainly much better than `Ask.Band`; the two `Ask.Amt.Over.` fractions are useless.
Almost equally low error can be obtained with just a few predictors:
* `Solicitation.Type.Desc`, `Planned.Amt`, and `Expected.Amt` are useful across the board
* `Stage.Future` and `stage.fiscal.mo` for the corresponding stages are useful
* `Ask.Amt` is useful in the ask and oral stages
```{r, echo=F, message=F, warning=F, cache=F}
## Update the explanatory variables
exclude <- c("plan2EOFY", "clear2EOFY", "ask2EOFY", "oral2EOFY", "plan2clear", "clear2ask", "ask2oral", "Planned.Ask.Band", "Ask.Band", "Ask.Amt.Over.Pln.Amt", "Ask.Amt.Over.Pln.Fac", "plan2expect", "clear2expect", "ask2expect", "oral2expect", "Expected.Fiscal.Mo", "Planned.Fiscal.Mo")
head.all <- setdiff(head.all, exclude)
head.plan <- setdiff(head.plan, exclude)
head.clear <- setdiff(head.clear, exclude)
head.ask <- setdiff(head.ask, exclude)
head.oral <- setdiff(head.oral, exclude)
## Regrow trees
## Regrow the random forests
set.seed(48676)
# Planned stage forest
treedat <- na.omit(mderived[,c("FY.Plan.Book", head.all, head.plan)])
rf.plan <- randomForest(factor(FY.Plan.Book) ~ ., data=treedat, importance=T)
# Clear stage forest
treedat <- na.omit(mderived[,c("FY.Clear.Book", head.all, head.clear)])
rf.clear <- randomForest(factor(FY.Clear.Book) ~ ., data=treedat, importance=T)
# Ask stage forest
treedat <- na.omit(mderived[,c("FY.Ask.Book", head.all, head.ask)])
rf.ask <- randomForest(factor(FY.Ask.Book) ~ ., data=treedat, importance=T)
# Oral stage forest
treedat <- na.omit(mderived[,c("FY.Oral.Book", head.all, head.oral)])
rf.oral <- randomForest(factor(FY.Oral.Book) ~ ., data=treedat, importance=T)
## randomForest errors
errors <- rbind(errors, data.frame(Model="Random forest sm.", Plan=tail(rf.plan$err.rate[,"OOB"],1), Clear=tail(rf.clear$err.rate[,"OOB"],1), Ask=tail(rf.ask$err.rate[,"OOB"],1), Oral=tail(rf.oral$err.rate[,"OOB"],1), stringsAsFactors=F))
## error table
kable(errors, digits=2)
```
# Final model
The variables from this "small" random forest (in terms of number of covariates, not number of trees or depth) were reintroduced into a second GLM. The final variables include solicitation stage, solicitation type, transformed planned ask amount, transformed expected amount, an indicator for solicitations expected to close in a future fiscal year, plus transformed ask amount in the ask and oral stages, when they would be known. Because dollar amounts are very right-skewed, I transformed them as follows:
$$ f(x) = \text{log}_{10}(x + \alpha), \alpha = 100 $$
$x = 100$ was the smallest non-zero ask amount, so $\alpha$ was set to this value.
```{r, echo=F, message=F, warning=F, cache=F}
## Data transformations
# Function to transform dollar amounts as above
LogTrans <- function(x, a=100) {log10(x + a)}
# Transformed dollar amounts
mderived <- mderived %>% mutate(lt.Planned.Amt = LogTrans(Planned.Amt), lt.Expected.Amt = LogTrans(Expected.Amt), lt.Ask.Amt = LogTrans(Ask.Amt))
# Plan model, main effects only
glm.pln.t <- glm(FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Plan.Future, data=mderived, family=binomial())
# Clear model, main effects only
glm.clr.t <- glm(FY.Clear.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Clear.Future, data=mderived, family=binomial())
# Ask model, main effects only
glm.ask.t <- glm(FY.Ask.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Ask.Future, data=mderived, family=binomial())
# Oral model, main effects only
glm.ora.t <- glm(FY.Oral.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Oral.Future, data=mderived, family=binomial())
## Put together the error rates
confuse <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), modnames=modnames, counts=F)
# Row to add to the error table
errors <- rbind(errors, CalcErrorRates(confuse, model.name="GLM big", modnames))
# Print the error table
kable(errors, digits=2)
```
Using the variables identified by the small random forest, with a few subtractions, the "big" GLM does nearly as well (in-sample) as a random forest -- not bad at all!
Finally, here are the error rates using the withheld FY16 data:
```{r, echo=F, message=F, warning=F, cache=F}
## Import new data
datafile <- "2016-07-12 Booth solicitation history.csv" # filename for data
source("Rscript1.Merge.R")
## Take only solicitations closing between 7/1/2015 and 6/30/2016 for our out-of-sample comparison
mdat <- mdat %>% filter(Final.Sol.Stage.Dt >= mdy("07/01/2015")) %>% filter(Final.Sol.Stage.Dt <= mdy("06/30/2016"))
mderived <- mderived %>% filter(Final.Sol.Stage.Dt >= mdy("07/01/2015")) %>% filter(Final.Sol.Stage.Dt <= mdy("06/30/2016"))
# Add needed fields for GLM
mderived <- mderived %>% mutate(lt.Planned.Amt = LogTrans(Planned.Amt), lt.Expected.Amt = LogTrans(Expected.Amt), lt.Ask.Amt = LogTrans(Ask.Amt))
confuse2 <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), testdata=mderived, modnames=modnames, counts=F)
# Final error calculations
# Print the error table
kable(CalcErrorRates(confuse2, model.name="GLM OOS", modnames), digits=2)
```
The out-of-sample error is very much in line with the in-sample error, as hoped. Here are the models with coefficients. Note that dollar amounts are log transformed with an offset, as defined above.
### Plan stage
```{r, echo=F, message=F, warning=F, cache=F}
kable(summary(glm.pln.t)$coef, digits=3)
```
### Clear stage
```{r, echo=F, message=F, warning=F, cache=F}
kable(summary(glm.clr.t)$coef, digits=3)
```
### Ask stage
```{r, echo=F, message=F, warning=F, cache=F}
kable(summary(glm.ask.t)$coef, digits=3)
```
### Oral stage
```{r, echo=F, message=F, warning=F, cache=F}
kable(summary(glm.ora.t)$coef, digits=3)
```
# Point-in-time validation
Now that we have a satisfactory model, JMod and GLM can be compared side-by-side against historical point-in-time data.
```{r, echo=F, message=F, warning=F, cache=F}
## Load point-in-time dataset
load("pit.dat.Rdata")
## Make predictions for each method
sols3 <- na.omit(sols2 %>% select(Solicitation.ID, Curr.Stage, rpt.date, Expected.Dt, Expected.Amt, Actual.Amt, Closed.In.FY, Expect.In.Future.FY, Sol.Type.Agg, lt.Planned.Amt, lt.Expected.Amt, lt.Ask.Amt))
## JMod predictions
jmod.preds <- JMod(curr.stage=sols3$Curr.Stage, curr.dt=sols3$rpt.date, expected.dt=sols3$Expected.Dt, expect.amt=sols3$Expected.Amt, act.amt=sols3$Actual.Amt, closed.in.fy=sols3$Closed.In.FY)
# Add the report date
jmod.preds$dt <- sols3$rpt.date
jmod.preds$id <- as.character(paste(sols3$Solicitation.ID, sols3$rpt.date, sep="."))
jmod.preds$expect <- sols3$Expected.Amt
jmod.preds <- data.frame(jmod.preds)
## GLM predictions
PredictGLM <- function(model, data){
data.frame(probability = predict(model, newdata=data, type="response")) %>%
mutate(prediction = probability * data$Expected.Amt,
error = prediction - ifelse(is.na(data$Actual.Amt), 0, data$Actual.Amt) * data$Closed.In.FY,
dt = data$rpt.date,
expect = data$Expected.Amt)
}
# Plan
glmdat <- sols3 %>% filter(Curr.Stage == 1) %>% mutate(FY.Plan.Book = Closed.In.FY, Plan.Future = Expect.In.Future.FY)
plan.preds <- PredictGLM(glm.pln.t, glmdat)
# Clear
glmdat <- sols3 %>% filter(Curr.Stage == 2) %>% mutate(FY.Clear.Book = Closed.In.FY, Clear.Future = Expect.In.Future.FY)
clear.preds <- PredictGLM(glm.clr.t, glmdat)
# Ask
glmdat <- sols3 %>% filter(Curr.Stage == 3) %>% mutate(FY.Ask.Book = Closed.In.FY, Ask.Future = Expect.In.Future.FY)
ask.preds <- PredictGLM(glm.ask.t, glmdat)
# Oral
glmdat <- sols3 %>% filter(Curr.Stage == 4) %>% mutate(FY.Oral.Book = Closed.In.FY, Oral.Future = Expect.In.Future.FY)
oral.preds <- PredictGLM(glm.ora.t, glmdat)
# Paperwork
glmdat <- sols3 %>% filter(Curr.Stage == 5)
paper.preds <- data.frame(probability = 1, prediction = glmdat$Expected.Amt, error = glmdat$Expected.Amt - glmdat$Actual.Amt, dt = glmdat$rpt.date, expect = glmdat$Expected.Amt)
# Combine
glm.preds <- rbind(plan.preds, clear.preds, ask.preds, oral.preds, paper.preds)
# Ignore gifts over $5M
x <- 5000000
## Set up comparison data frame
ggdat <- filter(jmod.preds, expect < x) %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise(pred = sum(prediction), act = sum(actual))
## Reformat glm predictions
tmpdf <- filter(glm.preds, expect < x) %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise(glm.pred = sum(prediction))
## Bind tmpdf to ggdat
ggdat <- left_join(ggdat, tmpdf, by = "dt")
## Bootstrapped GLM dataset
set.seed(2358)
rep <- 1000
boot.result <- matrix(0, nrow = nrow(glm.preds), ncol = rep)
for (i in 1:rep) {
boot.result[, i] <- (glm.preds$probability >= runif(n=nrow(glm.preds), min=0, max=1)) * glm.preds$expect
}
glm.boot <- cbind(glm.preds, data.frame(boot.result))
# Summarise by date across every bootstrapped dataset
glm.boot <- filter(glm.boot, expect < x) %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise_each(funs(sum))
# Find empirical 2.5th, 50th, and 97.5th percentiles (95% CI)
boot.names <- paste("X", 1:rep, sep="")
ggdat <- cbind(ggdat, t(apply(glm.boot[, boot.names], MARGIN=1, FUN=quantile, probs=c(.025, .5, .975))))
## Plot the comparison
ggplot(data=ggdat, aes(x=dt, y=act)) + geom_point(aes(color="Actual")) + geom_line(aes(color="Actual"), size=1.5, alpha=.5) + geom_point(aes(y=pred, color="JMod")) + geom_line(aes(y=pred, color="JMod")) + geom_point(aes(y=glm.pred, color="GLM big")) + geom_line(aes(y=glm.pred, color="GLM big")) + scale_y_continuous(name="Amount", labels=scales::dollar) + scale_x_date(name="Date", date_minor_breaks="month") + geom_ribbon(aes(ymin=`2.5%`, ymax=`97.5%`), fill="green", alpha=.1) + labs(title=paste("Solicitations under", dollar(x), "with 95% GLM CI"))
```
The dots are observations and the connecting lines are point estimates for each method. The green shaded GLM confidence bands are simulated, obtained by treating each solicitation as a Bernoulli trial with probability to close $p$ as predicted by the model. 1000 trials were performed per solicitation, and results in the same month and trial were summed together. The confidence bands cover the observed .025 and .975 quantiles of the sum.
GLM is clearly superior until the beginning of 2015; after that it's more of a toss-up. The dataset did not include any solicitations closing after 2/2/2016, so I compiled some more historical data and re-created this plot using all available FY16 data.
```{r, echo=F, message=F, warning=F, cache=F}
## Generate FY16 point-in-time dataset by creating a new sols2
folder <- "CSV data"
source("Rscript - historical data generation.R")
## Look for duplicate solicitations per month; only use the first in each month, and only closed sols
sols2 <- sols2 %>% mutate(m = month(rpt.date)) %>% arrange(rpt.date) %>% distinct(Solicitation.ID, m)
## Make predictions for each method
sols3 <- na.omit(sols2 %>% select(Solicitation.ID, Curr.Stage, rpt.date, Expected.Dt, Expected.Amt, Actual.Amt, Closed.In.FY, Expect.In.Future.FY, Sol.Type.Agg, lt.Planned.Amt, lt.Expected.Amt, lt.Ask.Amt))
sols3$id <- as.character(paste(sols3$Solicitation.ID, sols3$rpt.date, sep="."))
## JMod predictions
jmod.preds <- JMod(curr.stage=sols3$Curr.Stage, curr.dt=sols3$rpt.date, expected.dt=sols3$Expected.Dt, expect.amt=sols3$Expected.Amt, act.amt=sols3$Actual.Amt, closed.in.fy=sols3$Closed.In.FY)
# Add the report date
jmod.preds$dt <- sols3$rpt.date
jmod.preds$expect <- sols3$Expected.Amt
jmod.preds <- data.frame(jmod.preds)
# Plan
glmdat <- sols3 %>% filter(Curr.Stage == 1) %>% mutate(FY.Plan.Book = Closed.In.FY, Plan.Future = Expect.In.Future.FY)
plan.preds <- PredictGLM(glm.pln.t, glmdat)
plan.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Clear
glmdat <- sols3 %>% filter(Curr.Stage == 2) %>% mutate(FY.Clear.Book = Closed.In.FY, Clear.Future = Expect.In.Future.FY)
clear.preds <- PredictGLM(glm.clr.t, glmdat)
clear.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Ask
glmdat <- sols3 %>% filter(Curr.Stage == 3) %>% mutate(FY.Ask.Book = Closed.In.FY, Ask.Future = Expect.In.Future.FY)
ask.preds <- PredictGLM(glm.ask.t, glmdat)
ask.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Oral
glmdat <- sols3 %>% filter(Curr.Stage == 4) %>% mutate(FY.Oral.Book = Closed.In.FY, Oral.Future = Expect.In.Future.FY)
oral.preds <- PredictGLM(glm.ora.t, glmdat)
oral.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Paperwork
glmdat <- sols3 %>% filter(Curr.Stage == 5)
paper.preds <- data.frame(probability = 1, prediction = glmdat$Expected.Amt, error = glmdat$Expected.Amt - glmdat$Actual.Amt, dt = glmdat$rpt.date, expect = glmdat$Expected.Amt)
paper.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Combine
glm.preds <- rbind(plan.preds, clear.preds, ask.preds, oral.preds, paper.preds)
## Function to do my plots
SideBySidePlots <- function(x){
## Set up comparison data frame
ggdat <- filter(jmod.preds, expect < x) %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise(pred = sum(prediction), act = sum(actual))
## Reformat glm predictions
tmpdf <- filter(glm.preds, expect < x) %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise(glm.pred = sum(prediction))
## Bind tmpdf to ggdat
ggdat <- left_join(ggdat, tmpdf, by = "dt")
## Bootstrapped GLM dataset
set.seed(2358)
rep <- 1000
boot.result <- matrix(0, nrow = nrow(glm.preds), ncol = rep)
for (i in 1:rep) {
boot.result[, i] <- (glm.preds$probability >= runif(n=nrow(glm.preds), min=0, max=1)) * glm.preds$expect
}
glm.boot <- cbind(glm.preds, data.frame(boot.result)) %>% select(-id)
# Summarise by date across every bootstrapped dataset
glm.boot <- filter(glm.boot, expect < x) %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise_each(funs(sum))
# Find empirical 2.5th, 50th, and 97.5th percentiles (95% CI)
boot.names <- paste("X", 1:rep, sep="")
ggdat <- cbind(ggdat, t(apply(glm.boot[, boot.names], MARGIN=1, FUN=quantile, probs=c(.025, .5, .975))))
## Debugging
# jmod.preds$id <- sols3$id
# write.table(left_join(left_join(sols3, glm.preds, by="id"), jmod.preds, by="id"), "tmp.txt", row.names=F, sep="\t")
## Plot the comparison
ggplot(data=ggdat, aes(x=dt, y=act)) + geom_point(aes(color="Actual")) + geom_line(aes(color="Actual"), size=1.5, alpha=.5) + geom_point(aes(y=pred, color="JMod")) + geom_line(aes(y=pred, color="JMod")) + geom_point(aes(y=glm.pred, color="GLM big")) + geom_line(aes(y=glm.pred, color="GLM big")) + scale_y_continuous(name="Amount", labels=scales::dollar) + scale_x_date(name="Date", date_minor_breaks="month") + geom_ribbon(aes(ymin=`2.5%`, ymax=`97.5%`), fill="green", alpha=.1) + labs(title=paste("Solicitations under", dollar(x), "with 95% GLM CI"))
}
## Ignore gifts over $5M
SideBySidePlots(5000000)
```
Not a home run by any means, though this provides more evidence that for the second half of the year JMod is too pessimistic. But here's the interesting part:
```{r, echo=F, message=F, warning=F, cache=F}
## Increase x
g1 <- SideBySidePlots(10000000)
g2 <- SideBySidePlots(25000000)
g3 <- SideBySidePlots(50000000)
grid.arrange(g1, g2, g3, nrow=3)
```
As I include increasingly large gifts, the GLM does better and better, implying that this method *successfully models principal gifts* in the 7/1/2015--6/30/2016 time period! Luck, or skill? Here is the comparison for all past years using only the actual historical "SOL & Future FRP Projections" files available at the time.
```{r, echo=F, message=F, warning=F, cache=F}
## Generate FY16 point-in-time dataset by creating a new sols2
folder <- "CSV data 2"
source("Rscript - historical data generation.R")
## Look for duplicate solicitations per month; only use the first in each month, and only closed sols
sols2 <- sols2 %>% mutate(m = month(rpt.date)) %>% arrange(rpt.date) %>% distinct(Solicitation.ID, m)
## Make predictions for each method
sols3 <- na.omit(sols2 %>% select(Solicitation.ID, Curr.Stage, rpt.date, Expected.Dt, Expected.Amt, Actual.Amt, Closed.In.FY, Expect.In.Future.FY, Sol.Type.Agg, lt.Planned.Amt, lt.Expected.Amt, lt.Ask.Amt))
sols3$id <- as.character(paste(sols3$Solicitation.ID, sols3$rpt.date, sep="."))
## JMod predictions
jmod.preds <- JMod(curr.stage=sols3$Curr.Stage, curr.dt=sols3$rpt.date, expected.dt=sols3$Expected.Dt, expect.amt=sols3$Expected.Amt, act.amt=sols3$Actual.Amt, closed.in.fy=sols3$Closed.In.FY)
# Add the report date
jmod.preds$dt <- sols3$rpt.date
jmod.preds$expect <- sols3$Expected.Amt
jmod.preds <- data.frame(jmod.preds)
# Plan
glmdat <- sols3 %>% filter(Curr.Stage == 1) %>% mutate(FY.Plan.Book = Closed.In.FY, Plan.Future = Expect.In.Future.FY)
plan.preds <- PredictGLM(glm.pln.t, glmdat)
plan.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Clear
glmdat <- sols3 %>% filter(Curr.Stage == 2) %>% mutate(FY.Clear.Book = Closed.In.FY, Clear.Future = Expect.In.Future.FY)
clear.preds <- PredictGLM(glm.clr.t, glmdat)
clear.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Ask
glmdat <- sols3 %>% filter(Curr.Stage == 3) %>% mutate(FY.Ask.Book = Closed.In.FY, Ask.Future = Expect.In.Future.FY)
ask.preds <- PredictGLM(glm.ask.t, glmdat)
ask.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Oral
glmdat <- sols3 %>% filter(Curr.Stage == 4) %>% mutate(FY.Oral.Book = Closed.In.FY, Oral.Future = Expect.In.Future.FY)
oral.preds <- PredictGLM(glm.ora.t, glmdat)
oral.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Paperwork
glmdat <- sols3 %>% filter(Curr.Stage == 5)
paper.preds <- data.frame(probability = 1, prediction = glmdat$Expected.Amt, error = glmdat$Expected.Amt - glmdat$Actual.Amt, dt = glmdat$rpt.date, expect = glmdat$Expected.Amt)
paper.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Combine
glm.preds <- rbind(plan.preds, clear.preds, ask.preds, oral.preds, paper.preds)
SideBySidePlots(50000000)
```
Some of the principal gifts have been in the works since FY14 and didn't come in until FY16. As a result, GLM overestimated dollars booked in FY14 and FY15, but was almost spot on in FY16.
In summary, this model does very well predicting probabilities that a solicitation will close, but not the expected solicitation. Formulated statistically, recall that $E(FRP)=\sum_{i} P(Sol_{i}=1)E(Sol_{i})$. From the first part, $P(Sol_{i}=1)$ is accurate, but as seem from the point-in-time charts, multiplying this probability by $E(Sol_{i})$ is not a good model for $E(FRP)$. An alternate approach is to try to model $E(FRP)$ directly; depending on the method, this could be guaranteed to be statistically unbiased.
# JMod tweaking
In the past, JMod was tweaked at the end of the year based on how it had over- or under-estimated dollars booked. In comparing JMod to the basic GLM considering only stage and month, I noted two big opportunities:
1. The ask and oral stages need to be treated differently.
2. MG $(\$25\text{k} <= X < \$5\text{M})$ and PG $(\$5\text{M} < X)$ need to be treated differently.
I'll test each of the following easy-to-implement tweaks on the compiled data:
1. Fixed Oral discounting of 2/3 across the board
2. Fixed oral discounting of 2/3 across the board, and set post-April discounts to their April values
2. Same as #1, but exclude PG solicitations
4. Same as #2, but exclude PG solicitations
```{r, echo=F, message=F, warning=F, cache=F}
## Original method
jmod.preds <- JMod(curr.stage=sols3$Curr.Stage, curr.dt=sols3$rpt.date, expected.dt=sols3$Expected.Dt, expect.amt=sols3$Expected.Amt, act.amt=sols3$Actual.Amt, closed.in.fy=sols3$Closed.In.FY) %>% data.frame() %>%
mutate(dt = sols3$rpt.date, expect = sols3$Expected.Amt)
jmod.preds$id <- sols3$id
## Method 1
jmod.preds.1 <- JMod_Oral(curr.stage=sols3$Curr.Stage, curr.dt=sols3$rpt.date, expected.dt=sols3$Expected.Dt, expect.amt=sols3$Expected.Amt, act.amt=sols3$Actual.Amt, closed.in.fy=sols3$Closed.In.FY) %>% data.frame() %>% select(jm1.prediction = prediction)
## Method 2
jmod.preds.2 <- JMod_Oral_2step(curr.stage=sols3$Curr.Stage, curr.dt=sols3$rpt.date, expected.dt=sols3$Expected.Dt, expect.amt=sols3$Expected.Amt, act.amt=sols3$Actual.Amt, closed.in.fy=sols3$Closed.In.FY) %>% data.frame() %>% select(jm2.prediction = prediction)
## Method 5
jmod.preds.5 <- JMod_Oral_2step_75(curr.stage=sols3$Curr.Stage, curr.dt=sols3$rpt.date, expected.dt=sols3$Expected.Dt, expect.amt=sols3$Expected.Amt, act.amt=sols3$Actual.Amt, closed.in.fy=sols3$Closed.In.FY) %>% data.frame() %>% select(jm5.prediction = prediction)
## Combine
jmod.preds <- jmod.preds %>% bind_cols(jmod.preds.1) %>% bind_cols(jmod.preds.2) %>% bind_cols(jmod.preds.5)
jmod.preds <- full_join(jmod.preds, glm.preds %>% select(glm.pred = prediction, id), by="id")
## Summarized data, any solicitation amount
ggdat1 <- jmod.preds %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise(pred = sum(prediction), pred1 = sum(jm1.prediction), pred2 = sum(jm2.prediction), pred5 = sum(jm5.prediction), glm.pred = sum(glm.pred), act = sum(actual))
## Summarized data, sub-$5M only
ggdat2 <- filter(jmod.preds, expect < 5000000) %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise(pred = sum(prediction), pred1 = sum(jm1.prediction), pred2 = sum(jm2.prediction), pred5 = sum(jm5.prediction), glm.pred = sum(glm.pred), act = sum(actual))
## JMod point-in-time plots, any solicitation amount
ggplot(data=ggdat1, aes(x=dt, y=act)) + geom_point(aes(color="Actual")) + geom_line(aes(color="Actual"), size=1.5, alpha=.5) + geom_point(aes(y=glm.pred, color="GLM")) + geom_line(aes(y=glm.pred, color="GLM")) + geom_point(aes(y=pred, color="JMod")) + geom_line(aes(y=pred, color="JMod")) + geom_point(aes(y=pred1, color="JMod #1")) + geom_line(aes(y=pred1, color="JMod #1"), alpha=.5) + geom_point(aes(y=pred2, color="JMod #2")) + geom_line(aes(y=pred2, color="JMod #2"), alpha=.5) + scale_y_continuous(name="Amount", labels=scales::dollar) + scale_x_date(name="Date", date_minor_breaks="month") + labs(title="JMod methods comparison, any solicitation amount")
## JMod point-in-time plots, sub-$5M only
ggplot(data=ggdat2, aes(x=dt, y=act)) + geom_point(aes(color="Actual")) + geom_line(aes(color="Actual"), size=1.5, alpha=.5) + geom_point(aes(y=glm.pred, color="GLM")) + geom_line(aes(y=glm.pred, color="GLM")) + geom_point(aes(y=pred, color="JMod")) + geom_line(aes(y=pred, color="JMod")) + geom_point(aes(y=pred1, color="JMod #3")) + geom_line(aes(y=pred1, color="JMod #3"), alpha=.5) + geom_point(aes(y=pred2, color="JMod #4")) + geom_line(aes(y=pred2, color="JMod #4"), alpha=.5) + scale_y_continuous(name="Amount", labels=scales::dollar) + scale_x_date(name="Date", date_minor_breaks="month") + labs(title="JMod methods comparison, solicitations up to $5M")
```
These tweaks around the edges are clearly beneficial, but honestly they barely make a difference to the overall predictions. The JMod variants are just a little less negatively biased at the end of each year.
Let's try one more tweak: using a less-aggressive oral stage discounting (75%).
5. Fixed oral discounting of 3/4 across the board, and set post-April discounts to their April values
6. Same as #5, but exclude PG solicitations
```{r, echo=F, message=F, warning=F, cache=F}
## JMod point-in-time plots, any solicitation amount
ggplot(data=ggdat1, aes(x=dt, y=act)) + geom_point(aes(color="Actual")) + geom_line(aes(color="Actual"), size=1.5, alpha=.5) + geom_point(aes(y=glm.pred, color="GLM")) + geom_line(aes(y=glm.pred, color="GLM")) + geom_point(aes(y=pred, color="JMod")) + geom_line(aes(y=pred, color="JMod")) + geom_point(aes(y=pred5, color="JMod #5")) + geom_line(aes(y=pred5, color="JMod #5"), alpha=.5) + scale_y_continuous(name="Amount", labels=scales::dollar) + scale_x_date(name="Date", date_minor_breaks="month") + labs(title="JMod methods comparison, any solicitation amount")
## JMod point-in-time plots, sub-$5M only
ggplot(data=ggdat2, aes(x=dt, y=act)) + geom_point(aes(color="Actual")) + geom_line(aes(color="Actual"), size=1.5, alpha=.5) + geom_point(aes(y=glm.pred, color="GLM")) + geom_line(aes(y=glm.pred, color="GLM")) + geom_point(aes(y=pred, color="JMod")) + geom_line(aes(y=pred, color="JMod")) + geom_point(aes(y=pred5, color="JMod #6")) + geom_line(aes(y=pred5, color="JMod #6"), alpha=.5) + scale_y_continuous(name="Amount", labels=scales::dollar) + scale_x_date(name="Date", date_minor_breaks="month") + labs(title="JMod methods comparison, solicitations up to $5M")
```
Again, it's hard to see much of a difference outside the last few months of each year. Here's a summary of the count of solicitations expected in the current fiscal year by stage, on average:
```{r, echo=F, message=F, warning=F, cache=F}
tmp <- sols3 %>% mutate(
Stage = factor(Curr.Stage, label=c("Plan", "Clear", "Ask", "Oral", "Paperwork")),
Month = month(rpt.date) %>% factor(labels=strtrim(month.name, 3)),
Year = rpt.date %>% YToFY()) %>%
filter(!Expect.In.Future.FY)
tmp$Month <- factor(tmp$Month, levels(tmp$Month)[c(7:12,1:6)])
ggdat <- tmp %>% select(Stage, Month) %>% mutate(Count = 1) %>% group_by(Stage, Month) %>% summarise(Count = sum(Count)/length(table(tmp$Year)))
ggdat %>% ggplot(aes(x=Month, y=Count, color=Stage, group=Stage)) + geom_point(alpha=.5) + geom_line(alpha=.5) + labs(title="Mean yearly count of expected solicitations by month")
#ggdat %>% select(Count, Month, Stage) %>% xtabs(Count ~ Stage + Month, data=.) %>% kable()
```
And the associated expected dollar amounts:
```{r, echo=F, message=F, warning=F, cache=F}
ggdat <- tmp %>% group_by(Stage, Month) %>% summarise(Amount = sum(Expected.Amt)/length(table(tmp$Year)))
ggdat %>% ggplot(aes(x=Month, y=Amount, color=Stage, group=Stage)) + geom_point(alpha=.5) + geom_line(alpha=.5) + scale_y_log10(labels=scales::dollar, breaks=10^(0:7)) + labs(title="Mean yearly sum of expected amounts by month")
#ggdat %>% select(Amount, Month, Stage) %>% xtabs(Amount ~ Stage + Month, data=.) %>% kable()
```
Looking at the dollar amounts, tweaks to Oral just won't have much of an impact, and looking at the counts and dollar amounts, tweaks to Plan and Clear from February on won't have much of an impact either.
# Directly modeling dollar amounts
The preferred method for finding $E(FRP)$ is to model it directly. This is challenging given the right-skewness of `Actual.Amt` and large number of zero values.
```{r, echo=F, message=F, warning=F, cache=F}
# Re-import mderived
source("Rscript1.Merge.R")
# Untransformed, < $1M
mderived %>% filter(Actual.Amt < 1000000) %>% ggplot(aes(x=Actual.Amt, y=..density..)) + geom_histogram(alpha=.5, bins=25) + geom_density() + scale_x_continuous(labels=scales::dollar) + labs(title="Density of solicitation actual amount (under $1M)")
# Log transformed; remove $0 amounts
mderived %>% filter(Actual.Amt > 0) %>% ggplot(aes(x=Actual.Amt, y=..density..)) + geom_histogram(alpha=.5, bins=25) + geom_density() + scale_x_log10(labels=scales::dollar) + labs(title=expression(paste("Density of ", log[10], " solicitation actual amount")))
```
The $\text{log}_{10}$ transformation is nearly symmetric after excluding the $0 amounts, but it's unreasonable not to consider them in the analysis. Two possibilities:
1. Model the expected solicitation close amount $E(Sol_{i}\text{ value}|Sol_{i}\text{ closed})$ rather than taking `Expected.Amt` at face value.
2. Log transform the data, map all negative values to 0, and use a model such as the Tobit that allows for censored data.
### Modeling close amounts
We can check whether the high-importance variables for predicting whether gifts close to begin with are still important for predicting the amount at which they close. Begin by $\text{log}_{10}$ transforming `Actual.Amt` and dropping all values $<=0$; this gives the observed conditional distribution $E(Sol_{i}\text{ value}|Sol_{i}\text{ closed})$ as required. Now, growing regression forests on the transformed gift amounts:
```{r, echo=F, message=F, warning=F, cache=F}
# Censored log transformation
log_censor <- function(x, base=exp(1)) {
x <- ifelse(log(x, base=base) > 0, log(x, base=base), 0)
return(x)
}
# Data setup; only use through end of FY15 for fair comparisons
holdout <- mderived %>% filter(Final.Sol.Stage.Dt >= mdy("07/01/2015"))
mderived <- mderived %>% mutate(Actual.Amt.lt = Actual.Amt %>% log_censor(base=10)) %>% filter(Final.Sol.Stage.Dt <= mdy("06/30/2015"))
lmdat <- mderived %>% filter(Actual.Amt > 0)
## Random forests
library(randomForest)
set.seed(51906)
# Planned stage forest
treedat <- na.omit(lmdat[,c("Actual.Amt.lt", head.all, head.plan)])
rf.plan <- randomForest(Actual.Amt.lt ~ ., data=treedat, importance=T)
# Clear stage forest
treedat <- na.omit(lmdat[,c("Actual.Amt.lt", head.all, head.clear)])
rf.clear <- randomForest(Actual.Amt.lt ~ ., data=treedat, importance=T)
# Ask stage forest
treedat <- na.omit(lmdat[,c("Actual.Amt.lt", head.all, head.ask)])
rf.ask <- randomForest(Actual.Amt.lt ~ ., data=treedat, importance=T)
# Oral stage forest
treedat <- na.omit(lmdat[,c("Actual.Amt.lt", head.all, head.oral)])
rf.oral <- randomForest(Actual.Amt.lt ~ ., data=treedat, importance=T)
## Coerce importances into a data frame
imp.plan <- importance(rf.plan, scale=F, type=1)
imp.clear <- importance(rf.clear, scale=F, type=1)
imp.ask <- importance(rf.ask, scale=F, type=1)
imp.oral <- importance(rf.oral, scale=F, type=1)
rf.importances <- rbind(data.frame(Model="Plan", imp.plan, Variable=rownames(imp.plan)), data.frame(Model="Clear", imp.clear, Variable=rownames(imp.clear)), data.frame(Model="Ask", imp.ask, Variable=rownames(imp.ask)), data.frame(Model="Oral", imp.oral, Variable=rownames(imp.oral)), make.row.names=F)
## Plot importances
# y axis will end at nearest .05 above the maximum in the data
top <- ceiling(max(rf.importances$X.IncMSE)*20)/20
ggplot(rf.importances, aes(x=Variable, y=X.IncMSE, color=Model, group=Model)) + geom_point() + geom_line() + theme(axis.text.x=element_text(angle=90, hjust=1)) + labs(title="Random forest variable importances", y="Mean Decrease in Accuracy") + scale_y_continuous(breaks=seq(0, top, by=.05))
```
* `Expected.Amt` is extremely important. `Planned.Amt` and `Ask.Amt` are also helpful where available.
* `Solicitation.Type.Desc` is slightly helpful in the Plan and Clear stages, but not afterward.
* The future variables, and month the indicated stage was hit, are not helpful as continuous covariates using the tree method. This follows given that we're looking only at conditional giving.
Using the `leaps` package for variable selection:
```{r, echo=F, message=F, warning=F, cache=F}
library(leaps)
# Plan
lm.leaps <- lmdat %>% regsubsets(Actual.Amt.lt ~ log10(Expected.Amt) + log10(Planned.Amt) + Plan.Future + planning.fiscal.mo + Sol.Type.Agg, data=., nvmax=16)
rsubs1 <- summary(lm.leaps)
rsubsdf <- data.frame(model="Plan", bic=rsubs1$bic, p=(1:length(rsubs1$bic)))
# Clear
lm.leaps <- lmdat %>% regsubsets(Actual.Amt.lt ~ log10(Expected.Amt) + log10(Planned.Amt) + Sol.Type.Agg + clear.fiscal.mo + Clear.Future, data=., nvmax=16)
rsubs2 <- summary(lm.leaps)
rsubsdf <- rbind(rsubsdf, data.frame(model="Clear", bic=rsubs2$bic, p=(1:length(rsubs2$bic))))
# Ask
lm.leaps <- lmdat %>% filter(Ask.Amt > 0) %>% regsubsets(Actual.Amt.lt ~ log10(Expected.Amt) + log10(Planned.Amt) + Sol.Type.Agg + log10(Ask.Amt) + ask.fiscal.mo + Ask.Future, data=., nvmax=16)
rsubs3 <- summary(lm.leaps)
rsubsdf <- rbind(rsubsdf, data.frame(model="Ask", bic=rsubs3$bic, p=(1:length(rsubs3$bic))))
# Oral
lm.leaps <- lmdat %>% filter(Ask.Amt > 0) %>% regsubsets(Actual.Amt.lt ~ log10(Expected.Amt) + log10(Planned.Amt) + Sol.Type.Agg + log10(Ask.Amt) + oral.fiscal.mo + Oral.Future, data=., nvmax=16)
rsubs4 <- summary(lm.leaps)
rsubsdf <- rbind(rsubsdf, data.frame(model="Oral", bic=rsubs4$bic, p=(1:length(rsubs4$bic))))
# Plot BIC
rsubsdf %>% ggplot(aes(x=p, y=bic, color=model)) + geom_point(alpha=.5) + geom_line(alpha=.25) + labs(title="BIC for different model sizes") + facet_grid(. ~ model)
```
Across the board, two predictors are optimal. Let's see which they are.
```{r, echo=F, message=F, warning=F, cache=F}
tmp <- colnames(rsubs1$which)[rsubs1$which[2, ]]
tmp <- tmp %>% rbind(colnames(rsubs2$which)[rsubs2$which[2, ]])
tmp <- tmp %>% rbind(colnames(rsubs3$which)[rsubs3$which[2, ]])
tmp <- tmp %>% rbind(colnames(rsubs4$which)[rsubs4$which[2, ]])
rownames(tmp) <- c("Plan", "Clear", "Ask", "Oral")
colnames(tmp) <- c("V0", "V1", "V2")
tmp %>% kable()
```
For simplicity's sake, we can use the transformed `Expected.Amt` and corresponding `Stage.Future` in each of the models, plus the interaction. Now, calculating predicted amounts for the held-out FY16 data:
```{r, echo=F, message=F, warning=F, cache=F}
# Modeling with all data
lm.p <- lmdat %>% lm(Actual.Amt.lt ~ log10(Expected.Amt) * Plan.Future, data=.)
lm.c <- lmdat %>% lm(Actual.Amt.lt ~ log10(Expected.Amt) * Clear.Future, data=.)
lm.a <- lmdat %>% lm(Actual.Amt.lt ~ log10(Expected.Amt) * Ask.Future, data=.)
lm.o <- lmdat %>% lm(Actual.Amt.lt ~ log10(Expected.Amt) * Oral.Future, data=.)
# Predicted log10 actual amounts
holdout <- holdout %>% mutate(
lmpreds.pln = predict(lm.p, newdata=holdout),
lmpreds.pln = ifelse(lmpreds.pln == -Inf, 0, lmpreds.pln),
lmpreds.clr = predict(lm.c, newdata=holdout),
lmpreds.clr = ifelse(lmpreds.clr == -Inf, 0, lmpreds.clr),
lmpreds.ask = predict(lm.a, newdata=holdout),
lmpreds.ask = ifelse(lmpreds.ask == -Inf, 0, lmpreds.ask),
lmpreds.ora = predict(lm.o, newdata=holdout),
lmpreds.ora = ifelse(lmpreds.ora == -Inf, 0, lmpreds.ora)
)
```