gsb-sols-writeup.Rmd

---
title: "Booth Solicitations Write-up"
author: "Paul Hively"
date: "June 16, 2016"
output: html_document
---

# Problem statement

My goal is to find a way to accurately forecast fundraising revenue.

The existing model ("JMod") has two components. First, the probability a solicitation $Sol_{i}$ is successfully booked before the end of the fiscal year is a function of the current stage progress $Stage$ and time until the end of the year $t$, $P(Sol_{i}=1)=f(Stage_{i},t)$. The expected FRP in the current fiscal year is this probability times the expected amount, $E(FRP)=\sum_{i} P(Sol_{i}=1)E(Sol_{i})$, discounted to 0 if the expected close date is after the end of the fiscal year.

The probabilities are adjusted from year to year; these were the weights used in fiscal year 2015, 7/1/14 -- 6/30/15.

| Stage    | July-Feb| Mar-Apr |   May   |  June   |
|----------|:-------:|:-------:|:-------:|:-------:| 
| Plan     |  1/6    | 1/8     | 0       |   0     |
| Clear    |  1/3    | 1/6     | 1/8     |   0     |
| Ask      |  2/3    | 1/3     | 1/6     |   1/8   |
| Oral     |  2/3    | 1/3     | 1/6     |   1/8   |
| Paperwork|  1      | 1       | 1       |   1     |

I investigated the following:

  * How accurate is this method?
  * Are each of these variables (expected amount, current date, expected date) predictive?
  * Is there an alternate model for $P(Sol_{i}=1)$ that performs better?
  * Are there additional variables that will improve the model's accuracy?

# Data

The response variable is whether a solicitation was booked by the end of the fiscal year given its state at some earlier point. Covariates include current stage, current date, expected close date, planned ask amount, etc.

For model fitting and comparison, data was compiled on 1089 solicitations opened and closed between 7/1/2011 and 6/30/2015 with known outcome (booked, refused, cancelled).

For initial model validation, point-in-time data was compiled from 153 historical reports run between 8/1/2011 and 2/2/2016. One solicitation can be included in consecutive months provided that it hasn't yet closed.

Final model comparisons include out-of-sample data compiled from reports run between 7/1/2015 and 6/30/2016.

# Model structure

The most straightforward statistical model extends JMod, assuming that the probability a solicitation closes is some function of the solicitation stage and time of year $t$ -- in this case, fiscal month as a factor. The logistic regression model is:

$$ logit(P(Sol_{i}=1 | Stage_{j})) = \mu_{j} + t_{i} + \epsilon_{i} $$

Note that this does not include any information about expected close date. Comparing this approach to JMod using the 1089-row dataset (i.e. an in-sample test) yields the following error rates:

```{r, echo=F, message=F, warning=F, cache=F}
## Load libraries and functions
source("Rscript0 - libraries.R")
source("f.Wrangle.R")
source("f.Diagnostics.R")
source("f.JMod.R")
## Retrieve data
source("Rscript1b - derived variables.R")

## Back up mdat for use later
mdat.bak <- mdat

## Exclude solicitations closing after 6/30/2015 so we have a fair out-of-sample comparison later
mdat <- mdat %>% filter(Final.Sol.Stage.Dt <= mdy("06/30/2015"))
mderived <- mderived %>% filter(Final.Sol.Stage.Dt <= mdy("06/30/2015"))

## Error rates for JMod
jmod.1 <- JMod(curr.stage=1, curr.dt=mdat$Planning.Dt, expected.dt=mdat$Expected.Dt)
jmod.1$errtable <- ConfusionMatrixWeighted(preds=jmod.1$prob, truth=mdat$FY.Plan.Book, dimnames=c("truth", "pred"))
jmod.2 <- JMod(curr.stage=2, curr.dt=mdat$Clear.Dt, expected.dt=mdat$Expected.Dt)
jmod.2$errtable <- ConfusionMatrixWeighted(preds=na.omit(jmod.2$prob), truth=na.omit(mdat$FY.Clear.Book), dimnames=c("truth", "pred"))
jmod.3 <- JMod(curr.stage=3, curr.dt=mdat$Ask.Dt, expected.dt=mdat$Expected.Dt)
jmod.3$errtable <- ConfusionMatrixWeighted(preds=na.omit(jmod.3$prob), truth=na.omit(mdat$FY.Ask.Book), dimnames=c("truth", "pred"))
jmod.4 <- JMod(curr.stage=4, curr.dt=mdat$Oral.Dt, expected.dt=mdat$Expected.Dt) # Ask and Oral are considered the same stage
jmod.4$errtable <- ConfusionMatrixWeighted(preds=na.omit(jmod.4$prob), truth=na.omit(mdat$FY.Oral.Book), dimnames=c("truth", "pred"))
errors <- data.frame(Model="JMod", Plan=1-sum(diag(jmod.1$errtable))/sum(jmod.1$errtable), Clear=1-sum(diag(jmod.2$errtable))/sum(jmod.2$errtable), Ask=1-sum(diag(jmod.3$errtable))/sum(jmod.3$errtable), Oral=1-sum(diag(jmod.4$errtable))/sum(jmod.4$errtable))

## Error rates for basic GLM
## Model each stage by when it was asked 
glm.pln.1 <- glm(FY.Plan.Book ~ as.factor(month(mdat$Planning.Dt)), data=mdat, family=binomial())
glm.clr.1 <- glm(FY.Clear.Book ~ as.factor(month(mdat$Clear.Dt)), data=mdat, family=binomial())
glm.ask.1 <- glm(FY.Ask.Book ~ as.factor(month(mdat$Ask.Dt)), data=mdat, family=binomial())
glm.ora.1 <- glm(FY.Oral.Book ~ as.factor(month(mdat$Oral.Dt)), data=mdat, family=binomial())
modnames <- c("Plan","Clear","Ask","Oral")
confuse <- ConfusionMatrix(list(glm.pln.1, glm.clr.1, glm.ask.1, glm.ora.1), modnames=modnames, counts=F)

errors <- rbind(errors, CalcErrorRates(confuse, model.name="GLM"))

## Table output
kable(errors, digits=2)
```

These are fairly close, except that using the simple logistic regression, the Oral error rates are *much* lower. This suggests that at the least, the Ask and Oral stages should be treated differently. This can be confirmed by examining close rates in the underlying data by month and stage:

```{r, echo=F, message=F, warning=F, cache=F}
# Data manupulation
ggdat <- mdat %>%
  mutate(
    GiftSize = ifelse(Ask.Amt >= 5000000, "PG",
      ifelse(Ask.Amt >= 25000, "MG",
      ifelse(Ask.Amt < 25000, "AF",
      "Unasked"))),
    Plan = as.numeric(month(Planning.Dt)),
    Clear = as.numeric(month(Clear.Dt)),
    Ask = as.numeric(month(Ask.Dt)),
    Oral = as.numeric(month(Oral.Dt)),
    Count = 1
  ) %>%
  select(Count, Plan, Clear, Ask, Oral, FY.Plan.Book, FY.Clear.Book, FY.Ask.Book, FY.Oral.Book, GiftSize) %>%
  gather("Stage", "Month", 2:5, factor_key=T) %>%
  gather("Outcome", "Booked.In.FY", 2:5) %>%
  # Need to get rid of mismatched rows, e.g. where Stage=Ask and Outcome=FY.Clear.Book
  filter(
    (Stage=="Plan" & Outcome=="FY.Plan.Book") |
    (Stage=="Clear" & Outcome=="FY.Clear.Book") |
    (Stage=="Ask" & Outcome=="FY.Ask.Book") |
    (Stage=="Oral" & Outcome=="FY.Oral.Book")
  ) %>%
  mutate(Month = factor(Month, labels=strtrim(month.name, 3))) %>%
  group_by(GiftSize, Stage, Month, Booked.In.FY) %>%
  summarise(Count=sum(Count)) %>%
  na.omit()
# Reorder months to be in fiscal order
ggdat$Month <- factor(ggdat$Month, levels(ggdat$Month)[c(7:12,1:6)])
# Add proportions and CIs
ggdat <- ggdat %>% mutate(n = sum(Count), Prop = Count/n, ci = sqrt(Prop*(1-Prop)/n))
# # Plot counts by month
# ggplot(ggdat, aes(x=Month, y=Count, color=Booked.In.FY, group=Booked.In.FY)) + geom_point() + geom_line(alpha=.5) + facet_grid(Stage~GiftSize) + theme(text=element_text(size=12), axis.text.x=element_text(angle=90, vjust=.4))
# Plot prop by month
ggplot(ggdat, aes(x=Month, y=Prop, color=Booked.In.FY, fill=Booked.In.FY, group=Booked.In.FY)) + geom_point() + geom_line(alpha=.5) + geom_ribbon(aes(ymin=Prop-ci, ymax=Prop+ci), color=NA, alpha=.15) + labs(y="Percent") + theme(text=element_text(size=12), axis.text.x=element_text(angle=90, vjust=.4)) + facet_grid(Stage~GiftSize) + scale_y_continuous(labels=percent)
```

Confidence bands are based on the standard error for a binomial random variable, $\sqrt{\frac{p(1-p)}{n}}$ (assuming independence). The center column (MG) represents gifts in the $25k to $5M range and is our focus. Note that Ask and Oral do behave very differently.

# Variable selection

In order to create a drop-in JMod replacement, the set of covariates should be restricted to fields already included in the monthly solicitation pipeline reports, plus derived predictors. Compared to JMod, GLM does not use information about a solicitation's expected close date, so this is an easy choice to include.

To get an idea of which covariates are most predictive, I ran them all through random forests (500 trees). We carry over stage, current fiscal month, and expected close date from JMod, and add covariates for planned ask amount, ask amount, expected amount, days between stages, and several other derived variables. Here's the out-of-bag error (similar to cross-validation) for the random forests themselves, compared to the more parsimonious models.

```{r, echo=F, message=F, warning=F, cache=F}
## Set up for Random Forests
library(randomForest)
set.seed(15584)

## Set up headers
## Obviously shouldn't include after-the-fact information (FY. objects, actual amount, etc.)
head.all <- names(mderived)[names(mderived) %in% c("Solicitation.Type.Desc", "Planned.Fiscal.Mo", "Planned.Amt", "Expected.Fiscal.Mo", "Expected.Amt", "Planned.Ask.Band")]
# Look for "plan" and "Plan" in header ([pP] = both cases), but ignore the FY. outcome objects
head.plan <- names(mderived)[setdiff(grep("[pP]lan", names(mderived)), grep("FY.", names(mderived)))]
head.plan <- setdiff(head.plan, c(head.all, "plan2clear")) #take out plan2clear which wouldn't have been known
# Look for "Clear" in header
head.clear <- names(mderived)[setdiff(grep("[cC]lear", names(mderived)), grep("FY.", names(mderived)))]
head.clear <- setdiff(head.clear, c(head.all, "clear2ask")) #take out clear2ask which wouldn't have been known
# Look for "Ask" in header
head.ask <- names(mderived)[setdiff(grep("[aA]sk", names(mderived)), grep("FY.", names(mderived)))]
head.ask <- setdiff(head.ask, c(head.all, "ask2oral")) #take out ask2oral which wouldn't have been known
head.ask <- c(head.ask, "plan2clear") #add fields that would have been known
# Look for "Oral" in header
head.oral <- names(mderived)[setdiff(grep("[oO]ral", names(mderived)), grep("FY.", names(mderived)))]
head.oral <- setdiff(head.oral, head.all)
head.oral <- c(head.oral, "plan2clear", "clear2ask", "Ask.Amt", "Ask.Band", "Ask.Amt.Over.Pln.Amt", "Ask.Amt.Over.Pln.Fac") #add fields that would've been known

## Grow forests
# Planned stage forest
treedat <- na.omit(mderived[,c("FY.Plan.Book", head.all, head.plan)])
rf.plan <- randomForest(factor(FY.Plan.Book) ~ ., data=treedat, importance=T)
# Clear stage forest
treedat <- na.omit(mderived[,c("FY.Clear.Book", head.all, head.clear)])
rf.clear <- randomForest(factor(FY.Clear.Book) ~ ., data=treedat, importance=T)
# Ask stage forest
treedat <- na.omit(mderived[,c("FY.Ask.Book", head.all, head.ask)])
rf.ask <- randomForest(factor(FY.Ask.Book) ~ ., data=treedat, importance=T)
# Oral stage forest
treedat <- na.omit(mderived[,c("FY.Oral.Book", head.all, head.oral)])
rf.oral <- randomForest(factor(FY.Oral.Book) ~ ., data=treedat, importance=T)

## Recompute errors
errors <- rbind(errors, data.frame(Model="Random forest", Plan=tail(rf.plan$err.rate[,"OOB"],1), Clear=tail(rf.clear$err.rate[,"OOB"],1), Ask=tail(rf.ask$err.rate[,"OOB"],1), Oral=tail(rf.oral$err.rate[,"OOB"],1)))
kable(errors, digits=2)
```

This is a great result, of course, and represents something of a best-case cross-validation scenario. However, we're interested in explanation, not pure prediction, so random forests are not my first choice of model. However, they are extremely useful for variable selection. Here are the variable importances for each covariate:

```{r, echo=F, message=F, warning=F, cache=F}
## Coerce importances into a data frame
imp.plan <- importance(rf.plan, scale=F, type=1)
imp.clear <- importance(rf.clear, scale=F, type=1)
imp.ask <- importance(rf.ask, scale=F, type=1)
imp.oral <- importance(rf.oral, scale=F, type=1)
rf.importances <- rbind(data.frame(Model="Plan", imp.plan, Variable=rownames(imp.plan)), data.frame(Model="Clear", imp.clear, Variable=rownames(imp.clear)), data.frame(Model="Ask", imp.ask, Variable=rownames(imp.ask)), data.frame(Model="Oral", imp.oral, Variable=rownames(imp.oral)), make.row.names=F)
## Plot importances
# y axis will end at nearest .02 above the maximum in the data
top <- ceiling(max(rf.importances$MeanDecreaseAccuracy)*50)/50
ggplot(rf.importances, aes(x=Variable, y=MeanDecreaseAccuracy, color=Model, group=Model)) + geom_point() + geom_line() + theme(axis.text.x=element_text(angle=90, hjust=1)) + labs(title="Random forest variable importances", y="Mean Decrease in Accuracy") + scale_y_continuous(breaks=seq(0, top, by=.02))
```

Intuitively, importance is a measure of how useful a predictor is for distinguishing between classes, so the higher on the y-axis a point falls, the better it is for prediction. Note that because each stage (color) has its own outcome variable they do not share the same predictors, but certain predictors are comparable across stages.

For example, the distinctive double-peak pattern in each stage results from three variables related to the calendar date (`Stage.Future` = indicator for expected to close in the future, `stage2EOFY` = days left until the end of the current fiscal year, and `stage2expect` = days until the expected close date). These three variables are highly correlated -- because of how the forests are constructed (random branching) this means they all have high importance. Some observations from this plot and previous exploration:

  * The `Stage.Future`, `stage2EOFY`, `stage.fiscal.mo`, and `stage2expect` variables are by far the most important. They are also highly correlated.
  * Out of the four of them I would personally ditch `stage2EOFY` and consider dropping one of the `stage.fiscal.mo` or `stage2expect` variables as well.
  * `Expected.Amt` and `Planned.Amt` are good predictors before the Oral stage; `Planned.Ask.Band` is worse across the board.
  * `Expected.Fiscal.Mo` is important across the board, while `Planned.Fiscal.Mo` is useful before the Oral stage.
  * Similarly, `Solicitation.Type.Desc` is a decent predictor before the oral stage.
  * The various times between stages e.g. `plan2clear` are not good predictors.
  * `Ask.Amt` is a good predictor in the Ask stage, certainly much better than `Ask.Band`; the two `Ask.Amt.Over.` fractions are useless.
  
Almost equally low error can be obtained with just a few predictors:

  * `Solicitation.Type.Desc`, `Planned.Amt`, and `Expected.Amt` are useful across the board
  * `Stage.Future` and `stage.fiscal.mo` for the corresponding stages are useful
  * `Ask.Amt` is useful in the ask and oral stages

```{r, echo=F, message=F, warning=F, cache=F}
## Update the explanatory variables
exclude <- c("plan2EOFY", "clear2EOFY", "ask2EOFY", "oral2EOFY", "plan2clear", "clear2ask", "ask2oral", "Planned.Ask.Band", "Ask.Band", "Ask.Amt.Over.Pln.Amt", "Ask.Amt.Over.Pln.Fac", "plan2expect", "clear2expect", "ask2expect", "oral2expect", "Expected.Fiscal.Mo", "Planned.Fiscal.Mo")
head.all <- setdiff(head.all, exclude)
head.plan <- setdiff(head.plan, exclude)
head.clear <- setdiff(head.clear, exclude)
head.ask <- setdiff(head.ask, exclude)
head.oral <- setdiff(head.oral, exclude)

## Regrow trees
## Regrow the random forests
set.seed(48676)
# Planned stage forest
treedat <- na.omit(mderived[,c("FY.Plan.Book", head.all, head.plan)])
rf.plan <- randomForest(factor(FY.Plan.Book) ~ ., data=treedat, importance=T)
# Clear stage forest
treedat <- na.omit(mderived[,c("FY.Clear.Book", head.all, head.clear)])
rf.clear <- randomForest(factor(FY.Clear.Book) ~ ., data=treedat, importance=T)
# Ask stage forest
treedat <- na.omit(mderived[,c("FY.Ask.Book", head.all, head.ask)])
rf.ask <- randomForest(factor(FY.Ask.Book) ~ ., data=treedat, importance=T)
# Oral stage forest
treedat <- na.omit(mderived[,c("FY.Oral.Book", head.all, head.oral)])
rf.oral <- randomForest(factor(FY.Oral.Book) ~ ., data=treedat, importance=T)

## randomForest errors
errors <- rbind(errors, data.frame(Model="Random forest sm.", Plan=tail(rf.plan$err.rate[,"OOB"],1), Clear=tail(rf.clear$err.rate[,"OOB"],1), Ask=tail(rf.ask$err.rate[,"OOB"],1), Oral=tail(rf.oral$err.rate[,"OOB"],1), stringsAsFactors=F))
## error table
kable(errors, digits=2)
```

# Final model

The variables from this "small" random forest (in terms of number of covariates, not number of trees or depth) were reintroduced into a second GLM. The final variables include solicitation stage, solicitation type, transformed planned ask amount, transformed expected amount, an indicator for solicitations expected to close in a future fiscal year, plus transformed ask amount in the ask and oral stages, when they would be known. Because dollar amounts are very right-skewed, I transformed them as follows:

$$ f(x) = \text{log}_{10}(x + \alpha), \alpha = 100 $$

$x = 100$ was the smallest non-zero ask amount, so $\alpha$ was set to this value.

```{r, echo=F, message=F, warning=F, cache=F}
## Data transformations
# Function to transform dollar amounts as above
LogTrans <- function(x, a=100) {log10(x + a)}
# Transformed dollar amounts
mderived <- mderived %>% mutate(lt.Planned.Amt = LogTrans(Planned.Amt), lt.Expected.Amt = LogTrans(Expected.Amt), lt.Ask.Amt = LogTrans(Ask.Amt))

# Plan model, main effects only
glm.pln.t <- glm(FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Plan.Future, data=mderived, family=binomial())
# Clear model, main effects only
glm.clr.t <- glm(FY.Clear.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Clear.Future, data=mderived, family=binomial())
# Ask model, main effects only
glm.ask.t <- glm(FY.Ask.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Ask.Future, data=mderived, family=binomial())
# Oral model, main effects only
glm.ora.t <- glm(FY.Oral.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Oral.Future, data=mderived, family=binomial())

## Put together the error rates
confuse <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), modnames=modnames, counts=F)
# Row to add to the error table
errors <- rbind(errors, CalcErrorRates(confuse, model.name="GLM big", modnames))
# Print the error table
kable(errors, digits=2)
```

Using the variables identified by the small random forest, with a few subtractions, the "big" GLM does nearly as well (in-sample) as a random forest -- not bad at all!

Finally, here are the error rates using the withheld FY16 data:

```{r, echo=F, message=F, warning=F, cache=F}
## Import new data
datafile <- "2016-07-12 Booth solicitation history.csv" # filename for data
source("Rscript1.Merge.R")
## Take only solicitations closing between 7/1/2015 and 6/30/2016 for our out-of-sample comparison
mdat <- mdat %>% filter(Final.Sol.Stage.Dt >= mdy("07/01/2015")) %>% filter(Final.Sol.Stage.Dt <= mdy("06/30/2016"))
mderived <- mderived %>% filter(Final.Sol.Stage.Dt >= mdy("07/01/2015")) %>% filter(Final.Sol.Stage.Dt <= mdy("06/30/2016"))
# Add needed fields for GLM
mderived <- mderived %>% mutate(lt.Planned.Amt = LogTrans(Planned.Amt), lt.Expected.Amt = LogTrans(Expected.Amt), lt.Ask.Amt = LogTrans(Ask.Amt))
confuse2 <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), testdata=mderived, modnames=modnames, counts=F)
# Final error calculations
# Print the error table
kable(CalcErrorRates(confuse2, model.name="GLM OOS", modnames), digits=2)
```

The out-of-sample error is very much in line with the in-sample error, as hoped. Here are the models with coefficients. Note that dollar amounts are log transformed with an offset, as defined above.

### Plan stage
```{r, echo=F, message=F, warning=F, cache=F}
kable(summary(glm.pln.t)$coef, digits=3)
```

### Clear stage
```{r, echo=F, message=F, warning=F, cache=F}
kable(summary(glm.clr.t)$coef, digits=3)
```

### Ask stage
```{r, echo=F, message=F, warning=F, cache=F}
kable(summary(glm.ask.t)$coef, digits=3)
```

### Oral stage
```{r, echo=F, message=F, warning=F, cache=F}
kable(summary(glm.ora.t)$coef, digits=3)
```

# Point-in-time validation

Now that we have a satisfactory model, JMod and GLM can be compared side-by-side against historical point-in-time data.

```{r, echo=F, message=F, warning=F, cache=F}
## Load point-in-time dataset
load("pit.dat.Rdata")

## Make predictions for each method
sols3 <- na.omit(sols2 %>% select(Solicitation.ID, Curr.Stage, rpt.date, Expected.Dt, Expected.Amt, Actual.Amt, Closed.In.FY, Expect.In.Future.FY, Sol.Type.Agg, lt.Planned.Amt, lt.Expected.Amt, lt.Ask.Amt))
## JMod predictions
jmod.preds <- JMod(curr.stage=sols3$Curr.Stage, curr.dt=sols3$rpt.date, expected.dt=sols3$Expected.Dt, expect.amt=sols3$Expected.Amt, act.amt=sols3$Actual.Amt, closed.in.fy=sols3$Closed.In.FY)
# Add the report date
jmod.preds$dt <- sols3$rpt.date
jmod.preds$id <- as.character(paste(sols3$Solicitation.ID, sols3$rpt.date, sep="."))
jmod.preds$expect <- sols3$Expected.Amt
jmod.preds <- data.frame(jmod.preds)
## GLM predictions
PredictGLM <- function(model, data){
  data.frame(probability = predict(model, newdata=data, type="response")) %>%
  mutate(prediction = probability * data$Expected.Amt,
         error = prediction - ifelse(is.na(data$Actual.Amt), 0, data$Actual.Amt) * data$Closed.In.FY,
         dt = data$rpt.date,
         expect = data$Expected.Amt)
}
# Plan
glmdat <- sols3 %>% filter(Curr.Stage == 1) %>% mutate(FY.Plan.Book = Closed.In.FY, Plan.Future = Expect.In.Future.FY)
plan.preds <- PredictGLM(glm.pln.t, glmdat)
# Clear
glmdat <- sols3 %>% filter(Curr.Stage == 2) %>% mutate(FY.Clear.Book = Closed.In.FY, Clear.Future = Expect.In.Future.FY)
clear.preds <- PredictGLM(glm.clr.t, glmdat)
# Ask
glmdat <- sols3 %>% filter(Curr.Stage == 3) %>% mutate(FY.Ask.Book = Closed.In.FY, Ask.Future = Expect.In.Future.FY)
ask.preds <- PredictGLM(glm.ask.t, glmdat)
# Oral
glmdat <- sols3 %>% filter(Curr.Stage == 4) %>% mutate(FY.Oral.Book = Closed.In.FY, Oral.Future = Expect.In.Future.FY)
oral.preds <- PredictGLM(glm.ora.t, glmdat)
# Paperwork
glmdat <- sols3 %>% filter(Curr.Stage == 5)
paper.preds <- data.frame(probability = 1, prediction = glmdat$Expected.Amt, error = glmdat$Expected.Amt - glmdat$Actual.Amt, dt = glmdat$rpt.date, expect = glmdat$Expected.Amt)
# Combine
glm.preds <- rbind(plan.preds, clear.preds, ask.preds, oral.preds, paper.preds)

# Ignore gifts over $5M
x <- 5000000

## Set up comparison data frame
ggdat <- filter(jmod.preds, expect < x) %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise(pred = sum(prediction), act = sum(actual))
## Reformat glm predictions
tmpdf <- filter(glm.preds, expect < x) %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise(glm.pred = sum(prediction))
## Bind tmpdf to ggdat
ggdat <- left_join(ggdat, tmpdf, by = "dt")

## Bootstrapped GLM dataset
set.seed(2358)
rep <- 1000
boot.result <- matrix(0, nrow = nrow(glm.preds), ncol = rep)
for (i in 1:rep) {
  boot.result[, i] <- (glm.preds$probability >= runif(n=nrow(glm.preds), min=0, max=1)) * glm.preds$expect
}
glm.boot <- cbind(glm.preds, data.frame(boot.result))
# Summarise by date across every bootstrapped dataset
glm.boot <- filter(glm.boot, expect < x) %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise_each(funs(sum))
# Find empirical 2.5th, 50th, and 97.5th percentiles (95% CI)
boot.names <- paste("X", 1:rep, sep="")
ggdat <- cbind(ggdat, t(apply(glm.boot[, boot.names], MARGIN=1, FUN=quantile, probs=c(.025, .5, .975))))

## Plot the comparison
ggplot(data=ggdat, aes(x=dt, y=act)) + geom_point(aes(color="Actual")) + geom_line(aes(color="Actual"), size=1.5, alpha=.5) + geom_point(aes(y=pred, color="JMod")) + geom_line(aes(y=pred, color="JMod")) + geom_point(aes(y=glm.pred, color="GLM big")) + geom_line(aes(y=glm.pred, color="GLM big")) + scale_y_continuous(name="Amount", labels=scales::dollar) + scale_x_date(name="Date", date_minor_breaks="month") + geom_ribbon(aes(ymin=`2.5%`, ymax=`97.5%`), fill="green", alpha=.1) + labs(title=paste("Solicitations under", dollar(x), "with 95% GLM CI"))
```

The dots are observations and the connecting lines are point estimates for each method. The green shaded GLM confidence bands are simulated, obtained by treating each solicitation as a Bernoulli trial with probability to close $p$ as predicted by the model. 1000 trials were performed per solicitation, and results in the same month and trial were summed together. The confidence bands cover the observed .025 and .975 quantiles of the sum.

GLM is clearly superior until the beginning of 2015; after that it's more of a toss-up. The dataset did not include any solicitations closing after 2/2/2016, so I compiled some more historical data and re-created this plot using all available FY16 data.

```{r, echo=F, message=F, warning=F, cache=F}
## Generate FY16 point-in-time dataset by creating a new sols2
folder <- "CSV data"
source("Rscript - historical data generation.R")

## Look for duplicate solicitations per month; only use the first in each month, and only closed sols
sols2 <- sols2 %>% mutate(m = month(rpt.date)) %>% arrange(rpt.date) %>% distinct(Solicitation.ID, m)

## Make predictions for each method
sols3 <- na.omit(sols2 %>% select(Solicitation.ID, Curr.Stage, rpt.date, Expected.Dt, Expected.Amt, Actual.Amt, Closed.In.FY, Expect.In.Future.FY, Sol.Type.Agg, lt.Planned.Amt, lt.Expected.Amt, lt.Ask.Amt))
sols3$id <- as.character(paste(sols3$Solicitation.ID, sols3$rpt.date, sep="."))
## JMod predictions
jmod.preds <- JMod(curr.stage=sols3$Curr.Stage, curr.dt=sols3$rpt.date, expected.dt=sols3$Expected.Dt, expect.amt=sols3$Expected.Amt, act.amt=sols3$Actual.Amt, closed.in.fy=sols3$Closed.In.FY)
# Add the report date
jmod.preds$dt <- sols3$rpt.date
jmod.preds$expect <- sols3$Expected.Amt
jmod.preds <- data.frame(jmod.preds)

# Plan
glmdat <- sols3 %>% filter(Curr.Stage == 1) %>% mutate(FY.Plan.Book = Closed.In.FY, Plan.Future = Expect.In.Future.FY)
plan.preds <- PredictGLM(glm.pln.t, glmdat)
plan.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Clear
glmdat <- sols3 %>% filter(Curr.Stage == 2) %>% mutate(FY.Clear.Book = Closed.In.FY, Clear.Future = Expect.In.Future.FY)
clear.preds <- PredictGLM(glm.clr.t, glmdat)
clear.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Ask
glmdat <- sols3 %>% filter(Curr.Stage == 3) %>% mutate(FY.Ask.Book = Closed.In.FY, Ask.Future = Expect.In.Future.FY)
ask.preds <- PredictGLM(glm.ask.t, glmdat)
ask.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Oral
glmdat <- sols3 %>% filter(Curr.Stage == 4) %>% mutate(FY.Oral.Book = Closed.In.FY, Oral.Future = Expect.In.Future.FY)
oral.preds <- PredictGLM(glm.ora.t, glmdat)
oral.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Paperwork
glmdat <- sols3 %>% filter(Curr.Stage == 5)
paper.preds <- data.frame(probability = 1, prediction = glmdat$Expected.Amt, error = glmdat$Expected.Amt - glmdat$Actual.Amt, dt = glmdat$rpt.date, expect = glmdat$Expected.Amt)
paper.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Combine
glm.preds <- rbind(plan.preds, clear.preds, ask.preds, oral.preds, paper.preds)

## Function to do my plots
SideBySidePlots <- function(x){
  ## Set up comparison data frame
  ggdat <- filter(jmod.preds, expect < x) %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise(pred = sum(prediction), act = sum(actual))
  ## Reformat glm predictions
  tmpdf <- filter(glm.preds, expect < x) %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise(glm.pred = sum(prediction))
  ## Bind tmpdf to ggdat
  ggdat <- left_join(ggdat, tmpdf, by = "dt")
  
  ## Bootstrapped GLM dataset
  set.seed(2358)
  rep <- 1000
  boot.result <- matrix(0, nrow = nrow(glm.preds), ncol = rep)
  for (i in 1:rep) {
    boot.result[, i] <- (glm.preds$probability >= runif(n=nrow(glm.preds), min=0, max=1)) * glm.preds$expect
  }
  glm.boot <- cbind(glm.preds, data.frame(boot.result)) %>% select(-id)
  # Summarise by date across every bootstrapped dataset
  glm.boot <- filter(glm.boot, expect < x) %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise_each(funs(sum))
  # Find empirical 2.5th, 50th, and 97.5th percentiles (95% CI)
  boot.names <- paste("X", 1:rep, sep="")
  ggdat <- cbind(ggdat, t(apply(glm.boot[, boot.names], MARGIN=1, FUN=quantile, probs=c(.025, .5, .975))))
  
  ## Debugging
#   jmod.preds$id <- sols3$id
#   write.table(left_join(left_join(sols3, glm.preds, by="id"), jmod.preds, by="id"), "tmp.txt", row.names=F, sep="\t")
  
  ## Plot the comparison
  ggplot(data=ggdat, aes(x=dt, y=act)) + geom_point(aes(color="Actual")) + geom_line(aes(color="Actual"), size=1.5, alpha=.5) + geom_point(aes(y=pred, color="JMod")) + geom_line(aes(y=pred, color="JMod")) + geom_point(aes(y=glm.pred, color="GLM big")) + geom_line(aes(y=glm.pred, color="GLM big")) + scale_y_continuous(name="Amount", labels=scales::dollar) + scale_x_date(name="Date", date_minor_breaks="month") + geom_ribbon(aes(ymin=`2.5%`, ymax=`97.5%`), fill="green", alpha=.1) + labs(title=paste("Solicitations under", dollar(x), "with 95% GLM CI"))
}

## Ignore gifts over $5M
SideBySidePlots(5000000)
```

Not a home run by any means, though this provides more evidence that for the second half of the year JMod is too pessimistic. But here's the interesting part:

```{r, echo=F, message=F, warning=F, cache=F}
## Increase x
g1 <- SideBySidePlots(10000000)
g2 <- SideBySidePlots(25000000)
g3 <- SideBySidePlots(50000000)

grid.arrange(g1, g2, g3, nrow=3)
```

As I include increasingly large gifts, the GLM does better and better, implying that this method *successfully models principal gifts* in the 7/1/2015--6/30/2016 time period! Luck, or skill? Here is the comparison for all past years using only the actual historical "SOL & Future FRP Projections" files available at the time.

```{r, echo=F, message=F, warning=F, cache=F}
## Generate FY16 point-in-time dataset by creating a new sols2
folder <- "CSV data 2"
source("Rscript - historical data generation.R")

## Look for duplicate solicitations per month; only use the first in each month, and only closed sols
sols2 <- sols2 %>% mutate(m = month(rpt.date)) %>% arrange(rpt.date) %>% distinct(Solicitation.ID, m)

## Make predictions for each method
sols3 <- na.omit(sols2 %>% select(Solicitation.ID, Curr.Stage, rpt.date, Expected.Dt, Expected.Amt, Actual.Amt, Closed.In.FY, Expect.In.Future.FY, Sol.Type.Agg, lt.Planned.Amt, lt.Expected.Amt, lt.Ask.Amt))
sols3$id <- as.character(paste(sols3$Solicitation.ID, sols3$rpt.date, sep="."))
## JMod predictions
jmod.preds <- JMod(curr.stage=sols3$Curr.Stage, curr.dt=sols3$rpt.date, expected.dt=sols3$Expected.Dt, expect.amt=sols3$Expected.Amt, act.amt=sols3$Actual.Amt, closed.in.fy=sols3$Closed.In.FY)
# Add the report date
jmod.preds$dt <- sols3$rpt.date
jmod.preds$expect <- sols3$Expected.Amt
jmod.preds <- data.frame(jmod.preds)

# Plan
glmdat <- sols3 %>% filter(Curr.Stage == 1) %>% mutate(FY.Plan.Book = Closed.In.FY, Plan.Future = Expect.In.Future.FY)
plan.preds <- PredictGLM(glm.pln.t, glmdat)
plan.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Clear
glmdat <- sols3 %>% filter(Curr.Stage == 2) %>% mutate(FY.Clear.Book = Closed.In.FY, Clear.Future = Expect.In.Future.FY)
clear.preds <- PredictGLM(glm.clr.t, glmdat)
clear.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Ask
glmdat <- sols3 %>% filter(Curr.Stage == 3) %>% mutate(FY.Ask.Book = Closed.In.FY, Ask.Future = Expect.In.Future.FY)
ask.preds <- PredictGLM(glm.ask.t, glmdat)
ask.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Oral
glmdat <- sols3 %>% filter(Curr.Stage == 4) %>% mutate(FY.Oral.Book = Closed.In.FY, Oral.Future = Expect.In.Future.FY)
oral.preds <- PredictGLM(glm.ora.t, glmdat)
oral.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Paperwork
glmdat <- sols3 %>% filter(Curr.Stage == 5)
paper.preds <- data.frame(probability = 1, prediction = glmdat$Expected.Amt, error = glmdat$Expected.Amt - glmdat$Actual.Amt, dt = glmdat$rpt.date, expect = glmdat$Expected.Amt)
paper.preds$id <- as.character(paste(glmdat$Solicitation.ID, glmdat$rpt.date, sep="."))
# Combine
glm.preds <- rbind(plan.preds, clear.preds, ask.preds, oral.preds, paper.preds)

SideBySidePlots(50000000)
```

Some of the principal gifts have been in the works since FY14 and didn't come in until FY16. As a result, GLM overestimated dollars booked in FY14 and FY15, but was almost spot on in FY16.

In summary, this model does very well predicting probabilities that a solicitation will close, but not the expected solicitation. Formulated statistically, recall that $E(FRP)=\sum_{i} P(Sol_{i}=1)E(Sol_{i})$. From the first part, $P(Sol_{i}=1)$ is accurate, but as seem from the point-in-time charts, multiplying this probability by $E(Sol_{i})$ is not a good model for $E(FRP)$. An alternate approach is to try to model $E(FRP)$ directly; depending on the method, this could be guaranteed to be statistically unbiased.

# JMod tweaking

In the past, JMod was tweaked at the end of the year based on how it had over- or under-estimated dollars booked. In comparing JMod to the basic GLM considering only stage and month, I noted two big opportunities:

1. The ask and oral stages need to be treated differently.
2. MG $(\$25\text{k} <= X < \$5\text{M})$ and PG $(\$5\text{M} < X)$ need to be treated differently.

I'll test each of the following easy-to-implement tweaks on the compiled data:

1. Fixed Oral discounting of 2/3 across the board
2. Fixed oral discounting of 2/3 across the board, and set post-April discounts to their April values
2. Same as #1, but exclude PG solicitations
4. Same as #2, but exclude PG solicitations

```{r, echo=F, message=F, warning=F, cache=F}
## Original method
jmod.preds <- JMod(curr.stage=sols3$Curr.Stage, curr.dt=sols3$rpt.date, expected.dt=sols3$Expected.Dt, expect.amt=sols3$Expected.Amt, act.amt=sols3$Actual.Amt, closed.in.fy=sols3$Closed.In.FY) %>% data.frame() %>%
  mutate(dt = sols3$rpt.date, expect = sols3$Expected.Amt)
jmod.preds$id <- sols3$id
## Method 1
jmod.preds.1 <- JMod_Oral(curr.stage=sols3$Curr.Stage, curr.dt=sols3$rpt.date, expected.dt=sols3$Expected.Dt, expect.amt=sols3$Expected.Amt, act.amt=sols3$Actual.Amt, closed.in.fy=sols3$Closed.In.FY) %>% data.frame() %>% select(jm1.prediction = prediction)
## Method 2
jmod.preds.2 <- JMod_Oral_2step(curr.stage=sols3$Curr.Stage, curr.dt=sols3$rpt.date, expected.dt=sols3$Expected.Dt, expect.amt=sols3$Expected.Amt, act.amt=sols3$Actual.Amt, closed.in.fy=sols3$Closed.In.FY) %>% data.frame() %>% select(jm2.prediction = prediction)
## Method 5
jmod.preds.5 <- JMod_Oral_2step_75(curr.stage=sols3$Curr.Stage, curr.dt=sols3$rpt.date, expected.dt=sols3$Expected.Dt, expect.amt=sols3$Expected.Amt, act.amt=sols3$Actual.Amt, closed.in.fy=sols3$Closed.In.FY) %>% data.frame() %>% select(jm5.prediction = prediction)

## Combine
jmod.preds <- jmod.preds %>% bind_cols(jmod.preds.1) %>% bind_cols(jmod.preds.2) %>% bind_cols(jmod.preds.5)
jmod.preds <- full_join(jmod.preds, glm.preds %>% select(glm.pred = prediction, id), by="id")

## Summarized data, any solicitation amount
ggdat1 <- jmod.preds %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise(pred = sum(prediction), pred1 = sum(jm1.prediction), pred2 = sum(jm2.prediction), pred5 = sum(jm5.prediction), glm.pred = sum(glm.pred), act = sum(actual))

## Summarized data, sub-$5M only
ggdat2 <- filter(jmod.preds, expect < 5000000) %>% mutate(dt = DateToNthOfMonth(dt, n=1)) %>% group_by(dt) %>% summarise(pred = sum(prediction), pred1 = sum(jm1.prediction), pred2 = sum(jm2.prediction), pred5 = sum(jm5.prediction), glm.pred = sum(glm.pred), act = sum(actual))

## JMod point-in-time plots, any solicitation amount
ggplot(data=ggdat1, aes(x=dt, y=act)) + geom_point(aes(color="Actual")) + geom_line(aes(color="Actual"), size=1.5, alpha=.5) + geom_point(aes(y=glm.pred, color="GLM")) + geom_line(aes(y=glm.pred, color="GLM")) + geom_point(aes(y=pred, color="JMod")) + geom_line(aes(y=pred, color="JMod")) + geom_point(aes(y=pred1, color="JMod #1")) + geom_line(aes(y=pred1, color="JMod #1"), alpha=.5) + geom_point(aes(y=pred2, color="JMod #2")) + geom_line(aes(y=pred2, color="JMod #2"), alpha=.5) + scale_y_continuous(name="Amount", labels=scales::dollar) + scale_x_date(name="Date", date_minor_breaks="month") + labs(title="JMod methods comparison, any solicitation amount")

## JMod point-in-time plots, sub-$5M only
ggplot(data=ggdat2, aes(x=dt, y=act)) + geom_point(aes(color="Actual")) + geom_line(aes(color="Actual"), size=1.5, alpha=.5) + geom_point(aes(y=glm.pred, color="GLM")) + geom_line(aes(y=glm.pred, color="GLM")) + geom_point(aes(y=pred, color="JMod")) + geom_line(aes(y=pred, color="JMod")) + geom_point(aes(y=pred1, color="JMod #3")) + geom_line(aes(y=pred1, color="JMod #3"), alpha=.5) + geom_point(aes(y=pred2, color="JMod #4")) + geom_line(aes(y=pred2, color="JMod #4"), alpha=.5) + scale_y_continuous(name="Amount", labels=scales::dollar) + scale_x_date(name="Date", date_minor_breaks="month") + labs(title="JMod methods comparison, solicitations up to $5M")
```

These tweaks around the edges are clearly beneficial, but honestly they barely make a difference to the overall predictions. The JMod variants are just a little less negatively biased at the end of each year.

Let's try one more tweak: using a less-aggressive oral stage discounting (75%).

5. Fixed oral discounting of 3/4 across the board, and set post-April discounts to their April values
6. Same as #5, but exclude PG solicitations

```{r, echo=F, message=F, warning=F, cache=F}
## JMod point-in-time plots, any solicitation amount
ggplot(data=ggdat1, aes(x=dt, y=act)) + geom_point(aes(color="Actual")) + geom_line(aes(color="Actual"), size=1.5, alpha=.5) + geom_point(aes(y=glm.pred, color="GLM")) + geom_line(aes(y=glm.pred, color="GLM")) + geom_point(aes(y=pred, color="JMod")) + geom_line(aes(y=pred, color="JMod")) + geom_point(aes(y=pred5, color="JMod #5")) + geom_line(aes(y=pred5, color="JMod #5"), alpha=.5) + scale_y_continuous(name="Amount", labels=scales::dollar) + scale_x_date(name="Date", date_minor_breaks="month") + labs(title="JMod methods comparison, any solicitation amount")

## JMod point-in-time plots, sub-$5M only
ggplot(data=ggdat2, aes(x=dt, y=act)) + geom_point(aes(color="Actual")) + geom_line(aes(color="Actual"), size=1.5, alpha=.5) + geom_point(aes(y=glm.pred, color="GLM")) + geom_line(aes(y=glm.pred, color="GLM")) + geom_point(aes(y=pred, color="JMod")) + geom_line(aes(y=pred, color="JMod")) + geom_point(aes(y=pred5, color="JMod #6")) + geom_line(aes(y=pred5, color="JMod #6"), alpha=.5) + scale_y_continuous(name="Amount", labels=scales::dollar) + scale_x_date(name="Date", date_minor_breaks="month") + labs(title="JMod methods comparison, solicitations up to $5M")
```

Again, it's hard to see much of a difference outside the last few months of each year. Here's a summary of the count of solicitations expected in the current fiscal year by stage, on average:

```{r, echo=F, message=F, warning=F, cache=F}
tmp <- sols3 %>% mutate(
  Stage = factor(Curr.Stage, label=c("Plan", "Clear", "Ask", "Oral", "Paperwork")),
  Month = month(rpt.date) %>% factor(labels=strtrim(month.name, 3)),
  Year = rpt.date %>% YToFY()) %>%
  filter(!Expect.In.Future.FY)
tmp$Month <- factor(tmp$Month, levels(tmp$Month)[c(7:12,1:6)])

ggdat <- tmp %>% select(Stage, Month) %>% mutate(Count = 1) %>% group_by(Stage, Month) %>% summarise(Count = sum(Count)/length(table(tmp$Year)))

ggdat %>% ggplot(aes(x=Month, y=Count, color=Stage, group=Stage)) + geom_point(alpha=.5) + geom_line(alpha=.5) + labs(title="Mean yearly count of expected solicitations by month")

#ggdat %>% select(Count, Month, Stage) %>% xtabs(Count ~ Stage + Month, data=.) %>% kable()
```

And the associated expected dollar amounts:

```{r, echo=F, message=F, warning=F, cache=F}
ggdat <- tmp %>% group_by(Stage, Month) %>% summarise(Amount = sum(Expected.Amt)/length(table(tmp$Year)))

ggdat %>% ggplot(aes(x=Month, y=Amount, color=Stage, group=Stage)) + geom_point(alpha=.5) + geom_line(alpha=.5) + scale_y_log10(labels=scales::dollar, breaks=10^(0:7)) + labs(title="Mean yearly sum of expected amounts by month")

#ggdat %>% select(Amount, Month, Stage) %>% xtabs(Amount ~ Stage + Month, data=.) %>% kable()
```

Looking at the dollar amounts, tweaks to Oral just won't have much of an impact, and looking at the counts and dollar amounts, tweaks to Plan and Clear from February on won't have much of an impact either.

# Directly modeling dollar amounts

The preferred method for finding $E(FRP)$ is to model it directly. This is challenging given the right-skewness of `Actual.Amt` and large number of zero values.

```{r, echo=F, message=F, warning=F, cache=F}
# Re-import mderived
source("Rscript1.Merge.R")
# Untransformed, < $1M
mderived %>% filter(Actual.Amt < 1000000) %>% ggplot(aes(x=Actual.Amt, y=..density..)) + geom_histogram(alpha=.5, bins=25) + geom_density() + scale_x_continuous(labels=scales::dollar) + labs(title="Density of solicitation actual amount (under $1M)")
# Log transformed; remove $0 amounts
mderived %>% filter(Actual.Amt > 0) %>% ggplot(aes(x=Actual.Amt, y=..density..)) + geom_histogram(alpha=.5, bins=25) + geom_density() + scale_x_log10(labels=scales::dollar) + labs(title=expression(paste("Density of ", log[10], " solicitation actual amount")))
```

The $\text{log}_{10}$ transformation is nearly symmetric after excluding the $0 amounts, but it's unreasonable not to consider them in the analysis. Two possibilities:

1. Model the expected solicitation close amount $E(Sol_{i}\text{ value}|Sol_{i}\text{ closed})$ rather than taking `Expected.Amt` at face value.
2. Log transform the data, map all negative values to 0, and use a model such as the Tobit that allows for censored data.

### Modeling close amounts

We can check whether the high-importance variables for predicting whether gifts close to begin with are still important for predicting the amount at which they close. Begin by $\text{log}_{10}$ transforming `Actual.Amt` and dropping all values $<=0$; this gives the observed conditional distribution $E(Sol_{i}\text{ value}|Sol_{i}\text{ closed})$ as required. Now, growing regression forests on the transformed gift amounts:

```{r, echo=F, message=F, warning=F, cache=F}
# Censored log transformation
log_censor <- function(x, base=exp(1)) {
  x <- ifelse(log(x, base=base) > 0, log(x, base=base), 0)
  return(x)
}

# Data setup; only use through end of FY15 for fair comparisons
holdout <- mderived %>% filter(Final.Sol.Stage.Dt >= mdy("07/01/2015"))
mderived <- mderived %>% mutate(Actual.Amt.lt = Actual.Amt %>% log_censor(base=10)) %>% filter(Final.Sol.Stage.Dt <= mdy("06/30/2015"))
lmdat <- mderived %>% filter(Actual.Amt > 0)

## Random forests
library(randomForest)
set.seed(51906)
# Planned stage forest
treedat <- na.omit(lmdat[,c("Actual.Amt.lt", head.all, head.plan)])
rf.plan <- randomForest(Actual.Amt.lt ~ ., data=treedat, importance=T)
# Clear stage forest
treedat <- na.omit(lmdat[,c("Actual.Amt.lt", head.all, head.clear)])
rf.clear <- randomForest(Actual.Amt.lt ~ ., data=treedat, importance=T)
# Ask stage forest
treedat <- na.omit(lmdat[,c("Actual.Amt.lt", head.all, head.ask)])
rf.ask <- randomForest(Actual.Amt.lt ~ ., data=treedat, importance=T)
# Oral stage forest
treedat <- na.omit(lmdat[,c("Actual.Amt.lt", head.all, head.oral)])
rf.oral <- randomForest(Actual.Amt.lt ~ ., data=treedat, importance=T)

## Coerce importances into a data frame
imp.plan <- importance(rf.plan, scale=F, type=1)
imp.clear <- importance(rf.clear, scale=F, type=1)
imp.ask <- importance(rf.ask, scale=F, type=1)
imp.oral <- importance(rf.oral, scale=F, type=1)
rf.importances <- rbind(data.frame(Model="Plan", imp.plan, Variable=rownames(imp.plan)), data.frame(Model="Clear", imp.clear, Variable=rownames(imp.clear)), data.frame(Model="Ask", imp.ask, Variable=rownames(imp.ask)), data.frame(Model="Oral", imp.oral, Variable=rownames(imp.oral)), make.row.names=F)
## Plot importances
# y axis will end at nearest .05 above the maximum in the data
top <- ceiling(max(rf.importances$X.IncMSE)*20)/20
ggplot(rf.importances, aes(x=Variable, y=X.IncMSE, color=Model, group=Model)) + geom_point() + geom_line() + theme(axis.text.x=element_text(angle=90, hjust=1)) + labs(title="Random forest variable importances", y="Mean Decrease in Accuracy") + scale_y_continuous(breaks=seq(0, top, by=.05))
```

  * `Expected.Amt` is extremely important. `Planned.Amt` and `Ask.Amt` are also helpful where available.
  * `Solicitation.Type.Desc` is slightly helpful in the Plan and Clear stages, but not afterward.
  * The future variables, and month the indicated stage was hit, are not helpful as continuous covariates using the tree method. This follows given that we're looking only at conditional giving.

Using the `leaps` package for variable selection:

```{r, echo=F, message=F, warning=F, cache=F}
library(leaps)
# Plan
lm.leaps <- lmdat %>% regsubsets(Actual.Amt.lt ~ log10(Expected.Amt) + log10(Planned.Amt) + Plan.Future + planning.fiscal.mo + Sol.Type.Agg, data=., nvmax=16)
rsubs1 <- summary(lm.leaps)
rsubsdf <- data.frame(model="Plan", bic=rsubs1$bic, p=(1:length(rsubs1$bic)))
# Clear
lm.leaps <- lmdat %>% regsubsets(Actual.Amt.lt ~ log10(Expected.Amt) + log10(Planned.Amt) + Sol.Type.Agg + clear.fiscal.mo + Clear.Future, data=., nvmax=16)
rsubs2 <- summary(lm.leaps)
rsubsdf <- rbind(rsubsdf, data.frame(model="Clear", bic=rsubs2$bic, p=(1:length(rsubs2$bic))))
# Ask
lm.leaps <- lmdat %>% filter(Ask.Amt > 0) %>% regsubsets(Actual.Amt.lt ~ log10(Expected.Amt) + log10(Planned.Amt) + Sol.Type.Agg + log10(Ask.Amt) + ask.fiscal.mo + Ask.Future, data=., nvmax=16)
rsubs3 <- summary(lm.leaps)
rsubsdf <- rbind(rsubsdf, data.frame(model="Ask", bic=rsubs3$bic, p=(1:length(rsubs3$bic))))
# Oral
lm.leaps <- lmdat %>% filter(Ask.Amt > 0) %>% regsubsets(Actual.Amt.lt ~ log10(Expected.Amt) + log10(Planned.Amt) + Sol.Type.Agg + log10(Ask.Amt) + oral.fiscal.mo + Oral.Future, data=., nvmax=16)
rsubs4 <- summary(lm.leaps)
rsubsdf <- rbind(rsubsdf, data.frame(model="Oral", bic=rsubs4$bic, p=(1:length(rsubs4$bic))))

# Plot BIC
rsubsdf %>% ggplot(aes(x=p, y=bic, color=model)) + geom_point(alpha=.5) + geom_line(alpha=.25) + labs(title="BIC for different model sizes") + facet_grid(. ~ model)
```

Across the board, two predictors are optimal. Let's see which they are.

```{r, echo=F, message=F, warning=F, cache=F}
tmp <- colnames(rsubs1$which)[rsubs1$which[2, ]]
tmp <- tmp %>% rbind(colnames(rsubs2$which)[rsubs2$which[2, ]])
tmp <- tmp %>% rbind(colnames(rsubs3$which)[rsubs3$which[2, ]])
tmp <- tmp %>% rbind(colnames(rsubs4$which)[rsubs4$which[2, ]])
rownames(tmp) <- c("Plan", "Clear", "Ask", "Oral")
colnames(tmp) <- c("V0", "V1", "V2")
tmp %>% kable()
```

For simplicity's sake, we can use the transformed `Expected.Amt` and corresponding `Stage.Future` in each of the models, plus the interaction. Now, calculating predicted amounts for the held-out FY16 data:

```{r, echo=F, message=F, warning=F, cache=F}
# Modeling with all data
lm.p <- lmdat %>% lm(Actual.Amt.lt ~ log10(Expected.Amt) * Plan.Future, data=.)
lm.c <- lmdat %>% lm(Actual.Amt.lt ~ log10(Expected.Amt) * Clear.Future, data=.)
lm.a <- lmdat %>% lm(Actual.Amt.lt ~ log10(Expected.Amt) * Ask.Future, data=.)
lm.o <- lmdat %>% lm(Actual.Amt.lt ~ log10(Expected.Amt) * Oral.Future, data=.)

# Predicted log10 actual amounts
holdout <- holdout %>% mutate(
  lmpreds.pln = predict(lm.p, newdata=holdout),
  lmpreds.pln = ifelse(lmpreds.pln == -Inf, 0, lmpreds.pln),
  lmpreds.clr = predict(lm.c, newdata=holdout),
  lmpreds.clr = ifelse(lmpreds.clr == -Inf, 0, lmpreds.clr),
  lmpreds.ask = predict(lm.a, newdata=holdout),
  lmpreds.ask = ifelse(lmpreds.ask == -Inf, 0, lmpreds.ask),
  lmpreds.ora = predict(lm.o, newdata=holdout),
  lmpreds.ora = ifelse(lmpreds.ora == -Inf, 0, lmpreds.ora)
)


```