random_forest.Rmd

# Random forests

## Data business

Load some libraries and necessary data files
```{r}
library(tidyverse)
library(tidymodels)
library(feather)
library(magrittr)
library(skimr)
library(vip)
per <- read_feather("data/simulation_data/all_persons.feather")
```

Compute some summary statistic for each client.
```{r}
clients <-
  per %>%
  group_by(client) %>%
  summarize(
    zip3 = first(zip3),
    size = n(),
    volume = sum(FaceAmt),
    avg_qx = mean(qx),
    avg_age = mean(Age),
    per_male = sum(Sex == "Male") / size,
    per_blue_collar = sum(collar == "blue") / size,
    expected = sum(qx * FaceAmt),
    actual_2020 = sum(FaceAmt[year == 2020], na.rm = TRUE),
    ae_2020 = actual_2020 / expected,
    adverse = as.factor(ae_2020 > 1.1)
  ) %>%
  relocate(adverse, ae_2020, .after = zip3)
```

We can add some demographic information based on zip3.
```{r}
zip_data <-
  read_feather("data/data.feather") %>%
  mutate(
    density = POP / AREALAND,
    AREALAND = NULL,
    AREA = NULL,
    HU = NULL,
    vaccinated = NULL,
    per_lib = NULL,
    per_green = NULL,
    per_other = NULL,
    per_rep = NULL,
    unempl_2020 = NULL,
    poverty = NULL,
    deaths_covid = NULL,
    deaths_all = NULL
  ) %>%
  rename(
    unemp = unempl_2019,
    hes_uns = hes_unsure,
    str_hes = strong_hes,
    income = Median_Household_Income_2019
  )
```
There seems to be some clients with some zip codes that we cannot deal with. These are the ones
```{r}
clients %>%
  anti_join(zip_data, by = "zip3") %>%
  select(zip3)
```
These correspond to the following areas

ZIP3 | Area       |
-----|------------|
969  | Guam, Palau, Federated States of Micronesia, Northern Mariana Islands, Marshall Islands |
093  | Military bases in Iraq and Afghanistan |
732  | Not in use |
872  | Not in use |
004  | Not in use |
202  | Washington DC, Government 1 |

We ignore clients with these zip codes. There are also two clients in DC for which we're missing election data. We will ignore those as well.
```{r}
clients %<>%
  inner_join(zip_data, by = "zip3") %>%
  drop_na()
```

We now have our full dataset. Behold!
```{r}
skim(clients)
```

## First model
We will use a random forest using the tidymodels framework.

We start by creating a recipe. We won't use zip3, client ID, actual claims, or ae_2020 as predictors. Also, we don't have election data on DC, so we remove those.
```{r}
ranger_recipe <-
  recipe(adverse ~ ., data = clients) %>%
  update_role(zip3, ae_2020, new_role = "diagnostic") %>%
  step_rm(actual_2020, client)
```

We use the ranger engine for our random forest. We could tune the paramters as well
```{r}
ranger_spec <-
  rand_forest(trees = 1000) %>%
  set_mode("classification") %>%
  set_engine("ranger", num.threads = 8, importance = "impurity", seed = 123)
```

Wrap the recipe and model into a workflow
```{r}
ranger_workflow <-
  workflow() %>%
  add_recipe(ranger_recipe) %>%
  add_model(ranger_spec)
```

Create an initial test-train split
```{r}
set.seed(1111)
init_split <-
  clients %>%
  initial_split(strata = adverse)

clients_test <- init_split %>% testing()
clients_test %>% count(adverse)
clients_train <- init_split %>% training()
clients_train %>% count(adverse)
```

Train the workflow
```{r}
ranger_trained <-
  ranger_workflow %>%
  fit(clients_train)
```

And we predict
```{r}
predictions <-
  ranger_trained %>%
  predict(clients_test)
```

Compute the confusion matrix
```{r}
predictions %>%
  bind_cols(clients_test %>% filter(!is.na(per_dem)) %>% select(adverse)) %>%
  conf_mat(adverse, .pred_class)
```

It looks like the the model performs well, but it's basically predicting that all companies will have adverse deaths.

This is another way to automate computation of metrics
```{r}
ranger_last_fit <-
  ranger_workflow %>%
  last_fit(
    split = init_split,
    metrics = metric_set(sens, spec, roc_auc, j_index)
  )

ranger_last_fit %>% collect_metrics()

ranger_last_fit %>%
  collect_predictions() %>%
  roc_curve(adverse, .pred_FALSE) %>%
  autoplot()
```


### Subsampling
We will make train the model for more adverse outcomes by using *subsampling*. See e.g. [here](https://www.tidymodels.org/learn/models/sub-sampling/) for a nice introduction.
```{r}
library(themis)
set.seed(222)
subsample_recipe <-
  ranger_recipe %>%
  step_rose(adverse)
subsample_workflow <-
  ranger_workflow %>%
  update_recipe(subsample_recipe)
subsample_last_fit <-
  subsample_workflow %>%
  last_fit(
    split = init_split,
    metrics = metric_set(sens, spec, roc_auc, j_index)
  )

subsample_last_fit %>% collect_metrics()

subsample_last_fit %>%
  collect_predictions() %>%
  roc_curve(adverse, .pred_FALSE) %>%
  autoplot()
```

Looks a bit more balanced, but a much much worse fit....


## Changing the outcome variable
With this dataset, an AE > 1.1 is too low; there are too few clients with low AE in 2020
```{r}
clients$ae_2020 %>% summary()
```

Let's say that a client experiences adverse deaths if AE > 3, which is about the 1st quartile
```{r}
clients %<>%
  mutate(adverse = as.factor(ae_2020 > 3))
```

We can apply the same workflow as before
```{r}
set.seed(333)
new_split <-
  clients %>%
  initial_split()

ranger_last_fit <-
  ranger_workflow %>%
  last_fit(
    split = new_split,
    metrics = metric_set(sens, spec, roc_auc, j_index)
  )

ranger_last_fit %>% collect_metrics()

ranger_last_fit %>%
  collect_predictions() %>%
  roc_curve(adverse, .pred_FALSE) %>%
  autoplot()
```

Better!

Can we tune hyperparameters to get even better results? Let's check
```{r message = FALSE, warning = FALSE}
tune_spec <-
  ranger_spec %>%
  update(mtry = tune(), min_n = tune())

tune_workflow <-
  ranger_workflow %>%
  update_model(tune_spec)

set.seed(444)
tune_split <- initial_split(clients)
set.seed(555)
tune_resamples <-
  vfold_cv(training(tune_split))

param_grid <-
  grid_regular(mtry(c(1, 23)),
               min_n(),
               levels = 5)

tune_res <-
  tune_workflow %>%
  tune_grid(
    resamples = tune_resamples,
    grid = param_grid,
    metrics = metric_set(sens, spec, roc_auc, j_index, accuracy)
  )

autoplot(tune_res)
```

I chose mtry = 12, min_n = 21.
```{r}
best <- tibble(mtry = 12, min_n = 21)
final_wf <-
  tune_workflow %>%
  finalize_workflow(best)

final_wf_fit <-
  final_wf %>%
  last_fit(
    tune_split,
    metrics = metric_set(sens, spec, roc_auc, j_index, accuracy)
    )

final_wf_fit %>%
  collect_metrics()

final_wf_fit %>%
  collect_predictions() %>%
  roc_curve(adverse, .pred_FALSE) %>%
  autoplot()

final_wf_fit %>%
  collect_predictions() %>%
  conf_mat(adverse, .pred_class)
```
Cool stuff. How does this compare to logistic regression by month???

We can also check variable importance
```{r}
final_wf_fit %>%
  pluck(".workflow", 1) %>%
  pull_workflow_fit() %>%
  vip(num_features = 30)
```

Looks like population is the overwhelming winner. Next unemployment percentage, non-highschool graduate percentage and population density.