ll-3-case-study.Rmd

---
title: 'Learning Lab 3 Case Study'
author: ""
date: "`r format(Sys.Date(),'%B %e, %Y')`"
output:
  html_document:
    toc: yes
    toc_depth: 4
    toc_float: yes
    code_folding: show
    code_download: TRUE
editor_options:
  markdown:
    wrap: 72
bibliography: lit/references.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Even after feature engineering, machine learning approaches can often (but not always) be improved by choosing a more sophisticated model type. Note how we used a regression model in the first two case studies; here, we explore a considerably more sophisticated model, a random forest. 

Choosing a more sophisticated model adds some complexity to the modeling. Notably, more complex models have _tuning parameters_ - parts of the model that are not estimated from the data. In addition to using feature engineering in a way akin to how we did in the last case study, Bertolini et al. (2021) use tuning parameters to improve the performance of their predictive model. 

Our driving question is: How much of a difference does a more complex model make?

While answering this question, we focus not only on estimating, but also on tuning a complex model. 

The data we use is from the #NGSSchat community on Twitter, as in doing so we can compare the performance of this tuned, complex model to the initial model we used in the first case study.

## Step 0: Loading and setting up

First, let's load the packages we'll use---the familiar {tidyverse} and several others focused on modeling. Like in earlier learning labs, click the green arrow to run the code chunk.

```{r}
library(tidyverse)
library(here)
library(tidymodels)
library(vip)
library(ranger)
```

Next, we'll load two data sources, one with the tweets, the other with our qualitative codes. 

*Note*: We created a means of visualizing the threads to make coding them easier; that's here and it provides a means of seeing what the raw data is like: https://jmichaelrosenberg.shinyapps.io/ngsschat-shiny/

```{r}
d <- read_rds(here("data", "ngsschat-data.rds"))

codes <- read_csv(here("data", "ngsschat-qualitative-codes.csv"))
```

## Step 1. Split data

```{r}
library(tidymodels) # doesn't load forcats, stringr, readr from tidyverse
library(readr)
library(here)
library(vip)

d <- read_csv(here("spring-workshop", "data-to-model.csv"))

d <- select(d, -time_spent) # this is another continuous outcome

train_test_split <- initial_split(d, prop = .70)

data_train <- training(train_test_split)

kfcv <- vfold_cv(data_train)

```

## Step 2: Engineer features

```{r panel-chunk-2, echo = TRUE, eval = FALSE}
# pre-processing/feature engineering

# d <- select(d, student_id:final_grade, subject:percomp) # selecting the contextual/demographic variables
# and the survey variables

d <- d %>% select(-student_id)

sci_rec <- recipe(final_grade ~ ., data = d) %>% 
    add_role(course_id, new_role = "ID variable") %>% # this can be any string
    step_novel(all_nominal_predictors()) %>% 
    step_normalize(all_numeric_predictors()) %>%
    step_dummy(all_nominal_predictors()) %>% 
    step_nzv(all_predictors()) %>% 
    step_impute_knn(all_predictors(), all_outcomes())
```

## Step 3: Specify recipe, model, and workflow
 
```{r panel-chunk-3, echo = TRUE, eval = FALSE}
# specify model
rf_mod_many <-
    rand_forest(mtry = tune(),
                min_n = tune()) %>%
    set_engine("ranger", importance = "impurity") %>%
    set_mode("regression")

# specify workflow
rf_wf_many <-
    workflow() %>%
    add_model(rf_mod_many) %>% 
    add_recipe(sci_rec)

```
 
## Step 4: Fit model

```{r panel-chunk-4, echo = TRUE, eval = FALSE}
# specify tuning grid
finalize(mtry(), data_train)
finalize(min_n(), data_train)

tree_grid <- grid_max_entropy(mtry(range(1, 15)),
                              min_n(range(2, 40)),
                              size = 10)

# fit model with tune_grid
tree_res <- rf_wf_many %>% 
    tune_grid(
        resamples = kfcv,
        grid = tree_grid,
        metrics = metric_set(rmse, mae, rsq)
    )

# examine best set of tuning parameters; repeat?
show_best(tree_res, n = 10)

# select best set of tuning parameters
best_tree <- tree_res %>%
    select_best()

# finalize workflow with best set of tuning parameters
final_wf <- rf_wf_many %>% 
    finalize_workflow(best_tree)

# fit split data (separately)
final_fit <- final_wf %>% 
    last_fit(train_test_split, metrics = metric_set(rmse, mae, rsq))
```

## Step 5: Interpret accuracy

```{r panel-chunk-5, echo = TRUE, eval = FALSE}
# variable importance plot
final_fit %>% 
    pluck(".workflow", 1) %>%   
    pull_workflow_fit() %>% 
    vip(num_features = 10)

# fit stats
final_fit %>%
    collect_metrics()

# test set predictions
final_fit %>%
    collect_predictions() 
```

### 🧶 Knit & Check ✅

Congratulations - you've completed the Machine Learning Learning Lab 1 Guided Practice! Consider moving on to the independent practice next.