-
Notifications
You must be signed in to change notification settings - Fork 0
/
ll-3-case-study.Rmd
166 lines (122 loc) · 4.85 KB
/
ll-3-case-study.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
title: 'Learning Lab 3 Case Study'
author: ""
date: "`r format(Sys.Date(),'%B %e, %Y')`"
output:
html_document:
toc: yes
toc_depth: 4
toc_float: yes
code_folding: show
code_download: TRUE
editor_options:
markdown:
wrap: 72
bibliography: lit/references.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Even after feature engineering, machine learning approaches can often (but not always) be improved by choosing a more sophisticated model type. Note how we used a regression model in the first two case studies; here, we explore a considerably more sophisticated model, a random forest.
Choosing a more sophisticated model adds some complexity to the modeling. Notably, more complex models have _tuning parameters_ - parts of the model that are not estimated from the data. In addition to using feature engineering in a way akin to how we did in the last case study, Bertolini et al. (2021) use tuning parameters to improve the performance of their predictive model.
Our driving question is: How much of a difference does a more complex model make?
While answering this question, we focus not only on estimating, but also on tuning a complex model.
The data we use is from the #NGSSchat community on Twitter, as in doing so we can compare the performance of this tuned, complex model to the initial model we used in the first case study.
## Step 0: Loading and setting up
First, let's load the packages we'll use---the familiar {tidyverse} and several others focused on modeling. Like in earlier learning labs, click the green arrow to run the code chunk.
```{r}
library(tidyverse)
library(here)
library(tidymodels)
library(vip)
library(ranger)
```
Next, we'll load two data sources, one with the tweets, the other with our qualitative codes.
*Note*: We created a means of visualizing the threads to make coding them easier; that's here and it provides a means of seeing what the raw data is like: https://jmichaelrosenberg.shinyapps.io/ngsschat-shiny/
```{r}
d <- read_rds(here("data", "ngsschat-data.rds"))
codes <- read_csv(here("data", "ngsschat-qualitative-codes.csv"))
```
## Step 1. Split data
```{r}
library(tidymodels) # doesn't load forcats, stringr, readr from tidyverse
library(readr)
library(here)
library(vip)
d <- read_csv(here("spring-workshop", "data-to-model.csv"))
d <- select(d, -time_spent) # this is another continuous outcome
train_test_split <- initial_split(d, prop = .70)
data_train <- training(train_test_split)
kfcv <- vfold_cv(data_train)
```
## Step 2: Engineer features
```{r panel-chunk-2, echo = TRUE, eval = FALSE}
# pre-processing/feature engineering
# d <- select(d, student_id:final_grade, subject:percomp) # selecting the contextual/demographic variables
# and the survey variables
d <- d %>% select(-student_id)
sci_rec <- recipe(final_grade ~ ., data = d) %>%
add_role(course_id, new_role = "ID variable") %>% # this can be any string
step_novel(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_nzv(all_predictors()) %>%
step_impute_knn(all_predictors(), all_outcomes())
```
## Step 3: Specify recipe, model, and workflow
```{r panel-chunk-3, echo = TRUE, eval = FALSE}
# specify model
rf_mod_many <-
rand_forest(mtry = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
# specify workflow
rf_wf_many <-
workflow() %>%
add_model(rf_mod_many) %>%
add_recipe(sci_rec)
```
## Step 4: Fit model
```{r panel-chunk-4, echo = TRUE, eval = FALSE}
# specify tuning grid
finalize(mtry(), data_train)
finalize(min_n(), data_train)
tree_grid <- grid_max_entropy(mtry(range(1, 15)),
min_n(range(2, 40)),
size = 10)
# fit model with tune_grid
tree_res <- rf_wf_many %>%
tune_grid(
resamples = kfcv,
grid = tree_grid,
metrics = metric_set(rmse, mae, rsq)
)
# examine best set of tuning parameters; repeat?
show_best(tree_res, n = 10)
# select best set of tuning parameters
best_tree <- tree_res %>%
select_best()
# finalize workflow with best set of tuning parameters
final_wf <- rf_wf_many %>%
finalize_workflow(best_tree)
# fit split data (separately)
final_fit <- final_wf %>%
last_fit(train_test_split, metrics = metric_set(rmse, mae, rsq))
```
## Step 5: Interpret accuracy
```{r panel-chunk-5, echo = TRUE, eval = FALSE}
# variable importance plot
final_fit %>%
pluck(".workflow", 1) %>%
pull_workflow_fit() %>%
vip(num_features = 10)
# fit stats
final_fit %>%
collect_metrics()
# test set predictions
final_fit %>%
collect_predictions()
```
### 🧶 Knit & Check ✅
Congratulations - you've completed the Machine Learning Learning Lab 1 Guided Practice! Consider moving on to the independent practice next.