-
Notifications
You must be signed in to change notification settings - Fork 0
/
ll-2-overview-presentation.Rmd
448 lines (314 loc) · 11.4 KB
/
ll-2-overview-presentation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
---
title: "Machine Learning Learning Lab 2"
subtitle: "Overview Presentation"
author: "**Dr. Joshua Rosenberg**"
institute: "LASER Institute"
date: '`r format(Sys.time(), "%B %d, %Y")`'
output:
xaringan::moon_reader:
css:
- default
- css/laser.css
- css/laser-fonts.css
lib_dir: libs # creates directory for libraries
seal: false # false: custom title slide
nature:
highlightStyle: default # highlighting syntax for code
highlightLines: true # true: enables code line highlighting
highlightLanguage: ["r"] # languages to highlight
countIncrementalSlides: false # false: disables counting of incremental slides
ratio: "16:9" # 4:3 for standard size,16:9
slideNumberFormat: |
<div class="progress-bar-container">
<div class="progress-bar" style="width: calc(%current% / %total% * 100%);">
</div>
</div>
---
class: clear, title-slide, inverse, center, top, middle
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
```{r, echo=FALSE}
# then load all the relevant packages
pacman::p_load(pacman, knitr, tidyverse, readxl)
```
```{r xaringan-panelset, echo=FALSE}
xaringanExtra::use_panelset()
```
```{r xaringanExtra-clipboard, echo=FALSE}
# these allow any code snippets to be copied to the clipboard so they
# can be pasted easily
htmltools::tagList(
xaringanExtra::use_clipboard(
button_text = "<i class=\"fa fa-clipboard\"></i>",
success_text = "<i class=\"fa fa-check\" style=\"color: #90BE6D\"></i>",
),
rmarkdown::html_dependency_font_awesome()
)
```
```{r xaringan-extras, echo=FALSE}
xaringanExtra::use_tile_view()
```
# `r rmarkdown::metadata$title`
----
### `r rmarkdown::metadata$author`
### `r format(Sys.time(), "%B %d, %Y")`
---
# Background
- Our model does well, but we want to make it even better
- here, our aim is to fit a bunch of models using the training data and see how well they do on the test data, but, if we do this more than once, we may be over-fitting to the particular sample
- a way around this is resampling through k-fold cross-validation
- driving question: what's the best model I can create to predict students' chance of passing a course?
---
# Agenda
.pull-left[
## Part 1: Core Concepts
- Feature engineering
- Cross-validation
]
.pull-right[
## Part 2: R Code Examples
- MVS data
- comparing fit measures over time
]
.panelset[
.panel[.panel-name[1]
---
class: clear, inverse, center, middle
# Core Concepts
---
# Feature engineering
- In the last learning lab, we did a nice job of training a model that was _pretty good_
- But, can we do better?
- This question motivates this learning lab, specifically:
- Answering it
- And, developing a _means_ to answer it in a way that does not introduce bias into our analysis
- Feature engineering is a key way we can improve our model
- Feature engineering refers to the part of machine learning wherein we decide which variables to include in which forms
---
# Why feature engineering matters
- Let's consider a very simple data set, one with _time series_ (or longitudinal) data for a single student
```{r, include = FALSE}
d <- tibble(student_id = "janyia", time_point = 1:10, var_a = c(0.01, 0.32, 0.32, 0.34, 0.04, 0.54, 0.56, 0.75, 0.63, 0.78))
```
```{r}
d
```
- **How can we include a variable, `var_a`, in a machine learning model?**
---
# How can we include a single variable?
Let's take a look at the variable
```{r}
d %>%
ggplot(aes(x = time_point, y = var_a)) +
geom_point()
```
- Well, what if we just add the values for these variables directly
- But, that ignores that they are at different time points
- We could include the time point variable, but that is (about) the same for every student
*What are some other ideas?*
---
# A few options
- Raw data points
- Their mean
- Their maximum
- Their variability (standard deviation)
- Their linear slope
- Their quadratic slope
---
# What else should we consider?
## For other types of variables
- Categorical variables
- Combining categories
- Numeric variables
- Normalizng ("standardizing")
## For all variables
- Removing those with "near-zero variance"
- Removing ID variables and others that _should not be_ informatives
- Imputing missing values
- _Many_ other options
---
# How to do this?
- We can do all of these things manually
- But, there are also helpful functions to do this
- Examples
- `step_dummy()`
- `step_normalize()`
- `step_nzv()`
- `step_impute_knn()`
- We'll work on this more in the code examples
---
# The importance of training data
- Training data is what we use to "learn" from data
- A "check" on your work is your predictions on _test_ set data
- *Train data*: Coded/outcome data that you use to train ("estimate") your model
- *Validation data<sup>1</sup>*: Data you use to select a particular algorithm
- *Test ("hold-out") data*: Data that you do not use in any way to train your data
---
# A conundrum
- A challenges arises when we wish to use our training data _more than once_
- If we repeatedly training an algorithm on the same data and then make changes to some part of the model that we evaluate on the test data, we may be tailoring our model to specific features of the testing data
- As a result, our performance on _new_ test data may be poorer than it seems
- This is a _very common and pervasive problem_ in machine learning applicatins
- How do we resolve this conundrum?
---
# Resampling
- Resampling involves blurring the boundaries between training and testing data
- Specifically, it involves combining these two portions of our data into one, iteratively considering:
- Some of the data to be for "training"
- Some for "testing"
- Then, fit measures are averaged over these different samples
- The problem this solves is that of learning the specifics of the testing data, since the testing data is varied
---
# *k*-folds cross validation
- One of the simplest forms of resampling is *k*-folds cross validation
- Here, some of the data is considered to be a part of the *training* set
- The remaining data is a part of the *testing* set
- This process is repeated for different sets of the data
- How many sets (samples taken through resampling)?
- This is determined by _k_, number of times the data is resampled
---
# Let's consider an example
Say we have a data set, `d`, with 100 observations (rows or data points)
Let's say we determine that _k_ is set to 10
```{r, include = FALSE}
d <- tibble(id = 1:100, var_a = runif(100), var_b = runif(100))
```
```{r}
d
```
**Using _k_ = 10, how can we split this data into ten distinct training and testing sets?**
---
# Considering *k*-folds
## First resampling
```{r}
train <- d[1:90, ]
test <- d[91:100, ]
nrow(train)
nrow(test)
# then, train the model (using train) and calculate a fit measure (using test)
```
---
## Second resampling
```{r}
train <- d[c(1:80, 91:100), ]
test <- d[81:90, ]
nrow(train)
nrow(test)
# then, train the model (using train) and calculate a fit measure (using test)
```
---
## Third resampling
```{r}
train <- d[c(1:70, 81:100), ]
test <- d[71:80, ]
nrow(train)
nrow(test)
# then, train the model (using train) and calculate a fit measure (using test)
```
... through the tenth resampling, after which the fit measures are simply _averaged)
That's it! Thankfully, we have automated tools to do this that we'll work on in the code examples
---
# But how do you determine what _k_ should be?
- A _historically common value_ for _k_ has been 10
- But, as computers have grown in processing power, setting _k_ equal to the number of rows in the data has become more common
---
class: clear, inverse, center, middle
# Code Examples
---
# Data from online science classes
- This data comes from a study of ~700 secondary/high school-aged students in the United States
- These were "one-off" classes that helped to fill gaps in students' schedules
- The data were collected _over multiple semesters_ from _multiple classes_
- There are a number of types of variables:
- Demographic/contextual variables, e.g. `subject` and `gender`
- Self-report survey variables: `uv` (utility value), `percomp` (perceived competence), and `int` (interest)
- Gradebook variables: `percentage_earned` (based on the first 20 assignments)
- Discussion variables: `sum_discussion_posts`, `sum_n_words` (for the first 3 discussions)
- Outcomes: `final_grade`, `passing_grade`, `time_spent` (in minutes)
---
# Let's take a look at the data
```{r, include = FALSE}
d <- read_csv("data/data-to-model.csv")
d <- select(d, -time_spent) # this is another outcome
d <- mutate(d, passing_grade = ifelse(final_grade > 70, 1, 0),
passing_grade = as.factor(passing_grade)) %>%
select(-final_grade)
```
```{r}
d %>%
glimpse()
```
---
class: clear, inverse, center, middle
# Code Examples
*Note how these steps are the same as in the classification example for learning lab 1*
---
# Let's walk through a few steps
.panelset[
.panel[.panel-name[1]
**Split data**
```{r panel-chunk-1, echo = TRUE, eval = FALSE}
train_test_split <- initial_split(d_class, prop = .70)
data_train <- training(train_test_split)
kfcv <- vfold_cv(data_train) # this differentiates this from what we did before
# before, we simple used data_train to fit our model
```
]
.panel[.panel-name[2]
**Engineer features**
```{r panel-chunk-2, echo = TRUE, eval = FALSE}
sci_rec <- recipe(passing_grade ~ ., data = d_class) %>%
add_role(student_id, course_id, new_role = "ID variable") %>% # this can be any string
step_novel(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_nzv(all_predictors()) %>%
step_impute_knn(all_predictors(), all_outcomes())
```
]
.panel[.panel-name[3]
**Specify recipe, model, and workflow**
```{r panel-chunk-3, echo = TRUE, eval = FALSE}
# specify model
rf_mod_many <-
rand_forest(mtry = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification") # note this difference!
# specify workflow
rf_wf_many <-
workflow() %>%
add_model(rf_mod_many) %>%
add_recipe(sci_rec)
```
]
.panel[.panel-name[4]
**Fit model**
```{r panel-chunk-4, echo = TRUE, eval = FALSE}
tree_res <- fit_resamples(rf_wf_many, data = data_train)
predict(tree_res, data_test)
final_fit <- last_fit(tree_res, train_test_split,
metrics = metric_set(roc_auc, accuracy, kap,
sensitivity, specificity, precision))
```
]
.panel[.panel-name[5]
**Evaluate accuracy**
```{r panel-chunk-5, echo = TRUE, eval = FALSE}
# fit stats
final_fit %>%
collect_metrics()
```
]
]
---
# In the remainder of this learning lab, you'll dive deeper into resampling
- **Guided walkthrough**: You'll run all of this code, focusing on manually and then automatically calculating fit measures across the different samples
- **Independent practice**: In the independent practice, you'll explore a different kind of resampling, bootstrap
- **Readings**: Several readings on using log-trace data like that used in this lab
---
class: clear, center
## .font130[.center[**Thank you!**]]
<br/>
.center[<img style="border-radius: 80%;" src="img/jr-cycling.jpeg" height="200px"/><br/>**Dr. Joshua Rosenberg**<br/><mailto:[email protected]>]