-
Notifications
You must be signed in to change notification settings - Fork 0
/
ll-1-case-study.Rmd
executable file
·339 lines (257 loc) · 12.2 KB
/
ll-1-case-study.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
---
title: 'Learning Lab 1 Case Study'
author: ""
date: "`r format(Sys.Date(),'%B %e, %Y')`"
output:
html_document:
toc: yes
toc_depth: 4
toc_float: yes
code_folding: show
code_download: TRUE
editor_options:
markdown:
wrap: 72
bibliography: lit/references.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
First, please add your name above!
In the overview presentation for this learning lab, we considered five
steps in our supervised machine learning process. Those steps are
mirrored here in this case study---with the addition of a preamble step
whereby we load and process the data. Our driving question is: Can we
predict something we would have coded by hand?
We use the #NGSSchat data set as the context in which we answer this
question. See the network analysis of #NGSSchat (Rosenberg et al., 2020)
that used the coding frame from van Bommel et al. (2020) (available
[here](https://github.com/laser-institute/essential-readings/blob/main/machine-learning/ml-lab-1/van-bommel-et-al-2020-tate.pdf))
for the transactional or substantive nature of social media-based
conversations. Notably, Rosenberg et. al. coded *a lot* of data by hand,
and it would be quite convenient if the coding could be automated
through supervised machine learning methods. Though this case study is
tied to the #NGSSchat data, you can consider how qualitative coding you
or colleagues have done could be automated in a similar manner. In
short, again, can we predict something we would, heretofore, have coded
by hand?
That paper - that presents the qualitative coding of a large number of
tweets to the #NGSSchat - is available
[here](https://github.com/laser-institute/essential-readings/blob/main/machine-learning/ml-lab-1/rosenberg-et-al-2020-jrst.pdf):
> Rosenberg, J. M., Reid, J. W., Dyer, E. B., Koehler, M. J., Fischer,
> C., & McKenna, T. J. (2020). Idle chatter or compelling conversation?
> The potential of the social media‐based# NGSSchat network for
> supporting science education reform efforts. Journal of Research in
> Science Teaching, 57(9), 1322-1355.
> [Link](https://github.com/laser-institute/essential-readings/blob/main/machine-learning/ml-lab-1/rosenberg-et-al-2020-jrst.pdf).
Conceptually, we focus on prediction and how it differs from the goals
of description or explanation.
## Step 0: Loading and setting up
First, let's load the packages we'll use---the familiar {tidyverse} and
several others focused on modeling. Like in earlier learning labs, click
the green arrow to run the code chunk.
```{r}
library(tidyverse)
library(here)
library(tidymodels)
library(janitor)
```
Next, we'll load the *already-processed* (for this lab) data set that
we'll use for our supervised machine learning modeling.
*Note*: We created a means of visualizing the threads to make coding
them easier; that's here and it provides a means of seeing what the raw
data is like: <https://jmichaelrosenberg.shinyapps.io/ngsschat-shiny/>
```{r}
d <- read_csv(here("data", "ngsschat-processed-data.csv"))
d
```
The data has only five variables:
1. `n`: The number of tweets in the *thread* (independent variable)
2. `mean_favorite_count`: The mean number of favorites for the tweets
in the thread (*independent* variable)
3. `mean_retweet_count`: The mean nunber of tweets for the tweets in
the thread (*independent* variable)
4. `sum_display_text_width`: The sum of the number of characters for
the tweets in the thread (*independent* variable)
5. `code`: The qualitative code (TS = transactional; TF =
transformational) (*dependent* variable)
[Your Turn]{style="color: green;"} ⤵
In the chunk below, examine the prepared data using a function or means
of your choice (such as just *printing* the data set by typing its name
or using the `glimpse()` function). Do this in the code chunk below!
Note its dimensions --- especially how many rows it has. Add a few
dashes after the chunk with your observations.
```{r}
```
*Observations*:
-
\*What other variables might we include?\*\* This is a great question to
be asking - surely, including just the four variables we have cannot be
*that* predictive (right?)! Let's use these few, relatively simple
variables for now, but know that we'll use far more variables when we
get to the third learning lab.
## Step 1. Split data
- The *training set* is used to estimate develop and compare models,
feature engineering techniques, tune models, etc.
- The *test set* is held in reserve until the end of the project, at
which point there should only be one or two models under serious
consideration. It is used as an unbiased source for measuring final
model performance.
There are different ways to create these partitions of the data and
there is no uniform guideline for determining how much data should be
set aside for testing. The proportion of data can be driven by many
factors, including the size of the original pool of samples and the
total number of predictors.
Here, we split the data. We do so using our first {tidymodels}
function - `initial_split()`.
It is common when beginning a modeling project to [separate the data
set](https://bookdown.org/max/FES/data-splitting.html) into two
partitions. Why do we choose an 80% split (see `prop = .80` below)
This is to reserve a sufficient number of cases for testing our fitted
model later. You can change this number if you wish.
After you decide how much to set aside, the most common approach for
actually partitioning your data is to use a random sample. For our
purposes, we'll use random sampling to select 25% for the test set and
use the remainder for the training set, which are the defaults for the
{[rsample](https://tidymodels.github.io/rsample/)} package.
Additionally, since random sampling uses random numbers, it is important
to set the random number seed. This ensures that the random numbers can
be reproduced at a later time (if needed). We pick the first date on which we may
carry out this learning lab as the seed, but any number will work!
The function `initial_split()` function from the {rsample} package takes
the original data and saves the information on how to make the
partitions. The {rsample} package has two aptly named functions for
created a training and testing data set called `training()` and
`testing()`, respectively.
Run the following code to split the data:
```{r}
set.seed(20220212)
train_test_split <- initial_split(d, prop = .80)
data_train <- training(train_test_split)
```
Go ahead and type `data_train` and `d` into the console (in steps) to
check that this data set indeed has 80% of the number of observations as
in the larger data. Do that in the chunk below:
[Your Turn]{style="color: green;"} ⤵
```{r}
```
## Step 2: Engineer features and write down the recipe
We'll engage in a very basic feature engineering step, though we'll do
this *much* more in the next learning lab. Read more about feature
engineering [here](https://www.tmwr.org/recipes.html).
To do feature engineering, we introduce another {tidymodels} package,
[recipes](https://recipes.tidymodels.org/), which is designed to help
you prepare your data *before* training your model - in other words, to
engage in *feature engineering*. That's all we'll say about this now;
we'll dive into feature engineering in the third learning lab. For now,
we'll just use the variables as they are -- we'll do *no* feature
engineering at this stage.
To get started, let's create a recipe for a simple logistic regression
model. Before training the model, we can use a recipe.
The
[`recipe()`function](https://recipes.tidymodels.org/reference/recipe.html) as
we used it here has two arguments:
- A **formula**. Any variable on the left-hand side of the tilde (`~`)
is considered the model outcome (`code`, in our present case). On
the right-hand side of the tilde are the predictors. Variables may
be listed by name, or you can use the dot (`.`) to indicate all
other variables as predictors.
- The **data**. A recipe is associated with the data set used to
create the model. This will typically be the *training* set, so
`data = train_data` here. Naming a data set doesn't actually change
the data itself; it is only used to catalog the names of the
variables and their types, like factors, integers, dates, etc.
```{r}
my_rec <- recipe(code ~ ., data = data_train)
```
## Step 3: Specify the model and workflow
Next, we specify the model - using the `logistic_reg()` function to set
the *model* - using `set_engine("glm")` to set the *engine* - finally,
using `set_mode("classification"))` to set the "*mode*" to
classification; this could be changed to regression for a
continuous/numeric outcome:
```{r}
# specify model
my_mod <-
logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
```
## Step 4: Fit model
We will want to use our recipes created earlier across several steps as
we train and test our model. To simplify this process, we can use a
*model workflow*, which pairs a model and recipe together.
This is a straightforward approach because different recipes are often
needed for different models, so when a model and recipe are bundled, it
becomes easier to train and test *workflows*.
So, last, we'll put the pieces together - the model and recipe.
We'll use the {[workflows](https://workflows.tidymodels.org/)} package
from tidymodels to bundle our parsnip model (`my_mod`) with our first
recipe (`my_rec`).
```{r}
my_wf <-
workflow() %>%
add_model(my_mod) %>%
add_recipe(my_rec)
```
Finally, we'll fit our model.
```{r, warning = FALSE}
fitted_model <- fit(my_wf, data = data_train)
```
Finally, we'll use the `last_fit` function, which is the key here: note
that it uses the `train_test_split` data---not just the training data.
Here, then, we fit the data *using the training data set* and evaluate
its accuracy using the *testing data set* (which is not used to train
the model).
```{r, include = FALSE}
final_fit <- last_fit(fitted_model, train_test_split)
```
[Your Turn]{style="color: green;"} ⤵
Type `final_fit` below; this is the final, fitted model---one that can
be interpreted further in the next step!
```{r}
```
You may see a message/warning above or when you examine `final_fit`; you
can safely ignore that.
## Step 5: Interpret accuracy
Run the code below to examine the predictions for the *test* split of
data. Note that the row ID is in the output below, but this doesn't
correspond one-one to the ID variables used in the presentation/Shiny.
```{r}
# collect test split predictions
final_fit %>%
collect_predictions()
```
This is our first set of real output! Note two things:
1. `.pres_class`: This is the *predicted* code
2. `code`: This is the known *code*
When these are **the same**, the model predicted the code _correctly_; when they aren't the same, the model was incorrect.
Importantly, we can _summarize_ across all of these codes. One way to do this is straightforward; how many of the codes were the same, as in the following chunk of code:
```{r}
final_fit %>%
collect_predictions() %>% # see test set predictions
select(.pred_class, code) %>% # just to make the output easier to view
mutate(correct = .pred_class == code) # create a new variable, correct, telling us when the model was and was not correct
```
That's helpful, but there's one more step we can take -- counting up the values of `correct`:
```{r}
final_fit %>%
collect_predictions() %>% # see test set predictions
select(.pred_class, code) %>% # just to make the output easier to view
mutate(correct = .pred_class == code) %>% # create a new variable, correct, telling us when the model was and was not correct
tabyl(correct)
```
Let's interpret the above. If the value of `correct` is `TRUE` when the predicted and known code are the same, what does the `percent` column tell us? Add one or more notes to the dashes below:
-
A short-cut to the above is below:
```{r}
final_fit %>%
collect_metrics()
```
**Observations**
-
-
That's it for now; the core parts of machine learning are used in the above steps you took; what we'll do after this leaning lab only adds nuance and complexity to what we've already done.
## 🧶 Knit & Check ✅
Congratulations - you've completed the Machine Learning Learning Lab 1
Case study! Move on to the Independent Practice next.