-
Notifications
You must be signed in to change notification settings - Fork 0
/
ll-4-overview-presentation.Rmd
316 lines (224 loc) · 9.16 KB
/
ll-4-overview-presentation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
---
title: "Machine Learning Learning Lab 4"
subtitle: "Overview Presentation"
author: "**Dr. Joshua Rosenberg**"
institute: "LASER Institute"
date: '`r format(Sys.time(), "%B %d, %Y")`'
output:
xaringan::moon_reader:
css:
- default
- css/laser.css
- css/laser-fonts.css
lib_dir: libs # creates directory for libraries
seal: false # false: custom title slide
nature:
highlightStyle: default # highlighting syntax for code
highlightLines: true # true: enables code line highlighting
highlightLanguage: ["r"] # languages to highlight
countIncrementalSlides: false # false: disables counting of incremental slides
ratio: "16:9" # 4:3 for standard size,16:9
slideNumberFormat: |
<div class="progress-bar-container">
<div class="progress-bar" style="width: calc(%current% / %total% * 100%);">
</div>
</div>
---
class: clear, title-slide, inverse, center, top, middle
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
```{r, echo=FALSE}
# then load all the relevant packages
pacman::p_load(pacman, knitr, tidyverse, readxl)
```
```{r xaringan-panelset, echo=FALSE}
xaringanExtra::use_panelset()
```
```{r xaringanExtra-clipboard, echo=FALSE}
```
# `r rmarkdown::metadata$title`
----
### `r rmarkdown::metadata$author`
### `r format(Sys.time(), "%B %d, %Y")`
---
# Background
- Until now, we've used coded data to _train_ an algorithm
- In short, we've used _supervised_ machine learning
- But, we may not yet have codes; what options to do we have in such situations?
- We can turn to _unsupervised_ machine learning methods
- We'll use ASSISTments data to do so
---
# Agenda
.pull-left[
## Part 1: Core Concepts
- Determining the number of groups/codes in the data
- Interpreting the groups
- Computational Grounded Theory
]
.pull-right[
## Part 2: R Code-Along
- ASSISTments data
- The tidyLPA package
]
---
class: clear, inverse, center, middle
# Core Concepts
---
# Unsupervised ML
- Does not require coded data; one way to think about unsupervised ML is that its purpose is to discover codes/labels
- Is used to discover groups among observations/cases or to summarize across variables
- Can be used in an _exploratory mode_ (see [Nelson, 2020](https://journals.sagepub.com/doi/full/10.1177/0049124118769114?casa_token=EV5XH31qbyAAAAAA%3AFg09JQ1XHOOzlxYT2SSJ06vZv0jG-s4Qfz8oDIQwh2jrZ-jrHNr7xZYL2FwnZtZiokhPalvV1RL2Bw))
- **Warning**: The results of unsupervised ML _cannot_ directly be used to provide codes/outcomes for supervised ML techniques
- Can work with both continuous and dichotomous or categorical variables
- Algorithms include:
- Cluster analysis
- [Principle Components Analysis (really!)](https://web.stanford.edu/~hastie/ElemStatLearn/)
- Latent Dirichlet Allocation (topic modeling)
---
# What technique should I choose?
Do you not yet have codes/outcomes -- and do you want to?
- _Achieve a starting point_ for qualitative coding, perhaps in a ["computational grounded theory"](https://journals.sagepub.com/doi/full/10.1177/0049124117729703) mode?
- _Discover groups or patterns in your data_ that may be of interest?
- _Reduce the number of variables in your dataset_ to a smaller, but perhaps nearly as explanatory/predictive - set of variables?
<bold><h4><center>Unsupervised methods may be helpful</center></h4></bold>
---
# Range of data
- We can use unsupervised machine learning methods with a range of data types
- Structured data:
- Numeric data
- Categorical data
- Unstructured data:
- Text
- Images
- Video
**We'll focus here on structured, numeric data**
---
# LPA
- Latent Profile Analysis can be considered to be an unsupervised machine learning method suited to the analysis of structured, numeric data
- It is closely related to more general _mixture_ models, namely Latent Class Analysis (for categorical data)
- Historically, it has been common for educational researchers (and psychologists) to estimate such models using proprietary software--MPlus
- But, a package in R is now available and widely-used, [tidyLPA](https://data-edu.github.io/tidyLPA/index.html)
*We'll use tidyLPA fr this learning lab*
---
# Computational Grounded Theory
- To draw a connection between LPA and machine learning, we'll consider its use in a part of a broader frame, Computational Grounded Theory
- Laura Nelson developed this approach in a pioneering paper ([Nelson, 2020](https://journals.sagepub.com/doi/full/10.1177/0049124117729703))
- It involves three steps:
1. Unsupervised machine learning to _explore_ the data
2. Careful qualitative analysis of the raw data _and_ the output from step 1
3. Validation of the revised codes that result from steps 1 and 2
---
# Nelson's approach
```{r, echo = FALSE}
knitr::include_graphics("https://journals.sagepub.com/na101/home/literatum/publisher/sage/journals/content/smra/2020/smra_49_1/0049124117729703/20200108/images/medium/10.1177_0049124117729703-fig2.gif")
```
---
# An example in science education research
- In [Rosenberg and Krist (2020)](https://link.springer.com/article/10.1007/s10956-020-09862-4), students' written responses were the raw (unstrcutred) data that was used
- Though a coding frame existed, it was not well suited to the specific data that was available
- At the same time, there was _a lot_ of data available
- The three steps of computational grounded theory were carried out:
1. Unsupervised exploration of the textual data (topic modeling)
2. Careful qualitative analysis/reading of the textual data and the topics
3. Validation (using supervised machine learning methods)
---
# Using Computational Grounded Theory
- We can consider the results of LPA to not be a _final_, but an _initial_ step in the analysis
- After step 1 of computational grounded theory, the codes can be interrogated more deeply using _qualitative_ methods
- Then, the resulting codes can be validated:
- Expert review
- Referral to criterion/varied sources of validity evidence
- Supervised machine learning methods
---
# Back to LPA
- There are several key steps in LPA:
1. Choosing which variables to include
1. Determining the number of profiles
1. Interpreting the profile
**We'll explore each of these in the context of the ASSISTments data next**
---
# ASSISTments
- We will use a portion of the ASSISTment data from a [data mining
competition](https://sites.google.com/view/assistmentsdatamining/home?authuser=0), identifying groups/patterns in learners' interactions with ASSISTments
- [This paper](https://educationaldatamining.org/EDM2014/uploads/procs2014/short%20papers/276_EDM-2014-Short.pdf) provides helpful context
---
# ASSISTments
```{r, echo = FALSE, message = FALSE}
library(readr)
library(dplyr)
d <- read_csv("data/dat_csv_combine_final_full.csv") %>%
select(AveCarelessness, AveKnow, AveCorrect = AveCorrect.x, AveResBored,
AveResEngcon, AveResConf, AveResFrust, AveResOfftask,
AveResGaming, NumActions) %>%
janitor::clean_names()
d
```
---
class: clear, inverse, center, middle
# Code Examples
---
# Estimating 3 groups
We can see all three steps in this code chunk
```{r, eval = FALSE, echo = TRUE}
library(tidyLPA)
library(dplyr)
pisaUSA15[1:100, ] %>%
select(broad_interest, enjoyment, self_efficacy) %>% # select three variables
estimate_profiles(3) %>% # estimate 3 profiles
plot_profiles()
```
---
# Estimating 3 groups
```{r plot-3, echo = FALSE, message = FALSE, warning = FALSE}
library(tidyLPA)
library(dplyr)
pisaUSA15[1:100, ] %>%
select(broad_interest, enjoyment, self_efficacy) %>% # select three variables
estimate_profiles(3) %>% # estimate 3 profiles
plot_profiles() # interpret
```
---
# Interpreting
- How might we interpret these profiles?
- Let's look at the raw data (see `get_data()`)
- How might we validate them?
---
# Comparing profiles
```{r, echo = TRUE, eval = FALSE}
pisaUSA15[1:100, ] %>%
select(broad_interest, enjoyment, self_efficacy) %>%
single_imputation() %>%
estimate_profiles(1:3,
variances = c("equal", "varying"),
covariances = c("zero", "varying")) %>%
compare_solutions(statistics = c("AIC", "BIC"))
```
---
# Comparing profiles
```{r, echo = FALSE, eval = TRUE}
pisaUSA15[1:100, ] %>%
select(broad_interest, enjoyment, self_efficacy) %>%
single_imputation() %>%
estimate_profiles(1:3,
variances = c("equal", "varying"),
covariances = c("zero", "varying")) %>%
compare_solutions(statistics = c("AIC", "BIC"))
```
---
# Other functions for working with LPA output
```{r, eval = FALSE}
get_estimates(m3)
get_data(m3)
get_fit(m4)
```
---
# In the remainder of this learning lab, you'll dive deeper into this model
- **Guided walkthrough**: an LPA
- **Independent practice**: your own data
- **Readings**: paper: nelson, CGT
---
class: clear, center
## .font130[.center[**Thank you!**]]
<br/>
.center[<img style="border-radius: 80%;" src="img/jr-cycling" height="200px"/><br/>**Dr. Joshua Rosrenberg**<br/><mailto:[email protected]>]