ll-1-case-study.Rmd

---
title: 'Learning Lab 1 Case Study'
author: ""
date: "`r format(Sys.Date(),'%B %e, %Y')`"
output:
  html_document:
    toc: yes
    toc_depth: 4
    toc_float: yes
    code_folding: show
    code_download: TRUE
editor_options:
  markdown:
    wrap: 72
bibliography: lit/references.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

First, please add your name above!

In the overview presentation for this learning lab, we considered five
steps in our supervised machine learning process. Those steps are
mirrored here in this case study---with the addition of a preamble step
whereby we load and process the data. Our driving question is: Can we
predict something we would have coded by hand?

We use the #NGSSchat data set as the context in which we answer this
question. See the network analysis of #NGSSchat (Rosenberg et al., 2020)
that used the coding frame from van Bommel et al. (2020) (available
[here](https://github.com/laser-institute/essential-readings/blob/main/machine-learning/ml-lab-1/van-bommel-et-al-2020-tate.pdf))
for the transactional or substantive nature of social media-based
conversations. Notably, Rosenberg et. al. coded *a lot* of data by hand,
and it would be quite convenient if the coding could be automated
through supervised machine learning methods. Though this case study is
tied to the #NGSSchat data, you can consider how qualitative coding you
or colleagues have done could be automated in a similar manner. In
short, again, can we predict something we would, heretofore, have coded
by hand?

That paper - that presents the qualitative coding of a large number of
tweets to the #NGSSchat - is available
[here](https://github.com/laser-institute/essential-readings/blob/main/machine-learning/ml-lab-1/rosenberg-et-al-2020-jrst.pdf):

> Rosenberg, J. M., Reid, J. W., Dyer, E. B., Koehler, M. J., Fischer,
> C., & McKenna, T. J. (2020). Idle chatter or compelling conversation?
> The potential of the social media‐based# NGSSchat network for
> supporting science education reform efforts. Journal of Research in
> Science Teaching, 57(9), 1322-1355.
> [Link](https://github.com/laser-institute/essential-readings/blob/main/machine-learning/ml-lab-1/rosenberg-et-al-2020-jrst.pdf).

Conceptually, we focus on prediction and how it differs from the goals
of description or explanation.

## Step 0: Loading and setting up

First, let's load the packages we'll use---the familiar {tidyverse} and
several others focused on modeling. Like in earlier learning labs, click
the green arrow to run the code chunk.

```{r}
library(tidyverse)
library(here)
library(tidymodels)
library(janitor)
```

Next, we'll load the *already-processed* (for this lab) data set that
we'll use for our supervised machine learning modeling.

*Note*: We created a means of visualizing the threads to make coding
them easier; that's here and it provides a means of seeing what the raw
data is like: <https://jmichaelrosenberg.shinyapps.io/ngsschat-shiny/>

```{r}
d <- read_csv(here("data", "ngsschat-processed-data.csv"))

d
```

The data has only five variables:

1.  `n`: The number of tweets in the *thread* (independent variable)
2.  `mean_favorite_count`: The mean number of favorites for the tweets
    in the thread (*independent* variable)
3.  `mean_retweet_count`: The mean nunber of tweets for the tweets in
    the thread (*independent* variable)
4.  `sum_display_text_width`: The sum of the number of characters for
    the tweets in the thread (*independent* variable)
5.  `code`: The qualitative code (TS = transactional; TF =
    transformational) (*dependent* variable)

[Your Turn]{style="color: green;"} ⤵

In the chunk below, examine the prepared data using a function or means
of your choice (such as just *printing* the data set by typing its name
or using the `glimpse()` function). Do this in the code chunk below!
Note its dimensions --- especially how many rows it has. Add a few
dashes after the chunk with your observations.

```{r}

```

*Observations*:

-   

\*What other variables might we include?\*\* This is a great question to
be asking - surely, including just the four variables we have cannot be
*that* predictive (right?)! Let's use these few, relatively simple
variables for now, but know that we'll use far more variables when we
get to the third learning lab.

## Step 1. Split data

-   The *training set* is used to estimate develop and compare models,
    feature engineering techniques, tune models, etc.

-   The *test set* is held in reserve until the end of the project, at
    which point there should only be one or two models under serious
    consideration. It is used as an unbiased source for measuring final
    model performance.

There are different ways to create these partitions of the data and
there is no uniform guideline for determining how much data should be
set aside for testing. The proportion of data can be driven by many
factors, including the size of the original pool of samples and the
total number of predictors.

Here, we split the data. We do so using our first {tidymodels}
function - `initial_split()`.

It is common when beginning a modeling project to [separate the data
set](https://bookdown.org/max/FES/data-splitting.html) into two
partitions. Why do we choose an 80% split (see `prop = .80` below)

This is to reserve a sufficient number of cases for testing our fitted
model later. You can change this number if you wish.

After you decide how much to set aside, the most common approach for
actually partitioning your data is to use a random sample. For our
purposes, we'll use random sampling to select 25% for the test set and
use the remainder for the training set, which are the defaults for the
{[rsample](https://tidymodels.github.io/rsample/)} package.

Additionally, since random sampling uses random numbers, it is important
to set the random number seed. This ensures that the random numbers can
be reproduced at a later time (if needed). We pick the first date on which we may
carry out this learning lab as the seed, but any number will work!

The function `initial_split()` function from the {rsample} package takes
the original data and saves the information on how to make the
partitions. The {rsample} package has two aptly named functions for
created a training and testing data set called `training()` and
`testing()`, respectively.

Run the following code to split the data:

```{r}
set.seed(20220212)
train_test_split <- initial_split(d, prop = .80)
data_train <- training(train_test_split)
```

Go ahead and type `data_train` and `d` into the console (in steps) to
check that this data set indeed has 80% of the number of observations as
in the larger data. Do that in the chunk below:

[Your Turn]{style="color: green;"} ⤵

```{r}

```

## Step 2: Engineer features and write down the recipe

We'll engage in a very basic feature engineering step, though we'll do
this *much* more in the next learning lab. Read more about feature
engineering [here](https://www.tmwr.org/recipes.html).

To do feature engineering, we introduce another {tidymodels} package,
[recipes](https://recipes.tidymodels.org/), which is designed to help
you prepare your data *before* training your model - in other words, to
engage in *feature engineering*. That's all we'll say about this now;
we'll dive into feature engineering in the third learning lab. For now,
we'll just use the variables as they are -- we'll do *no* feature
engineering at this stage.

To get started, let's create a recipe for a simple logistic regression
model. Before training the model, we can use a recipe.

The
[`recipe()`function](https://recipes.tidymodels.org/reference/recipe.html) as
we used it here has two arguments:

-   A **formula**. Any variable on the left-hand side of the tilde (`~`)
    is considered the model outcome (`code`, in our present case). On
    the right-hand side of the tilde are the predictors. Variables may
    be listed by name, or you can use the dot (`.`) to indicate all
    other variables as predictors.

-   The **data**. A recipe is associated with the data set used to
    create the model. This will typically be the *training* set, so
    `data = train_data` here. Naming a data set doesn't actually change
    the data itself; it is only used to catalog the names of the
    variables and their types, like factors, integers, dates, etc.

```{r}
my_rec <- recipe(code ~ ., data = data_train)
```

## Step 3: Specify the model and workflow

Next, we specify the model - using the `logistic_reg()` function to set
the *model* - using `set_engine("glm")` to set the *engine* - finally,
using `set_mode("classification"))` to set the "*mode*" to
classification; this could be changed to regression for a
continuous/numeric outcome:

```{r}
# specify model
my_mod <-
    logistic_reg() %>% 
    set_engine("glm") %>%
    set_mode("classification")
```

## Step 4: Fit model

We will want to use our recipes created earlier across several steps as
we train and test our model. To simplify this process, we can use a
*model workflow*, which pairs a model and recipe together.

This is a straightforward approach because different recipes are often
needed for different models, so when a model and recipe are bundled, it
becomes easier to train and test *workflows*.

So, last, we'll put the pieces together - the model and recipe.

We'll use the {[workflows](https://workflows.tidymodels.org/)} package
from tidymodels to bundle our parsnip model (`my_mod`) with our first
recipe (`my_rec`).

```{r}
my_wf <-
    workflow() %>%
    add_model(my_mod) %>% 
    add_recipe(my_rec)
```

Finally, we'll fit our model.

```{r, warning = FALSE}
fitted_model <- fit(my_wf, data = data_train)
```

Finally, we'll use the `last_fit` function, which is the key here: note
that it uses the `train_test_split` data---not just the training data.

Here, then, we fit the data *using the training data set* and evaluate
its accuracy using the *testing data set* (which is not used to train
the model).

```{r, include = FALSE}
final_fit <- last_fit(fitted_model, train_test_split)
```

[Your Turn]{style="color: green;"} ⤵

Type `final_fit` below; this is the final, fitted model---one that can
be interpreted further in the next step!

```{r}
```

You may see a message/warning above or when you examine `final_fit`; you
can safely ignore that.

## Step 5: Interpret accuracy

Run the code below to examine the predictions for the *test* split of
data. Note that the row ID is in the output below, but this doesn't
correspond one-one to the ID variables used in the presentation/Shiny.

```{r}
# collect test split predictions
final_fit %>%
    collect_predictions()
```

This is our first set of real output! Note two things:

1.  `.pres_class`: This is the *predicted* code  
2.  `code`: This is the known *code*  

When these are **the same**, the model predicted the code _correctly_; when they aren't the same, the model was incorrect.

Importantly, we can _summarize_ across all of these codes. One way to do this is straightforward; how many of the codes were the same, as in the following chunk of code:

```{r}
final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, code) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == code) # create a new variable, correct, telling us when the model was and was not correct
```
That's helpful, but there's one more step we can take -- counting up the values of `correct`:

```{r}
final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, code) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == code) %>% # create a new variable, correct, telling us when the model was and was not correct
    tabyl(correct)
```

Let's interpret the above. If the value of `correct` is `TRUE` when the predicted and known code are the same, what does the `percent` column tell us? Add one or more notes to the dashes below:

- 

A short-cut to the above is below:

```{r}
final_fit %>% 
    collect_metrics()
```
**Observations**

- 
- 

That's it for now; the core parts of machine learning are used in the above steps you took; what we'll do after this leaning lab only adds nuance and complexity to what we've already done.

## 🧶 Knit & Check ✅

Congratulations - you've completed the Machine Learning Learning Lab 1
Case study! Move on to the Independent Practice next.