ll-3-overview-presentation.html

<!DOCTYPE html>
<html lang="" xml:lang="">
  <head>
    <title>Machine Learning Learning Lab 3</title>
    <meta charset="utf-8" />
    <meta name="author" content="Dr. Joshua Rosenberg" />
    <meta name="date" content="2022-07-01" />
    <script src="libs/header-attrs-2.14/header-attrs.js"></script>
    <link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
    <link href="libs/panelset-0.2.6/panelset.css" rel="stylesheet" />
    <script src="libs/panelset-0.2.6/panelset.js"></script>
    <script src="libs/clipboard-2.0.6/clipboard.min.js"></script>
    <link href="libs/xaringanExtra-clipboard-0.2.6/xaringanExtra-clipboard.css" rel="stylesheet" />
    <script src="libs/xaringanExtra-clipboard-0.2.6/xaringanExtra-clipboard.js"></script>
    <script>window.xaringanExtraClipboard(null, {"button":"<i class=\"fa fa-clipboard\"><\/i>","success":"<i class=\"fa fa-check\" style=\"color: #90BE6D\"><\/i>","error":"Press Ctrl+C to Copy"})</script>
    <link href="libs/font-awesome-5.1.0/css/all.css" rel="stylesheet" />
    <link href="libs/font-awesome-5.1.0/css/v4-shims.css" rel="stylesheet" />
    <link href="libs/tile-view-0.2.6/tile-view.css" rel="stylesheet" />
    <script src="libs/tile-view-0.2.6/tile-view.js"></script>
    <link rel="stylesheet" href="css/laser.css" type="text/css" />
    <link rel="stylesheet" href="css/laser-fonts.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">

class: clear, title-slide, inverse, center, top, middle


# Machine Learning Learning Lab 3
----
### **Dr. Joshua Rosenberg**
### July 01, 2022

---

# Background

- Once we've processed our variables in new ways and have made our model better, we want to select the best model
- Best models can be understood in two ways:
    - tuning
    - selection
- Focus: selecting a model, setting tuning parameters, and picking the best performing model
- We'll focus on random forest models as the context for this, though the points apply to any complex algorithm

---

# Agenda

.pull-left[

## Part 1: Core Concepts
- tuning
- model selection

]

.pull-right[

## Part 2: R Code-Along
- NGSS and transactional and substantive for one year (again)
- Different ways of estimating tuning parameters
]

---

class: clear, inverse, center, middle

# Core Concepts

---

# How do I select a model?

One general principle is to **start with the simplest useful model** and to _build toward
more complex models as helpfuL_.

This principle applies in multiple ways:

- To choose an algorithm, start with simpler models that you can efficiently use and understand
- To carry out feature engineering, understand your predictors well by starting with a subset
- To tune an algorithm, start with a relatively simple set of tuning parameters

This isn't just for beginners or those of us in education; [most spam filters use Support Vector Machines (and used Naive Bayes until recently)](https://vas3k.com/blog/machine_learning/) due to their combination of effectiveness and efficiency "in production."

---


# The bias-variance tradeoff

- An important way to achieve good performance with test data is to balance between the inherent _bias_ in your algorithm and the _variance_ in the predictions of your algorithm; this is referred to as the **bias-variance** trade-off of _all_ models

.footnote[
[1] not always/often used, for reasons we'll discuss later

]

---

# Illustrating the bias-variance tradeoff

&lt;img src="ll-3-overview-presentation_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /&gt;

---

# Strong bias

&lt;img src="ll-3-overview-presentation_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /&gt;

---

# A much less-biased algorithm

&lt;img src="ll-3-overview-presentation_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /&gt;

---

# Slightly different data (right pane)


.pull-left[
&lt;img src="img/bias-variance-data-1.png" width=400 height=400&gt;
]

.pull-right[
&lt;img src="img/bias-variance-data-2.png" width=400 height=400&gt;
]

---

# Still strong bias, but low variance


.pull-left[
&lt;img src="img/bias-variance-data-3.png" width=400 height=400&gt;
]

.pull-right[
&lt;img src="img/bias-variance-data-4.png" width=400 height=400&gt;
]

---

# Low bias, but very high variance


.pull-left[
&lt;img src="img/bias-variance-data-5.png" width=400 height=400&gt;
]

.pull-right[
&lt;img src="img/bias-variance-data-6.png" width=400 height=400&gt;
]

---

# The bias-variance tradeoff

.pull-left[

#### Bias

- *Definition*: Difference between our known codes/outcomes and our predicted codes/outcomes; difference between `\(y\)` and `\(\hat{y}\)`

- How (in)correct our models' (algorithms') predictions are

- Models with high bias can fail to capture important relationships--they can be *under-fit* to our data

- In short, how well our model reflects the patterns in the data

]

.pull-right[

#### Variance

- *Definition*: Using a different sample of data, the difference in `\(\hat{y}\)` values

- How sensitive our predictions are to the specific sample on which we trained the model 
- Models with high variance can fail to predict different data well--they can be *over-fit* to our data

- In short, how stable the predictions of our model are

&lt;h4&gt;&lt;center&gt;Regardless of model, we often wish to balance between bias and variance&lt;/center&gt;&lt;/h4&gt;

---

# Tuning

- Many parts of models - their _parameters_ are estimated from the data
- Other parts cannot be estimated from the data and must be used:
    - They are often set as defaults
    - But you can often improve on these defaults

---

# Aside: Random forests and classification trees

- *Random forests* are extensions of classification trees
- _Classification trees_ are a type of algorithm that - at their core - use conditional logic ("if-then" statements) in a _nested_ manner
    - For instance, here's a _very, very_ simple tree (from [APM](https://link.springer.com/book/10.1007/978-1-4614-6849-3)):


- Measures are used to determine the splits in such a way that classifies observations into small, homogeneous groups (using measures such as the Gini index and entropy measure)

---

# A more complex tree


As you can imagine, with many variables, these trees can become very complex

---

# Random forests

- Random forest is an extension of decision tree modeling, whereby a collection of decision trees are simultaneously estimated ("grown") and are evaluated based on out-of-sample predictive accuracy
- Random forest estimates all the decision trees at once so each tree is independent of every other tree. 
    - The random forest algorithm provides a regression approach that is distinct from other modeling approaches. 
    - The final random forest model aggregates the findings across all the separate trees in the forest

---

![](https://miro.medium.com/max/1184/1*i0o8mjFfCn-uD79-F1Cqkw.png)&lt;!-- --&gt;

[Koehrsen (2017)](https://williamkoehrsen.medium.com/random-forest-simple-explanation-377895a60d2d)

---

# Tuning parameters for random forests

- There are several important tuning parameters for these models:
    - the number of predictor columns that are randomly sampled for each split in the tree (`mtry` in the code we'll run later)
    - the number of data points required to execute a split (`min_n`)
    - the minimum number of observations in each node (`size`)
- These tuning parameters, broadly, balance predictive performance with the training data with how well the model will perform on new data 

---

# Overview of regression modeling

1. Split data (into train, test, and tuning sets sets)
1. Specify recipe, model, and workflow
1. Estimate models and determine tuning parameters' values
1. Evaluate metrics and predictions

---

class: clear, inverse, center, middle

# Code Examples

---

.panelset[

.panel[.panel-name[1]

**Split data**


```r
library(tidymodels) # doesn't load forcats, stringr, readr from tidyverse
library(readr)
library(here)
library(vip)

d &lt;- read_csv(here("spring-workshop", "data-to-model.csv"))

d &lt;- select(d, -time_spent) # this is another continuous outcome

train_test_split &lt;- initial_split(d, prop = .70)

data_train &lt;- training(train_test_split)

kfcv &lt;- vfold_cv(data_train)
```
]

.panel[.panel-name[2]

**Engineer features**


```r
# pre-processing/feature engineering

# d &lt;- select(d, student_id:final_grade, subject:percomp) # selecting the contextual/demographic variables
# and the survey variables

d &lt;- d %&gt;% select(-student_id)

sci_rec &lt;- recipe(final_grade ~ ., data = d) %&gt;% 
    add_role(course_id, new_role = "ID variable") %&gt;% # this can be any string
    step_novel(all_nominal_predictors()) %&gt;% 
    step_normalize(all_numeric_predictors()) %&gt;%
    step_dummy(all_nominal_predictors()) %&gt;% 
    step_nzv(all_predictors()) %&gt;% 
    step_impute_knn(all_predictors(), all_outcomes())
```
]

.panel[.panel-name[3]

**Specify recipe, model, and workflow**
 

```r
# specify model
rf_mod_many &lt;-
    rand_forest(mtry = tune(),
                min_n = tune()) %&gt;%
    set_engine("ranger", importance = "impurity") %&gt;%
    set_mode("regression")

# specify workflow
rf_wf_many &lt;-
    workflow() %&gt;%
    add_model(rf_mod_many) %&gt;% 
    add_recipe(sci_rec)
```
]
 
.panel[.panel-name[4]

**Fit model**


```r
# specify tuning grid
finalize(mtry(), data_train)
finalize(min_n(), data_train)

tree_grid &lt;- grid_max_entropy(mtry(range(1, 15)),
                              min_n(range(2, 40)),
                              size = 10)

# fit model with tune_grid
tree_res &lt;- rf_wf_many %&gt;% 
    tune_grid(
        resamples = kfcv,
        grid = tree_grid,
        metrics = metric_set(rmse, mae, rsq)
    )
```

]

.panel[.panel-name[5]

**Fit model (part 2)**


```r
# examine best set of tuning parameters; repeat?
show_best(tree_res, n = 10)

# select best set of tuning parameters
best_tree &lt;- tree_res %&gt;%
    select_best()

# finalize workflow with best set of tuning parameters
final_wf &lt;- rf_wf_many %&gt;% 
    finalize_workflow(best_tree)

# fit split data (separately)
final_fit &lt;- final_wf %&gt;% 
    last_fit(train_test_split, metrics = metric_set(rmse, mae, rsq))
```
]

.panel[.panel-name[6]

**Evaluate accuracy**


```r
# variable importance plot
final_fit %&gt;% 
    pluck(".workflow", 1) %&gt;%   
    pull_workflow_fit() %&gt;% 
    vip(num_features = 10)

# fit stats
final_fit %&gt;%
    collect_metrics()

# test set predictions
final_fit %&gt;%
    collect_predictions() 
```
]
]

---

# We'll next dive deeper

- **Guided walkthrough**: work through this code to tune a model
- **Independent practice**: try tuning a neural network
- **Readings**: tuning an advanced model - deep learning

---

class: clear, center

## .font130[.center[**Thank you!**]]

&lt;br/&gt;
.center[&lt;img style="border-radius: 80%;" src="img/jr-cycling.jpeg" height="200px"/&gt;&lt;br/&gt;**Dr. Joshua Rosenberg**&lt;br/&gt;&lt;mailto:jmrosenberg@utk.edu&gt;]
    </textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "default",
"highlightLines": true,
"highlightLanguage": "r",
"countIncrementalSlides": false,
"ratio": "16:9",
"slideNumberFormat": "<div class=\"progress-bar-container\">\n <div class=\"progress-bar\" style=\"width: calc(%current% / %total% * 100%);\">\n </div>\n</div>"
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function(d) {
  var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})(document);

(function(d) {
  var el = d.getElementsByClassName("remark-slides-area");
  if (!el) return;
  var slide, slides = slideshow.getSlides(), els = el[0].children;
  for (var i = 1; i < slides.length; i++) {
    slide = slides[i];
    if (slide.properties.continued === "true" || slide.properties.count === "false") {
      els[i - 1].className += ' has-continuation';
    }
  }
  var s = d.createElement("style");
  s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
  d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
  var deleted = false;
  slideshow.on('beforeShowSlide', function(slide) {
    if (deleted) return;
    var sheets = document.styleSheets, node;
    for (var i = 0; i < sheets.length; i++) {
      node = sheets[i].ownerNode;
      if (node.dataset["target"] !== "print-only") continue;
      node.parentNode.removeChild(node);
    }
    deleted = true;
  });
})();
// add `data-at-shortcutkeys` attribute to <body> to resolve conflicts with JAWS
// screen reader (see PR #262)
(function(d) {
  let res = {};
  d.querySelectorAll('.remark-help-content table tr').forEach(tr => {
    const t = tr.querySelector('td:nth-child(2)').innerText;
    tr.querySelectorAll('td:first-child .key').forEach(key => {
      const k = key.innerText;
      if (/^[a-z]$/.test(k)) res[k] = t;  // must be a single letter (key)
    });
  });
  d.body.setAttribute('data-at-shortcutkeys', JSON.stringify(res));
})(document);
(function() {
  "use strict"
  // Replace <script> tags in slides area to make them executable
  var scripts = document.querySelectorAll(
    '.remark-slides-area .remark-slide-container script'
  );
  if (!scripts.length) return;
  for (var i = 0; i < scripts.length; i++) {
    var s = document.createElement('script');
    var code = document.createTextNode(scripts[i].textContent);
    s.appendChild(code);
    var scriptAttrs = scripts[i].attributes;
    for (var j = 0; j < scriptAttrs.length; j++) {
      s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
    }
    scripts[i].parentElement.replaceChild(s, scripts[i]);
  }
})();
(function() {
  var links = document.getElementsByTagName('a');
  for (var i = 0; i < links.length; i++) {
    if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
      links[i].target = '_blank';
    }
  }
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
  const hlines = d.querySelectorAll('.remark-code-line-highlighted');
  const preParents = [];
  const findPreParent = function(line, p = 0) {
    if (p > 1) return null; // traverse up no further than grandparent
    const el = line.parentElement;
    return el.tagName === "PRE" ? el : findPreParent(el, ++p);
  };

  for (let line of hlines) {
    let pre = findPreParent(line);
    if (pre && !preParents.includes(pre)) preParents.push(pre);
  }
  preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>

<script>
slideshow._releaseMath = function(el) {
  var i, text, code, codes = el.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>