02_resampling.Rmd

---
title: "02_resampling"
output: html_document
---

layout: false
class: middle, center, inverse

# 4.1 `rsample`:<br><br>General Resampling Infrastructure

---

background-image: url(https://www.tidymodels.org/images/rsample.png)
background-position: 97.5% 2.5%
background-size: 7%
layout: true

---

name: data-split

## 4.1 `rsample`: Resampling Infrastructure

`rsample` provides methods for data partitioning (i.e. splitting the data into training and test set) and resampling (i.e. drawing repeated samples from the training set to obtain the sampling distributions).

<br>

--

**Data Partitioning:** First, let's divide our data into a training and test set via `initial_split()`. The resulting `rsplit` object indexes the original data points according to their data set membership.
```{r}
set.seed(2021)
climbers_split <- initial_split(climbers_df, prop = 0.8, strata = died)

climbers_split
```

???
- for imbalanced samples, random sample can lead to catastrophic model
- strata: conduct a stratified split -> keep proportions (i.e. imbalance) in training as well as in test set (1.5% death cases) -> since sampling is random it might otherwise be case that sampling creates an even severer or slighter imbalance
- with regression problems, stratified samples can be drawn based on a binned outcome (e.g., quartiles)
- indexing is more memory efficient

---

## 4.1 `rsample`: Resampling Infrastructure

To extract the training and test data, we can use the `training()` and `testing()` functions.

.panelset[
.panel[.panel-name[Train Set]
```{r}
train_set <- training(climbers_split)
train_set
```

]
.panel[.panel-name[Test Set]
```{r}
test_set <- testing(climbers_split)
test_set
```
]
]

---

## 4.1 `rsample`: Resampling Infrastructure

**Resampling**: Training predictive models which involve hyperparameters requires a three-way data split:
- The *Training Set*, which is used for model training (i.e. estimating model coefficients).
- The *Validation Set*, which is used for parameter tuning (i.e. finding optimal hyperparameters).
- The *Test Set*, which is used for computing an unbiased estimate of model performance.

--

<br>

.panelset[
.panel[.panel-name[Option 1]
.pull-left[
**Validation Split:** Partition the initial `train_set` into a smaller training as well as a validation set using `validation_split()`.
]
.pull-right[
```{r, echo=F, out.width='40%', out.height='40%', fig.align='center'}
knitr::include_graphics("https://www.tmwr.org/premade/validation-alt.svg")
```
]
]
.panel[.panel-name[Option 2]
.pull-left[
**Resampling:** Use a resampling approach, such as cross-validation (CV) or the bootstrap, to create resamples from our initial training set.

A **resample** is the outcome of a resampling method, e.g., a fold resulting from $k$-fold cross-validation or a bootstrapped and out-of-bag sample resulting from The Bootstrap.
]
.pull-right[
```{r, echo=F, out.width='70%', out.height='70%', fig.align='center'}
knitr::include_graphics("https://www.tmwr.org/premade/resampling.svg")
```
]
]
]

???
Data sets:
- training to optimize model coefs, validation to optimize hyperparameters (as part of model tuning as well as feature engineering), test to evaluate the model
- refit the optimal model on training and validation set and evaluate on the test set
- i prefer the terms training and validation instead of analysis and assessment set in the context of resampling (often test, validation and hold-out set are used interchangeably)

CV vs. train-test-split:
- we usually prefer the former as we would like to generate a distribution of our error measure and to account for uncertainty in the estimate
- You may increase the number of folds as your sample size decreases to retain more datapoints for model training.

---

## 4.1 `rsample`: Resampling Infrastructure

Here we implement a 10-fold CV approach using the `vfold_cv()` function. It returns a `tibble` containing the indexes of 10 separate splits.
```{r}
set.seed(2021)
climbers_folds <- train_set %>% 
  vfold_cv(v = 10, repeats = 1, strata = died) 

climbers_folds
```

---

## 4.1 `rsample`: Resampling Infrastructure

To extract the training and validation data, we can again use `training()` and `testing()`.

.panelset[
.panel[.panel-name[Train Set]
```{r}
climbers_folds %>% purrr::pluck("splits", 1) %>% training()
```
]
.panel[.panel-name[Test Set]
```{r}
climbers_folds %>% purrr::pluck("splits", 1) %>% testing()
```
]
]

???
- usually, you don't use `training()` and `testing()` as you let higher level functions access the individual resamples during hyperparameter tuning

---

## 4.1 `rsample`: Resampling Infrastructure

**Alternative resampling approaches:** In conjunction to $k$-fold CV, `rsample` enables various alternative resampling schemes for producing a more robust estimate of model performance.

.panelset[
.panel[.panel-name[Repeated k-fold CV]
For `repeats > 1`, `vfold_cv()` repeats the CV approach to reduce the standard error of the estimate at the cost of higher computational demand ( $k∗R$ folds).
```{r, eval=F}
set.seed(2021)

train_set %>%
  vfold_cv(v = 10, repeats = 2, strata = died)
```
]
.panel[.panel-name[The Bootstrap]
`bootstraps()` conducts sampling with replacement whereby model performance is estimated based on the "out-of-bag" observations.
```{r, eval=F}
set.seed(2021)

train_set %>%
  bootstraps(times = 25, strata = died)
```
]
.panel[.panel-name[Monte Carlo CV (MCCV)]
`mc_cv()` lies somewhere in between $k$-fold CV and the bootstraps since it enables partly overlapping assessment sets by generating each resample anew.
```{r, eval=F}
set.seed(2021)

train_set %>% 
  mc_cv(prop = 0.9, times = 25, strata = died)
```
]
.panel[.panel-name[Time-Series Resampling]
```{r, echo=F, out.width='40%', out.extra='style="float:right; padding:10px"'}
knitr::include_graphics("https://i.stack.imgur.com/fXZ6k.png")
```

For temporally correlated data `rsample` provides a suitable partitioning and resampling infrastructure as well.

For example, use `initial_time_split()` to conduct a non-random early-late-split and `rolling_origin()` or the `slide_*()` methods to generate time-series resamples.
<br><br><br><br>
.right[
*Source: [Stack Exchange](https://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection)*
]
]
]

.footnote[
*Note: Find more information about the resampling approaches implemented in `rsample` in [Kuhn/Silge (2021), ch. 10](https://www.tmwr.org/resampling.html#resampling-methods).*
]

???
- repeated CV: reduces validation set error independent of the way the folds were resampled -> increases computational demand -> i have simply more estimates for the validation set error that I can average
- MC CV: 
  - create one resample by sampling 90% as training and 10% as validation set data
  - create another resample by reperforming the previous step
  - perform again 10 resamples are created
- with time-dependent data, random sampling can lead to disastrous models