-
Notifications
You must be signed in to change notification settings - Fork 2
/
02_resampling.Rmd
209 lines (161 loc) · 6.84 KB
/
02_resampling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
title: "02_resampling"
output: html_document
---
layout: false
class: middle, center, inverse
# 4.1 `rsample`:<br><br>General Resampling Infrastructure
---
background-image: url(https://www.tidymodels.org/images/rsample.png)
background-position: 97.5% 2.5%
background-size: 7%
layout: true
---
name: data-split
## 4.1 `rsample`: Resampling Infrastructure
`rsample` provides methods for data partitioning (i.e. splitting the data into training and test set) and resampling (i.e. drawing repeated samples from the training set to obtain the sampling distributions).
<br>
--
**Data Partitioning:** First, let's divide our data into a training and test set via `initial_split()`. The resulting `rsplit` object indexes the original data points according to their data set membership.
```{r}
set.seed(2021)
climbers_split <- initial_split(climbers_df, prop = 0.8, strata = died)
climbers_split
```
???
- for imbalanced samples, random sample can lead to catastrophic model
- strata: conduct a stratified split -> keep proportions (i.e. imbalance) in training as well as in test set (1.5% death cases) -> since sampling is random it might otherwise be case that sampling creates an even severer or slighter imbalance
- with regression problems, stratified samples can be drawn based on a binned outcome (e.g., quartiles)
- indexing is more memory efficient
---
## 4.1 `rsample`: Resampling Infrastructure
To extract the training and test data, we can use the `training()` and `testing()` functions.
.panelset[
.panel[.panel-name[Train Set]
```{r}
train_set <- training(climbers_split)
train_set
```
]
.panel[.panel-name[Test Set]
```{r}
test_set <- testing(climbers_split)
test_set
```
]
]
---
## 4.1 `rsample`: Resampling Infrastructure
**Resampling**: Training predictive models which involve hyperparameters requires a three-way data split:
- The *Training Set*, which is used for model training (i.e. estimating model coefficients).
- The *Validation Set*, which is used for parameter tuning (i.e. finding optimal hyperparameters).
- The *Test Set*, which is used for computing an unbiased estimate of model performance.
--
<br>
.panelset[
.panel[.panel-name[Option 1]
.pull-left[
**Validation Split:** Partition the initial `train_set` into a smaller training as well as a validation set using `validation_split()`.
]
.pull-right[
```{r, echo=F, out.width='40%', out.height='40%', fig.align='center'}
knitr::include_graphics("https://www.tmwr.org/premade/validation-alt.svg")
```
]
]
.panel[.panel-name[Option 2]
.pull-left[
**Resampling:** Use a resampling approach, such as cross-validation (CV) or the bootstrap, to create resamples from our initial training set.
A **resample** is the outcome of a resampling method, e.g., a fold resulting from $k$-fold cross-validation or a bootstrapped and out-of-bag sample resulting from The Bootstrap.
]
.pull-right[
```{r, echo=F, out.width='70%', out.height='70%', fig.align='center'}
knitr::include_graphics("https://www.tmwr.org/premade/resampling.svg")
```
]
]
]
???
Data sets:
- training to optimize model coefs, validation to optimize hyperparameters (as part of model tuning as well as feature engineering), test to evaluate the model
- refit the optimal model on training and validation set and evaluate on the test set
- i prefer the terms training and validation instead of analysis and assessment set in the context of resampling (often test, validation and hold-out set are used interchangeably)
CV vs. train-test-split:
- we usually prefer the former as we would like to generate a distribution of our error measure and to account for uncertainty in the estimate
- You may increase the number of folds as your sample size decreases to retain more datapoints for model training.
---
## 4.1 `rsample`: Resampling Infrastructure
Here we implement a 10-fold CV approach using the `vfold_cv()` function. It returns a `tibble` containing the indexes of 10 separate splits.
```{r}
set.seed(2021)
climbers_folds <- train_set %>%
vfold_cv(v = 10, repeats = 1, strata = died)
climbers_folds
```
---
## 4.1 `rsample`: Resampling Infrastructure
To extract the training and validation data, we can again use `training()` and `testing()`.
.panelset[
.panel[.panel-name[Train Set]
```{r}
climbers_folds %>% purrr::pluck("splits", 1) %>% training()
```
]
.panel[.panel-name[Test Set]
```{r}
climbers_folds %>% purrr::pluck("splits", 1) %>% testing()
```
]
]
???
- usually, you don't use `training()` and `testing()` as you let higher level functions access the individual resamples during hyperparameter tuning
---
## 4.1 `rsample`: Resampling Infrastructure
**Alternative resampling approaches:** In conjunction to $k$-fold CV, `rsample` enables various alternative resampling schemes for producing a more robust estimate of model performance.
.panelset[
.panel[.panel-name[Repeated k-fold CV]
For `repeats > 1`, `vfold_cv()` repeats the CV approach to reduce the standard error of the estimate at the cost of higher computational demand ( $k∗R$ folds).
```{r, eval=F}
set.seed(2021)
train_set %>%
vfold_cv(v = 10, repeats = 2, strata = died)
```
]
.panel[.panel-name[The Bootstrap]
`bootstraps()` conducts sampling with replacement whereby model performance is estimated based on the "out-of-bag" observations.
```{r, eval=F}
set.seed(2021)
train_set %>%
bootstraps(times = 25, strata = died)
```
]
.panel[.panel-name[Monte Carlo CV (MCCV)]
`mc_cv()` lies somewhere in between $k$-fold CV and the bootstraps since it enables partly overlapping assessment sets by generating each resample anew.
```{r, eval=F}
set.seed(2021)
train_set %>%
mc_cv(prop = 0.9, times = 25, strata = died)
```
]
.panel[.panel-name[Time-Series Resampling]
```{r, echo=F, out.width='40%', out.extra='style="float:right; padding:10px"'}
knitr::include_graphics("https://i.stack.imgur.com/fXZ6k.png")
```
For temporally correlated data `rsample` provides a suitable partitioning and resampling infrastructure as well.
For example, use `initial_time_split()` to conduct a non-random early-late-split and `rolling_origin()` or the `slide_*()` methods to generate time-series resamples.
<br><br><br><br>
.right[
*Source: [Stack Exchange](https://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection)*
]
]
]
.footnote[
*Note: Find more information about the resampling approaches implemented in `rsample` in [Kuhn/Silge (2021), ch. 10](https://www.tmwr.org/resampling.html#resampling-methods).*
]
???
- repeated CV: reduces validation set error independent of the way the folds were resampled -> increases computational demand -> i have simply more estimates for the validation set error that I can average
- MC CV:
- create one resample by sampling 90% as training and 10% as validation set data
- create another resample by reperforming the previous step
- perform again 10 resamples are created
- with time-dependent data, random sampling can lead to disastrous models