-
Notifications
You must be signed in to change notification settings - Fork 17
/
05-origami.Rmd
1197 lines (998 loc) · 53 KB
/
05-origami.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Cross-validation {#origami}
_Ivana Malenica_
Based on the [`origami` `R` package](https://github.com/tlverse/origami)
by _Jeremy Coyle, Nima Hejazi, Ivana Malenica and Rachael Phillips_.
`r if (knitr::is_latex_output()) '\\begin{VT1}\n\\VH{Learning Objectives}'`
`r if (knitr::is_html_output()) '## Learning Objectives{-}'`
By the end of this chapter you will be able to:
1. Differentiate between training, validation and test sets.
2. Understand the concept of a loss function, risk and cross-validation.
3. Select a loss function that is appropriate for the functional parameter to be
estimated.
4. Understand and contrast different cross-validation schemes for i.i.d. data.
5. Understand and contrast different cross-validation schemes for time dependent
data.
6. Setup the proper fold structure, build custom fold-based function, and
cross-validate the proposed function using the `origami` `R` package.
7. Setup the proper cross-validation structure for the use by the Super Learner
using the the `origami` `R` package.
`r if (knitr::is_latex_output()) '\\end{VT1}'`
<!--
RP:
suggestion to modify some LOs (I am also open to not having LOs).
LOs should be expressed in terms of the reader and use action verbs.
Some ideas for action verbs related to "understanding" are here, within
"Action Verbs Aligned with Blooms Taxonomy" section:
https://academiceffectiveness.gatech.edu/assessment-toolkit/developing-student-learning-outcome-statements
Here's another helpful article:
https://www.thoughtco.com/common-mistakes-when-writing-learning-objectives-7786
As an example, for LO 2, it can be rephrased by thinking about the answers to
these questions and targeting the LO towards them. What specifically does a
reader need to understand about a loss, risk, CV? Why is this important for the
reader to understand?
-->
## Introduction
Following the [_Roadmap for Targeted Learning_](#roadmap), we start to elaborate
on the estimation step in the current chapter. In order to generate an estimate
of the target parameter, we need to decide how to evaluate the quality of our
estimation procedure's performance. The performance, or error, of any algorithm
(estimator) corresponds to its generalizability to independent datasets arising
from the same data-generating process. Assessment of the performance of an
algorithm is extremely important --- it provides a quantitative measure of how
well the algorithm performs, and it guides the choice of selecting among a set
(or "library") of algorithms. In order to assess the performance of an
algorithm, or a library of them, we introduce the concept of a **loss
function**, which defines the **risk** or the **expected prediction error**.
Our goal is to estimate the true performance (risk) of our estimator. In the
next chapter, we elaborate on how to estimate the performance of a library of
algorithms in order to choose the best-performing one. In the following, we
propose a method to do so using the observed data and **cross-validation**
procedures implemented in the `origami` package [@coyle2018origami;
@coyle-cran-origami].
## Background
Ideally, in a data-rich scenario (i.e., one with unlimited observations), we
would split our dataset into three parts:
1. the training set,
2. the validation set, and
3. the test (or holdout) set.
The training set is used to fit algorithm(s) of interest; we evaluate the
performance of the fit(s) on a validation set, which can be used to estimate
prediction error (e.g., for algorithm tuning or selection). The final error of
the selected algorithm is obtained by using the test (or holdout) set, which is
kept entirely separate such that the algorithms never encounter these
observations until the final model evaluation step. One might wonder, with
training data readily available, why not use the training error to evaluate the
proposed algorithm's performance? Unfortunately, the training error is a biased
estimate of a fitted algorithm's generalizability, since it uses the same data
for fitting and evaluation.
Since data are often scarce, separating a dataset into training, validation and
test sets can prove too limiting, on account of decreasing the available data
for use in training by too much. In the absence of a large dataset and a
designated test set, we must resort to methods that estimate the algorithm's
true performance by efficient sample re-use. Re-sampling methods, like the
bootstrap, involve repeatedly sampling from the training set and fitting the
algorithms to these samples. While often computationally intensive, re-sampling
methods are particularly useful for evaluating an algorithm and selecting among
a set of them. In addition, they provide more insight on the variability and
robustness of a fitted algorithm, relative to fitting an algorithm only once to
all of the training data.
<!--
RP:
What is meant by "scarce", "by too much", "large dataset" here? We also might
want to use CV even when we have thousands of observations, so our assessment
of the algorithm isn't hinging on a single split. Is the data's size the
motivating reason for re-sampling? Ah-ha! I knew you had it somewhere! I think
the (some of) message that you're getting across in the paragraph around L380
should be included here.
-->
### Introducing: cross-validation
In this chapter, we focus on **cross-validation** --- an essential tool for
evaluating how any algorithm extends from a sample of data to the target
population from which it arose. Cross-validation has seen widespread
application in all facets of modern statistics, and perhaps most notably in
statistical machine learning. The cross-validation procedure has been proven to
be optimal for algorithm selection in large samples, i.e. asymptotically.
In particular,cross-validated algorithm estimates the true risk when the
estimate is applied to an independent sample from the joint distribution of
the predictors and outcome.
When used for model selection, cross-validation has powerful optimality
properties. The asymptotic optimality results state that the cross-validated
selector performs (in terms of risk) asymptotically as well as an optimal oracle
selector --- a hypothetical procedure with free access to the true, unknown
data-generating distribution. For further details on the theoretical results, we
suggest consulting @vdl2003unified, @vdl2004asymptotic, @dudoit2005asymptotics
and @vaart2006oracle.
The `origami` package provides a suite of tools for cross-validation. In the
following, we describe different types of cross-validation schemes readily
available in `origami`, introduce the general structure of the `origami`
package, and demonstrate the use of these procedures in various applied
settings.
## Estimation Roadmap: How does it all fit together?
Similarly to how we defined the [_Roadmap for Targeted Learning_](#roadmap), we
can define the **Estimation Roadmap** as a guide for the estimation process. In
particular, the unified loss-based estimation framework [@vdl2003unified;
@vdl2004asymptotic; @dudoit2005asymptotics; @vaart2006oracle; @vdl2007super],
which relies on cross-validation for estimator construction,
selection, and performance assessment, consists of three main steps:
1. **Loss function**:
Define the target parameter as the minimizer of the expected loss (risk) for a
full data loss function chosen to represent the desired performance measure. By
full data, we refer to the complete data including missingness process, for
example. Map the full data loss function into an observed data loss function,
having the same expected value and leading to an estimator of risk.
2. **Algorithms**:
Construct a finite collection of candidate estimators of the parameter of
interest.
3. **Cross-validation scheme**:
Apply appropriate cross-validation, and use the cross-validated risk in
order to select the best performing estimator among the candidates. Assess
the overall performance of the resulting estimator.
## Example: Cross-validation and Prediction
Having introduced the [Estimation Roadmap](#roadmap), we can more precisely
define our objective using prediction as an example. Let the observed data be
defined as $O = (W, Y)$, where a unit specific data structure can be written as
$O_i = (W_i, Y_i)$, for $i = 1, \ldots, n$. We denote $Y_i$ as the
outcome/dependent variable of interest, and $W_i$ as a $p$-dimensional set of
covariate (predictor) variables. We assume the $n$ units are independent, or
conditionally independent, and identically distributed. Let $\psi_0(W)$ denote
the target parameter of interest, the quantity we wish to estimate (estimand).
For this example, we are interested in estimating the conditional expectation of
the outcome given the covariates, $\psi_0(W) = \E(Y \mid W)$. Following the
[Estimation Roadmap](#roadmap), we choose the appropriate loss function, $L$,
such that $\psi_0(W) = \text{argmin}_{\psi} \E_0[L(O, \psi(W))]$. Note that
$\psi_0(W)$, the true target parameter, is a minimizer of the risk (expected
value of the chosen loss function). The appropriate loss function for
conditional expectation with continuous outcome could be a mean squared error,
for example. Then we can define $L$ as $L(O, \psi(W)) = (Y_i -\psi(W_i)^2$. Note
that there can be many different algorithms which estimate the estimand (many
different $\psi$s). How do we know how well each of the candidate estimators of
$\psi_0(W)$ are doing? To pick the best-performing candidate estimator and
assess its overall performance, we use cross-validation. Observations in the
training set are used to fit (or train) the estimator, while those in validation
set are used to assess the risk of (or validate) it.
Next, we introduce notation flexible enough to represent any cross-validation
scheme. In particular, we define a **split vector**, $B_n = (B_n(i): i = 1,
\ldots, n) \in \{0,1\}^n$.
<!--
Note that such a split vector is independent of the empirical distribution
$P_n$, as in $B_n$ is not a function of $P_n$, but $P_0$.
-->
A realization of $B_n$ defines a random split of the data into training and
validation subsets such that if
$$B_n(i) = 0, \ \ \text{i sample is in the training set}$$
$$B_n(i) = 1, \ \ \text{i sample is in the validation set.}$$
We can further define $P_{n, B_n}^0$ and $P_{n, B_n}^1$ as the empirical
distributions of the training and validation sets, respectively. Then, $n_0 =
\sum_i (1 - B_n(i))$ and $n_1 = \sum_i B_n(i)$ denote the number of samples in
the training and validation sets, respectively. The particular distribution
of the split vector $B_n$ defines the type of cross-validation scheme, tailored
to the problem and dataset at hand.
<!--
nh: high-level comment -- it's helpful to define the splitting vector notation
with B_n and to explain it, but I think it could be made even clearer by
explicitly writing the forms of the splitting vector for each of the CV schemes
discussed below. this helps to make it concrete, since the notation is quite
general and exact but simultaneously cumbersome (a complaint I've heard in
seminars and agree with, personally -- e.g., it's much easier to write V-fold
CV in simpler notation than this). it seems it shouldn't be much work to write
examples of the splitting vector notation for the more common CV schemes, but
maybe it gets annoying for time-series examples. just a thought...
-->
## Cross-validation schemes in `origami`
A variety of different partitioning schemes exist, each tailored to the salient
details of the problem of interest, including data size, prevalence of the
outcome, and dependence structure (between units or across time). In the
following, we describe different cross-validation schemes available in the
`origami` package, and we go on to demonstrate their use in practical data
analysis examples.
### WASH Benefits Study Example {-}
In order to illustrate different cross-validation schemes, we will be using the
WASH Benefits example dataset (detailed information can be found in
[Chapter 3](#wash)). In particular, we are interested in predicting
weight-for-height Z-score (`whz`) using the available covariate data. For this
illustration, we will start by treating the data as independent and identically
distributed (i.i.d.) random draws from an unknown distribution $P_0$. To
see what each cross-validation scheme is doing, we will subset the data to only
$n=30$. Note that each row represents an i.i.d. sampled unit, indexed by the row
number.
```{r setup}
library(data.table)
library(origami)
library(knitr)
library(dplyr)
# load data set and take a peek
washb_data <- fread(
paste0(
"https://raw.githubusercontent.com/tlverse/tlverse-data/master/",
"wash-benefits/washb_data.csv"
),
stringsAsFactors = TRUE
)
```
```{r origami_washb_example_table1, echo=FALSE}
n_samp <- 30
washb_data <- washb_data[seq_len(n_samp), ]
if (knitr::is_latex_output()) {
head(washb_data) %>%
kable(format = "latex")
} else if (knitr::is_html_output()) {
head(washb_data) %>%
kable() %>%
kableExtra::kable_styling(fixed_thead = TRUE) %>%
kableExtra::scroll_box(width = "100%", height = "300px")
}
```
Above is a look at the first `r n_samp` of the data.
### Cross-validation for i.i.d. data
#### Re-substitution
The re-substitution method is perhaps the simplest strategy for estimating the
risk of a fitted algorithm. With
this cross-validation scheme, all observed data units are used in both the
training and validation set.
We illustrate the usage of the re-substitution method with `origami` below,
using the function `folds_resubstitution`. In order to set up
`folds_resubstitution`, we need only to specify the total number of sampled
units that we want to allocate to the training and validation sets; remember
that each row of the dataset is a unique i.i.d. sampled unit. Also, notice the
structure of the `origami` output:
1. **v:** the cross-validation fold
2. **training_set:** the indices of the samples in the training set
2. **validation_set:** the indices of the samples in the training set.
The structure of the `origami` output, a `list` of fold(s), holds across all of
the cross-validation schemes presented in this chapter. Below, we show the fold
generated by the re-substitution method:
```{r resubstitution}
folds <- folds_resubstitution(nrow(washb_data))
folds
```
<!--
nh: should we comment briefly on the displayed structure of the training
and validation folds?
-->
#### Holdout
The holdout method, or the validation set approach, consists of randomly
dividing the available data into training and validation (holdout) sets. The
model is then fitted (i.e., "trained") using the observations in the training
set and subsequently evaluated (i.e., "validated") using the observations in the
validation set. Typically, the dataset is split into $60:40$, $70:30$, $80:20$
or even $90:10$ training-to-validation splits.
The holdout method is intuitive and computationally inexpensive; however, it
does carry a disadvantage: If we were to repeat the process of randomly
splitting the data into training and validation sets, we could get very
different cross-validated estimates of the empirical risk. In particular, the
empirical mean of the loss function (i.e., the empirical risk) evaluated over
the validation set(s) could be highly variable, depending on which samples were
included in the training and validation splits. Overall, the cross-validated
empirical risk for the holdout method is more variable, since in includes
variability of the random split as well --- this is not desirable. For
classification problems (with a binary or categorical outcome variable), there
is an additional disadvantage: it is possible for the training and validation
sets to end up with uneven distributions of the two (or more) outcome classes,
leading to better training and poor validation, or vice-versa --- though this may
be corrected by incorporating stratification into the cross-validation process.
Finally, note that we are not using all of the data in training or in evaluating
the performance of the proposed algorithm, which could itself introduce bias.
<!--
nh: is there no folds_* function for this in origami? it seems to be the only
cross-validation scheme for which we don't demonstrate fold construction
-->
#### Leave-one-out
The leave-one-out cross-validation scheme is closely related to the holdout
method, as it also involves splitting the dataset into training and validation
sets; however, instead of partitioning the dataset into sets of similar size, a
single observation is used as the validation set. In doing so, the vast majority
of the sampled units are employed for fitting (or training) the candidate
learning algorithm. Since only a single sampled unit (for example $O_1 = (W_1,
Y_1)$) is left out of the fitting process, leave-one-out cross-validation can
result in a less biased estimate of the risk. Typically, the
leave-one-out approach will not overestimate the risk as much as the holdout
method does. On the other hand, since the estimate of risk is based on a single
sampled unit, it is usually a highly variable estimate.
We can repeat the process of spiting the dataset into training and validation
sets until all of the sampled units have had a chance to act as the validation
set. Continuing the example above, a subsequent iteration of the leave-one-out
cross-validation scheme may use $O_2 = (W_2, Y_2)$ as the validation set (where,
before, $O_1 = (W_1, Y_1)$ played that role), while the remaining $n-1$ sampled
units are included in the training set. Repeating this approach $n$ times
results in $n$ risk estimates, for example, $MSE_1, MSE_2, \ldots, MSE_n$ (note
that these are the mean squared error (MSE) estimates when unit $i$ is the
validation set). The estimate of the true risk is then the average over the $n$
leave-one-out risk estimates. While the leave-one-out cross-validation scheme
results in a less biased (albeit, more variable) estimate of risk than the
holdout method, it can be computationally very expensive to implement when $n$
is large.
We illustrate the usage of the leave-one-out cross-validation scheme with
`origami` below, using the `folds_loo(n)` function. In order to set up
`folds_loo(n)`, similarly to the case of the re-substitution method, we need
only the total number of sampled units over which the cross-validation procedure
is to operate. We show the first two folds generated by leave-one-out
cross-validation below.
```{r loo}
folds <- folds_loo(nrow(washb_data))
folds[[1]]
folds[[2]]
```
<!--
nh: should we comment briefly on the displayed structure of the training
and validation folds?
-->
#### $V$-fold
An alternative to the leave-one-out scheme is $V$-fold cross-validation. This
cross-validation scheme randomly divides the dataset into $V$ splits of
equal (or approximately equal) size. For each split $v=1,\ldots,V$, the
$v$th fold is defined by the $v$th split (which defines the validation set for
fold $v$) and the complement of the $v$th split (which defines the training set
for fold $v$). The algorithms are fit $V$ times separately, to each of the
$V$ training sets. The risk of each fold's fitted algorithms is then evaluated
via predictions obtained from the validation set. The cross-validated risk for
a fitted algorithm, for example the MSE, is its average risk across all folds.
With $V$-fold cross-validation, all of the observations
are used in the training and validation stages, preventing the candidate
learning algorithm from overfitting to only a subset of the data (e.g., a given
training set).
For a dataset with $n$ sampled units, $V$-fold cross-validation with $v=n$
merely reduces to leave-one-out. Similarly, if we set $n=1$, we can get the
holdout method's estimate of the candidate learning algorithm's performance.
Beyond its computational advantages, $V$-fold cross-validation often yields more
accurate estimates of the true, underlying risk. This is rooted in the differing
bias-variance trade-offs associated with these two cross-validation schemes:
While the leave-one-out scheme may be less biased, it has much greater variance
(since only a single unit is included in the validation set). This difference
becomes more obvious as $n$ becomes much greater than $v$. With the $V$-fold
cross-validation scheme, we end up averaging risk estimates across the $v$
validation folds, which are typically less correlated than the risk estimates
from the leave-one-out fits. Owing to the fact that the mean of many highly
correlated quantities has higher variance, leave-one-out estimates of the risk
will have higher variance than the corresponding estimates based on $V$-fold
cross-validation.
Now, let's see $V$-fold cross-validation with `origami` in action! In the next
chapter, we will turn to studying the Super Learner algorithm --- an algorithm
capable of selecting the "best" algorithm from among a large library of candidate
learning algorithms -- which we'd like to fit _and_ evaluate the performance of.
The Super Learner algorithm relies on $V$-fold cross-validation as its default
cross-validation scheme. In order to set up $V$-fold cross-validation, we need
to call `origami`'s `folds_vfold(n, V)` function. The two required arguments for
`folds_vfold(n, V)` are the total number of sample units to be cross-validated
and the number of folds we wish to have.
For example, at $V=2$, we will get two folds, each with approximately $n/2$
sampled units in the training and validation sets.
```{r cv}
folds <- folds_vfold(nrow(washb_data), V = 2)
folds[[1]]
folds[[2]]
```
<!--
nh: should we comment briefly on the displayed structure of the training
and validation folds?
-->
#### Monte Carlo
In the Monte Carlo cross-validation scheme, we randomly select some fraction of
the data, _without replacement_, to form the training set, assigning the
remainder of the sampled units to the validation set. In this way, the dataset
is randomly divided into two independent splits: A training set of $n_0 = n
\cdot (1 - p)$ observations and a validation set of $n_1 = n \cdot p$
observations. By repeating this procedure many times, the Monte Carlo
cross-validation scheme generates, at random, many training and validation
partitions of the dataset.
Since the partitions are independent across folds, the same observational unit
can appear in the validation set multiple times; note that this is a stark
difference between the Monte Carlo and $V$-fold cross-validation schemes. For a
given sampling fraction $p$, the Monte Carlo cross-validation scheme would be
optimal if repeated infinitely many times --- of course, this is not
computationally feasible. With Monte Carlo cross-validation, it is possible to
explore many more partitions of the dataset than with $V$-fold cross-validation,
resulting in (possibly) less variable estimates of the risk (across partitions),
though this comes at the cost of an increase in bias (because the splits are
correlated). Because Monte Carlo cross-validation generates many splits with
overlaps in the sampled units, more splits (and thus more computational time)
will be necessary to achieve the level of performance (in terms of unbiasedness)
that the $V$-fold cross-validation scheme achieves with only $V$ splits.
We illustrate the usage of the Monte Carlo cross-validation scheme with
`origami` below, using the `folds_montecarlo(n, V, pvalidation)` function. In
order to set up `folds_montecarlo(n, V, pvalidation)`, we need the following,
1. the total number of observations we wish to cross-validate;
2. the number of folds; and
3. the proportion of observations to be placed in the validation set.
For example, setting $V=2$ and $pvalidation = 0.2$, we obtain two folds, each
with approximately $6$ sampled units in the validation set for each fold.
```{r montecarlo}
folds <- folds_montecarlo(nrow(washb_data), V = 2, pvalidation = 0.2)
folds[[1]]
folds[[2]]
```
<!--
nh: should we comment briefly on the displayed structure of the training
and validation folds?
-->
#### Bootstrap
Like the Monte Carlo cross-validation scheme, the bootstrap cross-validation
scheme also consists of randomly selecting sampled units, _with replacement_,
for the training set; the rest of the sampled units are allocated to the
validation set. This process is then repeated multiple times, generating (at
random) new training and validation partitions of the dataset each time. In
contrast to the Monte Carlo cross-validation scheme, the total number of sampled
units in training and validation sets (i.e., the sizes of the two partitions)
across folds is not held constant. Also, as the name suggests, sampling is
performed with replacement (as in the bootstrap [@davison1997bootstrap]), hence
the exact same observational units may be included in multiple training sets.
The proportion of observational units in the validation sets is a random
variable, with expectation $\sim 0.368$.
<!--
nh: I don't follow the last bit about the proportion -- might be that I'm
missing something but it seems to be coming out of nowhere
-->
We illustrate the usage of the bootstrap cross-validation scheme with `origami`
below, using the `folds_bootstrap(n, V)` function. In order to set up
`folds_bootstrap(n, V)`, we need to specify the following arguments:
1. the total number of observations we wish to cross-validate; and
2. the number of folds.
For example, setting $V=2$, we obtain two folds, each with different numbers of
sampled units in the validation sets across the folds.
```{r bootstrap}
folds <- folds_bootstrap(nrow(washb_data), V = 2)
folds[[1]]
folds[[2]]
```
<!--
nh: should we comment briefly on the displayed structure of the training
and validation folds?
-->
<!--
RP:
Should we add stratified cross-validation and clustered cross-validation
examples with origami? I think these are both pretty common
-->
### Cross-validation for Time-series Data
The `origami` package also supports numerous cross-validation schemes for
time-series data, for both single and multiple time-series
with arbitrary time and network dependence.
### `AirPassenger` Data Example {-}
In order to illustrate different cross-validation schemes for time-series, we
will be using the _AirPassenger_ data; this is a widely used, freely available
dataset. The _AirPassenger_ dataset, included in `R`, provides monthly totals of
international airline passengers between the years 1949 and 1960.
**Goal:** we want to forecast the number of airline passengers at time $h$
horizon using the historical data from 1949 to 1960.
```{r plot_airpass}
library(ggfortify)
data(AirPassengers)
AP <- AirPassengers
autoplot(AP) +
labs(
x = "Date",
y = "Passenger numbers (1000's)",
title = "Air Passengers from 1949 to 1961"
)
t <- length(AP)
```
#### Rolling origin
The rolling origin cross-validation scheme lends itself to "online" learning
algorithms, in which large streams of data have to be fit continually
(respecting time), where the fit of the learning algorithm is (constantly)
updated as more data accrues. In general, the rolling origin scheme defines an
initial training set, and, with each iteration, the size of the training set
grows by a batch of $m$ observations, the size of the validation set remains
constant, there might be a gap between training and validation times of size
$h$ (a lag window), and new folds are added until time $t$ is reached in the
validation set. The time points included in the training set always lag behind
behind those in the validation set.
To further illustrate rolling origin cross-validation, we show below an example
that yields three folds. Here, the first window size is fifteen time points, on
which we first train the candidate learning algorithm. We then evaluate its
performance on ten time points with a gap ($h$) of five time points between the
training and validation sets.
In the following, we train the learning algorithm on a longer stream of data, 25
time points, including the original fifteen with which we initially started.
Then, we evaluate its performance at a (temporal) distance ten time points
ahead.
```{r, fig.cap="Rolling origin CV", results="asis", echo=FALSE}
knitr::include_graphics(path = "img/png/rolling_origin.png")
```
We illustrate the usage of the rolling origin cross-validation scheme with
`origami` below, using the `folds_rolling_origin(n, first_window,
validation_size, gap, batch)` function. In order to set up
`folds_rolling_origin(n, first_window, validation_size, gap, batch)`, we need
the following,
1. the total number of time points we wish to cross-validate (`n`);
2. the size of the first training set (`first_window`);
3. the size of the validation set (`validation_size`);
4. the gap between training and validation set(`gap`); and
5. the size of the training set update per iteration of cross-validation (`batch`).
Our time-series has $t=144$ time points. Setting `first_window` to $50$,
`validation_size` to 10, `gap` to 5, and `batch` to 20 yields four time-series
folds; we show the first two below.
```{r rolling_origin}
folds <- folds_rolling_origin(
n = t,
first_window = 50, validation_size = 10, gap = 5, batch = 20
)
folds[[1]]
folds[[2]]
```
<!--
nh: should we comment briefly on the displayed structure of the training
and validation folds?
-->
#### Rolling window
Rather than adding more and more time points to the training set in each
iteration of cross-validation (as under the rolling origin scheme), the rolling
window cross-validation scheme "rolls" the training sample forward in time by
$m$ units (of time). This strategy can be useful, for example, in settings with
parametric learning algorithms, which are often very sensitive to moment (e.g.,
mean, variance) or parameter drift, which is itself challenging to explicitly
account for in the model construction step. The rolling window scheme is also
computationally more efficient, and possibly warranted over rolling origin
when working in streaming data analysis where the training data is too large
for convenient access. In contrast to the
rolling origin scheme, the sampled units in the training set are always the same
for each iteration of the rolling window scheme.
The illustration below depicts rolling window cross-validation using three
time-series folds. The first window size is 15 time points, on which we first
train the candidate learning algorithm. As in the previous illustration, we
evaluate its performance on 10 time points, with a gap of size 5 time points
between the training and validation sets. However, for the next fold, we train
the learning algorithm on time points further away from the origin (here, 10
time points). Note that the size of the training set in the new fold is the same
as in the first fold (both include 15 time points). This setup keeps the
training sets comparable over time (and across folds), unlike under the rolling
origin cross-validation scheme. We then evaluate the performance of the
candidate learning algorithm on 10 time points in the future.
```{r, fig.cap="Rolling window CV", results="asis", echo=FALSE}
knitr::include_graphics(path = "img/png/rolling_window.png")
```
We demonstrate the usage of the rolling window cross-validation scheme with
`origami` below, using the `folds_rolling_window(n, window_size,
validation_size, gap, batch)` function. In order to set up
`folds_rolling_window(n, window_size, validation_size, gap, batch)`, we need to
specify the following arguments:
1. the total number of time points we wish to cross-validate (`n`);
2. the size of the training sets (`window_size`);
3. the size of the validation set (`validation_size`);
4. the gap between training and validation set (`gap`); and
5. the size of the training set update per iteration of cross-validation (`batch`).
Setting the `window_size` to $50$, `validation_size` to 10, `gap` to 5 and
`batch` to 20, we also get 4 time-series folds; we show the first two below.
```{r rolling_window}
folds <- folds_rolling_window(
n = t,
window_size = 50, validation_size = 10, gap = 5, batch = 20
)
folds[[1]]
folds[[2]]
```
<!--
nh: should we comment briefly on the displayed structure of the training
and validation folds?
-->
#### Rolling origin with $V$-fold
A variant of the rolling origin cross-validation scheme, accounting for sample
dependence, is the rolling-origin-$V$-fold cross-validation scheme. In contrast
to the canonical rolling origin scheme, under this hybrid scheme, sampled units
in the training and validation sets are _not_ the same, which this scheme
accomplishes by incorporating $V$-fold cross-validation within the time-series
setup. Here, the learning algorithm's predictions are evaluated on the future
time points of the time-series observational units excluded from the training
step, accommodating potential dependence not only across time but also across
observational units. To use the rolling-origin-$V$-fold cross-validation scheme
with `origami`, we can invoke the `folds_vfold_rolling_origin_pooled(n, t, id,
time, V, first_window, validation_size, gap, batch)` function. In the figure
below, we show $V=2$ folds, alongside two time-series (rolling origin)
cross-validation folds.
```{r, fig.cap="Rolling origin V-fold CV", results="asis", echo=FALSE}
knitr::include_graphics(path = "img/png/rolling_origin_v_fold.png")
```
#### Rolling window with $V$-fold
Just like the scheme described above, the rolling window approach, like the
rolling origin approach, can be extended to support multiple time-series with
arbitrary sample-level dependence by incorporating a $V$-fold splitting
component. This rolling-window-$V$-fold cross-validation scheme can be used
through `origami` via the `folds_vfold_rolling_window_pooled(n, t, id, time, V,
window_size, validation_size, gap, batch)` function. The figure below displays
$V=2$ folds and two time-series (rolling window) cross-validation folds.
```{r, fig.cap="Rolling window V-fold CV", results="asis", echo=FALSE}
knitr::include_graphics(path = "img/png/rolling_window_v_fold.png")
```
## General workflow of `origami`
Before we dive into more details, let's take a moment to review some of the
basic functionality in the `origami` `R` package. The main workhorse function in
`origami` is `cross_validate()`. To start off, the user must define the fold
structure and a function that operates on each fold (this `cv_fun()`, in
`origami`'s parlance, usually dictates how the candidate learning algorithm is
trained and its predictions validated).
Once passed to `cross_validate()`, the workhorse function will iteratively apply
the specified function (i.e., `cv_fun()`) to each fold, combining the
fold-specific results in a meaningful way. We will see this in action in later
sections --- for now, we provide specific details on each each step of this
process below.
### (1) Define folds
The `folds` object passed to `cross_validate` is a `list` of folds; such `list`
objects are generated using the `make_folds()` helper function. Each fold
consists of a `list` with a `"training"` index vector, a `"validation"` index
vector, and a `"fold_index"` (its order in the overall `list` of folds). The
`make_folds()` function supports a variety of cross-validation schemes,
described in the preceding section. The `make_folds()` function can also ensure
balance across levels of a given variable (through the `strata_ids` arguments),
and it can also keep all observations on the same independent unit together (via
the `cluster_ids` argument).
### (2) Define the fold function
The `cv_fun` argument to `cross_validate()` is a custom function that performs
some operation on each fold (again, _usually_ this specifies the training of the
candidate learning algorithm and its evaluation on a given training/validation
split, i.e., in a single fold). The first argument to this function is the
`fold`, which specifies the indices of the units in a given training/validation
split (note that this first argument is automatically passed to the `cv_fun()`
by `cross_validate()`, which queries the folds object from `make_folds()` in
doing so). Additional arguments can be passed to the `cv_fun()` through the
`...` argument to `cross_validate()`. Within this function, the convenience
functions `training()`, `validation()` and `fold_index()` can be used to return
the various components of a fold object. When the `training()` or
`validation()` functions are passed an object of a particular class, they will
index that object in a sensible way. For instance, if the input object is a
vector, these helper functions will index the vector directly, but if the input
object is a `data.frame` or `matrix`, these functions will automatically index
the rows. This allows the user to easily partition data into training and
validation sets. The fold function must return a named `list` of results
containing whatever fold-specific outputs are desired.
### (3) Apply `cross_validate()`
After defining the folds, the `cross_validate()` function can be used to map the
`cv_fun()` across the `folds`; internally, this uses either `lapply()` or
`future_lapply()` (a parallelized variant of the same). In this way,
`cross_validate()` can be easily parallelized by specifying a parallelization
scheme (i.e., a `plan` from the [future parallelization framework for
`R`](https://Cran.R-project.org/package=future) [@bengtsson2021unifying]). The
application of `cross_validate()` generates a list of results, matching the
customized `list` specified in the relevant `cv_fun()`. As noted above, each
call to `cv_fun()` itself returns a `list` of results, with different named
slots for each type of result we wish to store. The main `cross_validate()`
loop generates a `list` of these individual, fold-specific `list`s of results (a
`list` of `list`s or "meta-list"). Internally, this "meta-list" is cleaned up
(by concatenation) such that only a single slot per type of result specified by
the `cv_fun()` is returned (this too is a `list` of the results for each fold).
By default, the `combine_results()` helper function is used to combine the
individual, fold-specific `list`s of results in a sensible manner. How results
are combined is determined automatically by examining the data types of the
results from the first fold. This can be modified by specifying a `list` of
arguments in the `.combine_control` argument.
## Cross-validation in action
We've waited long enough. Now, let's see `origami` in action! In the next
chapter, we will learn how to use cross-validation with the Super Learner
algorithm, and how we can utilize the power of cross-validation to build optimal
ensembles of algorithms --- going far beyond the application of cross-validation
to a single statistical learning method.
### Cross-validation with linear regression
First, let's load the relevant `R` packages, set a seed (for reproducibility),
and once again load the WASH Benefits example dataset. For illustrative
purposes, we'll examine the application of cross-validation to simple linear
regression with `origami`, focusing on predicting the weight-for-height Z-score
(`whz`) using all of the other available covariates in the dataset. As mentioned
before, we will assume the dataset contains only independent and identically
distributed units, ignoring the clustering structure imposed by the trial
design. For the sake of illustration, we will work with only a subset of the
data, removing all observational units with missing covariate data from the
analysis-ready dataset. In the prior chapter, we discussed how to deal with
missingness.
```{r setup_ex}
library(stringr)
library(dplyr)
library(tidyr)
# load data set and take a peek
washb_data <- fread(
paste0(
"https://raw.githubusercontent.com/tlverse/tlverse-data/master/",
"wash-benefits/washb_data.csv"
),
stringsAsFactors = TRUE
)
# remove missing data with drop_na(), then pick just the first 500 rows
washb_data <- washb_data %>%
drop_na() %>%
slice(1:500)
# specify the outcome and covariates as character vectors
outcome <- "whz"
covars <- colnames(washb_data)[-which(names(washb_data) == outcome)]
```
Here's a look at the data:
```{r origami_washb_example_table2, echo=FALSE}
if (knitr::is_latex_output()) {
head(washb_data) %>%
kable(format = "latex")
} else if (knitr::is_html_output()) {
head(washb_data) %>%
kable() %>%
kableExtra::kable_styling(fixed_thead = TRUE) %>%
kableExtra::scroll_box(width = "100%", height = "300px")
}
```
Let's remind ourselves of the covariates to be used in the prediction step:
```{r covariates}
covars
```
Next, let's fit a simple main-terms linear regression model to the
analysis-ready dataset. Here, our goal is to predict the weight-for-height
Z-score (`"whz"`, which we assigned to the variable `outcome`) using all of the
available covariate data. Let's try it out:
```{r linear_mod}
lm_mod <- lm(whz ~ ., data = washb_data)
summary(lm_mod)
```
We can assess the quality of the model fit on the dataset by comparing the
linear model's predictions of the weight-for-height Z-score against the
observations of the same in the dataset. This is the well-known, and standard,
mean squared error (MSE). We can extract this summary measure from the `lm`
model object like so
```{r get_naive_mse}
(err <- mean(resid(lm_mod)^2))
```
The MSE estimate is `r err`, which, from examination of the above, is merely the
mean of the squared residuals of the model fit. An important problem arises
when we assess the learning algorithm's quality in this way --- that is, because
we have trained our linear regression model on the complete analysis-ready
dataset and then assessed its performance (the MSE) on the same dataset, all of
the data is used for both model training and validation. Unfortunately, this
simple estimate of the MSE is overly optimistic. Why? The linear regression
model is trained on the same dataset used in its evaluation, not unlike reusing
problems from a homework assignment in a course examination. Of course, we are
generally not interested in how well the algorithm explains variation in the
observed dataset; rather, we are interested in how well the explanations
provided by the learning algorithm generalize to a target population from which
this particular sample is drawn. By using all of the data available to us for
training the learning algorithm, we are left unable to honestly evaluate how
well the algorithm fits (and, thus, explains) variation at the level of the
target population.
<!--
RP:
suggestion to make the above paragraph more concise
-->
To resolve this issue, cross-validation allows for a particular procedure (e.g.,
linear regression) to be implemented over training and validation splits of the
dataset, evaluating how well the procedure fits on a holdout (or validation)
set. This evaluation of the learning algorithm's quality on data unseen during
the training phase provides an honest evaluation of the algorithm's
generalization error.
We can easily incorporate cross-validation into our linear regression procedure
using `origami`. First, let's define a new function to perform linear regression
on a specific partition of the dataset (i.e., a fold):
```{r define_fun_cv_lm}
cv_lm <- function(fold, data, reg_form) {
# get name and index of outcome variable from regression formula
out_var <- as.character(unlist(str_split(reg_form, " "))[1])
out_var_ind <- as.numeric(which(colnames(data) == out_var))
# split up data into training and validation sets
train_data <- training(data)
valid_data <- validation(data)
# fit linear model on training set and predict on validation set
mod <- lm(as.formula(reg_form), data = train_data)
preds <- predict(mod, newdata = valid_data)
valid_data <- as.data.frame(valid_data)
# capture results to be returned as output
out <- list(
coef = data.frame(t(coef(mod))),
SE = (preds - valid_data[, out_var_ind])^2
)
return(out)
}
```
Our `cv_lm()` function is quite simple: It merely splits the available data into
distinct training and validation sets (using the eponymous functions provided in
`origami`), fits the linear model on the training set, and evaluates the quality
of the trained linear regression model on the validation set. This is a simple
example of what `origami` considers to be a `cv_fun()` --- functions for applying
a particular routine over an input dataset in cross-validated manner.
Having defined such a function, we can simply generate a set of partitions using
`origami`'s `make_folds()` function and apply our `cv_lm()` function over the
resultant `folds` object using `cross_validate()`. Below, we replicate the
re-substitution estimate of the error --- we did this "by hand" above --- using the
functions `make_folds()` and `cv_lm()`.
```{r cv_lm_resub}
# re-substitution estimate
resub <- make_folds(washb_data, fold_fun = folds_resubstitution)[[1]]
resub_results <- cv_lm(fold = resub, data = washb_data, reg_form = "whz ~ .")
mean(resub_results$SE, na.rm = TRUE)
```
This (nearly) matches the estimate of the error that we obtained above.
We can more honestly evaluate the error by $V$-fold cross-validation, which
partitions the dataset into $V$ subsets, fitting the algorithm on $V - 1$ of the
subsets (training) and evaluating on the subset that was held out from
fitting (validation). This is repeated such that each holdout subset
takes a turn being used for validation. We can easily apply our `cv_lm()`
function in this way using `origami`'s `cross_validate()` (note that by default
this function performs $10$-fold cross-validation):
```{r cv_lm_cross_valdate}
# cross-validated estimate
folds <- make_folds(washb_data)
cvlm_results <- cross_validate(
cv_fun = cv_lm, folds = folds, data = washb_data, reg_form = "whz ~ .",
use_future = FALSE
)
mean(cvlm_results$SE, na.rm = TRUE)
```
Having performed $V$-fold cross-validation with 10 folds (the default), we
quickly notice that our previous
estimate of the model error (by re-substitution) was a bit optimistic. The honest
estimate of the linear regression model's error is larger!
### Cross-validation with random forests
To examine `origami` further, let's return to our example analysis using the
WASH Benefits dataset. Here, we will write a new `cv_fun()` function. As an
example, we will use Breiman's random forest algorithm [@breiman2001random],
implemented in the `randomForest()` function (from the `randomForest` package):
```{r cv_fun_randomForest}
# make sure to load the package!
library(randomForest)
cv_rf <- function(fold, data, reg_form) {
# get name and index of outcome variable from regression formula
out_var <- as.character(unlist(str_split(reg_form, " "))[1])
out_var_ind <- as.numeric(which(colnames(data) == out_var))
# define training and validation sets based on input object of class "folds"
train_data <- training(data)
valid_data <- validation(data)
# fit Random Forest regression on training set and predict on holdout set
mod <- randomForest(formula = as.formula(reg_form), data = train_data)
preds <- predict(mod, newdata = valid_data)
valid_data <- as.data.frame(valid_data)
# define output object to be returned as list (for flexibility)
out <- list(
coef = data.frame(mod$coefs),
SE = ((preds - valid_data[, out_var_ind])^2)
)