-
Notifications
You must be signed in to change notification settings - Fork 17
/
index.Rmd
253 lines (216 loc) · 12.2 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
---
title: "Targeted Learning in R"
subtitle: "Causal Data Science with the tlverse Software Ecosystem"
author: "Mark van der Laan, Jeremy Coyle, Nima Hejazi, Ivana Malenica, Rachael
Phillips, Alan Hubbard"
date: "`r format(Sys.time(), '%B %d, %Y')`"
documentclass: krantz
lof: yes
fontsize: '12pt, krantz2'
monofontoptions: "Scale=0.7"
bibliography: [book.bib, packages.bib]
csl: jrss.csl
link-citations: yes
colorlinks: yes
site: bookdown::bookdown_site
description: "An open source handbook for causal machine learning and data science with the Targeted Learning framework using the [`tlverse` software ecosystem](https://github.com/tlverse)."
favicon: "img/favicons/favicon.png"
github-repo: tlverse/tlverse-handbook
graphics: yes
url: 'https\://tlverse.org/tlverse-handbook/'
twitter-handle: tlverse
---
# Welcome {-}
_Targeted Learning in `R`: Causal Data Science with the `tlverse` Software
Ecosystem_ is an fully reproducible, open source, electronic handbook for
applying Targeted Learning methodology in practice using the software stack
provided by the [`tlverse` ecosystem](https://github.com/tlverse). This work is
a draft phase and is publicly available to solicit input from the community. To
view or contribute, visit the [GitHub
repository](https://github.com/tlverse/tlverse-handbook).
<!--- For HTML Only --->
`r if (knitr::is_latex_output()) '<!--'`
<img style="float: left; margin-right: 1%; margin-bottom: 0.01em"
src="img/logos/tlverse-logo.svg" width="30%" height="30%">
<img style="float: center; margin-right: 1%; margin-bottom: 0.01em"
src="img/logos/Rlogo.svg" width="35%" height="35%">
<img style="float: right; margin-right: 1%; margin-bottom: 0.01em"
src="img/logos/vdl-logo-transparent.svg" width="30%" height="30%">
<p style="clear: both;">
<br>
`r if (knitr::is_latex_output()) '-->'`
## Outline {-}
The contents of this handbook are meant to serve as a reference guide for both
applied research and for the teaching of short courses illustrating successful
applications of the Targeted Learning statistical paradigm. Each section
introduces a set of distinct causal inference questions, often motivated by a
case study, alongside statistical methodology and open source software for
assessing the scientific (causal) claim of interest. The set of materials
currently includes
* Motivation: [Why we need a statistical
revolution](https://senseaboutscienceusa.org/super-learning-and-the-revolution-in-knowledge/)
* The Roadmap and introductory case study: the WASH Benefits Bangladesh dataset
* Introduction to the [`tlverse` software
ecosystem](https://tlverse.org)
* Cross-validation with the [`origami`](https://github.com/tlverse/origami)
package
* Ensemble machine learning with the
[`sl3`](https://github.com/tlverse/sl3) package
* Targeted learning for causal inference with the
[`tmle3`](https://github.com/tlverse/tmle3) package
* Optimal treatments regimes and the
[`tmle3mopttx`](https://github.com/tlverse/tmle3mopttx) package
* Stochastic treatment regimes and the
[`tmle3shift`](https://github.com/tlverse/tmle3shift) package
* Causal mediation analysis with the
[`tmle3mediate`](https://github.com/tlverse/tmle3mediate) package
* _Coda_: [Why we need a statistical
revolution](https://senseaboutscienceusa.org/super-learning-and-the-revolution-in-knowledge/)
## What this book is not {-}
This book does __not__ focus on providing in-depth technically sophisticated
descriptions of modern statistical methodology or recent advancements in
Targeted Learning. Instead, the goal is to convey key details of these
state-of-the-art statistical techniques in a manner that is clear, complete, and
intuitive, while simultaneously avoiding the cognitive burden carried by
extraneous details (e.g., mathematically niche theoretical arguments). Our aim
is for the presentations herein to serve as a coherent reference for researchers
-- applied methodologists and domain specialists alike -- that empower them to
deploy the central statistical tools of Targeted Learning in a manner efficient
for their scientific pursuits. For a mathematically sophisticated treatment of
some of these topics, inclusive of in-depth technical details, in the field of
Targeted Learning, the interested reader is invited to consult @vdl2011targeted
and @vdl2018targeted, among numerous other works, as appropriate. The primary
literature in causal inference, machine learning, and non/semi-parametric
statistical theory include many of the most recent advances in Targeted Learning
and related areas. For background in causal inference, @hernan2022causal serves
as an introductory modern reference.
## About the authors {-}
### Mark van der Laan {-}
Mark van der Laan is Professor of Biostatistics and of Statistics at UC Berkeley
and co-director of UC Berkeley's [Center for Targeted Machine Learning and
Causal Inference](https://ctml.berkeley.edu/). His research interests include
statistical methods in computational biology, survival analysis, censored data,
adaptive designs, targeted maximum likelihood estimation, causal inference,
data-adaptive loss-based learning, and multiple testing. His research group
developed loss-based super learning in semiparametric models, based on
cross-validation, as a generic optimal tool for the estimation of
infinite-dimensional parameters, such as nonparametric density estimation and
prediction with both censored and uncensored data. Building on this work, his
research group developed targeted maximum likelihood estimation for a target
parameter of the data-generating distribution in arbitrary semiparametric and
nonparametric models, as a generic optimal methodology for statistical and
causal inference. Since mid-2017, Mark's group has focused in part on the
development of a centralized, principled set of software tools for targeted
learning, the `tlverse`.
### Jeremy Coyle {-}
Jeremy Coyle, PhD, is a consulting data scientist and statistical programmer,
currently leading the software development effort that has produced the
`tlverse` ecosystem of R packages and related software tools. Jeremy earned his
PhD in Biostatistics from UC Berkeley in 2017, primarily under the supervision
of Alan Hubbard.
### Nima Hejazi {-}
[Nima Hejazi](https://nimahejazi.org) is an Assistant Professor of Biostatistics
at the [Harvard T.H. Chan School of Public
Health](https://www.hsph.harvard.edu/biostatistics/). He obtained his PhD in
Biostatistics at UC Berkeley, working with Mark van der Laan and Alan Hubbard,
and held an NSF Mathematical Sciences Postdoctoral Research Fellowship
afterwards. Nima's research interests blend causal inference, machine learning,
non- and semi-parametric inference, and computational statistics, with areas of
recent emphasis having included causal mediation analysis; efficient estimation
under biased, outcome-dependent sampling designs; and sieve estimation for
causal machine learning. His methodological work is motivated principally by
scientific collaborations in clinical trials and observational studies of
infectious diseases, in infectious disease epidemiology, and in computational
biology. Nima is also passionate about high-performance statistical computing
and open source software design for applied statistics and statistical data
science.
### Ivana Malenica {-}
[Ivana Malenica](https://imalenica.github.io/) is a Postdoctoral Researcher in
the [Department of Statistics](https://statistics.fas.harvard.edu/) at Harvard
and a Wojcicki and Troper Data Science Fellow at the [Harvard Data Science
Initiative](https://datascience.harvard.edu/). She obtained her PhD in
Biostatistics at UC Berkeley working with Mark van der Laan, where she was a
Berkeley Institute for Data Science and a NIH Biomedical Big Data Fellow. Her
research interests span non/semi-parametric theory, causal inference and machine
learning, with emphasis on personalized health and dependent settings. Most of
her current work involves causal inference with time and network dependence,
online learning, optimal individualized treatment, reinforcement learning, and
adaptive sequential designs.
### Rachael Phillips {-}
Rachael Phillips is a PhD student in biostatistics, advised by Alan Hubbard and
Mark van der Laan. She has an MA in Biostatistics, BS in Biology, and BA in
Mathematics. As a student of targeted learning, Rachael integrates causal
inference, machine learning, and statistical theory to answer causal questions
with statistical confidence. She is motivated by issues arising in healthcare,
and is especially interested in clinical algorithm frameworks and guidelines.
Related to to this, she is also interested in experimental design;
human-computer interaction; statistical analysis pre-specification, automation,
and reproducibility; and open-source software.
### Alan Hubbard {-}
Alan Hubbard is Professor of Biostatistics at UC Berkeley, current chair of the
Division of Biostatistics of the UC Berkeley School of Public Health, head of
the data analytics core of UC Berkeley's SuperFund research program, and
co-director of UC Berkeley's [Center for Targeted Machine Learning and Causal
Inference](https://ctml.berkeley.edu/). His current research interests include
causal inference, variable importance analysis, statistical machine learning,
estimation of and inference for data-adaptive statistical target parameters, and
targeted minimum loss-based estimation. Research in his group is generally
motivated by applications to problems in computational biology, epidemiology,
and precision medicine.
<!--
# Acknowledgements {-}
-->
`r if (knitr::is_latex_output()) '<!--'`
## Reproduciblity {-}
The `tlverse` software ecosystem is a growing collection of packages, several of
which are quite early on in the software lifecycle. The team does its best to
maintain backwards compatibility. Once this work reaches completion, the
specific versions of the `tlverse` packages used will be archived and tagged to
produce it.
This book was written using [bookdown](http://bookdown.org/), and the complete
source is available on [GitHub](https://github.com/tlverse/tlverse-handbook).
This version of the book was built with `r R.version.string`,
[pandoc](https://pandoc.org/) version `r rmarkdown::pandoc_version()`, and the
following packages:
```{r pkg-list, echo = FALSE, results="asis"}
# borrowed from https://github.com/tidymodels/TMwR/blob/master/index.Rmd
deps <- desc::desc_get_deps()
pkgs <- sort(deps$package[deps$type == "Imports"])
pkgs <- sessioninfo::package_info(pkgs, dependencies = FALSE)
df <- tibble::tibble(
package = pkgs$package,
version = pkgs$ondiskversion,
source = gsub("@", "\\\\@", pkgs$source)
)
knitr::kable(df, format = "markdown")
```
`r if (knitr::is_latex_output()) '-->'`
## Learning resources {-}
To effectively utilize this handbook, the reader need not be a fully trained
statistician to begin understanding and applying these methods. However, it is
highly recommended for the reader to have an understanding of basic statistical
concepts such as confounding, probability distributions, confidence intervals,
hypothesis tests, and regression. Advanced knowledge of mathematical statistics
may be useful but is not necessary. Familiarity with the `R` programming
language will be essential. We also recommend an understanding of introductory
causal inference.
For learning the `R` programming language we recommend the following (free)
introductory resources:
* [Software Carpentry's _Programming with
`R`_](http://swcarpentry.github.io/r-novice-inflammation/)
* [Software Carpentry's _`R` for Reproducible Scientific
Analysis_](http://swcarpentry.github.io/r-novice-gapminder/)
* [Garret Grolemund and Hadley Wickham's _`R` for Data
Science_](https://r4ds.had.co.nz)
For a general, modern introduction to causal inference, we recommend
* [Miguel A. Hernán and James M. Robins' _Causal Inference: What If_
(2022)](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/)
* [Jason A. Roy's _A Crash Course in Causality: Inferring Causal Effects from
Observational Data_ on
Coursera](https://www.coursera.org/learn/crash-course-in-causality)
Feel free to [suggest a
resource](https://github.com/tlverse/tlverse-handbook/issues)!
## Want to help? {-}
Any feedback on the book is very welcome. Feel free to [open an
issue](https://github.com/tlverse/tlverse-handbook/issues), or to make a Pull
Request if you spot a typo.