-
Notifications
You must be signed in to change notification settings - Fork 15
/
index.Rmd
380 lines (266 loc) · 17 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
---
title: "Generate ROC Curve Charts for Print and Interactive Use"
author: "Michael C Sachs"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
fig_width: 5
fig_height: 5
vignette: >
%\VignetteIndexEntry{plotROC Examples}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo=FALSE, warning = FALSE}
library(knitr)
knit_hooks$set(source = function(x, options){
if (!is.null(options$verbatim) && options$verbatim){
opts = gsub(",\\s*verbatim\\s*=\\s*TRUE\\s*", "", options$params.src)
opts = gsub(",\\s*eval\\s*=\\s*FALSE\\s*", "", opts)
bef = sprintf('\n\n ```{r %s}\n', opts)
stringr::str_c(
bef,
knitr:::indent_block(paste(x, collapse = '\n'), " "),
"\n ```\n"
)
} else {
stringr::str_c("\n\n```", tolower(options$engine), "\n",
paste(x, collapse = '\n'), "\n```\n\n"
)
}
})
```
# Introduction
## About ROC Curves
The Receiver Operating Characteristic (ROC) curve is used to assess the accuracy of a continuous measurement for predicting a binary outcome. In medicine, ROC curves have a long history of use for evaluating diagnostic tests in radiology and general diagnostics. ROC curves have also been used for a long time in signal detection theory.
The accuracy of a diagnostic test can be evaluated by considering the two possible types of errors: false positives, and false negatives. For a continuous measurement that we denote as $M$, convention dictates that a test positive is defined as $M$ exceeding some fixed threshold $c$: $M > c$. In reference to the binary outcome that we denote as $D$, a good outcome of the test is when the test is positive among an individual who truly has a disease: $D = 1$. A bad outcome is when the test is positive among an individual who does not have the disease $D = 0$.
Formally, for a fixed cutoff $c$, the true positive fraction is the probability of a test positive among the diseased population:
$$ TPF(c) = P\{ M > c | D = 1 \} $$
and the false positive fraction is the probability of a test positive among the healthy population:
$$ FPF(c) = P\{ M > c | D = 0 \} $$
Since the cutoff $c$ is not usually fixed in advance, we can plot the TPF against the FPF for all possible values of $c$. This is exactly what the ROC curve is, $FPF(c)$ on the $x$ axis and $TPF(c)$ along the $y$ axis.
## Motivation
In the medical literature, ROC curves are commonly plotted without the cutoff values displayed. Other problems with ROC curve plots are abundant in the medical literature. We aim to solve some of these problems by providing a plotting interface for the ROC curve that comes with sensible defaults. It is easy to create interactive ROC curves for local or web-based use. The next section details the usage of the `plotROC` package.
# Usage
## Shiny application
I created a [shiny application](https://shiny.rstudio.com) in order to make the features more accessible to non-R users. A limited subset of the functions of the plotROC package can be performed on an example dataset or on data that users upload to the website. Resulting plots can be saved to the users' machine as a pdf or as a stand-alone html file. It can be used in any modern web browser with no other dependencies at the website here: https://sachsmc.shinyapps.io/plotROC.
## Installation and loading
**plotROC** can be installed from github or CRAN. It requires a recent version of **ggplot2** (>2.0.0).
```{r load, eval = FALSE}
install.packages("ggplot2")
install.packages("plotROC")
library(plotROC)
```
## Quick start
After installing, the interactive Shiny application can be run locally.
```{r shiny, eval = FALSE}
shiny_plotROC()
```
## Command line basic usage
I start by creating an example data set. There are 2 markers, one that is moderately predictive and one that is not as predictive.
```{r dataset, echo = -1}
library(plotROC)
set.seed(2529)
D.ex <- rbinom(200, size = 1, prob = .5)
M1 <- rnorm(200, mean = D.ex, sd = .65)
M2 <- rnorm(200, mean = D.ex, sd = 1.5)
test <- data.frame(D = D.ex, D.str = c("Healthy", "Ill")[D.ex + 1],
M1 = M1, M2 = M2, stringsAsFactors = FALSE)
```
### The Roc Geom
Next I use the `ggplot` function to define the aesthetics, and the `geom_roc` function to add an ROC curve layer. The `geom_roc` function requires the aesthetics `d` for disease status, and `m` for marker. The disease status need not be coded as 0/1, but if it is not, `stat_roc` assumes (with a warning) that the lowest value in sort order signifies disease-free status. `stat_roc` and `geom_roc` are linked by default, with the stat doing the underlying computation of the empirical ROC curve, and the geom consisting of the ROC curve layer.
```{r calc}
basicplot <- ggplot(test, aes(d = D, m = M1)) + geom_roc()
basicplot
```
The disease status aesthetic can be specified as a string or factor, but with a warning.
```{r warn}
ggplot(test, aes(d = D.str, m = M1)) + geom_roc()
```
The `geom_roc` layer includes the ROC curve line combined with points and labels to display the values of the biomarker at the different cutpoints. It accepts the argument `n.cuts` to define the number of cutpoints to display along the curve. Labels can be suppressed by using `n.cuts = 0` or `labels = FALSE`. The size of the labels and the number of significant digits can be adjusted with `labelsize` and `labelround`, respectively.
```{r opts}
ggplot(test, aes(d = D, m = M1)) + geom_roc(n.cuts = 0)
ggplot(test, aes(d = D, m = M1)) + geom_roc(n.cuts = 5, labelsize = 5, labelround = 2)
ggplot(test, aes(d = D, m = M1)) + geom_roc(n.cuts = 50, labels = FALSE)
```
We provide a function `style_roc` that can be added to a ggplot that contains an ROC curve layer. This adds a diagonal guideline, sets the axis labels, and adjusts the major and minor grid lines. The `direct_label` function operates on a ggplot object, adding a direct label to the plot. It attempts to intelligently select an appropriate location for the label, but the location can be adjusted with `nudge_x, nudge_y` and `label.angle`. If the `labels` argument is NULL, it will take the name from the mapped aesthetic.
```{r style}
styledplot <- basicplot + style_roc()
styledplot
basicplot + style_roc(theme = theme_grey, xlab = "1 - Specificity")
direct_label(basicplot) + style_roc()
direct_label(basicplot, labels = "Biomarker", nudge_y = -.1) + style_roc()
```
### Confidence regions and the Rocci Geom
It is common to compute confidence regions for points on the ROC curve using the Clopper and Pearson (1934) exact method. Briefly, exact confidence intervals are calculated for the $FPF$ and $TPF$ separately, each at level $1 - \sqrt{1 - \alpha}$. Based on result 2.4 from Pepe (2003), the cross-product of these intervals yields a $100 * (1 - \alpha)$ percent rectangular confidence region for the pair.
This is implemented in the `stat_rocci` and displayed as a `geom_rocci` layer. These both require the same aesthetics as the ROC geom, `d` for disease status and `m` for marker. By default, a set of 3 evenly spaced points along the curve are chosen to display confidence regions. You can select points by passing a vector of values in the range of `m` to the `ci.at` argument. By default, the significance level $\alpha$ is set to 0.05, this can be changed using the `sig.level` option.
```{r test-a-ci}
styledplot + geom_rocci()
styledplot + geom_rocci(sig.level = .01)
ggplot(test, aes(d = D, m = M1)) + geom_roc(n.cuts = 0) +
geom_rocci(ci.at = quantile(M1, c(.1, .4, .5, .6, .9))) + style_roc()
```
### Interactive Plots
Ggplot objects that contain a GeomRoc layer can be used to create an interactive plot and display it in the Rstudio viewer or default web browser by passing it to the `plot_interactive_roc`, or `export_interactive_roc` function. The style_roc function is applied by default. Give the function an optional path to an html file as an argument called `file` to save the interactive plot as a complete web page. By default, any existing Rocci layers are removed and replaced with a dense layer of confidence regions so that the user can click anywhere for a confidence region. This can be suppressed by `add.cis = FALSE`. Furthermore, the points layer of the Roc geom can be hidden by using the `hide.points` option.
Hovering over the display shows the cutoff value at the point nearest to the cursor. Clicking makes the cutoff label stick until the next click, and if confidence regions are available, clicks will also display those as grey rectangles. The confidence regions are automatically detected. When the user clicks on the ROC curve, the confidence region for the TPF and FPF is overlaid using a grey rectangle. The label and region stick until the next click.
```{r inter, eval = FALSE}
plot_interactive_roc(basicplot)
```
An interactive ROC plot can be exported by using the `export_interactive_roc` function, which returns a character string containing the necessary `HTML` and `JavaScript`. The character string can be copy-pasted into an html document, or better yet, incorporated directly into a dynamic document using `knitr` ([knitr homepage](https://yihui.org/knitr/)).
In a `knitr` document, it is necessary to use the `cat` function on the results and use the chunk options `results = 'asis'` and `fig.keep='none'` so that the interactive plot is displayed correctly. For documents that contain multiple interactive plots, it is necessary to assign each plot a unique name using the `prefix` argument of `export_interactive_roc`. This is necessary to ensure that the JavaScript code manipulates the correct svg elements. The next code block shows an example `knitr` chunk that can be used in an .Rmd document to display an interactive plot.
```{r int-no, fig.keep='none', results = 'asis', eval = FALSE, verbatim = TRUE}
cat(
export_interactive_roc(basicplot,
prefix = "a")
)
```
The result is shown below:
```{r int-yes, fig.keep='none', results = 'asis', fig.width=7, fig.height=7}
cat(
export_interactive_roc(basicplot,
prefix = "a")
)
```
Click for confidence regions.
### Multiple ROC curves
If you have grouping factors in your dataset, or you have multiple markers measured on the same subjects, you may wish to plot multiple ROC curves on the same plot. **plotROC** fully supports faceting and grouping done by ggplot2. In out example dataset, we have 2 markers measured in a paired manner:
```{r head}
head(test)
```
These data are in wide format, with the 2 markers going across 2 columns. ggplot requires long format, with the marker result in a single column, and a third variable identifying the marker. We provide the function `melt_roc` to perform this transformation. The arguments are the data frame, a name or index identifying the disease status column, and a vector of names or indices identifying the the markers. Optionally, the names argument gives a vector of names to assign to the marker, replacing their column names. The result is a data frame in long format.
```{r}
longtest <- melt_roc(test, "D", c("M1", "M2"))
head(longtest)
```
Then, the dataset can be passed to the ggplot function, with the marker name given as a grouping or faceting variable.
```{r group}
ggplot(longtest, aes(d = D, m = M, color = name)) + geom_roc() + style_roc()
ggplot(longtest, aes(d = D, m = M, color = name)) + geom_roc(n.cuts = 0) + style_roc()
ggplot(longtest, aes(d = D, m = M)) + geom_roc() + facet_wrap(~ name) + style_roc()
ggplot(longtest, aes(d = D, m = M, linetype = name)) + geom_roc() + geom_rocci()
ggplot(longtest, aes(d = D, m = M, color = name)) + geom_roc() + style_roc()
pairplot <- ggplot(longtest, aes(d = D, m = M, color = name)) +
geom_roc(show.legend = FALSE) + style_roc()
direct_label(pairplot)
pairplot + geom_rocci()
pairplot + geom_rocci(linetype = 1)
pairplot + geom_rocci()
```
Interactive versions of the plots are fully supported.
```{r grp1, fig.keep='none', results = 'asis', fig.width=7, fig.height=7}
cat(
export_interactive_roc(direct_label(pairplot),
prefix = "b")
)
```
Click for confidence regions.
```{r grp3, fig.keep='none', results = 'asis', fig.width=12, fig.height=6}
cat(
export_interactive_roc(
ggplot(longtest, aes(d = D, m = M)) + geom_roc() + facet_wrap(~ name),
prefix = "c", width = 10, height = 5)
)
```
Click for confidence regions.
Showing multiple curves is also useful when there is a factor that affects the classification accuracy of the test. Let's create another example dataset.
```{r ex2}
D.cov <- rbinom(400, 1, .5)
gender <- c("Male", "Female")[rbinom(400, 1, .49) + 1]
M.diff <- rnorm(400, mean = D.cov, sd = ifelse(gender == "Male", .5, 1.5))
test.cov <- data.frame(D = D.cov, gender = gender, M = M.diff)
```
```{r covplot}
bygend <- ggplot(test.cov, aes(d = D, m = M, color = gender)) + geom_roc(show.legend = FALSE)
direct_label(bygend) + style_roc()
```
## New Features
### AUC tables
```{r auctab}
test.cov2 <- data.frame(D = D.cov, gender = gender, M = M.diff,
group = c("A", "B")[rbinom(400, 1, .4) + 1])
bygend2 <- ggplot(test.cov2, aes(d = D, m = M, color = gender)) +
geom_roc() + facet_wrap(~ group)
calc_auc(bygend2)
```
Also works for multiple panels, and layers.
```{r auctab2}
set.seed(123)
x <- rnorm(100)
y <- round(plogis(3*x + rnorm(100, sd = 5)))
df <- data.frame(x, y, gp = c("A", "B"), pan = sample(c("Z", "W"), 100, replace = TRUE))
x2 <- rnorm(100)
y2 <- round(plogis(3*x + rnorm(100, sd = 2)))
df2 <- data.frame(x2, y2, gp = c("X", "Y"), pan = sample(c("Z", "W"), 100, replace = TRUE))
p3 <- ggplot(df, aes(d = y, m = x, color = gp)) + geom_roc() +
geom_roc(data = df2, aes(d = y2, m = x2, color = gp))
calc_auc(p3)
p4 <- ggplot(df, aes(d = y, m = x, color = gp)) + geom_roc() +
geom_roc(data = df2, aes(d = y2, m = x2, color = gp)) +
facet_wrap(~ pan)
calc_auc(p4)
```
### Custom labels and locations
```{r cuslab}
ggplot(test, aes(d = D, m = M1)) + geom_roc(cutoffs.at = c(2, 1, .5, 0, -.5, -1))
ggplot(test, aes(d = D, m = M1)) + geom_roc(cutoffs.at = c(2, 1, .5, 0, -.5, -1), cutoff.labels = letters[1:6])
```
### Increasing parameter
By default, the ROC calculation assumes that larger M is associated with a larger Pr(D = 1), you can now change this by setting `increasing = FALSE`.
```{r increasing}
ggplot(test, aes(d = D, m = M1)) + geom_roc(increasing = FALSE)
ggplot(test, aes(d = D, m = -M1)) + geom_roc(increasing = FALSE)
```
## Advanced options
### Themes and annotations
plotROC uses the `ggplot2` package to create the objects to be plotted. Therefore, themes and annotations can be added in the usual ggplot2 way, or with external libraries such as `ggthemes`. The area under the ROC curve can be calculated by passing the ggplot object to the `calc_auc` function.
```{r themes}
basicplot +
style_roc(theme = theme_grey) +
theme(axis.text = element_text(colour = "blue")) +
ggtitle("Themes and annotations") +
annotate("text", x = .75, y = .25,
label = paste("AUC =", round(calc_auc(basicplot)$AUC, 2))) +
scale_x_continuous("1 - Specificity", breaks = seq(0, 1, by = .1))
```
### Other estimation methods
By default `calculate_roc` computes the empirical ROC curve. There are other estimation methods out there, as I have summarized in the introduction. Any estimation method can be used, as long as the cutoff, the TPF and the FPF are returned. Then you can simply pass those values in a data frame to the `ggroc` function, and overriding the default stat to `identity`.
```{r binorm}
D.ex <- test$D
M.ex <- test$M1
mu1 <- mean(M.ex[D.ex == 1])
mu0 <- mean(M.ex[D.ex == 0])
s1 <- sd(M.ex[D.ex == 1])
s0 <- sd(M.ex[D.ex == 0])
c.ex <- seq(min(M.ex), max(M.ex), length.out = 300)
binorm.roc <- data.frame(c = c.ex,
FPF = pnorm((mu0 - c.ex)/s0),
TPF = pnorm((mu1 - c.ex)/s1)
)
binorm.plot <- ggplot(binorm.roc, aes(x = FPF, y = TPF, label = c)) +
geom_roc(stat = "identity") + style_roc(theme = theme_grey)
binorm.plot
```
Interactive plots with `stat = "identity"` are not currently supported.
Another potential use of this approach is for plotting time-dependent ROC curves for time-to-event outcomes estimated as described in Heagerty, Lumley, and Pepe (2000).
```{r surv}
library(survivalROC)
survT <- rexp(350, 1/5)
cens <- rbinom(350, 1, .1)
M <- -8 * sqrt(survT) + rnorm(350, sd = survT)
sroc <- lapply(c(2, 5, 10), function(t){
stroc <- survivalROC(Stime = survT, status = cens, marker = M,
predict.time = t, method = "NNE",
span = .25 * 350^(-.2))
data.frame(TPF = stroc[["TP"]], FPF = stroc[["FP"]],
c = stroc[["cut.values"]],
time = rep(stroc[["predict.time"]], length(stroc[["FP"]])))
})
sroclong <- do.call(rbind, sroc)
ggplot(sroclong, aes(x = FPF, y = TPF, label = c, color = time)) +
geom_roc(labels = FALSE, stat = "identity") + style_roc(theme = theme_gray)
```
# Acknowledgements
This package would not be possible without the following:
- [ggplot2](https://ggplot2.tidyverse.org/)
- [gridSVG](https://sjp.co.nz/projects/gridsvg/)
- [d3.js](https://d3js.org/)