forked from LucyMcGowan/dissertation-toolkit
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-chapter2.Rmd
144 lines (80 loc) · 26.5 KB
/
02-chapter2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
# Adaptive Monitoring using Second Generation p-Values Draft
## Introduction
Conclusive clinical trials either rule out clinically trivial or clinically actionable treatment effects. Like driving blind, this is a hard ideal to achieve without adaptive monitoring. The costs of ending too soon, when resources and knowledge were available otherwise, can be extraordinary [@Pocock:2016ca]. On the other hand, a significant yet non-clinically actionable study can also be costly [@Pocock:2016ey]. For these reasons, and to the extent possible, investigators turn to adaptive monitoring designs because the sample size is “not too big, not too small, but just right [@Broglio:2014fr].” To this we add the imperative: _to the point of being clinically conclusive_.
Unlike many study design aspects, clinical relevance is not an unknown assumption. A-priori scientific relevance informs which treatment effects are trivially null effects and which are clinically actionable enough to change clinical practice. Most sample size estimates already incorporate the latter into the study design. Though some study designs rule out a set of trivially null effects [@Kruschke:2013jy; @Hobbs:2008ce; @Freedman:1984wz], more common practice is to test for any difference from the point null.
Trivial effects surround the point null and include, in the least, indistinguishable treatment effects due to rounding error [@Blume:SGPV]. They may also include clinically irrelevant changes in a biomarker. This set of effects has many names including but not limited to _Indifference Zone_ and _Region of Practical Equivalence_ [@Blume:SGPV; @Kruschke:2013jy]; we call this set of effects the _Trivial Zone_. Switching from a point null to interval null carries intuitive clinical interpretation and endows a study with desirable statistical benefits.
Interval null hypotheses reduce family-wise Type I Error rates [@Blume:SGPV; @Kruschke:2013jy]. False discoveries most frequently occur closest to the point null hypothesis, and an interval null provides a stricter rejection criteria buffer that eliminates many false discoveries. For the same reason, the interval null provides a natural Type I Error adjustment for multiple looks / comparisons. Inferential approaches to evaluating interval hypotheses include Bayesian, Likelihood, and second-generation _p_-value inference. For any estimated interval (i.e. Confidence Interval, Credible Interval, Support Interval, etc.), the second-generation _p_-value draws conclusions from how much of the interval overlaps an interval hypothesis [@Blume:SGPV].
Many Bayesian adaptive trial designs incorporate interval null hypotheses [@Freedman:1984wz; @Berry:2010fr; @Hobbs:2008ce; @Kruschke:2013jy], and we develop an analogous Second Generation p-value adaptive design. The rest of the paper follows as: establishing a-priori clinically relevant guideposts, introducing the second generation _p_-value, adaptively monitoring with with the second generation p-value, comparing the second generation _p_-value and Bayesian adaptive monitoring with the REACH clinical trial data, and drawing final conclusions.
## Clinically Relevant Guideposts
In a two-sided study, four boundaries provide clinical guideposts. Highly actionable treatment effects are of a magnitude of at least of $\delta_E$ for benefit or of $\delta_H$ for harm. Trivial treatment effects are between $\delta_{TE}$ and $\delta_{TH}$ and encompass the point null (Figure 1). The remaining effects are moderately actionable; moderate treatment benefits are between $\delta_{TE}$ and $\delta_{E}$ and moderate treatment harms are between $\delta_{TH}$ and $\delta_{H}$. In any study, the implementation of an intervention takes into account secondary measures such as side effects, costs, etc.. Hence, a highly actionable effect is likely to outweigh the off-setting benefits / costs. A moderately actionable effect has greater equipoise with offsetting benefits / costs.A one sided study omits $\delta_H$ and $\delta_{TH}$ to focus only on $\delta_TE$ and $\delta_{E}$.
```{r clinicalGuideposts, echo=FALSE, fig.pos = "H", fig.align='center',fig.cap="Clinically relevant guideposts are determined during study design, based on scientific context, and ought to be incorporated into the final study inference. In a one-sided study (left figure), the clinical guideposts create the regions: no more than trivial effect, moderately actionable effect, and highly actionable effect. In a two-sided study (right figure), three regions are created: trivial effects, moderately actionable effects, and highly actionable effects."}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/clinicalGuideposts/Slidev03_01.png")
```
A similar set of clinical guideposts, the _Region of Equivalence_, uses only two boundaries[@Freedman:1984wz; @Hobbs:2008ce]. Treatment effects outside the _Region of Equivalence_ deem the novel intervention as either clinically superior or inferior to the standard of care. Under certain conditions (a one-sided study with a buffer around the point null), these guideposts match the one-sided guideposts we propose. The _Trivial Zone_ is necessary to rule out effects trivial to the point null and receive the benefits of an interval null such as reducing the False Discovery Probability.
Setting a-priori clinical guideposts brings transparency to the study design. Most study designs already incorporate $\delta_E$ when determining an adequate study sample size. Yet, in many instances, end trial inference does not incorporate $\delta_E$ (or other clinical guidepost decisions). For example, without knowing $\delta_E$, none of a Confidence, Credible, or Support Interval inform original study design intentions.
Excellent references are available for helping establish clinical guideposts [@Freedman:1984wz; @Spiegelhalter:1994cn; @Kruschke:2018bz; @Blume:SGPV].
## Second Generation p-value
An inferential metric, the second-generation _p_-value indicates when the data are compatible with the alternative hypothesis, the _Trivial Zone_ null hypothesis, or when the data are inconclusive [@Blume:SGPV]. More generally, it may be used for any interval hypothesis. The second-generation _p_-value calculates the overlap between an interval _I_ (any interval including but not limited to a Confidence, Credible, Support Interval, etc.) and the set of effects $\Delta_H$ in the hypothesis _H_. The interval includes [_a_, _b_] where _a_ and _b_ are real numbers such that _a_ < _b_, and the length of the interval is _b_ - _a_ and denoted |_I_|. The overlap between the interval and the set $\Delta_H$ is $|I \bigcap \Delta_H |$. The second generation p-value is then calculated as
$$
p_{H} = \frac{| I \bigcap \Delta_H |}{|I|} \times max\left\{\frac{| I |}{2 | \Delta_H |}, 1 \right\}.
$$
The multiplicative factor, $max\left\{\frac{| I |}{2 | \Delta_H |}, 1 \right\}$, provides a small sample size correction -- setting $p_H$ to 0.5 when an interval overwhelms $\Delta_H$ by at least twice the length. For a _Trivial Zone_ null hypothesis _T_, trivially null effects are ruled out when the $p_T = 0$ whereas non-trivial effects are ruled out when $p_T = 1$. The data are inconclusive when $0 < p_T < 1$.
Based on the four clinical guideposts, we define and focus on a Highly Actionable Hypothesis, _HA_, and Trivial Hypothesis, _T_ .
* Hypothesis _HA_: The treatment effect lies within a _Region of Clinically Highly Actionable Effects_ $\Delta_{HA} = (-\infty, \delta_{H}] \cup [\delta_{E}, \infty)$ for a two-sided study and $[\delta_{E}, \infty)$ for a one-sided study where a positive benefit is beneficial. Where a negative effect is beneficial, the one-sided study sets $\Delta_{HA} = (-\infty, \delta_{E}]$.
* Hypothesis _T_: The treatment effect lies within a _Region of Clinically Trivial Effects_ $\Delta_T = [\delta_{TH}, \delta_{TE}]$ for a two-sided study. In a one-sided study where a positive effect is beneficial, $\Delta_T = (\infty, \delta_{TE}]$ and is called a _Region of At Most Clinically Trivial Effects_. The bounds are mirrored when a negative effect is benificial.
Neither of these two hypotheses include moderately actionable treatment effects. When applied to the four clinical guideposts, nine conclusions may be drawn from the second-generation _p_-value and the Regions of Clinically Highly Actionable Effects and Clinically Trivial Effects (Figure 2). To motivate our adaptive monitoring design, we reduce to three conclusions:
* When $p_{HA} = 0$, the treatment effect is not clinically highly actionable.
* When $p_{T} = 0$, the treatment effect is not trivial different from the point null.
* Otherwise, the treatment effect is inconclusive
Again, the interpretation of these regions closely relate to the interpretations of the _Region of Equivalence_. They match exactly when in the case of a one-sided study with a _Trivial Zone_ around the point null.
```{r figs, echo=FALSE, fig.pos="H", fig.align='center',fig.cap="Final study inference ought to incorporate a-priori clinical guideposts. The second generation p-value draws inference based on hypothesized sets of treatment effects. We focus on two sets of hypotheses that form the Region of Clinically Trivial Effects and the Region of Clinically Highly Actionable Effects. (This figure focuses on two-sided studies yet similar conclusions are drawn for one-sided studies). With the two sets of hypotheses, the second-generation p-value can rule out trivial effects (the top four conclusions), rule out non-highly actionable effects (the middle five conclusions), or declare the study yet inconclusive. Confidence Intervals that correspond to a p-value close to but not exceeding 0.05 would be declared inconclusive."}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/clinicalGuideposts/Slidev03_02.png")
#plot(x=1,y=1)
```
## Adaptive Monitoring Rules / Guidance
A study implements the following rules for adaptive monitoring with the second-generation _p_-value:
* Design: Investigators a-priori determine the four clinical guideposts (two guideposts in the case of a one-sided study).
* Wait to Monitor: Enroll _B_ patients before applying monitoring.
* Monitor: Calculate the second-generation p-value using an inferential interval of choice. Raise an alert when $p_{HA} = 0$ or $p_T = 0$. Continue monitoring until affirming the _same_ alert (i.e. that again $p_{HA} = 0$ or $p_T = 0$) _K_ patients later.
* Stop: Stop once affirming an alert or at the end of resources.
* Report: Report only the interval at stopping.
Finding the appropriate _B_ and _K_ is done through simulations in the study design stage; both help protect against Type I Errors and bias. Chapter 3 provides practical guidance for determining _B_ and _K_. Type I Errors more commonly occur early in adaptive monitoring while estimated statistics are yet unstable, and they may occur randomly in the discrete Brownian motion of the Monitoring Interval. Bias is inherent in all adaptive monitoring schemes that stop at the first instance an alert is raised. Requiring _K_ patients to affirm an alert allows regression to the true treatment effect and improves the reported interval’s coverage rate. To draw emphasis, we call intervals used in the monitoring phase Monitoring Intervals as they are not to be interpreted.
When the true effect is moderately actionable, the study may stop for concluding the effect to be either not-trivial or not-highly actionable. This may bring consertnation; however, it is an important study design feature. Without a region of clinically moderately actionable effects, a highly actionable effect would border trivial effects and have a 50-50 chance of stopping for being actionable or trivial. A study that stops for concluding a non-trivial effect may include moderately actionable effects in the inferential interval. And similarly, moderately actionable effects may be included in inferential intervals when stopping to conclude a non-actionable effect. This behavior reflects the greater degree of equipoise between the moderately actionable effects and the off-setting harms/benefits.
Moderately actionable effects have a greater chance of raising conflicting alerts -- for example to raise an alert for a non-trivial effect then _K_ patients later alert for a non-highly actionable effect. For this reason, we require the affirmation alert to be the same as the alert it affirms. Only the final interval’s operating characteristics are relevant.
With only adapting the monitoring rules, the remaining rules apply to monitoring with Bayesian Credible Intervals. An alert for a non-highly actionable effect when P(Treatment Effect $\notin \Delta_{HA}$ | Data) > 1 – $\alpha_{criteria-not-actionable}$ or for a non-trivial effect when P(Treatment Effect $\notin \Delta_{T}$ | Data) > 1 – $\alpha_{criteria-not-trivial}$. $\alpha_{criteria-not-actionable}$ and $\alpha_{criteria-not-trivial}$ are study design tuning parameters based on simulations to achieve a desired end of study Type I Error and Power.
## Adaptively Monitoring the REACH clinical trial
### Context
The Rapid Education/Encouragement And Communications for Health (REACH) randomized clinical trial [@Nelson:2018bw] is designed to help patients with diabetes better manage glycemic control (as measured by Percent Hemoglobin A1c) and adhere to medication. Patients randomized to the intervention receive text message-delivered diabetes support for over 12 months. Now with complete enrollment, 512 patients are currently being followed up through 3, 6, 12, and 15 months.
In this population, lower Hemoglobin A1c reflects improved glycemic control. A change in Hemoglobin A1c of +/- of 0.15 (median REACH baseline Hemoglobin A1c of 8.20 [IQR of 7.20, 9.53]) is clinically trivial, whereas a decrease of Hemoglobin A1c of 0.5 is highly actionable to the point of adopting this novel intervention. While we anticipate the intervention to improve Hemoglobin A1c, we designed the study to be two-sided. That is, $\delta_E$ = -0.50, $\delta_{TE}$ = -0.15, $\delta_{TH}$ = 0.15, and $\delta_H$ = 0.50.
### Simulation
The performance of adaptive monitoring using the second generation p-value and posterior probabilities was observed from 20,000 bootstrap samples REACH patients. With block two randomization half of the patients were randomized to receive the REACH intervention. For outcomes, we used three month predicted Hemoglobin A1c from an ordinary least square model adjusting for baseline covariates including Hemoglobin A1c, age at enrollment, gender, years of education, years of diabetes, race / ethnicity, medication type, income, and type of insurance. All continuous baseline covariates flexibly allowed for non-linear associations through restricted cubic splines. Predicted Hemoglobin A1c changed by a treatment effect for those randomized to the REACH intervention. We simulated multiple treatment effect settings {Treatment Effect: -1, -0.75, -0.50, -0.375, -0.15, 0}. Only negative effects were investigated but by symmetry, increases in Hemoglobin A1c of the same magnitude perform similarly. In these simulations we assumed instantaneous outcomes.
Monitoring began after the 40th enrolled patient and continued every 20th patient until either affirming an alert 40 patients later or until reaching the end of resources (512 patients). For each second-generation _p_-value Monitoring Interval, we fit a marginal ordinary least squares regression of outcome Hemoglobin A1c given treatment assignment and obtained the 95% Confidence Interval for the treatment effect. And, for adaptive monitoring using posterior probabilities, we used a Bayesian alternative to the t-test: two-sample difference of means where the intervention group mean, $Y_I$, and control group mean, $Y_C$, were both distributed $t(\mu, \sigma, \nu)$ [@Kruschke:2013jy].
For Bayesian priors, we set $\mu$ ~ $N(\bar{y}$, 1 / $\phi$-1(.9)) as a fairly flat prior centered on the average mean Hemoglobin A1c of all observations with a 0.10 probability of observing an absolute change in Hemoglobin A1c greater than 1. For a skeptical prior on $\mu$, we mixed the flat prior 1:1 with $\mu$ ~ $N(\bar{y}$, 0.15 / $\phi$-1(0.95)). We set $\sigma$ ~ Gamma($s_y$ / 1000, $s_y$ 1000) where $s_y$ was the non-pooled standard deviation of all Hemoglobin A1c outcomes, and $\nu$ ~ $Gamma(1 / 600, 30) + 1$. The $\nu$ parameter places a prior on the skewness of the data, and was chosen such that with the flat prior yielded group means similar to raw means.
The two Bayesian adaptive designs (one with a flat and the other skeptical prior) were calibrated to the second generation _p_-value design by finding the $\alpha_{criteria-not-actionable} = \alpha_{criteria-not-trivial}$ that achieve the same Type I Error given the true effect is zero and that achieve the same probability of stopping for a not-trivial effect given the true effect is -0.15 (the border of trivial effects). The later calibration was then used to compare adaptive designs in terms of reasons for stopping, average trial sample size, bias, and coverage of the reported interval. Operating characteristics are estimated for given treatment effects. When allowing for distributional assumptions of observing the treatment effect, a correctly specified prior has zero bias. <!-- For a point of reference, we include the Power curve for a fixed, one-look, hypothesis test given the data’s variability and the Type I Error of the equally calibrated adaptive designs. -->
### Results
The flat and skeptical prior Credible Interval adaptive monitoring designs were calibrated to have the same Type I Error as adaptive monitoring with the second-generation _p_-value (Type I Error = 0.033; Figure \@ref(fig:Power)). While this was close to 0.05, the Type I Error changes depending on the wait time until monitoring and the number of looks. The flat and skeptical priors as specified both favor, at least some degree the null hypothesis which is reflected as a decrease in Power.
The probability of stopping for being not-trivial provides an adaptive monitoring analog for a Type I error under an interval null hypothesis. When the true treatment effect was truly -0.15 (i.e. the boundary of trivial effects), and when adaptively monitoring with second-generation _p_-values, the probability of stopping for concluding not-trivial was 0.063 (\@ref(fig:state)). Posterior probability monitoring designs were again calibrated to this same probability for this and the remaining figure results. The probability of stopping for being not-trivial was slightly higher when monitoring using posterior probabilities conditioned on the flat prior. However, for a clinically meaningful treatment effect of -0.50, monitoring on posterior probabilities conditioned the flat had a higher probability of stopping for concluding a non-highly-actionable effect than monitoring with the second generation _p_-value. The two designs that monitored on Credible Intervals were more likely to stop for concluding not-highly actionable effects than monitoring on the second-generation _p_-value; this is consistent with both priors being at least slightly favorable to the point null.
Trials monitored with the second-generation p-value using monitoring conifence intervals lasted longest when the treatment effect was moderately-actionable (Figure \@ref(fig:sampleSize)). As posterior probabilities condition on priors more strongly favoring the null, the longer trials occur for effects deemed moderately- to highly-actionable effects.
The bias was well mitigated when adaptively monitoring with the second generation _p_-value and a posterior conditioned on a flat prior (Figure \@ref(fig:bias)). The flat prior Credible Interval pulls estimates toward the point null, which tended to slightly help when the treatment effect was clinically meaningful and slightly hurt when the trivially null. Neither the flat nor skeptical prior are biased if they are each reflective of the true distribution of observing treatment effects. Coverage rates were increasingly better when monitoring with posterior probabilities compared to the second generation _p_-value. Of the treatment effects investigated, the worst coverage was *0.93* when monitoring with the second generation _p_-value.
```{r Power, echo=FALSE, out.width = "3.5in", fig.align='center', fig.pos="H", fig.show = "hold", fig.cap="In the REACH target population, a decrease in Hemoglobin A1c reflects desireable improvement in glycemic control. The Power curve (i.e. the probability of rejecting the point null of 0), was estimated from 20,000 adaptive monitoring simulations when monitoring using the second generation p-value (red circle) and posteriors conditioning on a flat prior (blue triangle) and skeptical prior (purple square). The intervention was simulated to have an effect of -1 (highly beneficial), -0.75, -0.50, -0.375, -0.15, to 0 effect (the point null). Bayesian adaptive monitoring designs were calibrated to have the same Type I Error as the second generation p-value adaptive design."}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/powerCalibratedLooks20.jpg")
#knitr::include_graphics("figs/probEff.jpeg")
```
```{r state, echo=FALSE, out.width= "1.8in", fig.align='center', fig.pos="H", fig.show="hold", fig.cap="In adaptive monitoring, a study can end in one of three states: concluding non-trivial effect, concluding non-highly-actionable effect, or ending at the end of resources. Above are estimated probabilities of ending in each of these states for each design and treatment effect. The probability of concluding non-trivial is an interval null analog to Power. The Bayesian adaptive monitoring designs were calibrated to have the same probability of concluding a non-trivial effect as adaptive monitoring with the second generation p-value. All following results are based off this calibration."}
knitr::include_graphics(c("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/probEffLooks20.jpeg","~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/probFutLooks20.jpeg","~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/probMoreLooks20.jpeg"))
```
```{r sampleSize, echo=FALSE, out.width= "3.5in", fig.align='center', fig.pos="H", fig.cap="Across simulations, the average time to stopping for the three adaptive designs is shown above. In these simulations the earliest possible stopping time was the 80th patient. Monitoring began at the 40th patient and continued at every 20th patient. Stopping required raising an alert and affirming the alert 40 patients later. Moderately actionable treatment effects required the greater sample size, while highly actionable and trivial effects required smaller sample sizes to suggest stopping."}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/averageNLooks20.jpeg")
```
```{r bias, echo=FALSE, out.width="3.5in", fig.align='center', fig.pos="H", fig.cap="At the study end, the final estimate and interval are reported. Across simulations, the bias was well mitagated when adaptively monitoring using the second generation p-value and posterior probability conditioned on a flat prior. A positive bias occurs when being pulled toward the null, as happens with the skeptical prior."}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/biasLooks20.jpeg")
```
```{r coverage, eval=TRUE, echo=FALSE, out.width="3.5in", fig.align='center', fig.pos="H", fig.cap="At the study end, the final estimate and interval are reported. In these simulations, a 95\\% Confidence Interval was reported when adaptively monitoring with the second-generation p-value (though any other interval could have been used for monitoring and reporting). And, Credible Intervals are reported for the Bayesian adaptive designs."}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/coverageLooks20.jpeg")
```
## Discussion
We developed an adaptive monitoring scheme, using the second generation _p_-value, to follow studies until either ruling out trivially null or clinically highly actionable treatment effects. Two major contributions come from this scheme: (1) the monitoring scheme incorporates clinically relevant guideposts that may help trivially null and clinically highly actionable studies stop quickly while still reaching a clear clinical conclusion and (2) the easy-to-calculate second generation _p_-value allows an investigator to adaptively monitor with their inferential interval of choice (including but not limited to Confidence Intervals, Credible Intervals, and Support Intervals). While not explored explicity in this paper, this design is well suited for following multiple outcomes and following subgroups until they reach clear clinical conclusions.
In any adaptive monitoring implementation, the adoption of a novel agent depends on multiple outcomes and reaching concluding an effect is highly actionable may not mean the agent should be adopted. Data Safety Monitoring Boards are encouraged to think of stopping rules more as guidelines when weighing the totality of evidence.
As seen in other similar designs, the use of an interval null decreases the False Discovery Rate, making the results more reproducible [@Blume:SGPV; @Kruschke:2013jy]. Incorporating clinical relevance into the inference brings study design transparency that otherwise may get lost. For example, a Confidence, Credible, or Support Interval alone do not inform of the targeted treatment effect when choosing a sample size.
With data from the REACH randomized clinical trial, we simulated adaptive monitoring using rules based on the second-generation _p_-value compared to rules based on the posterior probabilities. Monitoring with the second-generation _p_-value and posterior probabilities conditioned on a flat prior performed similarly in terms of Power, probability of stopping for efficacy, average sample size, bias, and reported interval coverage. Monitoring using a flat prior Credible Interval had a slight edge in these metrics but also a slightly worse probability of concluding highly-actionable effects were not-highly actionable. The flat prior was not uniformly flat, it slightly favored the point null.
All of the adaptive designs studied may be modified to achieve a desired Type I Error and Power. When adaptively monitoring with the second generation _p_-value, one may change the wait time before monitoring and number of looks. We encourage clinical guideposts to be anchored on clinical relevance and not change solely for the purpose of achieving statistical properties. The same encouragement to not change anchors denoting clinical relevance hold true for monitoring with posterior probabilities.