forked from LucyMcGowan/dissertation-toolkit
-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-chapter3.Rmd
202 lines (123 loc) · 25.8 KB
/
03-chapter3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
# The sgpvAM Package and Practical Recommendations for Adaptive Monitoring with the Second Generation p-value
## Introduction
In Chapter 2 we present an adaptive monitoring scheme that follows studies until evidence supports either a non-trivial or non-highly actionable treatment effect. The design is very easy to implement. It can done by anyone who can think about the clinical interpretation of possible effect sizes and calculate an interval estimate for their effect, such as a credible, support, or confidence interval (CI). However, estimating the operating characteristics of a given study design is not as easy. Without tools to assist them, it could be a barrier to implementation of the method.
In practice, the trialist will want to know the operating characteristics under the following adaptive monitoring design features.
* Frequency of looks at the data, i.e. recalculate CI at every jth subject.
* Minimum precision requirement before applying monitoring rules, i.e. check monitoring rules only if |CI| < w.
* Required observations between an alert and affirmation to stop, i.e. evaluate stopping rule j*k subjects after alert.
* Anticipated maximum amount of data that could be collected, maxN, i.e. the sample size at which the study will cease collection regardless of the stopping rules.
* Lag time between enrolling a subject and observing their outcome measured in the number of subjects recruited during the lag, i.e. m additional subjects will be recruited in the time between one subject being recruited and that one subject’s outcome being observed.
The traditional trialist will mainly be looking for the point null Type I error probability and Power, i.e. the probability of concluding an effect is non-trivial for a given true effect size. They may also want some simple summary statistics for the potential sample sizes. We hope to provide access to these and to a more rich set of operating characteristics. Operating characteristics we can potentially estimate for a given set of design features include the following.
* Distribution of potential sample sizes.
* Point Null Type I error, when the point null is true, i.e. probability of excluding the point null from the final interval estimate. Classical trialists will want reassurance this is < 5%.
* Interval Null Type I error, when the point null is true, i.e. probability of excluding the entire trivial zone from the final interval estimate. This will be less than or equal to the point null Type I error.
* Power vs the point null, i.e. probability the final interval estimate excludes the point null for a given true effect size. This is akin to classical statistical power.
* Power vs the interval null, i.e. probability the final interval estimate excludes the null zone for a given true effect size. This will be less than or equal to the point null power, but is conceptually the preferable quantity. In better terms, this is the probability of concluding the effect is non-trivial for given true effect sizes.
* Probability of concluding effect is non-highly actionable for given true effect sizes.
* Probability of an inconclusive finding at the end of resources. Note a clinically inconclusive finding is a possibility whenever maxN is finite and/or m > 0.
* Bias, MSE, and interval coverage probability from a frequentist perspective.
* False confirmation probability under a specified prior distribution.
To address these needs, we provide an R function, sgpvAM, that simulates the above design operating characteristics for normal and binomial outcomes, and we offer practical advice setting the minimum wait time (in terms of inferential interval width) and the number of looks before affirming an alert.
## sgpvAM Package
The sgpvAM package allows the user to obtain study design operating characteristics under a variety of settings for adaptive monitoring using the second generation p-value.
### MCMC Replicates
The user may use the sgpvAM function to generate MCMC replicates of outcomes and intervention assignments along with an estimate of the effect and a lower- and upper- interval bound; replicates are generated using parallel computing. Alternatively, the user may provide their own generated data together with an estimated effect and interval bounds. When using the sgpvAM function, the user specifies the data generation function (any of the r[dist] such as rnorm) along with arguments to the function. Similarly, the user specifies effect generation. Currently, only fixed effects have been thoroughly tested. However, by specifing a distribution for the effects, the user may explore False Discovery Probabilities and other operating characteristics, such as bias, dependent on distributional assumptions of the effect.
### One- vs Two-Sided Hypotheses
Clinical Guideposts defining regions of Trivial and Highly Actionable Effects must be provided though may be one- or two-sided. The point null must be within and not a boundary of the Trivial Region. For general nomenclature, inputs to define the regions are: deltaL2 (the Clinically Highly Actionable Boundary less than the point null), deltaL1 (the Trivial Region Boundary less than the point null), deltaG1 (the Trivial Region Boundary greater than the point null), and deltaG2 (the Clinically Highly Actionable Boundary greater than the point null). See Chapter 2 for a thorough discussion of these regions.
### Tuning study parameters
To maximize performance of operating characteristics under a given sample size the sgpvAM function allows the user to specify multiple wait time settings, frequency of looks, and number of steps before affirming a stopping rule. (The wait time is the time until the expected Margin of Error achieves a certain length or less). Here we define the Margin of Error as one-half the interval width.
### Operating characteristics under normal outcomes
After generating the operating characteristics under a fixed normal outcome, the user may use the locationShift function to obtain operating characteristics under a range of fixed treatment effects. The function uses the saved MCMC replicates and adds to them if needed for additional monitoring.
### ECDF of sample size and bias
Once a study design has been selected based on average performance (sample size, bias, and error probabilities), the user may use the ecdf.sgpv function to see the empirical cumulative distribution across the MCMC replicates for sample size and bias under a specific design. This provides an estimate of the probabilitiy a study does not exceed a certain maximum sample size.
### General suggestions
Computations may be time consuming. It is recommended to start with 1000 replicates to get a general sense of average sample size and error probabilities under a variety of investigated wait times and affirmation steps. Investigating many wait times increases the computational burden. When generating data that allows for a location shift, it is recommended to generate MCMC replicates in the (or one of the) mid point(s) between the Clinically Trivial and Highly Actionable Regions. This is the region with greatest expected sample size and reduces the burden of the locationShift function generating additional data when necessary.
### Inputs
**mcmcData** Previously generated data. Default (NULL) uses MCMC generation inputs to generate new or additional data.
**nreps** Number of MCMC replicates to generate
**waitWidths** Wait time, in terms of Margin of Error (one half the confidence interval width), before monitoring data.
**dataGeneration** Function (such as rnorm) to generate outcomes.
**dataGenArgs** Arguments for dataGeneration function. This includes, in the least, 'n' observations to generate. If 'n' is insufficient for unrestricted adaptive monitoring, addtional data will be generated.
**effectGeneration** Function (such as rnorm) or fixed value to generate intervention effect (theta).
**effectGenArgs** Arguments for effectGeneration function (if any)
**modelFit** An existing or user-defined function to obtain intervals. Two existing functions are provided: 1) lmCI which obtains a confidence interval from a linear model and has class 'normal' indicating normal data and 2) lrCI obtains Wald Confidence Interval from logistic regression model and has class 'binomial' indicating binomial data.
**pointNull** Point null.
**deltaL2** Clinical guidepost less than and furthest from point null.
**deltaL1** Clinical guidepost less than and closest to point null.
**deltaG1** Clinical guidepost greater than and closest to point null.
**deltaG2** Clinical guidepost greater than and furthest from point null.
**lookSteps** The frequency data are observed (defaults to 1 – fully sequential).
**kSteps** Affirmation steps to consider range from 0 to maxAlertSteps by kSteps.
**maxAlertSteps** Maximum number of steps before affirming an alert.
**maxN** Total enrolled patients equals maxN observed patients plus lagOutcomeN.
**lagOutcomeN** Total enrolled patients equals MaxN observed patients plus lagOutcomeN. lagOutcomeN are number of
observations enrolled but awaiting to observe outcome.
**monitoringIntervalLevel** Traditional (1-alpha) used in monitoring intervals.
**printProgress** Prints when adding more data for MCMC replicates to have sufficient observations to monitor until a
conclusion. Defaults to TRUE.
**outData** Returns the MCMC generated data. This can result in an out object with large memory. Yet, with location
shift data, can be re-used to obtain operating characteristics of shifted effects.
**getECDF** Returns the ECDF of sample size and bias for each wait width and number of steps before affirming end of
study.
**cores** Number of cores used in parallel computing. The default (NULL) does not run on parallel cores.
**fork** Fork clustering, works on POSIX systems (Mac, Linux, Unix, BSD) and not Windows. Defaults to TRUE.
**socket** Socket clustering. Defaults to TRUE yet only applies if FORK = FALSE.
### Return values
The sgpvAM function returns a list with three elements:
1. mcmcMonitoring – the mcmcReplicates when outData is TRUE,
2. mcmcEndOfStudy – operating characteristics on average and ECDF for each combination of wait time and number of steps before affirming a stopping rule
3. inputs – Inputs in to the sgpv function
As supporting material for the package, we have developed an extensive vignette that illustrates using sgpvAM to estimate and explore the impact of study design choices on Point Null Type I error, Power for a range of true effect sizes, average sample sizes, the distribution of possible sample sizes, and more. These include the types of figures and calculations that will be of particular interest to the traditional trialist. The full vignette may be found at https://github.com/chipmanj/sgpvAM or by loading the sgpvAM package and calling Vignette(package = "sgpvAM", topic = "README"). Three of the example figures are presented briefly below.
Figure \@ref(fig:exampleRejPN) is a classical power curve which shows the probability the final CI will exclude the point null at various true effect sizes. The power when the true effect size is equal to the point null is the classical point null Type I error probability. The figure illustrates the impact of various wait times on the power curve. The second generation p-value adaptive design allows for designing trials with finite and infinite sample sizes in mind. We recommend presenting both. Figure 1 illustrates the infinite sample size. Notice the point null Type I error is bounded below 5% even when the maximum sample size is theoretically infinite. Under this framework, a study that stopped for reaching a maximum sample size could be restarted without concern of controlling point null Type I error.
```{r exampleRejPN, echo=FALSE, out.width = "3.5in", fig.align='center', fig.pos="H", fig.show = "hold", fig.cap="Power Curve across treatment effects for rejecting the point null in a one-sided study when requiring different wait times before monitoring. The horizontal line is at 0.05 to indicate the alpha level corresponding to the final reported confidence interval. The first vertical line denotes the upper boundary of At Most Trivial Effects, and the second vertical line denotes the boundary of the Highly Actionable Effects. The Wait times are the expected sample size for achieving a confidence interval width. In this figure there is no restriction on sample size nor a lag in observing outcomes."}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/sgpvAMexampleRejPN.jpeg")
#knitr::include_graphics("figs/probEff.jpeg")
```
Figure \@ref(fig:exampleAveN) displays the impact of increasing the affirmation steps on the average sample size. Increasing the affirmation steps has benefits in reducing bias, increasing interval coverage probabilities, and increasing the stability of conclusions particularly in the presence of a lag between recruitment and outcome observation. These benefits come with an increase in average sample size, particularly in between the bounds of the Trivial and Highly Actionable regions, i.e. where erroneous conclusions due to stopping too early are most likely.
```{r exampleAveN, echo=FALSE, out.width = "3.5in", fig.align='center', fig.pos="H", fig.show = "hold", fig.cap="The average sample size across treatment effects in a one-sided study when requiring different wait times before monitoring. The first vertical line denotes the upper boundary of At Most Trivial Effects, and the second vertical line denotes the boundary of the Highly Actionable Effects. The Wait times are the expected sample size for achieving a confidence interval width. In this figure there is no restriction on sample size nor a lag in observing outcomes."}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/sgpvAMexampleAveN.jpeg")
#knitr::include_graphics("figs/probEff.jpeg")
```
Non-adaptive studies are typically designed a with high probability of an inconclusive finding. Consider a non-adaptive study designed to have 80% power at a clinically highly actionable effect size. Such a study will include the point null in its final interval 20% of the time. It will include values from the null region with an even higher probability. Although designed to only stop when a clinically highly actionable conclusion has been found, second generation p-value adaptive monitored trials may yield a clinically inconclusive finding if the trial has a fixed maximum sample size and/or a lag between recruitment and outcome observation. This probability is highest in between the bounds of the Trivial and Highly Actionable regions.
Figure \@ref(fig:exampleMore) illustrates the control of this probability provided through increasing the affirmation steps in the presence of a 50 subject lag time. Even with this large lag, a relatively small affirmation step requirement bounds the probability of an inconclusive finding below 20%.
```{r exampleMore, echo=FALSE, out.width = "3.5in", fig.align='center', fig.pos="H", fig.show = "hold", fig.cap="The probability of an inconclusive study (i.e. not ruling out Trivial or Highly Actionable effects) when outcomes have a lag time until being observed relative to enrollment. The study stops based upon drawing conclusions from observed data, yet after the remaining lagged outcomes (50 in this figure), the study would suggest more observations are needed to rule out Trivial or Highly Actionable effects. The risk of being inconclusive is greatest in the midpoint between the Trivial and Highly Actionable Regions (the boundary of the regions are respectively denoted by the two vertical lines). Requiring fifty observations to affirm an alert reduces the worst-case risk of being inconclusive to 20\\% in this setting."}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/sgpvAMexampleProbMore.jpeg")
#knitr::include_graphics("figs/probEff.jpeg")
```
## Practical Recommendations
Of key importance to classical trialists is controlling point null Type I Error and achieving a high probability of excluding the point null when the true effect size is clinically highly actionable. To the medical researcher, key importance is to complete the study with a clinically highly actionable finding. We show that focusing on the later will achieve the former. When a study is not able to observe outcomes immediately, care should be taken to reduce the risk of stopping and then being inconclusive after observing the remaining observations.
To control error rates, one may change the clinical guideposts, reduce the number of times monitoring the study, and/or increase the number of steps before affirming a stopping rule. Ideally, the clinical guideposts chosen for their clinical interpretation. We discourage altering them for the sake of operational characteristics. Instead, we encourage optimizing operating characteristics through the waiting a period of time before monitoring and requiring a number of observations to pass until affirming an alert to stop the study.
We consider various wait times based upon the expected confidence interval width under an assumed outcome standard deviation. Twenty thousand MCMC replicates of a study with standard normal outcomes are generated using the sgpvAM package under five one-sided hypotheses settings and five symmetric two-sided hypotheses. The settings reflect situations where (Setting 1) the Trivial and Highly Actionable Effect bounds are both close to the point null of zero, (Settings 2-4) the Trivial Effect bound is close to the null and the Highly Actionable Effect is far from the point null, and (Setting 5) situations where the bound for Highly Actionable Effects is far from the null and the bound for Trivial Effects is increasingly close to the Highly Actionable Effect bound. These results generalize to normal outcomes with clinical guideposts relative to the standard deviation.
For both one- and symmetric two-sided hypotheses, the probability of a Type I Error is minimized by waiting until the Margin of Error equals the midpoint between the Trivial and Highly Actionable Zones (Figure \@ref(fig:power)). For example, a one-sided study with At Most Trivial Effects defined as (-$\infty$, 0.1] and Highly Actionable Effects as [0.2, $\infty$) reduces the probability of a Type I Error by waiting until the Margin of Error is 0.15. In these simulations, the probability of a Type I Error remained less than or equal to 0.05 when waiting longer, until the Margin of Error equaled the positive boundary of the Trivial Effects.
```{r power, echo=FALSE, out.width = "6in", fig.align='center', fig.pos="H", fig.show = "hold", fig.cap="Impact of different wait times upon Type I Error. Studies observe a minimum sample size to achieve an expected minimum Margin of Error (half-width of a 95\\% confidence interval) under an assumed outcome standard deviation of 1. Five one-sided (A) and five symmetric two-sided (B) hypotheses are investigated. The upper bound for the Trivial Effect is denoted by the first vertical line, and the second vertical line denotes the minimal highly actionable effect greater than zero. Five combinations of effect size boundaries are shown. The lower bounds for the symmetric two-sided hypotheses are not shown. For each minimum sample size / minimum CI width, operating characteristics are provided with varying steps before affirming the stopping rule."}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/waitTimeRejPN.jpeg")
#knitr::include_graphics("figs/probEff.jpeg")
```
The following operating characteristics benefit from a longer wait time (i.e. waiting until the Margin of Error is more narrow): power (Figure \@ref(fig:powerImpact)); an interval null equivalent to Type I Error (Figure \@ref(fig:typeiTrivial) and 8), Power (Figure \@ref(fig:powInterval)), and Type Two Error (Figure \@ref(fig:typeii)); and the probability of stopping for conclusive observed outcomes yet becoming inconclusive after observing the remaining unobserved outcomes (Figure \@ref(fig:inconclusive)). On the other hand, a shorter wait time yields smaller average sample sizes (Figure \@ref(fig:n)). The gains in sample size diminishes once the wait time is the midpoint between the Trivial and Highly Actionable Zones.
```{r powerImpact, echo=FALSE, out.width = "6in", fig.align='center', fig.pos="H", fig.show = "hold", fig.cap="Impact of different wait times upon power, i.e. probability of rejecting the point null when the true effect is equal to the boundary of the highly actionable effect zone, i.e. the vertical line on the right. All other features of the figure mirror those of Figure 4.4. "}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/waitTimeRejectPointNullThetaDeltaG2.jpeg")
#knitr::include_graphics("figs/probEff.jpeg")
```
```{r n, echo=FALSE, out.width = "6in", fig.align='center', fig.pos="H", fig.show = "hold", fig.cap="Impact of different wait times upon the average observed sample size given the true treatment effect equals the average of the absolute trivial and highly actionable effect boundaries, i.e. the middle of the two vertical lines. When the observation lag is zero, the observed sample size equals the total sample size. All other features of the figure mirror those of Figure 4.4."}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/waitTimeNThetaMidPoint.jpeg")
#knitr::include_graphics("figs/probEff.jpeg")
```
```{r powInterval, echo=FALSE, out.width = "6in", fig.align='center', fig.pos="H", fig.show = "hold", fig.cap="Impact of different wait times upon the probability of concluding the treatment effect is not trivial given the null true treatment. This an interval null analog to the Type I Error for a given effect. All other features of the figure mirror those of Figure 4.4."}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/waitTimeStopNotTrivialTheta0.jpeg")
#knitr::include_graphics("figs/probEff.jpeg")
```
```{r typeiTrivial, echo=FALSE, out.width = "6in", fig.align='center', fig.pos="H", fig.show = "hold", fig.cap="Impact of different wait times upon the probability of concluding the treatment effect is not trivial given the true effect equals the upper bound defining Trivial effects. This is an interval null analog to Type I error for a given effect. This figure focuses on the boundary of the Trivial effects where the error probability is greatest. All other features of the figure mirror those of Figure 4.4. "}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/waitTimeStopNotTrivialThetaDeltaG1.jpeg")
#knitr::include_graphics("figs/probEff.jpeg")
```
```{r typeii, echo=FALSE, out.width = "6in", fig.align='center', fig.pos="H", fig.show = "hold", fig.cap="Impact of different wait times upon the probability of concluding the treatment effect is not trivial given the true effect equals the boundary of the highly actionable effect zone, ie the vertical line on the right. This is an interval null analog to power for a given effect. All other features of the figure mirror those of Figure 4.4."}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/waitTimeStopNotTrivialThetaDeltaG2.jpeg")
#knitr::include_graphics("figs/probEff.jpeg")
```
```{r inconclusive, echo=FALSE, out.width = "6in", fig.align='center', fig.pos="H", fig.show = "hold", fig.cap="Impact of different wait times upon the probability of being inconclusive after stopping for observed outcomes and then observing remaining observations (50 observations in these simulations). In this figure, the true effect is given to be the absolute average of the Trivial and Highly Actionable regions (i.e. the mid point between the two regions); this is the treatment effect with the worst probability of being inconclusive. Refer to figure 3 for the probability of being inconclusive across the range of treatment effects under setting 3 (A). All other features of the figure 11 mirror those of Figure 4.4."}
knitr::include_graphics("~/Dropbox/statistics/work/vandy/AdaptiveMonitoring/manuscript/figs/waitTimeInconclusiveThetaDeltaMidPoint.jpeg")
```
On whole, this suggests waiting no longer than when the Margin of Error achieves a length equal to this midpoint and may motivate waiting slightly longer (i.e. for a smaller Margin of Error). In certain settings, when the boundary of the trivial effects is close to half the boundary of the highly actionable boundary (settings 3 and 4), waiting for a slightly smaller Margin of Error has little added burden on the average sample size (Figure \@ref(fig:n), compare settings 3 and 4 to setting 2). Across these simulations, waiting for a slightly tighter Margin of Error balances well maximizing operating characteristics without a substantive increase in average observed sample size.
As a separate practical recommendation, we suggest increasing the number of affirmation steps before stopping when observations are not immediately observable relative to enrollment. For effects that are neither Trivial nor Highly Actionable, there’s a very plausible chance the study ends conclusively for the outcomes observed yet becomes inconclusive after observing the outcomes of remaining study observations (Figure \@ref(fig:inconclusive)). To reduce this risk, we recommend increasing the number of steps required to affirm stopping rules.
## Conclusion
We provide the sgpvAM package to provide greater ease of access to develop and study the operating characteristics of adaptive monitoring with the second generation p-value. Based on simulations of normally distributed data, we recommend waiting until the expected Confidence Interval Width is one quarter the absolute distance between the Trivial and Highly Actionable Region boundaries. When outcomes do not occur immediately, there is a risk of the study stopping then being inconclusive once all observations are observed. This risk is reduced by requiring an increased number of steps before affirming an alert to stop the study.