The SeqSGPV package is used to design a study with sequential monitoring of scientifically meaningful hypotheses using the second generation p-value (SGPV).
It supports the paper “Sequential monitoring using the Second Generation P-Value with Type I error controlled by monitoring frequency” which advances how to:
- Specify scientifically meaningful hypotheses using a constrained Region of Equivalence (Freedman et al., 1984). The constrained set of hypotheses are called Pre-Specified Regions Indicating Scientific Merit (PRISM).
- Sequentially monitor the SGPV (SeqSGPV) for scientifically meaningful hypotheses.
- Control error rates uses monitoring frequency mechanisms, including an affirmation step.
Monitoring until establishing statistical significance does not provide information on whether an effect is scientifically meaningful. Establishing scientific relevance requires pre-specifying which effects are scientifically meaningful and an end-study inference that evaluates these effects.
For intervention studies, Freedman et al. (1984) categorizes effects as being: universally acceptable for adopting the intervention; universally unacceptable compared to standard of care; or scientifically ambiguous for whether the intervention should be adopted over the standard of care. The latter set of scientifically ambiguous effects form the ROE.
Strategies for specifying ROE include setting the point null as a ROE boundary [Hobbs & Carlin (2008); Section 2.2], setting the ROE away from the point null [Freedman et al. (1984); Figure 1], and surrounding the point null (Kruschke, 2013). Kruschke (2013) calls the latter strategy a Region of Practical Equivalence (ROPE); see footnote1 for clarification between the ROPE and ROE.
The PRISM constrains ROE to be set away from the point null (similar in spirit to [Freedman et al. (1984); Figure 1]) but with a more explicit constraint. The PRISM divides the parameter space into three exhaustive, non-empty, and mutually exclusive regions:
- The ROPE as defined by Kruschke (2013) to include effects practically equivalent to the point null.
- The ROE as defined by [Freedman et al. (1984); Figure 1] to include scientifically ambiguous effects.
- The Region of Meaningful Effects (ROME) to include effects that are scientifically meaningful.
In a 1-sided hypothesis, the ROPE is replaced by a ROWPE denoting a Region of Worse or Practically Equivalent effects.
Pre-Specified Regions Indicating Scientific Merit (PRISM) for one- and two-sided hypotheses. The PRISM always includes an indifference zone that surrounds the point null hypothesis (i.e. ROPE/ROWPE).In the context of interval monitoring, error rates and sample size are impacted by ROE specification.
Compared to ROPE monitoring, PRISM monitoring also reduces the risk of type I error yet resolves the issue of indefinite monitoring at ROPE boundaries.
Compared to null-bound ROE monitoring, PRISM monitoring reduces the risk of Type I error for the same monitoring frequency and allows for earlier monitoring and yields smaller average sample size to achieve the same Type I error.
The SGPV is an evidence-based metric that measures the overlap between
an inferential interval and scientifically meaningful hypotheses.
Described as ‘method agnostic’ (Stewart & Blume, 2019), the SGPV may be
calculated for any inferential interval (ex: bayesian, frequentist,
likelihood, etc.). For an interval hypothesis,
The SGPV (Blume et al., 2018) is an evidence-based metric that
quantifies the overlap between an interval
The adjustment,
In “Sequential monitoring using the Second Generation P-Value with Type I error controlled by monitoring frequency”, foundational likelihood, frequentist, and Bayesian metrics are compared and contrasted in terms of minimal assumptions, handling of composite hypotheses, and conclusions that can be drawn. The SGPV makes no further assumptions beyond those that may be inherited by the inferential interval. It does not require a likelihood, prior, study design, or error rates.
See examples for interpreting possible end of study conclusions using the SGPV.
Controlling the design-based Type I error is recommended for trials to receive regulatory approval US Food and Drug Administration (2010). In SeqSGPV, error rates can be controlled through PRISM specification and/or monitoring frequency.2
Since scientific relevance (i.e., PRISM) is considered fixed, monitoring frequency is a targettable means for controlling error rates. These include a wait time until evaluating stopping rules, the frequency of evaluations, a maximum sample size, and an affirmation rule. The affirmation rule is used in dose-escalation trials once a number of patients have consecutively been enrolled at the recommended maximum tolerable dose. We use it here to further control error rates and frequency properties after setting a practial wait time, monitoring frequency, and maximum sample size.
Synergy between 1-sided PRISM and monitoring frequency: On their own, both the PRISM and monitoring frequency strategy help reduce the risk of Type I error. When used together, the 1-sided PRISM and monitoring frequency can notably reduce the average sample size to achieve a Type I error. When outcomes are delayed, the risk of reversing a decision on the null hypothesis decreases when monitoring a 1-sided PRISM more so than under a 1-sided null-bound ROE. Additional strategies, such as posterior predictive probabilities could be considered to further inform decisions under delayed outcomes.
Trial designs are evaluated through simulation via the SeqSGPV function until achieving desirable operating characteristics such as error rates, sample size, bias, and coverage. The SeqSGPV function can be used to assess departures from modelling assumptions.
Outcomes may be generated from any r[dist] distribution, a user-supplied data generation function, or pre-existing data. Study designs of bernoulli and normally distributed outcomes have been more extensively evaluated and extra care should be provided when designing a study with outcomes of other distribution families.
The user provides a function for obtaining interval of interest. Some functions have been built for common interval estimations: binomial credible and confidence intervals using binom::binom.confint, wald confidence intervals using lm function for normal outcomes, and wald confidence intervals using glm function with binomial link for bernoulli outcomes.
Depending on computing environment, simulations may be time consuming to obtain many (10s of thousands) replicates and more so for bernoulli outcomes. The user may consider starting with a small number of replicates (200 - 2000) to get a sense of design operating characteristics. Sample size estimates of a single look trial may also inform design parameters.
Study designs and interpretations of a single trial are provided below for 2 arm trials with bernoulli or normally distributed outcomes.
Blume, J. D., DAgostino McGowan, L., Dupont, W. D., & Greevy Jr., R. A.
(2018). Second-generation p-values: Improved rigor,
reproducibility,
Freedman, L. S., Lowe, D., & Macaskill, P. (1984). Stopping rules for clinical trials incorporating clinical opinion. Biometrics, 40(3), 575–586.
Hobbs, B. P., & Carlin, B. P. (2008). Practical Bayesian design and analysis for drug and device clinical trials. Journal of Biopharmaceutical Statistics, 18(1), 54–80.
Jennison, C., & Turnbull, B. W. (1989). Interim analyses: the repeated confidence interval approach. Journal of the Royal Statistical Society: Series B (Methodological), 51(3), 305–334.
Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology. General, 142(2), 573–603.
Stewart, T. G., & Blume, J. (2019). Second-generation p-values, shrinkage, and regularized models. Frontiers in Ecology and Evolution, 7, 486.
US Food and Drug Administration. (2010). Guidance for the use of Bayesian statistics in medical device clinical trials.
Footnotes
-
Kruschke (2013) uses ‘equivalence’ to refer to effects indifferent to the point null whereas Freedman et al. (1984) uses ‘equivalence’ to refer to effects in which there is ambiguity on whether an effect is clearly superior to standard of care. Hence, ROE is a broader term. A ROPE encompasses the point null and could be considered a ROE; whereas, there are no constraints on ROE. ↩
-
Jennison & Turnbull (1989) provide error rate control for intervals that adjust for frequency properties using group sequential methods. For intervals which do not adjust for frequency properties, Type I error can be controlled through a combination of the PRISM’s ROPE and monitoring frequency mechanisms of wait time (W), frequency of evaluation (S for steps between evaluations), maximum sample size (N), and affirmation steps (A). ↩