-
Notifications
You must be signed in to change notification settings - Fork 0
/
power-analysis-complete.Rmd
565 lines (391 loc) · 22.3 KB
/
power-analysis-complete.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
---
title: "Power Analysis Primer"
author: "Matt Schuelke"
date: '(`r format(Sys.time(), "%B %d, %Y")`)'
bibliography: bibliography.bibtex
output:
html_document:
toc: true
toc_depth: 3
toc_float: true
css: css/slucor.css
html_notebook: default
github_document: default
---
## Introduction
This notebook introduces the concept of statistical power through explanation, example, and practice using one-sample, two-sample, and paired variants of the t-test.
## Load Dependencies
The following code loads the package dependencies for our work.
```{r}
library(readr) # import csv data files
library(tibble) # make fancy data frames
```
Now you try loading an additional package - `here` (for working with file paths):
```{r}
library(here)
```
### Package `pwr`
The [pwr package](https://cran.r-project.org/package=pwr) is a handy package for doing power analyses for various common statistical procedures. This is not a standard R package, so you will need to install it before you run it. You only have to install it once, so I'm actually preventing this chunk from running automatically by setting the chunk option `eval = FALSE`. The only way this chunk runs is if I specifically put my cursor in it and tell it to run manually.
If you do not have this package installed, you will need to do likewise, but only once.
```{r, eval=FALSE}
install.packages("pwr")
```
## Power Primer Refresh
Power is the probability of detecting an effect, given that the effect is really there (i.e., the probability of rejecting the null hypothesis when it is in fact false).
$$power = \mathbb{P} (reject H_o | H_a = true)$$
As such, power is defined under the alternative distribution and is the compliment of $\beta$, the probability of committing a Type II error:
$$power = 1 - \beta$$
A commonly accepted cut-off for $\beta$ is 0.20 which means folks often seek power greater than or equal to 0.80 (although 0.90 is also common). Power depends on a constellation of factors:
* effect size
+ larger effects are easier to detect
* sample size
+ larger samples generate more power
* directional hypotheses
+ specifying direction allows one to place all of $\alpha$ in one tail
* significance level
+ greater $\alpha$ means more probability to reject
* variability
+ experimental design can control variability and thus yield more power
There are several reasons why one might perform a power analysis with the most common being to determine the required number of observations to detect a given effect size. Another quite common reason would be to determine if one can achieve sufficient power given limited resources. Power analyses are often required for project proposals and are one step in performing quality research because they make one think through their project in detail.
## One-sample t-test
First, we read in some data, calculate some descriptive statistics, and then perform a one-sample t-test.
Gettelfinger and Cussler [-@gettelfinger_will_2004] wondered whether swimmers would go faster or slower if the fluid viscosity was increased. To test the question, they collected swim times for each person in both syrup and water and compared the ratio of syrup- to water-time to a constant of 1.
### Hypotheses
$H_o: \mu = 1$
$H_a: \mu \ne 1$
```{r}
swimmers.csv <- read_csv(here("data", "swimmers.csv"))
(n <- nrow(swimmers.csv))
(m <- mean(swimmers.csv$speed))
(s <- sd(swimmers.csv$speed))
t.test(swimmers.csv$speed, mu = 1)
```
### Interpretation
A one-sample t-test suggested fluid viscosity has no effect on swim speed as the ratio of syrup to water times (M = 1.01, SD = 0.04) was not significantly different than 1, t(17) = 1.17, p = 0.26.
### Post-hoc Power Analysis
Because we fail to reject the null, our test was under-powered by definition. Therefore, we might be interested in performing a post-hoc power analysis to determine what level of power we did achieve.
We can calculate this using `pwr.t.test()` in the pwr package, but first we need to calculate some argument values to pass to the function. Let's check the help files for the function to know what the required arguments are:
```{r, eval=FALSE}
?pwr::pwr.t.test
```
Here we see that we need to pass all of the following arguments (except for the one we want to calculate, which we pass as NULL):
* `n`
+ the number of observations per sample
+ in this case we only have one sample, so the number of observations (rows) in the data frame
* `d`
+ the effect size
+ [Cohen's d](https://en.wikipedia.org/wiki/Effect_size#Cohen.27s_d) in this case
* `sig.level`
+ alpha level
* [`power`](https://en.wikipedia.org/wiki/Statistical_power)
+ power of the test
+ which is equal to $1 - \beta$, where $\beta$ is the probability of committing a Type II Error
+ powers of 0.80 and 0.90 are the most commonly sought
* `type`
+ type of t-test as a string, `"one-sample"` in this case
* `alternative`
+ another character string specifying the alternative hypothesis, `"two.sided"` in this case
Because we wish to calculate the observed power of our test, we will need to figure out values for all the arguments except for `power` will be passed as `NULL` and thus calculated based on the value of these other arguments by the function.
Cohen's d is actually a family of effect sizes describing the difference between groups in standard deviation units (very similar to the formula for a z-score). The formula for the one sample case given by Cohen [-@cohen_statistical_1988] (f. 2.5.7, p. 72) is:
$$d_s = \frac{\bar{X} - c}{s}$$
Where $\bar{X}$ will be the mean from our alternative hypothesis (i.e., the mean of our sample); c will be the hypothesized constant; and s can be the standard deviation of our sample.
```{r}
(cohens_d <- (m - 1) / s)
```
We compute the observed power thusly:
```{r}
pwr::pwr.t.test(n = n,
d = cohens_d,
sig.level = 0.05,
power = NULL,
type = "one.sample",
alternative = "two.sided")
```
So our observed power was only 0.20, which is much smaller than our minimally acceptable power of 0.80.
#### Power via Simulation
![](img/advanced.png){width=100px}
We could also estimate power via simulation.
```{r}
mean(replicate(10000, {
t.test(rnorm(n = n,
mean = m,
sd = s),
mu = 1)$p.value < 0.05
}))
```
### A Priori Power Analysis
Because the observed power of our test was quite low, the next logical question is what would be the minimum sample size required to attain a significant p-value in this situation? So that we might do better next time.
We simply pass a NULL value for the n argument and now pass 0.80 as the value for the power argument.
```{r}
(p <- pwr::pwr.t.test(n = NULL,
d = cohens_d,
sig.level = 0.05,
power = 0.80,
type = "one.sample",
alternative = "two.sided"))
```
So we would need 106 (round up because can't have 0.23 observations) swimmers at a theoretical minimum to detect a Cohen's d of 0.28 using a one-sample t-test to test a two-tailed hypothesis.
And here is a visualization of power as a function of sample size.
```{r}
plot(p)
```
### Practice
@brick_association_2010 explored the effect of many variables (exercise, caffeine intake, etc.) on medical students' sleep. One criterion variable in the study was the Pittsburgh Sleep Quality Index (PSQI). This measure produces higher scores for people with more sleep problems (i.e., lower sleep quality). The researchers hypothesized that "medical students would report worse sleep quality in comparison to published normative samples of healthy, young adults" (p. 114). The mean of their 293 student sample was 6.37 and the standard deviation was 2.57. Imagine they wished to compare their sample to a population normative score of 5.60 as reported by @carney_daily_2006.
As a hint the hypotheses are:
$H_o: \mu \le 5.60$
$H_a: \mu \gt 5.60$
First compute Cohen's ds.
```{r}
cohens_d <- (6.37 - 5.60) / 2.57
```
Next conduct a post-hoc power analysis to compute the observed power.
```{r}
pwr::pwr.t.test(n = 293,
d = cohens_d,
sig.level = 0.05,
power = NULL,
type = "one.sample",
alternative = "greater")
```
Finally conduct an a priori power analysis to determine the minimum number of subjects required.
```{r}
pwr::pwr.t.test(n = NULL,
d = cohens_d,
sig.level = 0.05,
power = 0.80,
type = "one.sample",
alternative = "greater")
```
## Two-sample t-test
Let's do the same things for a two-sample t-test. First we read in some data, then calculate some descriptive statistics, and finally perform the two-sample t-test.
This data is taken from Altman [-@altman_practical_1999] where energy expenditures were observed for a group of lean and a group of obese women. The question was whether mass has any effect on energy expenditure.
### Hypotheses
$H_o: \mu_1 = \mu_2$
$H_a: \mu_1 \ne \mu_2$
```{r}
energy.csv <- read_csv(here("data", "energy.csv"))
(n_lean <- nrow(energy.csv[energy.csv$stature == "lean", ]))
(n_obese <- nrow(energy.csv[energy.csv$stature == "obese", ]))
(m_lean <- mean(energy.csv[energy.csv$stature == "lean", ]$expend))
(m_obese <- mean(energy.csv[energy.csv$stature == "obese", ]$expend))
(s_lean <- sd(energy.csv[energy.csv$stature == "lean", ]$expend))
(s_obese <- sd(energy.csv[energy.csv$stature == "obese", ]$expend))
t.test(expend ~ stature, data = energy.csv)
```
### Interpretation
A Welch two-sample t-test suggested the energy expenditures for those with a stature of lean (M = 8.07, SD = 1.24) were significantly different than those with a stature of obese (M = 10.30, SD = 1.40), t(15.92) = -3.86, p = .00.
### Post-hoc Power Analysis
While the earlier example was an under-powered one-sample t-test, this example for a two-sample t-test is already significant. Therefore, we have achieved adequate power by definition.
Even though the significance of the test tells us that our observed power is greater than our minimally acceptable 0.80, we might wish to calculate what it was.
To do so we need to calculate a different version of Cohen's d. The formula for a two-sample test [@cohen_statistical_1988] (f. 2.5.1, p. 66) is:
$$d_s = \frac{\bar{X}_a - \bar{X}_b}{s}$$
In this case $\bar{X}_a$ and $\bar{X}_b$ are going to be equal to the means of our samples.
However, there is a new issue here which results in two practical considerations we must address before we can perform that calculation.
The new idea is that because we have two groups, the sizes of those two groups can differ (and they do differ in this case). Therefore, it is a good idea to weight any descriptive statistics such that the descriptive statistics from larger groups get more weight and vice versa.
The first practical consideration that results from this idea is that we should use a different function from the `pwr` package to help deal with this weighting issue.
Let's look at the help page for `pwr::pwr.t2n.test()`.
```{r, eval=FALSE}
?pwr::pwr.t2n.test
```
We see most of the same required arguments as before, but now the `n` argument is split into `n1` and `n2`.
The second practical consideration that results from the new idea of differing groups sizes is that our effect size for t-tests (i.e., Cohen's d) requires a single estimate of the spread of the data. Therefore, we must calculate a [pooled standard deviation](https://en.wikipedia.org/wiki/Pooled_variance).
While pooled deviations can be calculated for any number of groups, a formula for just two groups [@cohen_statistical_1988] (f. 2.5.2, p. 67) is:
$$s = \sqrt{\frac{\sum{(X_A - \bar{X_A})^2} + \sum{(X_B - \bar{X_B})^2}}{n_a + n_b - 2}}$$
but you will often see it written thusly:
$$s_{pooled} = \sqrt{\frac{(n_1 - 1) * s^2_1 + (n_2 - 1) * s^2_s}{n_1 + n_2 - 2}}$$
and here is a custom R function to perform this calculation:
```{r}
sd_pooled <- function (sd1, n1, sd2, n2) {
sqrt(((n1 - 1) * sd1^2 + (n2 - 1) * sd2^2) / (n1 + n2 - 2))
}
```
We compute this pooled standard deviation and then call `pwr::pwr.t2n.test()` with all the requisite argument values except for power which we pass as `NULL` because that is the one we wish to compute.
```{r}
(s_pooled <- sd_pooled(sd1 = s_lean,
n1 = n_lean,
sd2 = s_obese,
n2 = n_obese))
(cohens_d <- (m_lean - m_obese) / s_pooled)
pwr::pwr.t2n.test(n1 = n_lean,
n2 = n_obese,
d = cohens_d,
sig.level = 0.05,
power = NULL,
alternative = "two.sided")
```
Our observed power was quite high at 0.96, which is much higher than our minimally acceptable power of 0.80 and very close to maximum power of 1. In other words, we collected more data than what was required.
#### Power via Simulation
![](img/advanced.png){width=100px}
```{r}
mean(replicate(10000, {
one_sample <- tibble(expend = c(rnorm(n = n_lean,
mean = m_lean,
sd = s_lean),
rnorm(n = n_obese,
mean = m_obese,
sd = s_obese)),
stature = c(rep(x = "lean",
times = n_lean),
rep(x = "obese",
times = n_obese)))
t.test(expend ~ stature, data = one_sample)$p.value < 0.05
}))
```
### A Priori Power Analysis
Because the observed power of our test was quite high, we might again be interested in knowing the minimum sample size required to attain a significant p-value in this situation. We might be interested in this minimum to inform future studies.
For this we return to `pwr::pwr.t.test()` to remove the unequal sample sizes issue and simply pass a `NULL` value for the `n` argument and now pass 0.80 as the value for the `power` argument.
```{r}
(p <- pwr::pwr.t.test(n = NULL,
d = cohens_d,
sig.level = 0.05,
power = 0.80,
type = "two.sample",
alternative = "two.sided"))
```
So we only required 7 observations per sample at a theoretical minimum to detect a Cohen's d of 1.71 using a two-sample t-test to test a two-tailed hypothesis. Because we have two samples in this case, the minimum total sample size required in this situation was 14 observations evenly distributed across the two groups.
And here is a visualization of power as a function of sample size.
```{r}
plot(p)
```
### Practice
This practice problem is inspired by @stephens_swearing_2009. Pain researchers have some standardized ways of testing how long people can tolerate pain. One way of testing pain tolerance is to time how long a person can stand to keep their hand submerged in freezing water. Imagine we wish to test whether there is any effect of cursing (i.e., swearing) on pain tolerance. For the swearing group we observe M = 190.63 (SD = 82.81) seconds on 11 observations. For the control group we observed M = 83.28 (SD = 61.42) seconds on 9 observations.
First compute the pooled standard deviation using our custom function `pooled_sd`
```{r}
pooled_sd <- sd_pooled(sd1 = 82.81, n1 = 11, sd2 = 61.42, n2 = 9)
```
Next compute Cohen's d
```{r}
cohens_d <- (190.63 - 83.28) / pooled_sd
```
Now perform a post hoc power analysis to determine the observed power using `pwr::pwr.t2n.test()`
```{r}
pwr::pwr.t2n.test(n1 = 11,
n2 = 9,
d = cohens_d,
sig.level = 0.05,
power = NULL,
alternative = "two.sided")
```
Finally, compute the minimum sample size required in each of two equally sized groups to show this effect is statistically significant.
```{r}
pwr::pwr.t.test(n = NULL,
d = cohens_d,
sig.level = 0.05,
power = 0.80,
type = "two.sample",
alternative = "two.sided")
```
## Paired-samples t-test
Now, we apply the same ideas to a paired t-test example.
This data was taken from Altman [-@altman_practical_1999]. Here we compare energy intake for each woman both pre- and post-menstration to see if there is any effect of menstration on intake.
### Hypotheses
$H_o: \mu_{post} = \mu_{pre}$
$H_a: \mu_{post} \ne \mu_{pre}$
```{r}
intake.csv <- read_csv(here("data", "intake.csv"))
(n <- n_pre <- n_post <- nrow(intake.csv))
(m_post <- mean(intake.csv$post))
(m_pre <- mean(intake.csv$pre))
(s_post <- sd(intake.csv$post))
(s_pre <- sd(intake.csv$pre))
t.test(intake.csv$post, intake.csv$pre, paired = TRUE)
```
### Interpretation
A paired-samples t-test suggested there was a significant difference on intake scores such that post-scores (M = 5,433.18, SD = 1,216.83) were lower than pre-scores (M = 6,753.64, SD = 1,142.12), t(10) = -11.94, p = .00.
### Post-Hoc Power Analysis
Because our test was significant, we already know our observed power was adequate, but let us calculate what is was anyway.
In this example, we now have to compute a different effect. This time the effect is known as Cohen's dz [@cohen_statistical_1988] (f. 2.5.10, p. 73), and here is the formula:
$$d_s = \frac{\bar{Z}}{s}$$
Perhaps the easiest way to compute this new effect is to first compute difference scores by subtracting post from pre, then computing the mean and standard deviation of those difference scores, and finally dividing the mean by the standard deviation.
```{r}
intake.csv$difference <- intake.csv$post - intake.csv$pre
(m_diffs <- mean(intake.csv$difference))
(s_diffs <- sd(intake.csv$difference))
(cohens_dz <- m_diffs / s_diffs)
```
Note that the order of subtracting the post from the pre scores provides for an easier interpretation of the effect. A positive effect suggests an increase from pre to post while a negative effect suggests a decrease going from pre to post.
Let us quickly check the help document for `pwr::pwr.t.test()` to re-familiarize ourselves with the argument requirements.
```{r, eval=FALSE}
?pwr::pwr.t.test
```
This time, `d` will be Cohen's dz as opposed to Cohen's d, and we need to specify `type` as `"paired"`.
```{r}
pwr::pwr.t.test(n = n,
d = cohens_dz,
sig.level = 0.05,
power = NULL,
type = "paired",
alternative = "two.sided")
```
Our observed power was basically maximized at 1. In other words, we collected *WAY* more data than what was required.
#### Power via Simulation
![](img/advanced.png){width=100px}
```{r}
# paired t is equivalent to one-sample on the differences
mean(replicate(10000, {
t.test(rnorm(n = n,
mean = m_diffs,
sd = s_diffs),
mu = 0)$p.value < 0.05
}))
```
### A Priori Power Analysis
Because the observed power of our test was *SO* high, we might be especially interested in again knowing the minimum sample size required to attain a significant p-value in this situation to inform future studies.
For this we again use `pwr::pwr.t.test()` and simply pass a `NULL` value for the `n` argument, `0.80` for the power argument, and remember to pass `"paired"` for `type`.
```{r}
(p <- pwr::pwr.t.test(n = NULL,
d = cohens_dz,
sig.level = 0.05,
power = 0.80,
type = "paired",
alternative = "two.sided"))
```
So we only required 3 observations per sample at a theoretical minimum to detect a Cohen's dz of 3.60 using a paired-samples t-test to test a two-tailed hypothesis. Because our "samples" are paired, there is only one sample and so the minimum total sample size required in this situation is just 3, with each observation having both a pre and post score.
And here is a visualization of power as a function of sample size.
```{r}
plot(p)
```
### Practice
@fayers_vibration-assisted_2010 conducted a study of pairs of pain ratings from patients who underwent injections of anesthesia (pain-blocking medicine) before surgery on their upper eyelids. They hypothesized that vibration might reduce the perceived pain of the injection. Vibration was applied to the forehead between the eyes before an injection in one eyelid but not the other. After each injection, the patients rated their pain from 0 (no pain) to 10 (worst pain imaginable). Simulated data based on the descriptive statistics reported in the article are provided in the `pain.csv` file in the data folder.
This time, you will do everything starting with reading in the data.
First, read in the data.
```{r}
pain.csv <- read_csv(here("data", "pain.csv"))
```
Second, calculate the
* difference scores, subtracting control from vibration
* sample size
* mean of the difference scores
* standard deviation of the difference scores
* effect size Cohen's dz
```{r}
pain.csv$difference <- pain.csv$vibr - pain.csv$ctrl
(n <- length(pain.csv$difference))
(m_diffs <- mean(pain.csv$difference))
(s_diffs <- sd(pain.csv$difference))
(cohens_dz <- m_diffs / s_diffs)
```
Third, conduct an post-hoc power analysis to calculated the observed power.
```{r}
pwr::pwr.t.test(n = n,
d = cohens_dz,
sig.level = 0.05,
power = NULL,
type = "paired",
alternative = "less")
```
Finally, conduct an a priori power analysis to calculate the minimum required sample size to detect this effect size at 95% significant and 80% power.
```{r}
pwr::pwr.t.test(n = NULL,
d = cohens_dz,
sig.level = 0.05,
power = 0.80,
type = "paired",
alternative = "less")
```
## Going Further
There are power functions in the base R `stats` package such as `power.t.test`, but the `pwr` package both replicates and extends this functionality.
To learn some more about the `pwr` package, you might check out the vignette by typing `??pwr` in the console and clicking on the vignette link in the help viewer.
I also use another power calculating package enough to warrant mention. `powerSurvEpi` offers functions for other tests such as odds ratios and comparing two survival curves.
Finally, because many designs/situations are complicated, computers are fast, and tools like `R` are available that it is just easier to perform power analysis through simulation.
## References