-
Notifications
You must be signed in to change notification settings - Fork 0
/
ch13.qmd
342 lines (270 loc) · 14.5 KB
/
ch13.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
# Chapter 13: Multiple Testing
When consider $m$ null hypotheses: $H_{01}, \cdots, H_{0m}$, need to be careful to avoid incorrectly rejecting too many null hypotheses, i.e., having too many false postives.
## Review of Hypothesis Tesing
Divide the world into *null* and *alternative* hypotheses
- Define the null ($H_0$, the default state of belief about the world ) and alternative ($H_a$, or $H_a$, something different and unexpected) hypotheses
- Construct the test statistic under $H_0$: the test statistic summarizes the extent to which our data are consistent with $H_0$.
For example, to test $H_0: \ \mu_t =\mu_c$, use the $t$-statistic
$$
T=\frac{\hat{\mu}_t - \hat{\mu}_c}{s\sqrt{\frac{1}{n_c}+ \frac{1}{n_t}}}
$${#eq-two-sample-test}
Where, $s$ is a *pooled standard deviation*, assumed to be common to both groups. If the two groups have different s.d., then the formula is slightly different. if $n_c$ and $n_t$ are large. e.g. larger than 40, then $T$ is approximately $\mathcal{N}(0,1)$.
- Computer the $p$-value: is the probability of observing a test statistic at least as extreme as the observed test statistic, *under the assumption that $H_0$ is true*. A *small* $p$-value provides evidence *against* $H_0$.
- Decide whether to reject $H_0$.
if $p$-value is small, we will want to *reject* $H_0$ (and therefore make a potential "discovery").
## How small is a $p$-value small? Type I error
| | | Truth |
:------|:----------------:|:------------------:|:-----------:
| | | $H_0$ | $H_a$
Decision | Reject $H_0$ | Type I Error (FP)| Correct
| | Do Not Reject $H_0$| Correct | Type II error (FN)
: {.striped .hover .bordered}
The *power* of a hypothesis test is the the probability of not making Type II error given that $H_a$ is correct, i.e., power = 1- Type II Error = True positive rate=recall=sensitivity.
- Type I error rate: the probability of making a Type I error: if we only reject $H_0$ when the $p$-value is less than $\alpha$, the type I error rate is at most $\alpha$.
- Type I error and Type II error move in opposite direction. We can make the Type I error small by only rejecting $H_0$ of we are quite sure it doesn't hold, that is, with smaller probability $\alpha$; But this will result in *increase* in Type II error because we we do not reject $H_0$ with a higher probability. On the other hand, we can make the Type II error small by rejecting $H_0$ even with modest evidence that it does not hold, but this will cause the Type I error to be large.
- In practice we view Type I error more *serious* than Type II error, as the former involves declaring a scientific finding that is not correct. Hence we typically require a low Type I error rate at most $\alpha$, and try to make Type II error as small as possible.
## Multiple Testing
suppose we wish to test $m$ null hypotheses: $H_{01}, \cdots, H_{0m$. Can we reject all null hypotheses for which the corresponding $p$-value falls below $\alpha$, say 0.01?
The answer is "No", because we are almost certain to get one very small $p$-value by chance.
Example, we wish to test $H_0$: the coin is fair. If we flip 1024 *fair* coins ten times each. since the probability of having all tails for a coin is $\left(\frac{1}{2}\right)^{10}$, therefore we'd expect on average to have one coin to come up all tails (even if it is a fair coin). For this coin, the $p$-value is less than 0.002! we would conclude it is **not fair**, i.e., we *reject $H_0$*, even though it is a fair coin.
For $m$ multiple $H_0$, and $\alpha=0.01$, we are expecting to falsely reject $m\alpha=0.01m$ null hypotheses.
### The Familiy-wise Error Rate (FWER)
FWER is the probability of making *at least one* Type I error when conducting $m$ hypothesis tests.
FWER = Pr($V\ge 1$)
| |$H_0$ is True|$H_0$ is False|Total|
|:----------------:|:--------------:|:-------------:|:------:|
Reject $H_0$ | $V$| $S$ |$R$
Do Not Reject $H_0$| $U$ | $W$ |$m-R$
Total | $m_0$ | $m-m_0$ | $m$
: {.striped .hover .bordered}
Note power $=\frac{S}{m-m_0}$. and Type I error rate = FPR = $\frac{V}{V+U}$
$$
\begin{array}{l l l}
\text{FWER}
&=& 1-\text{Pr(do not falsely reject any null hypotheses)} \\
&= & 1- \text{Pr}\left( \cap_{j=1}^m \left\{\text{do not falsely reject} H_{0j} \right\} \right).
\end{array}
$$
If the tests are independent and all $H_{0j}$ are true then
$$
\text{FWER} = 1- \prod_{j=1}^m (1-\alpha) 1- (1-\alpha)^m.
$$
From the above formula, when $\alpha =0.05$, $m=50$, $\text{FWER}\approx 1.0$.
When $m$ is large, FWER may cause us to be super conservative (to very rarely reject)
### The Bonferoni Correction
Let $A_j$ be the event that the $j$-th null hypothesis is falsely rejected. Then
$$
\begin{array}{l l l}
\text{FWER} &=& 1-\text{Pr(falsely reject at least one null hypotheses)} \\
&= & \text{Pr}\left( \cup_{j=1}^m A_j \right)\\
& \le & \sum_{j=1}^m \text{Pr}(A_j)
\end{array}
$$
If we only reject hypotheses when the $p$-value is less than $\alpha/m$, then
$$
\text{FWER} \le \sum_{j=1}^m \text{Pr}(A_j) \le \sum_{j=1}^m \frac{\alpha}{m} = \alpha.
$$
Bonferroni corrections makes sure not too many null hypotheses are rejected falsely and hence small Type I error, but may be too conservative and lead to higher Type II error.
## Holm's Step-down Method for Controlling the FWER
More complex than Bonferroni, lead to more rejections while controlling FWER! Holm's method always rejects at least as many null hypotheses as Bonferroni, results in fewer Type II error, hence greater power. *Homm is a better choice* as it is uniformly more powerful than the Bonferroni method.
1. Compute the $p$-values, $p_1, \cdots, p_m$ for the $m$-null hypotheses $H_{01}, \cdots, H_{0m}$.
2. Order the $m$ $p$-values so that $p_{(1)$ \le p_{(2)}\le \cdots p_{(m)}$.
3. Define
$$
L=\min\left\{ j: p_{(j)}>\frac{\alpha}{m+1-j} \right\}
$$
4. Reject all null hypotheses $H_{0j}$ for which $p_j <p_{(L)}$. Note $p_{(L)}$ depends on the values of all $m$ $p$-values.
5. Holm's method controls the FWER at level $\alpha$.
## Other Methods
Bonferroni and Hold are general procedures that will work in most settings. In certain special cases, the following method may give better results: more rejections (higher power) while maintaining FWER control.
- Turkey's Method: for pairwise comparisons of the difference in expected means among a number of groups. Turkey's method rejects at least as many rejections as Bonferroni's method, hence with greater power.
- Scheffe's Method: for testing arbitrary linear combinations of a set of expected means, e.g..
$$
H_0: \frac{1}{2}(\mu_1+\mu_3) =\frac{1}{3}(\mu_2+\mu_4+\mu_5)
$$
## Control *False Discovery Rate* (FDR) (Contemporary Approach)
$$
\text{FDR} = E(V/R)=E\left(\frac{\text{number of false rejections}}{\text{total number of rejections}}\right)
$$
FDR = $1-\frac{S}{V+S} $ = 1- precision
FWER controls Pr(at least one falsely rejected $H_0$), may be too conservative. FDR controls the fraction of candidates in the smaller set that are really false rejections. E.g., A scientist tests on $m=20,000$ drug candidates. She wants to identify a smaller set of promising candidates to investigate further. She wants reassurance that this smaller set is really "promising", i.e., not too many falsely rejected $H_0$'s. She can control FDR. This may be achieved by *Benjamini-Hochberg Procedure*.
1. Specify $q$, the level at which to control the FDR.
2. Compute the $p$-values, $p_1, \cdots, p_m$ for the $m$-null hypotheses $H_{01}, \cdots, H_{0m}$.
3. Order the $m$ $p$-values so that $p_{(1)$ \le p_{(2)}\le \cdots p_{(m)}$.
4. Define
$$
L=\max\left\{ j: p_{(j)}<\frac{qj}{m} \right\}
$$
5. Reject all null hypotheses $H_{0j}$ for which $p_j \le p_{(L)}$.
Then $FDR \le q$.
## Resampling Approach (Contemporary Approach)
When the *theoretical null distribution* is unknown or requires stringent assumptions, one can use *resampling* or *permutation* approach to approximate it.
For example, suppose that
$H_0: E(X)= E(Y)$, and $H_a: E(X)\ne E(Y)$, and number of independent samples $n_X$ and $n_Y$. The $t$-statistic as in @eq-two-sample-test, rewritten as
$$
T=\frac{\hat{\mu}_X - \hat{\mu}_Y}{s\sqrt{\frac{1}{n_X}+ \frac{1}{n_Y}}}
$$
When $n_X$ and $n_Y$ are small. If $X$ and $Y$ are normally distributed, then $T$ follows a $t$-distribution with dof $n_X+n_Y-2$. $T$ follows a normal $\mathcal{N}(0,1)$ when $n_X$ and $n_Y$ are large. The theoretical distribution is not necessarily $\mathcal{N}(0,1)$. In this case the $p$-value can be approximated using the following *resampling approach*. The idea is that under $H_0$: $E(X)=E(Y)$ and we make a strong assumption that the distributions of $X$ is the same as that of $Y$, then the distribution of $T$ is **invariant** if we swap some observations of $X$ with those of $Y$.
1. Computer the two sample $t$-statistic $T$ on the original data $x_1, \cdots, x_{n_X}$, and $y_1, \cdots, y_{n_Y}$.
2. For $b=1, \cdots, B$ (where $B$ is large, e.g., 1000):
2.1 Randomly shuffle (permute) the $n_X+n_Y$ observations.
2.2 Call the first $n_X$ shuffled observations $x_1^*, \cdots, x_{n_X}^*$ and call the remaining observations $y_1^*, \cdots, y_{n_Y}^*$.
2.3 Computer the two-sample $t$-statistic on the shuffled data, and call it $T^{*b}$.
3. The $T^{*b}$'s may be considered as an approximaiton to the distribution of $T$ under $H_0$. The $p$-value is given by
$$
\frac{\sum_{b=1}^B 1_{|T^{*b}|\ge |T|}}{B}
$$
## Resampling approach for FDR
First note
$$
FDR =E\left( \frac{V}{R} \right) \approx \frac{E(V)}{R}.
$$
Suppose we reject a null hypothesis for which the test statistic exceeds $c$, then
$R=\sum_{j=1}^m 1_{|T_j|\ge c}.$ $E(V)$ is the expected number of false positive associated with rejecting a null hypothesis for which the test statistic exceeds $c$. Since we do not know which $H_{0j}, j=1, \cdots, m$ is true, we do not know which rejected $H_{0j}$ is false positive. The resampling approach simulates the data under $H_{0j}, j=1, \cdots, m$, and then compute the resulting test statistics. The number of re-sampled test statistics that exceeds $c$ provides an estimate of $V$.
## Homework:
- Conceptual: 1--
- Applied: At least one.
## Code Snippet
### Python
```
```
### Numpy
```
np.where(sorted_ < q * np.linspace(1, m, m) / m)[0]
```
### Pandas
```
decision = pd.cut(p_values,
[0, 0.05, 1],
labels=['Reject H0',
'Do not reject H0'])
truth = pd.Categorical(true_mean == 0,
categories=[True, False], # specify ordre. True=0, False=1
ordered=True)
pd.crosstab(decision, #row
truth, # col
rownames=['Decision'],
colnames=['H0'])
pd.concat([Khan['xtrain'], Khan['xtest']]) # concat vertically along rows
```
### Graphics
```
[ax.plot(m,
1 - (1 - alpha)**m,
label=r'$\alpha=%s$' % str(alpha))
for alpha in [0.05, 0.01, 0.001]] #plot multiple curves
ax.set_xscale('log')
ax.axhline(0.05, c='k', ls='--');
x.axline((0, 0), (1,q/m), c='k', ls='--', linewidth=3); #line through (0,0), (1,q/m)
```
### ISLP and statsmodels
#### 1-sample $t$-test
```
from scipy.stats import \
(ttest_1samp,
ttest_rel,
ttest_ind,
t as t_dbn)
from statsmodels.stats.multicomp import \
pairwise_tukeyhsd
from statsmodels.stats.multitest import \
multipletests as mult_test
result = ttest_1samp(X[:,0], 0)
result.pvalue
```
#### Two-sample t-test
```
ttest_rel(fund_mini['Manager1'],
fund_mini['Manager2']).pvalue #ttest_rel requires two vectors of the same length
observedT, pvalue = ttest_ind(D2[gene_11],
D4[gene_11],
equal_var=True) #ttest_ind allows two vectors with different lengths
ttest_ind(D2[gene_11],
D4[gene_11],
equal_var=True).statistic
```
#### Multiple test
```
# obtain invidual p_values
fund_mini = Fund.iloc[:,:5]
fund_mini_pvals = np.empty(5)
for i in range(5):
fund_mini_pvals[i] = ttest_1samp(fund_mini.iloc[:,i], 0).pvalue
reject, bonf = mult_test(fund_mini_pvals, method = "bonferroni")[:2]
mult_test(fund_mini_pvals, method = "holm", alpha=0.05)[:2]
mult_test(fund_pvalues, method = "fdr_bh") #for controling FDR
fund_qvalues = mult_test(fund_pvalues, method = "fdr_bh")[1]
(fund_qvalues <= 0.1).sum()
```
#### Turkey HSD Test
```
returns = np.hstack([fund_mini.iloc[:,i] for i in range(5)])
managers = np.hstack([[i+1]*50 for i in range(5)])
tukey = pairwise_tukeyhsd(returns, managers)
print(tukey.summary())
fig, ax = plt.subplots(figsize=(8,8))
tukey.plot_simultaneous(ax=ax);
```
#### Plug-in Resamplng FDR
```
m, B = 100, 10000
idx = rng.choice(Khan['xtest'].columns, m, replace=False)
T_vals = np.empty(m)
Tnull_vals = np.empty((m, B))
for j in range(m):
col = idx[j]
T_vals[j] = ttest_ind(D2[col],
D4[col],
equal_var=True).statistic #observed t-statistic
D_ = np.hstack([D2[col], D4[col]])
D_null = D_.copy()
for b in range(B):
rng.shuffle(D_null)
ttest_ = ttest_ind(D_null[:n_],
D_null[n_:],
equal_var=True)
Tnull_vals[j,b] = ttest_.statistic
cutoffs = np.sort(np.abs(T_vals))
FDRs, Rs, Vs = np.empty((3, m))
for j in range(m):
R = np.sum(np.abs(T_vals) >= cutoffs[j]) # Note cufoffs is sorted and increasing
V = np.sum(np.abs(Tnull_vals) >= cutoffs[j]) / B # sum over all elements in Tnull_vals >= cutoffs[j]
Rs[j] = R
Vs[j] = V
FDRs[j] = V / R
fig, ax = plt.subplots()
ax.plot(Rs, FDRs, 'b', linewidth=3)
ax.set_xlabel("Number of Rejections")
ax.set_ylabel("False Discovery Rate");
```
### sklearn
### Useful code snippets
#### Fisher-Yates reshuffling Algorithm
```
D_null = D_.copy()
for b in range(B):
rng.shuffle(D_null)
ttest_ = ttest_ind(D_null[:n_],
D_null[n_:],
equal_var=True)
Tnull[b] = ttest_.statistic # test statistic
(np.abs(Tnull) > np.abs(observedT)).mean()
```
#### Plot histogram and theoretical pdf
```
fig, ax = plt.subplots(figsize=(8,8))
ax.hist(Tnull,
bins=100,
density=True, # show probability density rather than the raw count
facecolor='y',
label='Null')
xval = np.linspace(-4.2, 4.2, 1001)
ax.plot(xval,
t_dbn.pdf(xval, D_.shape[0]-2), #theoretical null disbribution: t-distr:
c='r')
ax.axvline(observedT,
c='b',
label='Observed')
ax.legend()
ax.set_xlabel("Null Distribution of Test Statistic");
```