ch13.qmd

# Chapter 13: Multiple Testing 
When consider $m$ null hypotheses: $H_{01}, \cdots, H_{0m}$, need to be careful to avoid incorrectly rejecting too many null hypotheses, i.e., having too many false postives. 

## Review of Hypothesis Tesing
Divide the world into *null* and *alternative* hypotheses

- Define the null ($H_0$, the default state of belief about the world ) and alternative ($H_a$, or $H_a$, something different and unexpected) hypotheses
- Construct the test statistic under $H_0$: the test statistic summarizes the extent to which our data are consistent with $H_0$.
  For example, to test $H_0: \ \mu_t =\mu_c$, use the $t$-statistic
  $$
  T=\frac{\hat{\mu}_t - \hat{\mu}_c}{s\sqrt{\frac{1}{n_c}+ \frac{1}{n_t}}}
  $${#eq-two-sample-test}
  
Where, $s$ is a *pooled standard deviation*, assumed to be common to both groups. If the two groups have different s.d., then the formula is slightly different.   if $n_c$ and $n_t$ are large. e.g. larger than 40, then $T$ is approximately $\mathcal{N}(0,1)$.
- Computer the $p$-value: is the probability of observing  a test statistic at least as extreme as the observed test statistic, *under the assumption that $H_0$ is true*. A *small* $p$-value provides evidence *against* $H_0$. 
  
- Decide whether to reject $H_0$.
  if $p$-value is small, we will want to *reject* $H_0$ (and therefore make a potential "discovery"). 
  
## How small is a $p$-value small? Type I error

 |    |   |  Truth |
 :------|:----------------:|:------------------:|:-----------:
 |        |                |   $H_0$    | $H_a$
 Decision | Reject $H_0$   | Type I Error (FP)| Correct 
  |        | Do Not Reject $H_0$| Correct  | Type II error (FN)
  
: {.striped .hover .bordered}
 
The *power* of a hypothesis test  is the the probability of not making Type II error given that $H_a$ is correct, i.e., power = 1- Type II Error = True positive rate=recall=sensitivity. 

-  Type I error rate: the probability of making a Type I error: if we only reject $H_0$ when the $p$-value is less than $\alpha$, the type I error rate is at most $\alpha$. 

- Type I error and Type II error move in opposite direction. We can make the Type I error small by only rejecting $H_0$ of we are quite sure it doesn't hold, that is, with smaller probability $\alpha$; But this will result in *increase*  in Type II  error because we we do not reject $H_0$ with a higher probability. On the other hand, we can make the Type II error small by rejecting $H_0$ even with modest evidence that it does not hold, but this will cause the Type I error to be large. 

- In practice we view Type I error more *serious* than Type II error, as the former involves declaring a scientific finding that is not correct. Hence we typically require a low Type I error rate at most $\alpha$, and try to make Type II error as small as possible. 

## Multiple Testing
suppose we wish to test $m$ null hypotheses: $H_{01}, \cdots, H_{0m$. Can we reject all null hypotheses for which the corresponding $p$-value falls below $\alpha$, say 0.01?
The answer is "No", because we are almost certain to get one very small $p$-value by chance. 

Example, we wish to test $H_0$: the coin is fair. If we flip 1024 *fair* coins ten times each. since the probability of having all tails for a coin is $\left(\frac{1}{2}\right)^{10}$, therefore we'd expect on average to have one coin to come up all tails (even if it is a fair coin). For this coin, the $p$-value is less than 0.002! we would conclude it is **not fair**, i.e., we *reject $H_0$*, even though it is a fair coin. 

For $m$ multiple $H_0$, and $\alpha=0.01$, we are expecting to falsely reject $m\alpha=0.01m$ null hypotheses. 

### The Familiy-wise Error Rate (FWER)
FWER is the probability of making *at least one* Type I error when conducting $m$ hypothesis tests. 

FWER = Pr($V\ge 1$)

|                  |$H_0$  is True|$H_0$  is False|Total|
|:----------------:|:--------------:|:-------------:|:------:|    
Reject $H_0$   | $V$| $S$ |$R$
Do Not Reject $H_0$| $U$  | $W$ |$m-R$
Total | $m_0$ | $m-m_0$ | $m$ 
  
: {.striped .hover .bordered}

Note power $=\frac{S}{m-m_0}$. and Type I error rate = FPR = $\frac{V}{V+U}$ 


$$
\begin{array}{l l l}
   \text{FWER}
&=& 1-\text{Pr(do not falsely reject any null hypotheses)} \\
&= & 1- \text{Pr}\left( \cap_{j=1}^m \left\{\text{do not falsely reject} H_{0j} \right\} \right).
\end{array}
$$
If the tests are independent and all $H_{0j}$ are true then
$$
\text{FWER} = 1- \prod_{j=1}^m (1-\alpha) 1- (1-\alpha)^m.
$$
From the above formula, when $\alpha =0.05$, $m=50$, $\text{FWER}\approx 1.0$. 

When $m$ is large, FWER may cause us to be super conservative (to very rarely reject)
### The Bonferoni Correction
Let $A_j$ be the event that the $j$-th null hypothesis is falsely rejected.  Then 
$$
\begin{array}{l l l}
\text{FWER} &=& 1-\text{Pr(falsely reject at least one null hypotheses)} \\
&= & \text{Pr}\left( \cup_{j=1}^m A_j \right)\\
& \le & \sum_{j=1}^m \text{Pr}(A_j)
\end{array}
$$
If we only reject hypotheses when the $p$-value is less than $\alpha/m$, then 
$$
\text{FWER} \le \sum_{j=1}^m \text{Pr}(A_j) \le \sum_{j=1}^m \frac{\alpha}{m} = \alpha.
$$
Bonferroni corrections makes sure not too many null hypotheses are rejected falsely and hence small Type I error,  but may be too conservative and  lead to higher Type II error.
## Holm's Step-down Method for Controlling the FWER
More complex than Bonferroni,  lead to more rejections while controlling FWER! Holm's method always rejects at least as many null hypotheses as Bonferroni, results in fewer Type II error, hence greater power.  *Homm is a better choice* as it is uniformly more powerful than the Bonferroni method. 
1. Compute the $p$-values, $p_1, \cdots, p_m$ for the $m$-null hypotheses $H_{01}, \cdots, H_{0m}$. 
2. Order the $m$ $p$-values so that $p_{(1)$ \le p_{(2)}\le \cdots p_{(m)}$. 
3. Define 
$$
L=\min\left\{ j: p_{(j)}>\frac{\alpha}{m+1-j}  \right\}
$$
4. Reject all null hypotheses $H_{0j}$ for which $p_j <p_{(L)}$. Note $p_{(L)}$ depends on the values of all $m$ $p$-values. 
5. Holm's method controls the FWER at level $\alpha$. 

## Other Methods
Bonferroni and Hold are general procedures that will work in most settings. In certain special cases, the following method may give better results: more rejections (higher power) while maintaining FWER control. 
- Turkey's Method: for pairwise comparisons of the difference in expected means among a number of groups. Turkey's method rejects at least as many rejections as Bonferroni's method, hence with greater power. 
- Scheffe's Method: for testing arbitrary linear combinations of a set of expected means, e.g..
$$
H_0: \frac{1}{2}(\mu_1+\mu_3) =\frac{1}{3}(\mu_2+\mu_4+\mu_5)
$$

## Control *False Discovery Rate* (FDR) (Contemporary Approach)

$$
\text{FDR} = E(V/R)=E\left(\frac{\text{number of false rejections}}{\text{total number of rejections}}\right)
$$
FDR = $1-\frac{S}{V+S} $ = 1- precision

FWER controls Pr(at least one falsely rejected $H_0$), may be too conservative. FDR controls the fraction of candidates in the smaller set that are really false rejections. E.g., A scientist tests on $m=20,000$ drug candidates. She wants to identify a smaller set of promising candidates to investigate further. She wants reassurance that this smaller set is really "promising", i.e., not too many falsely rejected $H_0$'s. She can control FDR. This may be achieved by *Benjamini-Hochberg Procedure*. 

1. Specify $q$, the level at which to control the FDR. 
2. Compute the $p$-values, $p_1, \cdots, p_m$ for the $m$-null hypotheses $H_{01}, \cdots, H_{0m}$. 
3. Order the $m$ $p$-values so that $p_{(1)$ \le p_{(2)}\le \cdots p_{(m)}$. 
4. Define 
$$
L=\max\left\{ j: p_{(j)}<\frac{qj}{m}  \right\}
$$
5. Reject all null hypotheses $H_{0j}$ for which $p_j \le p_{(L)}$.

Then  $FDR \le q$. 

## Resampling Approach (Contemporary Approach)
When the *theoretical null distribution* is unknown or requires stringent assumptions, one can use *resampling* or *permutation*  approach to approximate it. 


For example, suppose that 
$H_0: E(X)= E(Y)$, and $H_a: E(X)\ne E(Y)$, and number of independent samples $n_X$ and $n_Y$. The $t$-statistic as in @eq-two-sample-test, rewritten as 
$$
T=\frac{\hat{\mu}_X - \hat{\mu}_Y}{s\sqrt{\frac{1}{n_X}+ \frac{1}{n_Y}}}
$$
When $n_X$ and $n_Y$ are small. If $X$ and $Y$ are normally distributed, then $T$ follows a $t$-distribution with dof $n_X+n_Y-2$. $T$ follows a normal $\mathcal{N}(0,1)$ when $n_X$ and $n_Y$ are large.  The theoretical distribution is not necessarily $\mathcal{N}(0,1)$. In this case the $p$-value can be approximated using the following *resampling approach*.  The idea is that under $H_0$: $E(X)=E(Y)$ and we make a strong assumption that the distributions of $X$ is the same as that of $Y$, then the distribution of $T$ is **invariant** if we swap some observations of $X$ with those of $Y$. 

1. Computer the two sample $t$-statistic $T$ on the original data $x_1, \cdots, x_{n_X}$, and $y_1, \cdots, y_{n_Y}$. 
2. For $b=1, \cdots, B$ (where $B$ is large, e.g., 1000):
  2.1 Randomly shuffle (permute) the $n_X+n_Y$ observations.
  2.2 Call the first $n_X$ shuffled observations $x_1^*, \cdots, x_{n_X}^*$ and call the remaining observations $y_1^*, \cdots, y_{n_Y}^*$.
  2.3 Computer the two-sample $t$-statistic on the shuffled data, and call it $T^{*b}$. 
  
3. The $T^{*b}$'s may be considered as an approximaiton to the distribution of $T$ under $H_0$. The $p$-value is given by 
$$
\frac{\sum_{b=1}^B 1_{|T^{*b}|\ge |T|}}{B}
$$
## Resampling approach for FDR
First note 
$$
FDR =E\left( \frac{V}{R} \right) \approx \frac{E(V)}{R}.
$$
Suppose we reject a null hypothesis for which the test statistic exceeds $c$, then 
$R=\sum_{j=1}^m 1_{|T_j|\ge c}.$ $E(V)$ is the expected number of false positive associated with rejecting a null hypothesis for which the test statistic exceeds $c$. Since we do not know which $H_{0j}, j=1, \cdots,   m$ is true, we do not know which rejected $H_{0j}$ is false positive.  The resampling approach simulates the data under $H_{0j}, j=1, \cdots,   m$, and then compute the resulting test statistics. The number of re-sampled test statistics that exceeds $c$ provides an estimate of $V$. 

## Homework:
- Conceptual: 1--
- Applied: At least one. 

## Code Snippet
### Python
```


```

### Numpy
```
np.where(sorted_ < q * np.linspace(1, m, m) / m)[0]

```

### Pandas
```
decision = pd.cut(p_values,
                  [0, 0.05, 1],
                  labels=['Reject H0',
                          'Do not reject H0'])
truth = pd.Categorical(true_mean == 0,
                       categories=[True, False], # specify ordre. True=0, False=1
                       ordered=True)
                       
pd.crosstab(decision, #row
            truth, # col
     rownames=['Decision'],
     colnames=['H0'])    
     
pd.concat([Khan['xtrain'], Khan['xtest']]) # concat vertically along rows

```

### Graphics
```
[ax.plot(m,
         1 - (1 - alpha)**m,
         label=r'$\alpha=%s$' % str(alpha))
         for alpha in [0.05, 0.01, 0.001]]   #plot multiple curves
ax.set_xscale('log')

ax.axhline(0.05, c='k', ls='--');

x.axline((0, 0), (1,q/m), c='k', ls='--', linewidth=3); #line through (0,0), (1,q/m)
```

### ISLP and statsmodels
#### 1-sample $t$-test
```
from scipy.stats import \
    (ttest_1samp,
     ttest_rel,
     ttest_ind,
     t as t_dbn)
from statsmodels.stats.multicomp import \
     pairwise_tukeyhsd
from statsmodels.stats.multitest import \
     multipletests as mult_test

result = ttest_1samp(X[:,0], 0)
result.pvalue     
     

```
#### Two-sample t-test
```
ttest_rel(fund_mini['Manager1'],
          fund_mini['Manager2']).pvalue #ttest_rel requires two vectors of the same length
          
observedT, pvalue = ttest_ind(D2[gene_11],
                              D4[gene_11],
                              equal_var=True) #ttest_ind allows two vectors with different lengths  
ttest_ind(D2[gene_11],
                              D4[gene_11],
                              equal_var=True).statistic                              
          
```
#### Multiple test

```
# obtain invidual p_values
fund_mini = Fund.iloc[:,:5]
fund_mini_pvals = np.empty(5)
for i in range(5):
    fund_mini_pvals[i] = ttest_1samp(fund_mini.iloc[:,i], 0).pvalue

reject, bonf = mult_test(fund_mini_pvals, method = "bonferroni")[:2]   

mult_test(fund_mini_pvals, method = "holm", alpha=0.05)[:2]

mult_test(fund_pvalues, method = "fdr_bh") #for controling FDR
fund_qvalues = mult_test(fund_pvalues, method = "fdr_bh")[1]
(fund_qvalues <= 0.1).sum()

```
#### Turkey HSD Test
```
returns = np.hstack([fund_mini.iloc[:,i] for i in range(5)])
managers = np.hstack([[i+1]*50 for i in range(5)])
tukey = pairwise_tukeyhsd(returns, managers)
print(tukey.summary())

fig, ax = plt.subplots(figsize=(8,8))
tukey.plot_simultaneous(ax=ax);


```
#### Plug-in Resamplng FDR
```
m, B = 100, 10000
idx = rng.choice(Khan['xtest'].columns, m, replace=False)
T_vals = np.empty(m)
Tnull_vals = np.empty((m, B))

for j in range(m):
    col = idx[j]
    T_vals[j] = ttest_ind(D2[col],
                          D4[col],
                          equal_var=True).statistic #observed t-statistic
    D_ = np.hstack([D2[col], D4[col]])
    D_null = D_.copy()
    for b in range(B):
        rng.shuffle(D_null)
        ttest_ = ttest_ind(D_null[:n_],
                           D_null[n_:],
                           equal_var=True)
        Tnull_vals[j,b] = ttest_.statistic
        
cutoffs = np.sort(np.abs(T_vals)) 
FDRs, Rs, Vs = np.empty((3, m))

for j in range(m):
   R = np.sum(np.abs(T_vals) >= cutoffs[j]) # Note cufoffs is sorted and increasing
   V = np.sum(np.abs(Tnull_vals) >= cutoffs[j]) / B # sum over all elements in Tnull_vals >= cutoffs[j]
   Rs[j] = R
   Vs[j] = V
   FDRs[j] = V / R
        
fig, ax = plt.subplots()
ax.plot(Rs, FDRs, 'b', linewidth=3)
ax.set_xlabel("Number of Rejections")
ax.set_ylabel("False Discovery Rate");        
```


### sklearn


### Useful code snippets
#### Fisher-Yates reshuffling Algorithm
```
D_null = D_.copy()
for b in range(B):
    rng.shuffle(D_null)
    ttest_ = ttest_ind(D_null[:n_],
                       D_null[n_:],
                       equal_var=True)
    Tnull[b] = ttest_.statistic  # test statistic
(np.abs(Tnull) > np.abs(observedT)).mean()
```

#### Plot histogram and theoretical pdf
```
fig, ax = plt.subplots(figsize=(8,8))
ax.hist(Tnull,
        bins=100,
        density=True, # show probability density rather than the raw count
        facecolor='y',
        label='Null')
xval = np.linspace(-4.2, 4.2, 1001)
ax.plot(xval,
        t_dbn.pdf(xval, D_.shape[0]-2), #theoretical null disbribution: t-distr: 
        c='r')
ax.axvline(observedT,
           c='b',
           label='Observed')
ax.legend()
ax.set_xlabel("Null Distribution of Test Statistic");
```