-
Notifications
You must be signed in to change notification settings - Fork 0
/
empiricalFDR.Rmd
294 lines (251 loc) · 14.8 KB
/
empiricalFDR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
---
title: "Empirical estimation of FDR"
author: "Timothy Daley"
date: "12/27/2020"
output:
html_document:
fig_height: 7
fig_width: 7
toc: yes
code_folding: hide
toc_float: yes
---
# A brief introduction to the theory of local and global False Discovery Rates
## Introduction
A more detailed explanation is given in a my previous post [Background of two groups model
](https://timydaley.github.io/twoGroupsBackground.html).
An idea I put into package CRISPhieRmix ([R package](https://github.com/timydaley/CRISPhieRmix), [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1538-6)) is a way to estimate global false discovery rates (FDRs) from local fdrs when it is too difficult to directly compute the global FDR.
The global FDR is the expected fraction of false discoveries among the ones that we call significant. The way we usually think about this is that we're gonna compute some one dimensional statistic (e.g. $p$-value or $z$-value) for tests/categories/hypotheses $i = 1, \ldots, n$, we're gonna set a threshold $p^{*}$ (or $z^{*}$ if we look in terms of a $z$-value), then we're gonna call as significant those with $p_{i} < p^{*}$ (or $| z_{i} | > z^{*}$). Then the expected fraction of false discoveries among the set that we called significant is the global FDR, or
$$
\mathrm{E} \left( \sum_{i = 1}^{n} 1 \left( \text{test } i \text{ is null} \cap p_{i} < p^{*} \right) / \sum_{i=1}^{n} 1 \left(p_{i} < p^{*} \right) \right).
$$
Note that there is some difficulty in defining the FDR when no tests are called significant. To get around this difficulty we can define marginal FDR (mFDR) as the ratio of the expected number of nulls called significant divided by the expected number of tests called significant, with the understanding that $0/0 = 0$ (see [Storey, 2010](http://theta.edu.pl/wp-content/uploads/2012/10/Storey_FDR_2010.pdf)). I.e.
$$
\text{mFDR} = \frac{\mathrm{E} \big( \sum_{i = 1}^{n} 1 \left( \text{test } i \text{ is null} \cap p_{i} < p^{*} \right) \big)}{\mathrm{E} \big( \sum_{i = 1}^{n} 1(p_{i} < p^{*}) \big)}.
$$
Similarly, the local fdr (following [Efron's notation](https://pdfs.semanticscholar.org/93dd/1ad905da7b09568aaf7d04c3d325772d42fc.pdf) of keeping the local fdr in lower case) is the expected probability that an individual test is false. So for example, for test $i$ we get $p$-value $p_{i}$ then the local fdr is
$$
\Pr(\text{test } i \text{ is null} | p_{i}).
$$
## Empirical estimation of the global FDR
The idea for empirical estimation came from an anology from [Efron, 2005](https://pdfs.semanticscholar.org/93dd/1ad905da7b09568aaf7d04c3d325772d42fc.pdf), the local false discovery rate is akin to the pdf and the global FDR is akin the cdf. We can see this from the (rather loose) definitions given above. Indeed, formally the global false discovery rates can be defined as the integral over the tail area of the local false discovery rates, as well as the intuition that the overall (global) FDR at a given threshold must be equal to the average of the individual local FDRs that are called significant at the threshold.
This intuition leads directly to the an easy way to estimate the global FDRs when we have the local fdrs, we just take the average of all local fdr's lower than the one under consideration to be the estimate of the global FDR. In fact, this is an unbiased estimate of the marginal false discovery rate.
We'll show this in the 1-dimensional case. For a measurement $x$, we assume that the measurements follow a mixture distribution with $x \sim f = (1- p) f_{0} + p f_{1}$. $p$ is the probability that a randomly selected test will be non-null, $f_{0}$ is the distribution of null cases, and $f_{1}$ is the distribution of non-null cases.
$$
\begin{aligned}
\text{mFDR}(x) &\approx \frac{\mathrm{E} \big( \sum_{i = 1}^{n} 1(x_{i} \geq x \cap \text{ test } i \text{ is null}) \big)}{\mathrm{E} \big( \sum_{i = 1}^{n} 1(x_{i} \geq x) \big)}
\notag \\
&= \frac{\int_{|z| \geq x} (1 - p) f_{0} (z) dz }{\int_{|z| > x} f(z) dz}
\notag \\
&= \frac{\int_{|z| \geq x} \frac{(1 - p) f_{0} (z)}{f(z)} f(z) dz}{\int_{|z| \geq x} f(z) dz} .
\notag \\
& = \frac{\mathrm{E} \big(\text{fdr}(z) \cdot 1(|z| \geq x) \big)}{\mathrm{E} \big( 1(|z| \geq x) \big)}
\notag \\
&\approx \frac{\frac{1}{N} \sum_{i = 1}^{n} \text{fdr}(x_{i}) 1 (\text{fdr}(x_{i}) \leq \text{fdr}(x))}{\frac{1}{N} \sum_{i = 1}^{n} 1 (\text{fdr}(x_{i}) \leq \text{fdr}(x))}.
\notag \\
&
\notag
\end{aligned}
$$
The advantage for this method is that in some cases the local FDR is easy to compute but the global FDR is difficult. Such I case I encountered in my hierarchical mixture modeling of pooled CRISPR screens. Computing the global FDRs here involves integration over the level sets of a mixture model, which can be difficult. Below is an example of the level sets for when we have 2 guides (measurements) per gene, more guides makes it even more complicated.
![levelSets2D.png]
# Simulations
## 1-D case
The simplest case is the one dimensional case. We'll assume that we have a normal mixture model,
$$
f(x) \sim (1 - p) \mathcal{N}(0, \sigma_{0}^2) + p \mathcal{N}(\mu, \sigma_{1}^{2}).
$$
We'll compare two cases, fitting the mixture model and computing the local FDR and global FDR from the empirical method, and the exact method (because it's easy to compute in this case).
```{r}
set.seed(12345)
mu0 = 0
mu1 = 3
sigma0 = 1
sigma1 = 1
p = 0.95
N = 10000
labels = rbinom(N, prob = 1 - p, 1)
mu_vec = sapply(labels, function(i) if(i == 1) mu1 else mu0)
sigma_vec = sapply(labels, function(i) if(i == 1) sigma1 else sigma0)
# sanity check
stopifnot(sum(mu_vec != 0)/length(mu_vec) == sum(labels)/length(labels))
x = rnorm(N, mean = mu_vec, sd = sigma_vec)
```
```{r message=F}
mixfit = mixtools::normalmixEM2comp(x, lambda = c(0.5, 0.5),
mu = c(0, 1), sigsqrd = c(1, 1))
print("fitted proportions: ")
print(mixfit$lambda)
print("fitted means: ")
print(mixfit$mu)
print("fitted sd's: ")
print(mixfit$sigma)
# taken from https://tinyheero.github.io/2015/10/13/mixture-model.html
plot_mix_comps <- function(x, mu, sigma, lam) {
lam * dnorm(x, mu, sigma)
}
library(ggplot2)
ggplot(data.frame(x = x) ) +
geom_histogram(aes(x, ..density..), binwidth = 1, colour = "black",
fill = "white") +
stat_function(geom = "line", fun = plot_mix_comps,
args = list(mixfit$mu[1], mixfit$sigma[1], lam = mixfit$lambda[1]),
colour = "red", lwd = 1.5) +
stat_function(geom = "line", fun = plot_mix_comps,
args = list(mixfit$mu[2], mixfit$sigma[2], lam = mixfit$lambda[2]),
colour = "blue", lwd = 1.5) +
ylab("Density") + theme_bw()
```
The local FDR here is easy to compute,
$$
\Pr(i \text{ is null} | x_{i}) = \frac{p f_{1} (x_{i} | \mu_{1}, \sigma_{1})}{(1 - p) f_{0}(x_{i} | \mu_{0}, \sigma_{0}) + p f_{1} (x_{i} | \mu_{1}, \sigma_{1})}.
$$
Similarly the global mFDR is equal to
$$
mFDR(i) = \frac{\int_{x \geq x_{i}} p f_{1} (x | \mu_{1}, \sigma_{1}) dx}{\int_{x \geq x_{i}} (1 - p) f_{0}(x | \mu_{0}, \sigma_{0}) + p f_{1} (x | \mu_{1}, \sigma_{1}) dx}.
$$
Of course, the global FDR (not marginal) has the integral outside the fraction above, but the above equation is much easier to compute because the integrals can be calculated from the cdf of the normal distribution.
```{r}
normal_2group_loc_fdr <- function(x, p, mu, sigma){
# the assumption here is that the null group is the second group
stopifnot((length(p) == length(mu)) &
(length(mu) == length(sigma)) &
(length(sigma) == 2))
p0 = p[1]*dnorm(x, mu[1], sigma[1])
p1 = p[2]*dnorm(x, mu[2], sigma[2])
return(p1/(p0 + p1))
}
normal_2group_mFDR <- function(x, p, mu, sigma){
# the assumption here is that the null group is the second group
stopifnot((length(p) == length(mu)) &
(length(mu) == length(sigma)) &
(length(sigma) == 2))
p0 = p[1]*pnorm(x, mu[1], sigma[1], lower.tail = FALSE)
p1 = p[2]*pnorm(x, mu[2], sigma[2], lower.tail = FALSE)
return(p0/(p0 + p1))
}
```
Now let's compare the methods. For comparison we will also include the standard Benjamini-Hochberg correct p-value method of computing FDRs.
```{r}
BH_FDR <- function(x, mu0, sigma0){
pvals = sapply(x, function(y) pnorm(y, mu0, sigma0, lower.tail = FALSE))
return(p.adjust(pvals, method = "BH"))
}
compare_fdr <- function(x, p, mu, sigma, labels){
loc_fdr = sapply(x, function(y) normal_2group_loc_fdr(y, p, mu, sigma))
loc_mFDR = sapply(x, function(y) mean(loc_fdr[which(x >= y)]))
mFDR = sapply(x, function(y) normal_2group_mFDR(y, p, mu, sigma))
bh = BH_FDR(x, mu[1], sigma[1])
true_fdr = sapply(1:length(x), function(i) sum(labels[which(x >= x[i])] == 0)/length(which(x >= x[i])))
return(data.frame(x = x,
loc_fdr = loc_fdr,
loc_mFDR = 1 - loc_mFDR,
mFDR = mFDR,
BH_FDR = bh,
true_FDR = true_fdr))
}
normal_2_group_fdr_comparison = compare_fdr(x, mixfit$lambda, mixfit$mu, mixfit$sigma, labels)
```
```{r}
df = data.frame(estimated_fdr = c(normal_2_group_fdr_comparison$loc_mFDR,
normal_2_group_fdr_comparison$mFDR,
normal_2_group_fdr_comparison$BH_FDR),
method = rep(c("loc_mFDR", "mFDR", "BH"), each = N),
true_fdr = rep(normal_2_group_fdr_comparison$true_FDR, times = 3))
ggplot(df, aes(y = true_fdr, x = estimated_fdr, col = method)) + geom_line() + scale_colour_brewer(palette = "Set1") + theme_bw() + geom_abline(intercept = 0, slope = 1)
```
The empirical method is on-par with exact method, though the Benjamini-Hochberg FDRs show slightly better calibration with the true FDR.
## Multi-dimensional case
Let's now take a look at the hierarchical model. We'll assume that each test/gene have 5 measurements, and the measurements for the non-null genes also follow a mixture distribution, e.g. $f_{1} = (1 - q) f_{0} + q f_{2}$. This creates more uncertainty about the true positives.
```{r message=F}
set.seed(12345)
n_genes = 10000
n_guides_per_gene = 5
genes = rep(1:n_genes, each = n_guides_per_gene)
x = rep(0, times = length(genes))
mu0 = 0
mu1 = 2
sigma0 = 1
sigma1 = 1
p_gene = 0.9
p_guide = 0.5
gene_labels = rbinom(n_genes, prob = 1 - p_gene, 1)
for(i in 1:n_genes){
if(gene_labels[i] == 0){
x[which(genes == i)] = rnorm(n_guides_per_gene, mean = mu0, sd = sigma0)
}
else{
# gene label == 1
guide_labels = rbinom(n_guides_per_gene, prob = 1 - p_guide, 1)
mu_vec = sapply(guide_labels, function(i) if(i == 1) mu1 else mu0)
sigma_vec = sapply(guide_labels, function(i) if(i == 1) sigma1 else sigma0)
x[which(genes == i)] = rnorm(n_guides_per_gene, mean = mu_vec, sd = sigma_vec)
}
}
```
```{r message=F}
mixfit = mixtools::normalmixEM2comp(x, lambda = c(0.5, 0.5),
mu = c(0, 1), sigsqrd = c(1, 1))
print("fitted proportions: ")
print(mixfit$lambda)
print("fitted means: ")
print(mixfit$mu)
print("fitted sd's: ")
print(mixfit$sigma)
```
```{r}
# taken from https://tinyheero.github.io/2015/10/13/mixture-model.html
plot_mix_comps <- function(x, mu, sigma, lam) {
lam * dnorm(x, mu, sigma)
}
library(ggplot2)
ggplot(data.frame(x = x) ) +
geom_histogram(aes(x, ..density..), binwidth = 1, colour = "black",
fill = "white") +
stat_function(geom = "line", fun = plot_mix_comps,
args = list(mixfit$mu[1], mixfit$sigma[1], lam = mixfit$lambda[1]),
colour = "red", lwd = 1.5) +
stat_function(geom = "line", fun = plot_mix_comps,
args = list(mixfit$mu[2], mixfit$sigma[2], lam = mixfit$lambda[2]),
colour = "blue", lwd = 1.5) +
ylab("Density") + theme_bw()
```
The hierarchical mixture model has likelihood (see the Methods section of https://link.springer.com/article/10.1186/s13059-018-1538-6)
$$
\mathcal{L}(f_{0}, f_{1}, p, q | x_{1}, \ldots, x_{n}) = \prod_{g = 1}^{G} (1 - p) \prod_{i: g_{i} = g} f_{0} (x_{i}) + p \prod_{i : g_{i} = g} \big( (1 - q) f_{0} (x_{i}) + q f_{1} (x_{i}) \big).
$$
Because of the non-identifiability of $p$ and $q$ simultaneously, we marginalized over $q$ (with a uniform prior) to obtain gene-level local false discovery rates
$$
\text{locfdr}(g) = \int_{0}^{1} \frac{(\hat{\tau} / q) \prod_{i: g_{i} = g} q f_{1} (x_{i}) + (1 - q) f_{0} (x_{i}) }{(\hat{\tau} / q) \prod_{i: g_{i} = g} \big( q f_{1} (x_{i}) + (1 - q) f_{0} (x_{i}) \big) + (1 - \hat{\tau} / q) \prod_{i: g_{i} = g} f_{0} (x_{i})} d q.
$$
We then used the empirical FDR estimator to compute estimates for the global false discovery rates.
```{r}
crisphiermixFit = CRISPhieRmix::CRISPhieRmix(x = x, geneIds = genes,
pq = 0.1, mu = 4, sigma = 1,
nMesh = 100, BIMODAL = FALSE,
VERBOSE = TRUE, PLOT = TRUE,
screenType = "GOF")
crisphiermixFit$mixFit$lambda
crisphiermixFit$mixFit$mu
crisphiermixFit$mixFit$sigma
```
```{r}
null_pvals = sapply(1:n_genes, function(g) t.test(x[which(genes == g)], alternative = "greater", mu = crisphiermixFit$mixFit$mu[1])$p.value)
bh_fdr = data.frame(estimated_FDR = p.adjust(null_pvals, method = "fdr"),
pvals = null_pvals)
bh_fdr['true_fdr'] = sapply(bh_fdr$estimated_FDR, function(p) sum(gene_labels[which(bh_fdr$estimated_FDR <= p)] == 0)/sum(bh_fdr$estimated_FDR <= p))
head(bh_fdr)
crisphiermixFDR = data.frame(estimated_FDR = crisphiermixFit$FDR)
crisphiermixFDR['true_fdr'] = sapply(crisphiermixFDR$estimated_FDR, function(p) sum(gene_labels[which(crisphiermixFDR$estimated_FDR <= p)] == 0)/sum(crisphiermixFDR$estimated_FDR <= p))
head(crisphiermixFDR)
cols = RColorBrewer::brewer.pal(8, 'Set1')
df = data.frame(estimated_FDR = c(crisphiermixFDR$estimated_FDR, bh_fdr$estimated_FDR),
true_FDR = c(crisphiermixFDR$true_fdr, bh_fdr$true_fdr),
method = rep(c("empirical FDR", "BH FDR"), each = dim(crisphiermixFDR)[1]))
ggplot(df, aes(y = true_FDR, x = estimated_FDR, col = method)) + geom_line() + theme_bw() + scale_colour_brewer(palette = "Set1") + geom_abline(intercept = 0, slope = 1)
```
What's happening with the Benjamini-Hochberg FDRs? Why aren't many lower than 0.3?
```{r}
head(bh_fdr[which(bh_fdr$estimated_FDR < 0.3), ])
```
We see that the Benjamini-Hochberg-corrected $t$-test is not powerful enough to give meaningful insight into low false discovery rate region. Though, the rest of the region is well-calibrated. However, we see that the empirical estimator of the FDR is very well calibrated. At least in this case when the model is correctly specified. An argument could be made that the empirical estimator has the advantage of knowing the correct model, but that would lead the BH FDRs to not be calibrated and not what we see here.