empiricalFDR.Rmd

---
title: "Empirical estimation of FDR"
author: "Timothy Daley"
date: "12/27/2020"
output:
  html_document:
    fig_height: 7
    fig_width: 7
    toc: yes
    code_folding: hide
    toc_float: yes
---

# A brief introduction to the theory of local and global False Discovery Rates

## Introduction

A more detailed explanation is given in a my previous post [Background of two groups model
](https://timydaley.github.io/twoGroupsBackground.html).

An idea I put into package CRISPhieRmix ([R package](https://github.com/timydaley/CRISPhieRmix), [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1538-6)) is a way to estimate global false discovery rates (FDRs) from local fdrs when it is too difficult to directly compute the global FDR.

The global FDR is the expected fraction of false discoveries among the ones that we call significant.  The way we usually think about this is that we're gonna compute some one dimensional statistic (e.g. $p$-value or $z$-value) for tests/categories/hypotheses $i = 1, \ldots, n$, we're gonna set a threshold $p^{*}$ (or $z^{*}$ if we look in terms of a $z$-value), then we're gonna call as significant those with $p_{i} < p^{*}$ (or $| z_{i} | > z^{*}$).  Then the expected fraction of false discoveries among the set that we called significant is the global FDR, or 
$$
\mathrm{E} \left( \sum_{i = 1}^{n} 1 \left( \text{test } i \text{ is null}  \cap p_{i} < p^{*} \right) / \sum_{i=1}^{n} 1 \left(p_{i} < p^{*} \right) \right).
$$
Note that there is some difficulty in defining the FDR when no tests are called significant.  To get around this difficulty we can define marginal FDR (mFDR) as the ratio of the expected number of nulls called significant divided by the expected number of tests called significant, with the understanding that $0/0 = 0$ (see [Storey, 2010](http://theta.edu.pl/wp-content/uploads/2012/10/Storey_FDR_2010.pdf)).  I.e.
$$
\text{mFDR} = \frac{\mathrm{E} \big( \sum_{i = 1}^{n} 1 \left( \text{test } i \text{ is null}  \cap p_{i} < p^{*} \right) \big)}{\mathrm{E} \big( \sum_{i = 1}^{n} 1(p_{i} < p^{*}) \big)}.
$$

 
Similarly, the local fdr (following [Efron's notation](https://pdfs.semanticscholar.org/93dd/1ad905da7b09568aaf7d04c3d325772d42fc.pdf) of keeping the local fdr in lower case) is the expected probability that an individual test is false.  So for example, for test $i$ we get $p$-value $p_{i}$ then the local fdr is 
$$
\Pr(\text{test } i \text{ is null} | p_{i}).
$$


## Empirical estimation of the global FDR

The idea for empirical estimation came from an anology from [Efron, 2005](https://pdfs.semanticscholar.org/93dd/1ad905da7b09568aaf7d04c3d325772d42fc.pdf), the local false discovery rate is akin to the pdf and the global FDR is akin the cdf.  We can see this from the (rather loose) definitions given above.  Indeed, formally the global false discovery rates can be defined as the integral over the tail area of the local false discovery rates, as well as the intuition that the overall (global) FDR at a given threshold must be equal to the average of the individual local FDRs that are called significant at the threshold.  

This intuition leads directly to the an easy way to estimate the global FDRs when we have the local fdrs, we just take the average of all local fdr's lower than the one under consideration to be the estimate of the global FDR.  In fact, this is an unbiased estimate of the marginal false discovery rate.


We'll show this in the 1-dimensional case. For a measurement $x$, we assume that the measurements follow a mixture distribution with $x \sim f = (1- p) f_{0} + p f_{1}$.  $p$ is the probability that a randomly selected test will be non-null, $f_{0}$ is the distribution of null cases, and $f_{1}$ is the distribution of non-null cases. 
$$
\begin{aligned}
\text{mFDR}(x) &\approx \frac{\mathrm{E} \big( \sum_{i = 1}^{n} 1(x_{i} \geq x \cap \text{ test } i \text{ is null}) \big)}{\mathrm{E} \big( \sum_{i = 1}^{n} 1(x_{i} \geq x) \big)} 
\notag \\
&= \frac{\int_{|z| \geq x} (1 - p) f_{0} (z) dz }{\int_{|z| > x} f(z) dz}
\notag \\
&= \frac{\int_{|z| \geq x} \frac{(1 - p) f_{0} (z)}{f(z)} f(z) dz}{\int_{|z| \geq x} f(z) dz} .
\notag \\
& = \frac{\mathrm{E} \big(\text{fdr}(z) \cdot 1(|z| \geq x) \big)}{\mathrm{E} \big( 1(|z| \geq x) \big)}
\notag \\
&\approx \frac{\frac{1}{N} \sum_{i = 1}^{n} \text{fdr}(x_{i}) 1 (\text{fdr}(x_{i}) \leq \text{fdr}(x))}{\frac{1}{N} \sum_{i = 1}^{n} 1 (\text{fdr}(x_{i}) \leq \text{fdr}(x))}. 
\notag \\
&
\notag
\end{aligned}
$$


The advantage for this method is that in some cases the local FDR is easy to compute but the global FDR is difficult.  Such I case I encountered in my hierarchical mixture modeling of pooled CRISPR screens.  Computing the global FDRs here involves integration over the level sets of a mixture model, which can be difficult.  Below is an example of the level sets for when we have 2 guides (measurements) per gene, more guides makes it even more complicated. 
![levelSets2D.png]


# Simulations

## 1-D case

The simplest case is the one dimensional case.  We'll assume that we have a normal mixture model, 
$$
f(x) \sim (1 - p) \mathcal{N}(0, \sigma_{0}^2) + p \mathcal{N}(\mu, \sigma_{1}^{2}).
$$
We'll compare two cases, fitting the mixture model and computing the local FDR and global FDR from the empirical method, and the exact method (because it's easy to compute in this case).

```{r}
set.seed(12345)
mu0 = 0
mu1 = 3
sigma0 = 1
sigma1 = 1
p = 0.95
N = 10000
labels = rbinom(N, prob = 1 - p, 1)
mu_vec = sapply(labels, function(i) if(i == 1) mu1 else mu0)
sigma_vec = sapply(labels, function(i) if(i == 1) sigma1 else sigma0)
# sanity check
stopifnot(sum(mu_vec != 0)/length(mu_vec) == sum(labels)/length(labels))
x = rnorm(N, mean = mu_vec, sd = sigma_vec)
```
```{r message=F}
mixfit = mixtools::normalmixEM2comp(x, lambda = c(0.5, 0.5), 
                                    mu = c(0, 1), sigsqrd = c(1, 1))
print("fitted proportions: ")
print(mixfit$lambda)
print("fitted means: ")
print(mixfit$mu)
print("fitted sd's: ")
print(mixfit$sigma)

# taken from https://tinyheero.github.io/2015/10/13/mixture-model.html
plot_mix_comps <- function(x, mu, sigma, lam) {
  lam * dnorm(x, mu, sigma)
}
library(ggplot2)
ggplot(data.frame(x = x) ) +
  geom_histogram(aes(x, ..density..), binwidth = 1, colour = "black", 
                 fill = "white") +
  stat_function(geom = "line", fun = plot_mix_comps,
                args = list(mixfit$mu[1], mixfit$sigma[1], lam = mixfit$lambda[1]),
                colour = "red", lwd = 1.5) +
  stat_function(geom = "line", fun = plot_mix_comps,
                args = list(mixfit$mu[2], mixfit$sigma[2], lam = mixfit$lambda[2]),
                colour = "blue", lwd = 1.5) +
  ylab("Density") + theme_bw()


```

The local FDR here is easy to compute, 
$$
\Pr(i \text{ is null} | x_{i}) = \frac{p f_{1} (x_{i} | \mu_{1}, \sigma_{1})}{(1 - p) f_{0}(x_{i} | \mu_{0}, \sigma_{0}) + p f_{1} (x_{i} | \mu_{1}, \sigma_{1})}.
$$
Similarly the global mFDR is equal to 
$$
mFDR(i) = \frac{\int_{x \geq x_{i}} p f_{1} (x | \mu_{1}, \sigma_{1}) dx}{\int_{x \geq x_{i}} (1 - p) f_{0}(x | \mu_{0}, \sigma_{0}) + p f_{1} (x | \mu_{1}, \sigma_{1}) dx}.
$$
Of course, the global FDR (not marginal) has the integral outside the fraction above, but the above equation is much easier to compute because the integrals can be calculated from the cdf of the normal distribution.  

```{r}
normal_2group_loc_fdr <- function(x, p, mu, sigma){
  # the assumption here is that the null group is the second group
  stopifnot((length(p) == length(mu)) &
            (length(mu) == length(sigma)) &
            (length(sigma) == 2))
  p0 = p[1]*dnorm(x, mu[1], sigma[1])
  p1 = p[2]*dnorm(x, mu[2], sigma[2])
  return(p1/(p0 + p1))
}

normal_2group_mFDR <- function(x, p, mu, sigma){
  # the assumption here is that the null group is the second group
  stopifnot((length(p) == length(mu)) &
            (length(mu) == length(sigma)) &
            (length(sigma) == 2))
  p0 = p[1]*pnorm(x, mu[1], sigma[1], lower.tail = FALSE)
  p1 = p[2]*pnorm(x, mu[2], sigma[2], lower.tail = FALSE)
  return(p0/(p0 + p1))
}
```

Now let's compare the methods.  For comparison we will also include the standard Benjamini-Hochberg correct p-value method of computing FDRs.

```{r}
BH_FDR <- function(x, mu0, sigma0){
  pvals = sapply(x, function(y) pnorm(y, mu0, sigma0, lower.tail = FALSE))
  return(p.adjust(pvals, method = "BH"))
}

compare_fdr <- function(x, p, mu, sigma, labels){
  loc_fdr = sapply(x, function(y) normal_2group_loc_fdr(y, p, mu, sigma))
  loc_mFDR = sapply(x, function(y) mean(loc_fdr[which(x >= y)]))
  mFDR = sapply(x, function(y) normal_2group_mFDR(y, p, mu, sigma))
  bh = BH_FDR(x, mu[1], sigma[1])
  true_fdr = sapply(1:length(x), function(i) sum(labels[which(x >= x[i])] == 0)/length(which(x >= x[i])))
  return(data.frame(x = x,
                    loc_fdr = loc_fdr,
                    loc_mFDR = 1 - loc_mFDR,
                    mFDR = mFDR,
                    BH_FDR = bh,
                    true_FDR = true_fdr))
}

normal_2_group_fdr_comparison = compare_fdr(x, mixfit$lambda, mixfit$mu, mixfit$sigma, labels)
```

```{r}
df = data.frame(estimated_fdr = c(normal_2_group_fdr_comparison$loc_mFDR, 
                                  normal_2_group_fdr_comparison$mFDR, 
                                  normal_2_group_fdr_comparison$BH_FDR), 
                method = rep(c("loc_mFDR", "mFDR", "BH"), each = N),
                true_fdr = rep(normal_2_group_fdr_comparison$true_FDR, times = 3))
ggplot(df, aes(y = true_fdr, x = estimated_fdr, col = method)) + geom_line() + scale_colour_brewer(palette = "Set1") + theme_bw() + geom_abline(intercept = 0, slope = 1)
```

The empirical method is on-par with exact method, though the Benjamini-Hochberg FDRs show slightly better calibration with the true FDR.  

## Multi-dimensional case

Let's now take a look at the hierarchical model.  We'll assume that each test/gene have 5 measurements, and the measurements for the non-null genes also follow a mixture distribution, e.g. $f_{1} = (1 - q) f_{0} + q f_{2}$.  This creates more uncertainty about the true positives.  

```{r message=F}
set.seed(12345)
n_genes = 10000
n_guides_per_gene = 5
genes = rep(1:n_genes, each = n_guides_per_gene)
x = rep(0, times = length(genes))
mu0 = 0
mu1 = 2
sigma0 = 1
sigma1 = 1
p_gene = 0.9
p_guide = 0.5
gene_labels = rbinom(n_genes, prob = 1 - p_gene, 1)
for(i in 1:n_genes){
  if(gene_labels[i] == 0){
    x[which(genes == i)] = rnorm(n_guides_per_gene, mean = mu0, sd = sigma0)
  }
  else{
    # gene label == 1
    guide_labels = rbinom(n_guides_per_gene, prob = 1 - p_guide, 1)
    mu_vec = sapply(guide_labels, function(i) if(i == 1) mu1 else mu0)
    sigma_vec = sapply(guide_labels, function(i) if(i == 1) sigma1 else sigma0)
    x[which(genes == i)] = rnorm(n_guides_per_gene, mean = mu_vec, sd = sigma_vec)
  }
}
```
```{r message=F}
mixfit = mixtools::normalmixEM2comp(x, lambda = c(0.5, 0.5), 
                                    mu = c(0, 1), sigsqrd = c(1, 1))
print("fitted proportions: ")
print(mixfit$lambda)
print("fitted means: ")
print(mixfit$mu)
print("fitted sd's: ")
print(mixfit$sigma)
```
```{r}
# taken from https://tinyheero.github.io/2015/10/13/mixture-model.html
plot_mix_comps <- function(x, mu, sigma, lam) {
  lam * dnorm(x, mu, sigma)
}
library(ggplot2)
ggplot(data.frame(x = x) ) +
  geom_histogram(aes(x, ..density..), binwidth = 1, colour = "black", 
                 fill = "white") +
  stat_function(geom = "line", fun = plot_mix_comps,
                args = list(mixfit$mu[1], mixfit$sigma[1], lam = mixfit$lambda[1]),
                colour = "red", lwd = 1.5) +
  stat_function(geom = "line", fun = plot_mix_comps,
                args = list(mixfit$mu[2], mixfit$sigma[2], lam = mixfit$lambda[2]),
                colour = "blue", lwd = 1.5) +
  ylab("Density") + theme_bw()
```

The hierarchical mixture model has likelihood (see the Methods section of https://link.springer.com/article/10.1186/s13059-018-1538-6)
$$
\mathcal{L}(f_{0}, f_{1}, p, q | x_{1}, \ldots, x_{n}) = \prod_{g = 1}^{G} (1 - p) \prod_{i: g_{i} = g} f_{0} (x_{i}) + p \prod_{i : g_{i} = g} \big( (1 - q) f_{0} (x_{i}) + q f_{1} (x_{i}) \big).
$$
Because of the non-identifiability of $p$ and $q$ simultaneously, we marginalized over $q$ (with a uniform prior) to obtain gene-level local false discovery rates 
$$
\text{locfdr}(g) = \int_{0}^{1} \frac{(\hat{\tau} / q) \prod_{i: g_{i} = g} q f_{1} (x_{i}) + (1 - q) f_{0} (x_{i}) }{(\hat{\tau} / q) \prod_{i: g_{i} = g} \big(  q f_{1} (x_{i}) + (1 - q) f_{0} (x_{i})  \big) + (1 - \hat{\tau} / q) \prod_{i: g_{i} = g} f_{0} (x_{i})} d q.
$$
We then used the empirical FDR estimator to compute estimates for the global false discovery rates.  

```{r}
crisphiermixFit = CRISPhieRmix::CRISPhieRmix(x = x, geneIds = genes, 
                                             pq = 0.1, mu = 4, sigma = 1,
                                             nMesh = 100,  BIMODAL = FALSE,
                                             VERBOSE = TRUE, PLOT = TRUE,
                                             screenType = "GOF")
crisphiermixFit$mixFit$lambda
crisphiermixFit$mixFit$mu
crisphiermixFit$mixFit$sigma
```

```{r}
null_pvals = sapply(1:n_genes, function(g) t.test(x[which(genes == g)], alternative = "greater", mu = crisphiermixFit$mixFit$mu[1])$p.value)
bh_fdr = data.frame(estimated_FDR = p.adjust(null_pvals, method = "fdr"),
                    pvals = null_pvals)
bh_fdr['true_fdr'] = sapply(bh_fdr$estimated_FDR, function(p) sum(gene_labels[which(bh_fdr$estimated_FDR <= p)] == 0)/sum(bh_fdr$estimated_FDR <= p))
head(bh_fdr)
crisphiermixFDR = data.frame(estimated_FDR = crisphiermixFit$FDR)
crisphiermixFDR['true_fdr'] = sapply(crisphiermixFDR$estimated_FDR, function(p) sum(gene_labels[which(crisphiermixFDR$estimated_FDR <= p)] == 0)/sum(crisphiermixFDR$estimated_FDR <= p))
head(crisphiermixFDR)
cols = RColorBrewer::brewer.pal(8, 'Set1')
df = data.frame(estimated_FDR = c(crisphiermixFDR$estimated_FDR, bh_fdr$estimated_FDR),
                true_FDR = c(crisphiermixFDR$true_fdr, bh_fdr$true_fdr),
                method = rep(c("empirical FDR", "BH FDR"), each = dim(crisphiermixFDR)[1]))
ggplot(df, aes(y = true_FDR, x = estimated_FDR, col = method)) + geom_line() + theme_bw() + scale_colour_brewer(palette = "Set1") + geom_abline(intercept = 0, slope = 1)
```

What's happening with the Benjamini-Hochberg FDRs?  Why aren't many lower than 0.3?

```{r}
head(bh_fdr[which(bh_fdr$estimated_FDR < 0.3), ])
```

We see that the Benjamini-Hochberg-corrected $t$-test is not powerful enough to give meaningful insight into low false discovery rate region.  Though, the rest of the region is well-calibrated.  However, we see that the empirical estimator of the FDR is very well calibrated.  At least in this case when the model is correctly specified.  An argument could be made that the empirical estimator has the advantage of knowing the correct model, but that would lead the BH FDRs to not be calibrated and not what we see here.