analysis-bcpnn.Rmd

---
title: "Base Ranker: Bayesian Confidence Propagation Neural Network"
author:
  - name: Nan Xiao
    url: https://nanx.me/
    affiliation: Seven Bridges
    affiliation_url: https://www.sevenbridges.com/
  - name: Soner Koc
    url: https://github.com/skoc
    affiliation: Seven Bridges
    affiliation_url: https://www.sevenbridges.com/
  - name: Kaushik Ghose
    url: https://kaushikghose.wordpress.com/
    affiliation: Seven Bridges
    affiliation_url: https://www.sevenbridges.com/
date: "`r Sys.Date()`"
output: distill::distill_article
bibliography: rankv.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = TRUE, cache = TRUE)
```

# Data Model

The BCPNN method leverages the information component (IC) to measure the association between the vaccine and symptom. IC is widely used to measure the mutual information between two random variables.

Let $p_i$ be the probability of a target vaccine $i$ exposure being reported, $p_j$ be the the probability of the target symptom $j$ being reported, and $p_{ij}$ be the joint probability of a report on the target symptom $j$ under exposure to the target vaccine $i$. Bate et al. [@bate1998] defines the metric $\text{IC}_{ij}$ as

$$
\text{IC}_{ij} = \log_2\frac{p_{ij}}{p_i p_j}.
$$

Recall the contingency table for target vaccine $i$ and target symptom $j$:

| Target vaccine | Target symptom | All other symptoms       | Total     |
| :------------- | :------------- | :----------------------- | :-------- |
| Yes            | $n_{ij}$       | $n_i - n_{ij}$           | $n_i$      |
| No             | $n_j - n_{ij}$ | $n - n_i - n_j + n_{ij}$ | $n - n_i$ |
| Total          | $n_j$          | $n - n_j$                | $n$       |

Let the cell counts for vaccine-symptom pairs $(i, j)$ be $n_{ij}$. The BCPNN data model assumes

$$
n_{ij} | p_{ij} \sim \text{Binomial}(n, p_{ij}),\\
p_{ij} \sim \text{Beta}(\alpha_{ij}, \beta_{ij})
$$

where

$$
\alpha_{ij} = 1,\\
\beta_{ij} = \frac{1}{E(p_i | n_i) + E(p_j | n_j)} - 1.
$$

Under the assumption of independence, the marginal sums over the rows and columns of the $i \times j$ contingency table are:

$$
n_i | p_i \sim \text{Binomial}(n, p_i),\\
n_j | p_j \sim \text{Binomial}(n, p_j)
$$

where

$$
p_i \sim \text{Beta}(1, 1),\\
p_j \sim \text{Beta}(1, 1).
$$

The IC estimate is

$$
\hat{\text{IC}_{ij}} = \log_2 \frac{(n_{ij} + 1) (n + 2)^2}{(n_{ij} + 1) (n+2)^2 + n(n_i + 1) (n_j + 1)}.
$$

The variance estimation is given by

$$
\hat{\sigma_{ij}}^2 = \frac{1}{(\log 2)^2} (\frac{n - n_{ij} + \gamma - 1}{(n_{ij} + 1)(n+\gamma+1)} + \frac{n-n_{i} + 1}{(n_i + 1) (n+3)} + \frac{n - n_j + 1}{(n_j + 1)(n+3)})
$$

where

$$
\gamma = \frac{(n+2)^2}{(n_i + 1)(n_j + 1)}.
$$

# Computation

Load the packages for BCPNN-based singal detection and ranking:

```{r}
suppressMessages(library("PhViD"))
library("kableExtra")
```

Load the preprocessed VAERS data and transform it into the analyzable format:

```{r}
df_p <- readRDS("data-processed/df_p.rds")
df_p <- df_p[, 1:3]
df_v <- as.PhViD(df_p, MARGIN.THRES = 10)
```

Calculate the Information Component derived by the Bayesian neural network model [@bate1998], [@noren2006] and the ranking statistic --- 2.5% quantile of the posterior distribution of IC:

```{r}
lst_bcpnn <- BCPNN(df_v, MIN.n11 = 10, DECISION = 3, RANKSTAT = 2)
df_bcpnn <- lst_bcpnn$SIGNALS[order(lst_bcpnn$SIGNALS$`Q_0.025(log(IC))`, decreasing = TRUE), 1:5]
row.names(df_bcpnn) <- NULL
```

View the top ranked vaccine-adverse event pairs:

```{r}
head(df_bcpnn) %>% kable() %>% kable_styling()
```

```{r,echo=FALSE}
saveRDS(df_bcpnn, file = "data-processed/df_bcpnn.rds")
```