Biased (preferential) testing refers to the increased odds in getting tested as an infected subject against getting tested as a non-infected subject. This can lead to overestimation of the infected population and hence the underestimation of the Infection Fatality Rate (IFR). Using Bayesian prior on the testing bias, this project attempts to illustrate this effect based on current data from selected countries.
Capstone project in the context of the course Bayesian Statistics: Techniques and Models by the University of California: Santa Cruz hosted on Coursera.
Infected individuals are more likely to be tested than non-infected ones. This phenomenon is described as biased (or preferential) testing, which leads to biased estimates of the number of infected people in a population, which in turn leads to a biased estimate of the Infection Fatality Rate (IFR). This project attempts to illustrate this effect using Bayesian statistics and data from selected countries. The result is an estimate of the uncertainty introduced by the testing bias.
A commonly used and misleading metric for describing the risk of dying from a COVID-19 infection is the Case Fatality Rate (CFR), which is given by the ratio of deaths to confirmed cases
The data used in this project is from Our World in Data [1]. Specifically, the data for all confirmed cases and deaths described here [2] is used to obtain values for all selected countries and observed variables. The countries were selected based on prior knowledge on the reliability of the reported data and prior knowledge on the degree of preferential testing. The Python script owid.py downloads the most current data and performs the following steps:
- Only columns are retained that are relevant to the stated problem
- Only selected countries are retained for which data is available
- For each country only the most current data is retained
This leaves us with the following data set:
## # A tibble: 5 × 6
## iso_code location population total_tests total_cases total_deaths
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ESP Spain 46745211 72847345 6294745 89405
## 2 GBR United Kingdom 68207114 368855703 12774849 148534
## 3 DEU Germany 83900471 93337884 7208790 112161
## 4 NLD Netherlands 17173094 22433872 3229270 21104
## 5 ITA Italy 60367471 145453492 6975465 138474
For demonstration purposes let's assume unbiased testing for this section. MLE point estimates for both the IR and IFR are then straightforward to obtain. To be clear, when assuming unbiased testing we interpret the tests as a perfectly random and representative sample from the overall population, without bias toward positive or negative cases. Since in reality, this is very likely not the case [3][4][5], these estimates are not very accurate. Using
The base frame for the model follows the paper Bayesian adjustment for preferential testing in estimating infection fatality rates, as motivated by the COVID-19 pandemic [3], which assumes binomial distributions for both deaths
The main idea behind the testing bias is that it models the odds of a positive case being tested against a negative case being tested. This was directly inspired by the article Upper-Bounds and Testing Biases For the Number of SARS-COVID-19 Infections [4]. Using Bayes Theorem, we can express the relationship between the bias, the odds of being infected and the odds of being tested positively. For example: If positive cases are twice as likely to be tested as negative tests the bias is
Choices for the prior probabilities for
Due to the decision to model the testing bias as a redundant intercept in
The results are easiest interpretable when compared to the MLE estimates without adjustments for testing biases. The expectation is that the the posterior mean IFR for each country
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)
## # A tibble: 5 × 5
## location IFR_MLE_unbiased IFR_mean IFR_lower IFR_upper
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Spain 0.0221 0.0293 0.0222 0.0385
## 2 United Kingdom 0.0629 0.0751 0.0628 0.0952
## 3 Germany 0.0173 0.0231 0.0173 0.0303
## 4 Netherlands 0.00854 0.0117 0.00856 0.0155
## 5 Italy 0.0478 0.0588 0.0477 0.0763
Estimating the true Infection Mortality Rate (IFR) is an ambiguous task due to plethora of reasons and can not be the goal of this project. This project merely tries to shine a light on the effect of preferential testing and the resulting underestimation of the IFR and its confidence bounds. The next time a number on the IFR appears in the media or sciences, this may help interpretation with regards to how the data was obtained and what adjustments were made to incorporate preferential testing. The uncertainty due to the testing bias not only makes the estimation of IFR much harder, but also may pose serious implications for policy makers. This once again emphasizes the value of reliable data.
The prior choices for
Identifiability and choice of prior
The unidentifiability of the model due to the choice of modeling the bias as a redundant intercept leads to a strong dependence on the choice of the prior. The resulting posteriors with heavy tails toward higher values are a direct result of the chosen half-normal distribution for the bias. Any other choice of prior leads to drastically different results (!).
Better informed priors
If reliable data was available on the degree of preferential testing, a more data-driven approach in quantifying the bias,
Covariates
Estimating the IFR or even the CFR is not meaningful without including covariates. Most importantly, it is well known that the IFR drastically changes depending on the age and previous health record of a subject, but also information related to a countries health care system and even the current dynamic of the disease itself (time) are all crucial factors [1][3].
[2] Data on COVID-19 by Our World in Data
[4] Upper-Bounds and Testing Biases For the Number of SARS-COVID-19 Infections
[5] Stan User Guide: Problematic Posteriors due to redundant intercepts