Most workflows for scaling normalization of ADT data use the geometric mean as the size factor, based on the CLR method used by Stoeckius et al. (2017). This is a simple and pragmatic solution to the problem of composition biases introduced by a minority of high-abundance tags.
Consider a cell
An obvious issue with the geometric mean is that it is equal to zero when one or more values are zero. As such, we usually add a pseudo-count - typically 1 - to ensure that some information is preserved from the non-zero counts. (Alternatively, I suppose we could directly replace zeros with a value of 1, though this is rarely done as it discards the distinction between 0 and 1 in the original data.) This workaround introduces its own bias in the form of a fold-change from the expected value of the tag with the zero count and its pseudo-count-based replacement, effectively overestimating the size factor.
The "standard" CLR formula for the size factor for cell
Despite its rather ad hoc derivation, the CLRm1 approach works surprisingly well. In a simulation with all-background counts, the CLRm1 size factors accurately reflect the true biases, with deviation comparable to the optimal estimate (i.e., the sum of Poisson-distributed counts).
Let's randomly choose a single tag for each cell and increase its abundance 100-fold to introduce some composition biases. CLRm1's performance advantage over the standard approach is still present:
We add even more composition biases by randomly choosing 0, 1 or 2 tags for each cell and increasing their abundance 100-fold. We continue to see an improvement for CLRm1 over the standard method, albeit reduced:
Let us consider two cells
- In the case where
$y_{kt}$ is equal to some constant$c_k$ for all$t$ , the CLRm1 size factor simplifies to$(c_k + 1) - 1$ for$k$ and$(ac_k + 1) - 1$ for$k'$ . The ratio in the size factors will be equal to$a$ . - A generalization of the previous point involves approximating the geometric mean with the arithmetic mean.
This approximation is satisfactory if the variance in
$y_{kt}$ is low relative to the mean (see Equation 31 and related discussion in Rodin, 2014). Doing so simplifies the CLRm1 factor to$n^{-1}\sum_t(y_{kt} + 1) - 1$ for cell$k$ and$n^{-1}\sum_t(ay_{kt} + 1) - 1$ for cell$k'$ , again yielding a ratio of$a$ . - If all
$y_{kt}$ and$ay_{kt}$ are much greater than 1, the addition or subtraction of the pseudo-count can be ignored entirely. The size factors for the two cells cancel out, leaving us with$a$ . - In the rare case that all
$y_{kt}$ are much less than 1, we can approximate$\prod_t (1 + y_{kt}) \approx 1 + \sum_t y_{kt}$ . We can further approximate$\sqrt[n]{1 + z} \approx 1 + zn^{-1}$ when$z$ is close to zero. This allows us to obtain a size factor of$(1 + n^{-1}\sum_t y_{kt}) - 1$ for$k$ and$(1 + an^{-1}\sum_t y_{kt}) - 1$ for$k'$ , which again cancels out to$a$ .
This analysis suggests that our approach will deteriorate when
Our hope is that this does not happen too frequently in real data, as there should not be large fluctuations in the ambient concentrations of different tags. At the very least, users can eliminate unnecessary variability by removing uninformative all-zero rows from the count matrix before using CLRm1. Perhaps even more protection could be gained by trimming away the tags with the most extreme average abundances across all cells, though this must be weighed against the loss of precision of the size factor estimates when the number of tags is decreased.
Of course, differentially abundant tags will also introduce variation in
This result is not surprising as some loss of accuracy is to be expected from composition bias. Indeed, it demonstrates a fundamental weakness of geometric mean-based methods - the composition bias is only mitigated without any attempt to explicitly ignore or remove it à la robust ratio-based methods like DESeq normalization or edgeR's TMM. Unfortunately, the latter are unreliable on sparse data, and the workarounds to reduce sparsity are tedious, e.g., pre-clustering or deconvolution (Lun et al., 2015).
CLRm1 is appealing as it can be easily calculated with good performance in the presence of minor composition biases. For datasets with strong composition biases... well, at least CLRm1 isn't any worse than the standard method.
The CLRm1 procedure itself is so simple that it is barely worth providing a reference implementation. Nonetheless, we provide some code for easy vendoring into other applications:
- Base R.
- R with
DelayedArray
objects to avoid copies. - C++ using tatami.
Stoeckius M, Hafemeister C, Stephenson W, et al. (2017). Simultaneous epitope and transcriptome measurement in single cells. Nature Methods 14, 865-868.
Rodin B (2014). Variance and the Inequality of Arithmetic and Geometric Means. arXiv doi:10.48550/arXiv.1409.0162.
Lun ATL, Bach K, Marioni JC (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biology 17, 75.