From 62bfc064152fac1b549c80bf956ba386b1e2c20c Mon Sep 17 00:00:00 2001 From: mattlevine22 Date: Mon, 26 Aug 2024 11:47:47 -0400 Subject: [PATCH 01/18] starting to update methods, part 1 --- reproducibility/manuscript/manuscript.qmd | 88 ++++++++++++++++++----- 1 file changed, 71 insertions(+), 17 deletions(-) diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd index 6002915f0..685530f48 100644 --- a/reproducibility/manuscript/manuscript.qmd +++ b/reproducibility/manuscript/manuscript.qmd @@ -497,35 +497,89 @@ fate potency are reported in the titles. We assume the dynamical gene expression is determined by the RNA splicing process, and infer the unspliced and spliced gene expression level from the -differential equations proposed in velocyto [@La_Manno2018-lj] and scVelo +ordinary differential equation (ODEs) proposed in velocyto [@La_Manno2018-lj] and scVelo [@Bergen2020-pj] \begin{align} -\frac{d u\left(\tau^{\left(k_{cg}\right)}\right)}{d \tau^{\left(k_{cg}\right)}} - &= \alpha^{\left(k_{cg}\right)}-\beta_g u\left(\tau^{\left(k_{cg}\right)}\right), - \label{eq-dudt}\\ -\frac{d s\left(\tau^{\left(k_{cg}\right)}\right)}{d \tau^{\left(k_{cg}\right)}} - &= \beta_g u\left(\tau^{\left(k_{cg}\right)}\right) - -\gamma_g s\left(\tau^{\left(k_{cg}\right)}\right). \label{eq-dsdt} +\frac{du}{dt} &= \alpha(t) - \beta u, \quad u(0) = u_0 \label{eq-dudt}\\ + \frac{ds}{dt} &= \beta u - \gamma s, \quad s(0) = s_0, \label{eq-dsdt} +\end{align} +where $u(t), s(t)$ are the unspliced and spliced expression levels of a gene at time $t$ under a transcription rate $\alpha(t)$ with possible temporal dependence, splicing rate $\beta$, and degradation rate $\gamma$. We specify this model to a setting that depends on cell $c$ and gene $g$ as follows: +\begin{align} +\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label{eq-dudt}\\ + \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)}, \label{eq-dsdt}. \end{align} In the equation, the subscript $c$ is the cell dimension, $g$ is the gene dimension, -$\left( u\left( \tau^{(k_{cg})} \right), s\left( \tau^{(k_{cg})} \right) \right)$ +$\left( u_{cg}(t), s_{cg}(t) \right)$ are the unspliced and spliced expression functions given the -change of time per cell and gene. $\tau_{cg}$ represents the displacement of time per +change of time per cell and gene. +We restrict attention to piecewise-constant $\alpha_{cg}(t)$ to capture gene-specific activation and repression. We take special care to model a gene- and cell-specific switching time that marks a single transition from activation to repression by introducing a Bernoulli variable $k_{cg}$ to model unknown activation state. We assume our cell-by-gene data-matrix arrive as observations of Poisson-counts related to the solution of the above ODEs at unknown times $\tau_{cg}$, which is modeled as a relationship between an unknown latent time shared across each cell, $t_c$, and unknown gene-specific time-offsets $t_{0,g}$ where all read counts for a single cell occurred at an unknown, but shared latent time $t_c$. These relative times are also used to parametrize the Bernoulli process for $k_{cg}$. +Importantly, we recognize that the initial conditions are in fact unknown. + +We propose and study two models: Model 1 assumes that spliced and unspliced concentrations are both 0 at time 0; Model 2 considers these initial conditions as unknowns with a log-Normal prior distribution. In general, the solution space of ODEs becomes much richer when considered over a domain of initial conditions (as opposed to a single point); indeed, this affords Model 2 much greater expressivity. For clarity, we first present the generative framework for both models, then provide further interpretation and intuition. + +First, we introduce the generative model that describes the various unobserved times: +\begin{align} + % unit lognormal t_c + t_c &\sim \text{LogNormal}(0, 1) \\ + % gene-specific t_0 + t^{(0)}_{0,g} &\sim \text{LogNormal}(0, 1) \\ + % switching time + \Delta \textrm{switching}_g &\sim \text{LogNormal}(0, 1) \\ + % gene-specific t_1 + t^{(1)}_{0,g} &= t^{(0)}_{0,g} + \Delta \textrm{switching}_g \\ + %cell-gene-specific activation state + k_{cg} &\sim \text{Bernoulli}(\textrm{logits}=t_c - t^{(1)}_{0,g}) \\ + % cell-gene-specific latent time + \tau_{cg} &= \text{softplus}(t_c - t^{(k_{cg})}_{0,g}). +\end{align} +Here, $\tau_{cg}$ represents the displacement of time per cell and gene with \begin{align} - \tau^{(k_{cg})} &= \operatorname{softplus} \left( t_{c} - {t_{0}^{(k_{cg})}}_g \right) \\ - & = \log( 1 + \exp (t_c - {t_{0}^{(k_{cg})}}_g)), + textrm{softplus}(t) := \log( 1 + e^t). \end{align} -in which $t_c$ is the shared time per cell, ${t_{0}^{(kcg)}}_g$ is the +Recall that $t_c$ is the shared time per cell, $t^{(k_{cg})}_{0,g}$ is the gene-specific switching time. Each cell and gene combination has its transcriptional state $k_{cg} \in \{ 0, 1 \}$, where $0$ indicates the activation state and $1$ indicates the expression state. Each gene has two -switching times for representing activation and repression: ${t_{0}^{(0)}}_g$ is +switching times for representing activation and repression: $t^{(0)}_{0,g}$ is the first switching time corresponding to when the gene expression starts to be -activated, ${t_0^{(1)}}_g$ is the second switching time corresponding to when -the gene expression starts to be repressed. We note that $\alpha^{(1)}$ is -shared for all the genes, while ${\alpha^{(0)}}_g$ is learned independently for -each gene. +activated, $t^{(1)}_{0,g}$ is the second switching time corresponding to when +the gene expression starts to be repressed. + +Next we introduce the priors for the splicing parameters (where the activation rate $\alpha$ depends on the activation state $k_{cg}$ from above): +\begin{align} + \alpha^{(0)}_g &\sim \text{LogNormal}(0, 1) \\ + \beta_g &\sim \text{LogNormal}(0, 1) \\ + \gamma_g &\sim \text{LogNormal}(0, 1) \\ + \alpha_{cg} = \begin{cases} + \alpha^{(0)}_g & \text{if } k_{cg} = 0 \\ + 0 & \text{if } k_{cg} = 1 + \end{cases} +\end{align} +\textbf{Note that $\alpha^{(1)}$ is shared for all the genes, while ${\alpha^{(0)}}_g$ is learned independently for +each gene. MATT: this was in the old text, but I think $\alpha^{(1)}$ is no longer used based on conversations with Alvin?} + +Now, we describe the priors for the initial conditions, noting that this is the only difference between Model 1 and Model 2: +\begin{align} + \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} &\sim \begin{cases} + (0, 0) & \text{Model 1} \\ + (\text{LogNormal}(0, 1), \text{LogNormal}(0, 1)) & \text{Model 2} + \end{cases} + + u^{(0)}_{cg}, s^{(0)}_{cg} &= \begin{cases} + \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} & \text{if } k_{cg} = 0 \\ + \textrm{ODESolve}\Big( \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg}, \alpha^{(0)}_g, \beta_g, \gamma_g; \ T_0=0, T_1=\Delta \textrm{switching}_g \Big) & \text{if } k_{cg} = 1 + \end{cases} +\end{align} + +Finally, we describe the ODE solution at time $\tau_{cg}$, and the observation model at those times that give rise to the observed counts: +\begin{align} + \hat{u}_{cg}, \hat{s}_{cg} &= \textrm{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big) \\ + u_{cg} &\sim \text{Poisson}(\hat{u}_{cg}) \\ + s_{cg} &\sim \text{Poisson}(\hat{s}_{cg}) +\end{align} + + The analytic solution of the differential equations to predict spliced and unspliced gene expression given their parameters is derived by the From 63c188ba21d384ed75d489573fff4c1fbf9a1cfa Mon Sep 17 00:00:00 2001 From: mattlevine22 Date: Mon, 26 Aug 2024 14:36:37 -0400 Subject: [PATCH 02/18] starting to update methods, part 2 --- reproducibility/manuscript/manuscript.qmd | 108 ++++------------------ 1 file changed, 19 insertions(+), 89 deletions(-) diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd index 685530f48..fe7141724 100644 --- a/reproducibility/manuscript/manuscript.qmd +++ b/reproducibility/manuscript/manuscript.qmd @@ -535,7 +535,7 @@ First, we introduce the generative model that describes the various unobserved t Here, $\tau_{cg}$ represents the displacement of time per cell and gene with \begin{align} - textrm{softplus}(t) := \log( 1 + e^t). + \text{softplus}(t) := \log( 1 + e^t). \end{align} Recall that $t_c$ is the shared time per cell, $t^{(k_{cg})}_{0,g}$ is the gene-specific switching time. Each cell and gene combination has its @@ -551,7 +551,7 @@ Next we introduce the priors for the splicing parameters (where the activation r \alpha^{(0)}_g &\sim \text{LogNormal}(0, 1) \\ \beta_g &\sim \text{LogNormal}(0, 1) \\ \gamma_g &\sim \text{LogNormal}(0, 1) \\ - \alpha_{cg} = \begin{cases} + \alpha_{cg} &= \begin{cases} \alpha^{(0)}_g & \text{if } k_{cg} = 0 \\ 0 & \text{if } k_{cg} = 1 \end{cases} @@ -574,13 +574,25 @@ Now, we describe the priors for the initial conditions, noting that this is the Finally, we describe the ODE solution at time $\tau_{cg}$, and the observation model at those times that give rise to the observed counts: \begin{align} - \hat{u}_{cg}, \hat{s}_{cg} &= \textrm{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big) \\ - u_{cg} &\sim \text{Poisson}(\hat{u}_{cg}) \\ - s_{cg} &\sim \text{Poisson}(\hat{s}_{cg}) + \hat{u}_{cg}, \hat{s}_{cg} &= \text{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big) \\ + \mu^{(u)}_c &= \sum_{g=1}^G {u}^{\text{(obs)}}_{cg}, \quad \mu^{(s)}_c = \sum_{g=1}^G {s}^{\text{(obs)}}_{cg} \\ + \sigma^{(u)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( u_{cg}^{\text{(obs)}} - \mu^{(u)}_c \right)^2} \\ + \sigma^{(s)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( s_{cg}^{\text{(obs)}} - \mu^{(s)}_c \right)^2} \\ + \eta^{(u)}_c &\sim \text{Normal}\Big(\mu^{(u)}_c, \ \sigma^{(u)}_c\Big) \\ + \eta^{(s)}_c &\sim \text{Normal}\Big(\mu^{(s)}_c, \ \sigma^{(s)}_c\Big) \\ + \hat{\mu}^{(u)}_c &= \sum_{g=1}^G \hat{u}_{cg}, \quad \hat{\mu}^{(s)}_c = \sum_{g=1}^G \hat{s}_{cg} \\ + \lambda^{(u)}_{cg} &= \log(\hat{u}_{cg}) + \log(\eta^{(u)}_{c}) - \log(\hat{\mu}^{(u)}_c) \\ + \lambda^{(s)}_{cg} &= \log(\hat{s}_{cg}) + \log(\eta^{(s)}_{c}) - \log(\hat{\mu}^{(s)}_c) \\ + \hat{u}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(u)}_{cg})\Big) \\ + \hat{s}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(s)}_{cg})\Big) \end{align} +Here, we use ${u}^{\text{(obs)}}_{cg}, {s}^{\text{(obs)}}_{cg}$ to denote the observed unspliced and spliced counts +for cell $c$ and gene $g$. We use $\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg}$ to +denote our generative model's prediction of these unspliced and spliced expression levels. +The generative process for modeling these observed read counts given denoised gene transcript expression level $\hat{u}_cg, \hat{s}_cg$ considers the expected number of observed reads for a given gene in a given cell as the number of transcripts times the ratio of the cell's total reads to total transcripts. +\textbf{Improve descriptions of how noise is modeled in the observation model.} - - +\textbf{Need to update the analytic solutions, but first need to confirm the above is correct. Also, I recommend pushing all of the below analytic solutions to the appendix.} The analytic solution of the differential equations to predict spliced and unspliced gene expression given their parameters is derived by the authors of scVelo and a theoretical RNA velocity study @@ -650,88 +662,6 @@ s\left(\tau^{(1)}\right)=s_0^{(1)}{ }_g e^{-\gamma_g \tau^{(1)}} +\beta_g u_0^{(1)}{ }_g \tau^{(1)} e^{-\beta_g \tau^{(1)}}. \end{align} -We use these solutions to formulate an end-to-end probabilistic generative model -that relates prior distributions on kinetic parameters to a distribution on pairs -of observed unspliced and spliced read count matrices - -\begin{align} -\alpha^{(0)}{ }_g &\sim \operatorname{LogNormal}(0,1), \\ -\beta_g &\sim \operatorname{LogNormal}(0,1), \\ -\gamma_g &\sim \operatorname{LogNormal}(0,1), \\ -&\hskip -18pt \Delta \text { switching }_g \sim \operatorname{LogNormal}(0,1), \\ -t_0^{\left(k_{c g}\right)} &= \left\{ - \begin{array}{l} - t_0^{(0)}{ }_g \sim \operatorname{Normal}(0,1), k_{c g}=0 \\ - t_0^{(1)}{ }_g=t_0^{(0)}{ }_g+\Delta \text { switching }_g, \\ - \quad k_{c g}=1 - \end{array}\right. \\ -t_c &\sim \operatorname{LogNormal}(0,1), \\ -k_{c g} &\sim \text{Bernoulli} \left( \text{logits}= t_c-t_0^{(1)} \right), \\ -\tau^{\left(k_{c g}\right)} - &= \operatorname{softplus}\left(t_c-t_0^{\left(k_{c g}\right)}{ }_g\right), \\ -u_{c g} - &= \text { Measurement }_u \left( u\left(\tau^{\left(k_{c g}\right)}\right) ; - u_{c g}^{obs}\right), \\ -s_{c g} - &= \text { Measurement }_s \left( s\left(\tau^{(k_{c g})}\right) ; - s_{c g}^{obs}\right). -\end{align} -$u\left(\tau^{\left(k_{c g}\right)}\right)$ and $s\left(\tau^{(k_{c g})}\right)$ are -are called the denoised gene expression calculated from the velocity analytic -solution input with the kinetics random variables. $u_{cg}$ and $s_{cg}$ are the spliced -and unspliced read count sampled from the Poisson models. $u_{cg}^{obs}$ and $s_{cg}^{obs}$ are -the observed spliced and unspliced read count tables. The generative process - -$\text{Measurement}(\cdot)$ for observed unspliced read counts given denoised unspliced -gene transcript expression level $u\left(\tau^{(k_{cg})}\right)$ -(and identical for observed spliced read counts) models the expected number of observed -reads for a given gene in a given cell as the number of transcripts times the ratio of -the cell's total reads to total transcripts -\begin{align} -u_c^{\hat{obs}} &= \sum_g u_{c g}^{obs}, \\ -\hat{u}_c &= \sum_g u\left( \tau^{(k_{c g})}\right), \\ -\eta_c^{(u)} &\sim \operatorname{Normal}\left( - u_c^{\hat{obs}_c}, - \operatorname{std} \left(u_c^{\hat{obs}}\right) - \right), \\ -\mu_{c g}^{(u)} &= \log \left(u\left(\tau^k{ }_{c g}\right)\right) - +\log \left(\eta_c^{(u)}\right)-\log \left(\hat{u}_c\right), \\ -u_{c g}^{obs} &\sim - \operatorname{Poisson}\left(\lambda=\exp \left(\mu_{c g}^{(u)}\right)\right). -\end{align} - -For the first Pyro-Velocity model (Model 1), we constrain the shared time to be strictly -larger than $t_{0}^{(0)}$ by introducing auxiliary random variables -$$ -\text{t\_constraint}_{cg} - \sim \text{Bernoulli} \left( \text{logits} = t_c - {t_{0}^{(0)}}_g \right), -$$ -and setting their values to $1$, and we set the initial condition per gene to be -\begin{align} -\left( {u_{0}^{(k_{cg})}}_g , {s_{0}^{(k_{cg})}}_g \right) &= \left\{ - \begin{array}{l} - (0,0), k_{c g}=0 \\ - \bigg( {u \left( \Delta \text { switching }_g \right)}_g,\\ - \quad {s \left( \Delta \text { switching }_g \right)}_g \bigg), \\ - \quad k_{c g}=1 - \end{array}\right. -\end{align} -For the extended Pyro -Velocity model (Model 2), we remove the shared -time constraint $\text{t\_constraint}_{cg}$, thus allowing a time lag per gene -that might be caused by delayed gene activation and set the initial condition -per gene as random variables that are strictly positive $\left({u_{0}^{(0)}}_g, -{s_{0}^{(0)}}_g\right)$, which allow genes having a basal expression level before gene -activation. Then, we compute the gene expression at the second switching time as -\begin{align} -({u_{0}^{(1)}}_g, {s_{0}^{(1)}}_g) &= - \bigg( {u \left( \Delta \text { switching }_g \right)}_g, \nonumber \\ -& \qquad {s \left( \Delta \text { switching }_g \right)}_g \bigg), -\end{align} -which shares the same initial condition $\left({u_{0}^{(0)}}_g, {s_{0}^{(0)}}_g\right)$ where -\begin{align} -{u_{0}^{(0)}}_g &\sim \operatorname{LogNormal}(0,1),\\ -{s_{0}^{(0)}}_g &\sim \operatorname{LogNormal}(0,1). -\end{align} ## Variational inference {#sec-methods-inference} From f327453a6a570cd59c2a0144c029cb061adcc9e8 Mon Sep 17 00:00:00 2001 From: Cameron Smith Date: Mon, 26 Aug 2024 15:04:50 -0400 Subject: [PATCH 03/18] fix(manuscript): remove paragraph break in align env Signed-off-by: Cameron Smith --- reproducibility/manuscript/manuscript.qmd | 1 - 1 file changed, 1 deletion(-) diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd index fe7141724..ebf76594d 100644 --- a/reproducibility/manuscript/manuscript.qmd +++ b/reproducibility/manuscript/manuscript.qmd @@ -565,7 +565,6 @@ Now, we describe the priors for the initial conditions, noting that this is the (0, 0) & \text{Model 1} \\ (\text{LogNormal}(0, 1), \text{LogNormal}(0, 1)) & \text{Model 2} \end{cases} - u^{(0)}_{cg}, s^{(0)}_{cg} &= \begin{cases} \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} & \text{if } k_{cg} = 0 \\ \textrm{ODESolve}\Big( \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg}, \alpha^{(0)}_g, \beta_g, \gamma_g; \ T_0=0, T_1=\Delta \textrm{switching}_g \Big) & \text{if } k_{cg} = 1 From 1d6b1e21766ea8880921eaa63042e9abdb6bf224 Mon Sep 17 00:00:00 2001 From: mattlevine22 Date: Mon, 26 Aug 2024 15:21:36 -0400 Subject: [PATCH 04/18] latex fixes and updates to methods. ready to be read. --- reproducibility/manuscript/manuscript.qmd | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd index ebf76594d..b40104d70 100644 --- a/reproducibility/manuscript/manuscript.qmd +++ b/reproducibility/manuscript/manuscript.qmd @@ -506,7 +506,7 @@ ordinary differential equation (ODEs) proposed in velocyto [@La_Manno2018-lj] an where $u(t), s(t)$ are the unspliced and spliced expression levels of a gene at time $t$ under a transcription rate $\alpha(t)$ with possible temporal dependence, splicing rate $\beta$, and degradation rate $\gamma$. We specify this model to a setting that depends on cell $c$ and gene $g$ as follows: \begin{align} \frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label{eq-dudt}\\ - \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)}, \label{eq-dsdt}. + \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} \label{eq-dsdt}. \end{align} In the equation, the subscript $c$ is the cell dimension, $g$ is the gene dimension, $\left( u_{cg}(t), s_{cg}(t) \right)$ @@ -543,8 +543,12 @@ transcriptional state $k_{cg} \in \{ 0, 1 \}$, where $0$ indicates the activation state and $1$ indicates the expression state. Each gene has two switching times for representing activation and repression: $t^{(0)}_{0,g}$ is the first switching time corresponding to when the gene expression starts to be -activated, $t^{(1)}_{0,g}$ is the second switching time corresponding to when -the gene expression starts to be repressed. +activated, $t^{(1)}_{0,g}$ is the second switching time corresponding to when +the gene expression starts to be repressed, and is determined by the first +switching time and the gene-specific switching time $\Delta \text{switching}_g$. +The cell-gene-specific activation state $k_{cg}$ is a Bernoulli random variable +with logits equal to the difference between the cell's shared time $t_c$ and the time $t^{(1)}_{0,g}$ when the gene expression starts to be repressed. + Next we introduce the priors for the splicing parameters (where the activation rate $\alpha$ depends on the activation state $k_{cg}$ from above): \begin{align} @@ -564,16 +568,20 @@ Now, we describe the priors for the initial conditions, noting that this is the \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} &\sim \begin{cases} (0, 0) & \text{Model 1} \\ (\text{LogNormal}(0, 1), \text{LogNormal}(0, 1)) & \text{Model 2} - \end{cases} + \end{cases} \\ u^{(0)}_{cg}, s^{(0)}_{cg} &= \begin{cases} \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} & \text{if } k_{cg} = 0 \\ \textrm{ODESolve}\Big( \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg}, \alpha^{(0)}_g, \beta_g, \gamma_g; \ T_0=0, T_1=\Delta \textrm{switching}_g \Big) & \text{if } k_{cg} = 1 \end{cases} \end{align} -Finally, we describe the ODE solution at time $\tau_{cg}$, and the observation model at those times that give rise to the observed counts: +We define the ODE solution at time $\tau_{cg}$ as: +\begin{equation} + \hat{u}_{cg}, \hat{s}_{cg} = \text{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big). +\end{equation} + +Next, we define the observation model that gives rise to the observed counts as: \begin{align} - \hat{u}_{cg}, \hat{s}_{cg} &= \text{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big) \\ \mu^{(u)}_c &= \sum_{g=1}^G {u}^{\text{(obs)}}_{cg}, \quad \mu^{(s)}_c = \sum_{g=1}^G {s}^{\text{(obs)}}_{cg} \\ \sigma^{(u)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( u_{cg}^{\text{(obs)}} - \mu^{(u)}_c \right)^2} \\ \sigma^{(s)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( s_{cg}^{\text{(obs)}} - \mu^{(s)}_c \right)^2} \\ @@ -588,7 +596,7 @@ Finally, we describe the ODE solution at time $\tau_{cg}$, and the observation m Here, we use ${u}^{\text{(obs)}}_{cg}, {s}^{\text{(obs)}}_{cg}$ to denote the observed unspliced and spliced counts for cell $c$ and gene $g$. We use $\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg}$ to denote our generative model's prediction of these unspliced and spliced expression levels. -The generative process for modeling these observed read counts given denoised gene transcript expression level $\hat{u}_cg, \hat{s}_cg$ considers the expected number of observed reads for a given gene in a given cell as the number of transcripts times the ratio of the cell's total reads to total transcripts. +The generative process for modeling these observed read counts given denoised gene transcript expression level $\hat{u}_{cg}, \hat{s}_{cg}$ considers the expected number of observed reads for a given gene in a given cell as the number of transcripts times the ratio of the cell's total reads to total transcripts. \textbf{Improve descriptions of how noise is modeled in the observation model.} \textbf{Need to update the analytic solutions, but first need to confirm the above is correct. Also, I recommend pushing all of the below analytic solutions to the appendix.} From b85530633440ca67a0267795eb949586a2c8abc8 Mon Sep 17 00:00:00 2001 From: Cameron Smith Date: Mon, 26 Aug 2024 15:28:46 -0400 Subject: [PATCH 05/18] chore(manuscript): update v2 diff file --- reproducibility/manuscript/v2.tex | 228 ++++++++++++++++-------------- 1 file changed, 122 insertions(+), 106 deletions(-) diff --git a/reproducibility/manuscript/v2.tex b/reproducibility/manuscript/v2.tex index b69d93566..914342c02 100644 --- a/reproducibility/manuscript/v2.tex +++ b/reproducibility/manuscript/v2.tex @@ -661,34 +661,133 @@ \subsection{Model formulation}\label{sec-methods-model} We assume the dynamical gene expression is determined by the RNA splicing process, and infer the unspliced and spliced gene expression -level from the differential equations proposed in velocyto -\citep{La_Manno2018-lj} and scVelo \citep{Bergen2020-pj} \begin{align} -\frac{d u\left(\tau^{\left(k_{cg}\right)}\right)}{d \tau^{\left(k_{cg}\right)}} - &= \alpha^{\left(k_{cg}\right)}-\beta_g u\left(\tau^{\left(k_{cg}\right)}\right), - \label{eq-dudt}\\ -\frac{d s\left(\tau^{\left(k_{cg}\right)}\right)}{d \tau^{\left(k_{cg}\right)}} - &= \beta_g u\left(\tau^{\left(k_{cg}\right)}\right) - -\gamma_g s\left(\tau^{\left(k_{cg}\right)}\right). \label{eq-dsdt} +level from the ordinary differential equation (ODEs) proposed in +velocyto \citep{La_Manno2018-lj} and scVelo \citep{Bergen2020-pj} +\begin{align} +\frac{du}{dt} &= \alpha(t) - \beta u, \quad u(0) = u_0 \label{eq-dudt}\\ + \frac{ds}{dt} &= \beta u - \gamma s, \quad s(0) = s_0, \label{eq-dsdt} +\end{align} where \(u(t), s(t)\) are the unspliced and spliced +expression levels of a gene at time \(t\) under a transcription rate +\(\alpha(t)\) with possible temporal dependence, splicing rate +\(\beta\), and degradation rate \(\gamma\). We specify this model to a +setting that depends on cell \(c\) and gene \(g\) as follows: +\begin{align} +\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label{eq-dudt}\\ + \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} \label{eq-dsdt}. \end{align} In the equation, the subscript \(c\) is the cell dimension, -\(g\) is the gene dimension, -\(\left( u\left( \tau^{(k_{cg})} \right), s\left( \tau^{(k_{cg})} \right) \right)\) -are the unspliced and spliced expression functions given the change of -time per cell and gene. \(\tau_{cg}\) represents the displacement of -time per cell and gene with \begin{align} - \tau^{(k_{cg})} &= \operatorname{softplus} \left( t_{c} - {t_{0}^{(k_{cg})}}_g \right) \\ - & = \log( 1 + \exp (t_c - {t_{0}^{(k_{cg})}}_g)), -\end{align} in which \(t_c\) is the shared time per cell, -\({t_{0}^{(kcg)}}_g\) is the gene-specific switching time. Each cell and -gene combination has its transcriptional state +\(g\) is the gene dimension, \(\left( u_{cg}(t), s_{cg}(t) \right)\) are +the unspliced and spliced expression functions given the change of time +per cell and gene. We restrict attention to piecewise-constant +\(\alpha_{cg}(t)\) to capture gene-specific activation and repression. +We take special care to model a gene- and cell-specific switching time +that marks a single transition from activation to repression by +introducing a Bernoulli variable \(k_{cg}\) to model unknown activation +state. We assume our cell-by-gene data-matrix arrive as observations of +Poisson-counts related to the solution of the above ODEs at unknown +times \(\tau_{cg}\), which is modeled as a relationship between an +unknown latent time shared across each cell, \(t_c\), and unknown +gene-specific time-offsets \(t_{0,g}\) where all read counts for a +single cell occurred at an unknown, but shared latent time \(t_c\). +These relative times are also used to parametrize the Bernoulli process +for \(k_{cg}\). Importantly, we recognize that the initial conditions +are in fact unknown. + +We propose and study two models: Model 1 assumes that spliced and +unspliced concentrations are both 0 at time 0; Model 2 considers these +initial conditions as unknowns with a log-Normal prior distribution. In +general, the solution space of ODEs becomes much richer when considered +over a domain of initial conditions (as opposed to a single point); +indeed, this affords Model 2 much greater expressivity. For clarity, we +first present the generative framework for both models, then provide +further interpretation and intuition. + +First, we introduce the generative model that describes the various +unobserved times: \begin{align} + % unit lognormal t_c + t_c &\sim \text{LogNormal}(0, 1) \\ + % gene-specific t_0 + t^{(0)}_{0,g} &\sim \text{LogNormal}(0, 1) \\ + % switching time + \Delta \textrm{switching}_g &\sim \text{LogNormal}(0, 1) \\ + % gene-specific t_1 + t^{(1)}_{0,g} &= t^{(0)}_{0,g} + \Delta \textrm{switching}_g \\ + %cell-gene-specific activation state + k_{cg} &\sim \text{Bernoulli}(\textrm{logits}=t_c - t^{(1)}_{0,g}) \\ + % cell-gene-specific latent time + \tau_{cg} &= \text{softplus}(t_c - t^{(k_{cg})}_{0,g}). +\end{align} Here, \(\tau_{cg}\) represents the displacement of time per +cell and gene with \begin{align} + \text{softplus}(t) := \log( 1 + e^t). +\end{align} Recall that \(t_c\) is the shared time per cell, +\(t^{(k_{cg})}_{0,g}\) is the gene-specific switching time. Each cell +and gene combination has its transcriptional state \(k_{cg} \in \{ 0, 1 \}\), where \(0\) indicates the activation state and \(1\) indicates the expression state. Each gene has two switching -times for representing activation and repression: \({t_{0}^{(0)}}_g\) is +times for representing activation and repression: \(t^{(0)}_{0,g}\) is the first switching time corresponding to when the gene expression -starts to be activated, \({t_0^{(1)}}_g\) is the second switching time -corresponding to when the gene expression starts to be repressed. We -note that \(\alpha^{(1)}\) is shared for all the genes, while -\({\alpha^{(0)}}_g\) is learned independently for each gene. +starts to be activated, \(t^{(1)}_{0,g}\) is the second switching time +corresponding to when the gene expression starts to be repressed, and is +determined by the first switching time and the gene-specific switching +time \(\Delta \text{switching}_g\). The cell-gene-specific activation +state \(k_{cg}\) is a Bernoulli random variable with logits equal to the +difference between the cell's shared time \(t_c\) and the time +\(t^{(1)}_{0,g}\) when the gene expression starts to be repressed. + +Next we introduce the priors for the splicing parameters (where the +activation rate \(\alpha\) depends on the activation state \(k_{cg}\) +from above): \begin{align} + \alpha^{(0)}_g &\sim \text{LogNormal}(0, 1) \\ + \beta_g &\sim \text{LogNormal}(0, 1) \\ + \gamma_g &\sim \text{LogNormal}(0, 1) \\ + \alpha_{cg} &= \begin{cases} + \alpha^{(0)}_g & \text{if } k_{cg} = 0 \\ + 0 & \text{if } k_{cg} = 1 + \end{cases} +\end{align} +\textbf{Note that $\alpha^{(1)}$ is shared for all the genes, while ${\alpha^{(0)}}_g$ is learned independently for +each gene. MATT: this was in the old text, but I think $\alpha^{(1)}$ is no longer used based on conversations with Alvin?} + +Now, we describe the priors for the initial conditions, noting that this +is the only difference between Model 1 and Model 2: \begin{align} + \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} &\sim \begin{cases} + (0, 0) & \text{Model 1} \\ + (\text{LogNormal}(0, 1), \text{LogNormal}(0, 1)) & \text{Model 2} + \end{cases} \\ + u^{(0)}_{cg}, s^{(0)}_{cg} &= \begin{cases} + \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} & \text{if } k_{cg} = 0 \\ + \textrm{ODESolve}\Big( \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg}, \alpha^{(0)}_g, \beta_g, \gamma_g; \ T_0=0, T_1=\Delta \textrm{switching}_g \Big) & \text{if } k_{cg} = 1 + \end{cases} +\end{align} +We define the ODE solution at time \(\tau_{cg}\) as: \begin{equation} + \hat{u}_{cg}, \hat{s}_{cg} = \text{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big). +\end{equation} + +Next, we define the observation model that gives rise to the observed +counts as: \begin{align} + \mu^{(u)}_c &= \sum_{g=1}^G {u}^{\text{(obs)}}_{cg}, \quad \mu^{(s)}_c = \sum_{g=1}^G {s}^{\text{(obs)}}_{cg} \\ + \sigma^{(u)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( u_{cg}^{\text{(obs)}} - \mu^{(u)}_c \right)^2} \\ + \sigma^{(s)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( s_{cg}^{\text{(obs)}} - \mu^{(s)}_c \right)^2} \\ + \eta^{(u)}_c &\sim \text{Normal}\Big(\mu^{(u)}_c, \ \sigma^{(u)}_c\Big) \\ + \eta^{(s)}_c &\sim \text{Normal}\Big(\mu^{(s)}_c, \ \sigma^{(s)}_c\Big) \\ + \hat{\mu}^{(u)}_c &= \sum_{g=1}^G \hat{u}_{cg}, \quad \hat{\mu}^{(s)}_c = \sum_{g=1}^G \hat{s}_{cg} \\ + \lambda^{(u)}_{cg} &= \log(\hat{u}_{cg}) + \log(\eta^{(u)}_{c}) - \log(\hat{\mu}^{(u)}_c) \\ + \lambda^{(s)}_{cg} &= \log(\hat{s}_{cg}) + \log(\eta^{(s)}_{c}) - \log(\hat{\mu}^{(s)}_c) \\ + \hat{u}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(u)}_{cg})\Big) \\ + \hat{s}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(s)}_{cg})\Big) +\end{align} Here, we use +\({u}^{\text{(obs)}}_{cg}, {s}^{\text{(obs)}}_{cg}\) to denote the +observed unspliced and spliced counts for cell \(c\) and gene \(g\). We +use \(\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg}\) to +denote our generative model's prediction of these unspliced and spliced +expression levels. The generative process for modeling these observed +read counts given denoised gene transcript expression level +\(\hat{u}_{cg}, \hat{s}_{cg}\) considers the expected number of observed +reads for a given gene in a given cell as the number of transcripts +times the ratio of the cell's total reads to total transcripts. +\textbf{Improve descriptions of how noise is modeled in the observation model.} + +\textbf{Need to update the analytic solutions, but first need to confirm the above is correct. Also, I recommend pushing all of the below analytic solutions to the appendix.} The analytic solution of the differential equations to predict spliced and unspliced gene expression given their parameters is derived by the authors of scVelo and a theoretical RNA velocity study @@ -753,89 +852,6 @@ \subsection{Model formulation}\label{sec-methods-model} +\beta_g u_0^{(1)}{ }_g \tau^{(1)} e^{-\beta_g \tau^{(1)}}. \end{align} -We use these solutions to formulate an end-to-end probabilistic -generative model that relates prior distributions on kinetic parameters -to a distribution on pairs of observed unspliced and spliced read count -matrices - -\begin{align} -\alpha^{(0)}{ }_g &\sim \operatorname{LogNormal}(0,1), \\ -\beta_g &\sim \operatorname{LogNormal}(0,1), \\ -\gamma_g &\sim \operatorname{LogNormal}(0,1), \\ -&\hskip -18pt \Delta \text { switching }_g \sim \operatorname{LogNormal}(0,1), \\ -t_0^{\left(k_{c g}\right)} &= \left\{ - \begin{array}{l} - t_0^{(0)}{ }_g \sim \operatorname{Normal}(0,1), k_{c g}=0 \\ - t_0^{(1)}{ }_g=t_0^{(0)}{ }_g+\Delta \text { switching }_g, \\ - \quad k_{c g}=1 - \end{array}\right. \\ -t_c &\sim \operatorname{LogNormal}(0,1), \\ -k_{c g} &\sim \text{Bernoulli} \left( \text{logits}= t_c-t_0^{(1)} \right), \\ -\tau^{\left(k_{c g}\right)} - &= \operatorname{softplus}\left(t_c-t_0^{\left(k_{c g}\right)}{ }_g\right), \\ -u_{c g} - &= \text { Measurement }_u \left( u\left(\tau^{\left(k_{c g}\right)}\right) ; - u_{c g}^{obs}\right), \\ -s_{c g} - &= \text { Measurement }_s \left( s\left(\tau^{(k_{c g})}\right) ; - s_{c g}^{obs}\right). -\end{align} \(u\left(\tau^{\left(k_{c g}\right)}\right)\) and -\(s\left(\tau^{(k_{c g})}\right)\) are are called the denoised gene -expression calculated from the velocity analytic solution input with the -kinetics random variables. \(u_{cg}\) and \(s_{cg}\) are the spliced and -unspliced read count sampled from the Poisson models. \(u_{cg}^{obs}\) -and \(s_{cg}^{obs}\) are the observed spliced and unspliced read count -tables. The generative process - -\(\text{Measurement}(\cdot)\) for observed unspliced read counts given -denoised unspliced gene transcript expression level -\(u\left(\tau^{(k_{cg})}\right)\) (and identical for observed spliced -read counts) models the expected number of observed reads for a given -gene in a given cell as the number of transcripts times the ratio of the -cell's total reads to total transcripts \begin{align} -u_c^{\hat{obs}} &= \sum_g u_{c g}^{obs}, \\ -\hat{u}_c &= \sum_g u\left( \tau^{(k_{c g})}\right), \\ -\eta_c^{(u)} &\sim \operatorname{Normal}\left( - u_c^{\hat{obs}_c}, - \operatorname{std} \left(u_c^{\hat{obs}}\right) - \right), \\ -\mu_{c g}^{(u)} &= \log \left(u\left(\tau^k{ }_{c g}\right)\right) - +\log \left(\eta_c^{(u)}\right)-\log \left(\hat{u}_c\right), \\ -u_{c g}^{obs} &\sim - \operatorname{Poisson}\left(\lambda=\exp \left(\mu_{c g}^{(u)}\right)\right). -\end{align} - -For the first Pyro-Velocity model (Model 1), we constrain the shared -time to be strictly larger than \(t_{0}^{(0)}\) by introducing auxiliary -random variables \[ -\text{t\_constraint}_{cg} - \sim \text{Bernoulli} \left( \text{logits} = t_c - {t_{0}^{(0)}}_g \right), -\] and setting their values to \(1\), and we set the initial condition -per gene to be \begin{align} -\left( {u_{0}^{(k_{cg})}}_g , {s_{0}^{(k_{cg})}}_g \right) &= \left\{ - \begin{array}{l} - (0,0), k_{c g}=0 \\ - \bigg( {u \left( \Delta \text { switching }_g \right)}_g,\\ - \quad {s \left( \Delta \text { switching }_g \right)}_g \bigg), \\ - \quad k_{c g}=1 - \end{array}\right. -\end{align} For the extended Pyro -Velocity model (Model 2), we remove -the shared time constraint \(\text{t\_constraint}_{cg}\), thus allowing -a time lag per gene that might be caused by delayed gene activation and -set the initial condition per gene as random variables that are strictly -positive \(\left({u_{0}^{(0)}}_g, -{s_{0}^{(0)}}_g\right)\), which allow genes having a basal expression -level before gene activation. Then, we compute the gene expression at -the second switching time as \begin{align} -({u_{0}^{(1)}}_g, {s_{0}^{(1)}}_g) &= - \bigg( {u \left( \Delta \text { switching }_g \right)}_g, \nonumber \\ -& \qquad {s \left( \Delta \text { switching }_g \right)}_g \bigg), -\end{align} which shares the same initial condition -\(\left({u_{0}^{(0)}}_g, {s_{0}^{(0)}}_g\right)\) where \begin{align} -{u_{0}^{(0)}}_g &\sim \operatorname{LogNormal}(0,1),\\ -{s_{0}^{(0)}}_g &\sim \operatorname{LogNormal}(0,1). -\end{align} - \subsection{Variational inference}\label{sec-methods-inference} Given observations From bedcc1e26ded0f2bdf7a0331b192a14072317c49 Mon Sep 17 00:00:00 2001 From: Cameron Smith Date: Mon, 26 Aug 2024 15:39:12 -0400 Subject: [PATCH 06/18] fix(ci): include log files in manuscript artifacts Signed-off-by: Cameron Smith --- .github/workflows/manuscript.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/workflows/manuscript.yaml b/.github/workflows/manuscript.yaml index 3f5b1be87..6bd88a198 100644 --- a/.github/workflows/manuscript.yaml +++ b/.github/workflows/manuscript.yaml @@ -62,3 +62,4 @@ jobs: reproducibility/manuscript/v2*.* reproducibility/manuscript/*.bib reproducibility/manuscript/*.dvc + reproducibility/manuscript/*.log From 767d18931f65a8e9d37526cea271f9df623db715 Mon Sep 17 00:00:00 2001 From: Cameron Smith Date: Mon, 26 Aug 2024 15:44:10 -0400 Subject: [PATCH 07/18] test(manuscript): disable latexdiff bibtex diff Signed-off-by: Cameron Smith --- reproducibility/manuscript/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/reproducibility/manuscript/Makefile b/reproducibility/manuscript/Makefile index 2b75a9b59..21c3c29cb 100644 --- a/reproducibility/manuscript/Makefile +++ b/reproducibility/manuscript/Makefile @@ -108,12 +108,12 @@ cache-tex: copy-tex # run as make latexdiff COMPARISON_SHA1="commit-hash" # to diff with a different commit # see: https://gitlab.com/git-latexdiff/git-latexdiff +# --bibtex latexdiff: git-latexdiff \ --ignore-latex-errors \ --latexmk \ --xelatex \ - --bibtex \ --no-view \ --ignore-makefile \ --ln-untracked \ From a835cd29ce8b368b00fec5455debba0ff09f343e Mon Sep 17 00:00:00 2001 From: Cameron Smith Date: Mon, 26 Aug 2024 15:58:47 -0400 Subject: [PATCH 08/18] fix(ci): include tmp log files in manuscript artifacts Signed-off-by: Cameron Smith --- .github/workflows/manuscript.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/workflows/manuscript.yaml b/.github/workflows/manuscript.yaml index 6bd88a198..f9614b382 100644 --- a/.github/workflows/manuscript.yaml +++ b/.github/workflows/manuscript.yaml @@ -63,3 +63,4 @@ jobs: reproducibility/manuscript/*.bib reproducibility/manuscript/*.dvc reproducibility/manuscript/*.log + /tmp/git-latexdiff.*/new/reproducibility/manuscript/*.log From 10528ca171677ac699a353eac1053a1bd1a52490 Mon Sep 17 00:00:00 2001 From: Cameron Smith Date: Mon, 26 Aug 2024 15:59:07 -0400 Subject: [PATCH 09/18] test(manuscript): disable latexdiff tmp cleanup to save log file Signed-off-by: Cameron Smith --- reproducibility/manuscript/Makefile | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/reproducibility/manuscript/Makefile b/reproducibility/manuscript/Makefile index 21c3c29cb..6e0ed8929 100644 --- a/reproducibility/manuscript/Makefile +++ b/reproducibility/manuscript/Makefile @@ -108,15 +108,16 @@ cache-tex: copy-tex # run as make latexdiff COMPARISON_SHA1="commit-hash" # to diff with a different commit # see: https://gitlab.com/git-latexdiff/git-latexdiff -# --bibtex latexdiff: git-latexdiff \ --ignore-latex-errors \ --latexmk \ --xelatex \ + --bibtex \ --no-view \ --ignore-makefile \ --ln-untracked \ + --cleanup none \ --output $(TEX_FILE)-$(COMPARISON_SHA1)-diff.pdf \ --main $(TEX_FILE).tex $(COMPARISON_SHA1) HEAD From fa824bb394d152636df1792745cecfee913d1949 Mon Sep 17 00:00:00 2001 From: Cameron Smith Date: Mon, 26 Aug 2024 16:10:42 -0400 Subject: [PATCH 10/18] chore(manuscript): update v2 diff file --- reproducibility/manuscript/v2.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/reproducibility/manuscript/v2.tex b/reproducibility/manuscript/v2.tex index 914342c02..7c08cd37e 100644 --- a/reproducibility/manuscript/v2.tex +++ b/reproducibility/manuscript/v2.tex @@ -156,6 +156,7 @@ \raggedbottom \usepackage{nameref} \usepackage{placeins} +\usepackage[utf8]{inputenc} \usepackage[labelfont=bf]{caption} % \usepackage{showframe} % \usepackage{layout} From 86cbd30e7727d9308ee41cd6247019d614ee817d Mon Sep 17 00:00:00 2001 From: Cameron Smith Date: Mon, 26 Aug 2024 16:14:24 -0400 Subject: [PATCH 11/18] chore(manuscript): update v2 diff file --- reproducibility/manuscript/v2.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/reproducibility/manuscript/v2.tex b/reproducibility/manuscript/v2.tex index 7c08cd37e..45996fbdb 100644 --- a/reproducibility/manuscript/v2.tex +++ b/reproducibility/manuscript/v2.tex @@ -157,6 +157,7 @@ \usepackage{nameref} \usepackage{placeins} \usepackage[utf8]{inputenc} +\usepackage{fontspec} \usepackage[labelfont=bf]{caption} % \usepackage{showframe} % \usepackage{layout} From 7e0d2bfd70ad45a009b1b6d79b812b1da79909da Mon Sep 17 00:00:00 2001 From: Cameron Smith Date: Mon, 26 Aug 2024 16:15:05 -0400 Subject: [PATCH 12/18] fix(manuscript): support unicode thin space in tex Signed-off-by: Cameron Smith --- reproducibility/manuscript/header.tex | 2 ++ 1 file changed, 2 insertions(+) diff --git a/reproducibility/manuscript/header.tex b/reproducibility/manuscript/header.tex index 0fa6ebe3f..04d22655a 100644 --- a/reproducibility/manuscript/header.tex +++ b/reproducibility/manuscript/header.tex @@ -1,5 +1,7 @@ \usepackage{nameref} \usepackage{placeins} +\usepackage[utf8]{inputenc} +\usepackage{fontspec} \usepackage[labelfont=bf]{caption} % \usepackage{showframe} % \usepackage{layout} From 53f08f24adaf0d64cab278e6349295e89f1c9a5f Mon Sep 17 00:00:00 2001 From: Cameron Smith Date: Mon, 26 Aug 2024 16:34:31 -0400 Subject: [PATCH 13/18] fix(ci): note artifact path to include latexdiff log Signed-off-by: Cameron Smith --- .github/workflows/manuscript.yaml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/.github/workflows/manuscript.yaml b/.github/workflows/manuscript.yaml index f9614b382..ec2bd9d37 100644 --- a/.github/workflows/manuscript.yaml +++ b/.github/workflows/manuscript.yaml @@ -57,10 +57,12 @@ jobs: uses: actions/upload-artifact@834a144ee995460fba8ed112a2fc961b36a5ec5a # v4 with: name: manuscript-${{ github.ref_name }}-${{ github.sha }} + # Include + # /tmp/git-latexdiff.*/new/reproducibility/manuscript/*.log + # to capture latexdiff log file path: | reproducibility/manuscript/manuscript.* reproducibility/manuscript/v2*.* reproducibility/manuscript/*.bib reproducibility/manuscript/*.dvc reproducibility/manuscript/*.log - /tmp/git-latexdiff.*/new/reproducibility/manuscript/*.log From bbbca8a65b014f484fa227a5f8ebd154c4cf6412 Mon Sep 17 00:00:00 2001 From: Cameron Smith Date: Mon, 26 Aug 2024 16:35:08 -0400 Subject: [PATCH 14/18] fix(manuscript): note how to retain latexdiff log Signed-off-by: Cameron Smith --- reproducibility/manuscript/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/reproducibility/manuscript/Makefile b/reproducibility/manuscript/Makefile index 6e0ed8929..8eb6fde05 100644 --- a/reproducibility/manuscript/Makefile +++ b/reproducibility/manuscript/Makefile @@ -108,6 +108,7 @@ cache-tex: copy-tex # run as make latexdiff COMPARISON_SHA1="commit-hash" # to diff with a different commit # see: https://gitlab.com/git-latexdiff/git-latexdiff +# Set `--cleanup none` to keep the tmp files including the latex log latexdiff: git-latexdiff \ --ignore-latex-errors \ @@ -117,7 +118,6 @@ latexdiff: --no-view \ --ignore-makefile \ --ln-untracked \ - --cleanup none \ --output $(TEX_FILE)-$(COMPARISON_SHA1)-diff.pdf \ --main $(TEX_FILE).tex $(COMPARISON_SHA1) HEAD From d2492f481c8b06276de107869ee57625b583febe Mon Sep 17 00:00:00 2001 From: mattlevine22 Date: Thu, 29 Aug 2024 12:54:33 -0400 Subject: [PATCH 15/18] updated and tersened analytic solutions methods sub-section --- reproducibility/manuscript/manuscript.qmd | 104 ++++++---------------- 1 file changed, 26 insertions(+), 78 deletions(-) diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd index b40104d70..438c393b5 100644 --- a/reproducibility/manuscript/manuscript.qmd +++ b/reproducibility/manuscript/manuscript.qmd @@ -499,21 +499,27 @@ We assume the dynamical gene expression is determined by the RNA splicing process, and infer the unspliced and spliced gene expression level from the ordinary differential equation (ODEs) proposed in velocyto [@La_Manno2018-lj] and scVelo [@Bergen2020-pj] +\begin{equation} + \label{eq-rna} \begin{align} -\frac{du}{dt} &= \alpha(t) - \beta u, \quad u(0) = u_0 \label{eq-dudt}\\ - \frac{ds}{dt} &= \beta u - \gamma s, \quad s(0) = s_0, \label{eq-dsdt} +\frac{du}{dt} &= \alpha(t) - \beta u, \quad u(0) = u_0 \\ + \frac{ds}{dt} &= \beta u - \gamma s, \quad s(0) = s_0, \end{align} +\end{equation} where $u(t), s(t)$ are the unspliced and spliced expression levels of a gene at time $t$ under a transcription rate $\alpha(t)$ with possible temporal dependence, splicing rate $\beta$, and degradation rate $\gamma$. We specify this model to a setting that depends on cell $c$ and gene $g$ as follows: +\begin{equation} + \label{eq-rna-cg} \begin{align} -\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label{eq-dudt}\\ - \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} \label{eq-dsdt}. +\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \\ + \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)}. \end{align} +\end{equation} In the equation, the subscript $c$ is the cell dimension, $g$ is the gene dimension, $\left( u_{cg}(t), s_{cg}(t) \right)$ are the unspliced and spliced expression functions given the change of time per cell and gene. -We restrict attention to piecewise-constant $\alpha_{cg}(t)$ to capture gene-specific activation and repression. We take special care to model a gene- and cell-specific switching time that marks a single transition from activation to repression by introducing a Bernoulli variable $k_{cg}$ to model unknown activation state. We assume our cell-by-gene data-matrix arrive as observations of Poisson-counts related to the solution of the above ODEs at unknown times $\tau_{cg}$, which is modeled as a relationship between an unknown latent time shared across each cell, $t_c$, and unknown gene-specific time-offsets $t_{0,g}$ where all read counts for a single cell occurred at an unknown, but shared latent time $t_c$. These relative times are also used to parametrize the Bernoulli process for $k_{cg}$. -Importantly, we recognize that the initial conditions are in fact unknown. +We restrict attention to piecewise-constant $\alpha_{cg}(t)$ to capture gene-specific activation and repression. We take special care to model a gene- and cell-specific switching time that marks a single transition from activation to repression by introducing a Bernoulli variable $k_{cg}$ to model unknown transcription state. We assume our cell-by-gene data-matrix arrive as observations of Poisson-counts related to the solution of the above ODEs at unknown times, which is modeled as a relationship between an unknown latent time shared across each cell, $t_c$, and unknown gene-specific time-offsets $t_{0,g}$ where all read counts for a single cell occurred at an unknown, but shared latent time $t_c$. These relative times are also used to parametrize the Bernoulli process for $k_{cg}$. +Importantly, we recognize that the initial conditions are, in fact, unknown. We propose and study two models: Model 1 assumes that spliced and unspliced concentrations are both 0 at time 0; Model 2 considers these initial conditions as unknowns with a log-Normal prior distribution. In general, the solution space of ODEs becomes much richer when considered over a domain of initial conditions (as opposed to a single point); indeed, this affords Model 2 much greater expressivity. For clarity, we first present the generative framework for both models, then provide further interpretation and intuition. @@ -557,11 +563,9 @@ Next we introduce the priors for the splicing parameters (where the activation r \gamma_g &\sim \text{LogNormal}(0, 1) \\ \alpha_{cg} &= \begin{cases} \alpha^{(0)}_g & \text{if } k_{cg} = 0 \\ - 0 & \text{if } k_{cg} = 1 + 0 & \text{if } k_{cg} = 1. \end{cases} \end{align} -\textbf{Note that $\alpha^{(1)}$ is shared for all the genes, while ${\alpha^{(0)}}_g$ is learned independently for -each gene. MATT: this was in the old text, but I think $\alpha^{(1)}$ is no longer used based on conversations with Alvin?} Now, we describe the priors for the initial conditions, noting that this is the only difference between Model 1 and Model 2: \begin{align} @@ -597,78 +601,22 @@ Here, we use ${u}^{\text{(obs)}}_{cg}, {s}^{\text{(obs)}}_{cg}$ to denote the ob for cell $c$ and gene $g$. We use $\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg}$ to denote our generative model's prediction of these unspliced and spliced expression levels. The generative process for modeling these observed read counts given denoised gene transcript expression level $\hat{u}_{cg}, \hat{s}_{cg}$ considers the expected number of observed reads for a given gene in a given cell as the number of transcripts times the ratio of the cell's total reads to total transcripts. -\textbf{Improve descriptions of how noise is modeled in the observation model.} - -\textbf{Need to update the analytic solutions, but first need to confirm the above is correct. Also, I recommend pushing all of the below analytic solutions to the appendix.} -The analytic solution of the differential equations to predict -spliced and unspliced gene expression given their parameters is derived by the -authors of scVelo and a theoretical RNA velocity study -[@Bergen2020-pj;@Li2021-qa] and given in -Eqs. \ref{eq-solution-u}-\ref{eq-solution-s2}. -\begin{align} -u\left(\tau^{\left(k_{c g}\right)}\right) - &= u_0^{\left(k_{c g}\right)}{ }_g e^{-\beta_g \tau^{\left(k_{c g}\right)}} - \nonumber \\ -&\hskip -24pt + \frac{\alpha^{\left(k_{c g}\right)}} - {\beta_g}\left(1-e^{-\beta_g \tau^{\left(k_{c g}\right)}}\right) - \label{eq-solution-u}\\ -s\left(\tau^{\left(k_{c g}\right)}\right) - &= s_0^{\left(k_{c g}\right)} e^{-\gamma_g \tau^{\left(k_{c g}\right)}} - \nonumber \\ - &\hskip -24pt + \frac{\alpha^{\left(k_{c g}\right)}}{\gamma_g} - \left(1-e^{-\gamma_g \tau^{\left(k_{c g}\right)}}\right) - \nonumber\\ - &\hskip -24pt + \frac{\alpha^{\left(k_{c g}\right)}-\beta_g u_0^{\left(k_{c g}\right)}} - {\gamma_g-\beta_g}\left(e^{-\gamma_g \tau^{\left(k_{c g}\right)}} - -e^{-\beta_g \tau^{\left(k_{c g}\right)}}\right), - \nonumber \\ - &\qquad \beta \neq \gamma \label{eq-solution-s} \\ -s\left(\tau^{\left(k_{c g}\right)}\right) - &= {s_0^{\left(k_{c g}\right)}}_g e^{-\beta_g \tau^{\left(k_{c g}\right)}} \nonumber \\ - &\hskip -24pt +\frac{\alpha^{\left(k_{c g}\right)}}{\beta_g} - \left(1-e^{-\beta_g \tau^{(k c g)}}\right) - \nonumber \\ - &\hskip -24pt -\left(\alpha^{\left(k_{c g}\right)} - -\beta_g u_0^{\left(k_{c g}\right)}{ }_g\right) \tau^{\left(k_{c g}\right)} - e^{-\beta_g \tau^{\left(k_{c g}\right)}}, \nonumber \\ - &\qquad \beta = \gamma \label{eq-solution-s2} -\end{align} -To simplify these equations, consider the case when $k_{cg} = 0$ and -$\beta_g \neq \gamma_g$. Then, -\begin{align} -u\left(\tau^{(0)}\right) &= {u_0^{(0)}}_g e^{-\beta_g \tau^{(0)}} \nonumber \\ - & \hskip -24pt + \frac{{\alpha^{(0)}}_g}{\beta_g}\left(1-e^{-\beta_g \tau^{(0)}}\right), - \label{eq-sol-usimp} \\ -s\left(\tau^{(0)}\right) &= s_0^{(0)}{ }_g e^{-\gamma_g \tau^{(0)}} \nonumber \\ - & \hskip -24pt +\frac{{\alpha^{(0)}}_g}{\gamma_g}\left(1-e^{-\gamma_g \tau^{(0)}}\right) - \nonumber\\ - & \hskip -24pt +\frac{{\alpha^{(0)}}_g-\beta_g {u_0^{(0)}}_g}{\gamma_g-\beta_g} - \left(e^{-\gamma_g \tau^{(0)}}-e^{-\beta_g \tau^{(0)}}\right). \label{eq-sol-ssimp} -\end{align} -When $k_{cg} = 0$ and $\beta_g = \gamma_g$, then $u\left(\tau^{(0)}\right)$ -has the same solution, and $s\left(\tau^{(0)}\right)$ becomes -\begin{align} -s\left(\tau^{(0)}\right) &= s_0^{(0)}{ }_g e^{-\gamma_g \tau^{(0)}} \nonumber \\ - & \hskip -24pt +\frac{{\alpha^{(0)}}_g}{\gamma_g} - \left(1-e^{-\gamma_g \tau^{(0)}}\right) \nonumber\\ - & \hskip -24pt - \left( {\alpha^{(0)}}_g-\beta_g {u_0^{(0)}}_g \right) - \tau^{(0)} e^{-\beta_g \tau^{(0)}}. \label{eq-sol-ssimp2} -\end{align} -When $k_{cg} = 1$ and $\beta_g \neq \gamma_g$, then -\begin{align} -u\left(\tau^{(1)}\right) &=u_0^{(1)} g^{e^{-\beta_g \tau^{(1)}}}, \\ -s\left(\tau^{(1)}\right) &=s_0^{(1)} e^{-\gamma_g \tau^{(1)}} \nonumber \\ - & \hskip -24pt +\frac{-\beta_g u_0^{(1)}}{\gamma_g-\beta_g} - \left(e^{-\gamma_g \tau^{(1)}}-e^{-\beta_g \tau^{(1)}}\right). -\end{align} -When $k_{cg} = 1$ and $\beta_g = \gamma_g$, then $u\left(\tau^{(1)}\right)$ -has the same solution, and $s\left(\tau^{(1)}\right)$ becomes +The linear ODE in Eq. \ref{eq-rna} has an analytic solution under the assumption that the transcription rate $\alpha(t)$ is piece-wise constant, and is derived by the authors of scVelo and a theoretical RNA velocity study +[@Bergen2020-pj;@Li2021-qa]. +For brevity, we present the solution under constant transcription rate $\alpha(t) = \alpha_0$, and note that this can be combined to assemble the solution in the piece-wise constant case. +First, let $C = u_0 - \frac{\alpha}{\beta}$. Then, the solution to the ODE in Eq. \ref{eq-rna} is: +\begin{equation} +\label{eq-rna-solution} \begin{align} -s\left(\tau^{(1)}\right)=s_0^{(1)}{ }_g e^{-\gamma_g \tau^{(1)}} - +\beta_g u_0^{(1)}{ }_g \tau^{(1)} e^{-\beta_g \tau^{(1)}}. + u(t) &= \frac{\alpha}{\beta} + C e^{-\beta t} \\ + s(t) &= \begin{cases} + \frac{\alpha}{\gamma} + \frac{\beta}{\gamma - \beta} C e^{-\beta t} + \Big(s_0 - \frac{\alpha}{\gamma} - \frac{\beta}{\gamma - \beta} C\Big) e^{-\gamma t} & \text{if } \beta \neq \gamma \\ + \frac{\alpha}{\beta} + \Big(s_0 - \frac{\alpha}{\beta} + \beta C t\Big) e^{-\beta t} & \text{if } \beta = \gamma. + \end{cases} \end{align} - +\end{equation} +These solutions are the essential building block used to compute all cell-gene-specific ODE solutions for the unspliced and spliced expression levels in the generative model. ## Variational inference {#sec-methods-inference} From 6cf2faa9a0e6c9c7ab00bd50c7bae34ab05fae42 Mon Sep 17 00:00:00 2001 From: mattlevine22 Date: Thu, 29 Aug 2024 12:59:00 -0400 Subject: [PATCH 16/18] bugfix with align vs aligned --- reproducibility/manuscript/manuscript.qmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd index 438c393b5..634c4299d 100644 --- a/reproducibility/manuscript/manuscript.qmd +++ b/reproducibility/manuscript/manuscript.qmd @@ -501,18 +501,18 @@ ordinary differential equation (ODEs) proposed in velocyto [@La_Manno2018-lj] an [@Bergen2020-pj] \begin{equation} \label{eq-rna} -\begin{align} +\begin{aligned} \frac{du}{dt} &= \alpha(t) - \beta u, \quad u(0) = u_0 \\ \frac{ds}{dt} &= \beta u - \gamma s, \quad s(0) = s_0, -\end{align} +\end{aligned} \end{equation} where $u(t), s(t)$ are the unspliced and spliced expression levels of a gene at time $t$ under a transcription rate $\alpha(t)$ with possible temporal dependence, splicing rate $\beta$, and degradation rate $\gamma$. We specify this model to a setting that depends on cell $c$ and gene $g$ as follows: \begin{equation} \label{eq-rna-cg} -\begin{align} +\begin{aligned} \frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \\ \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)}. -\end{align} +\end{aligned} \end{equation} In the equation, the subscript $c$ is the cell dimension, $g$ is the gene dimension, $\left( u_{cg}(t), s_{cg}(t) \right)$ @@ -608,13 +608,13 @@ For brevity, we present the solution under constant transcription rate $\alpha(t First, let $C = u_0 - \frac{\alpha}{\beta}$. Then, the solution to the ODE in Eq. \ref{eq-rna} is: \begin{equation} \label{eq-rna-solution} -\begin{align} +\begin{aligned} u(t) &= \frac{\alpha}{\beta} + C e^{-\beta t} \\ s(t) &= \begin{cases} \frac{\alpha}{\gamma} + \frac{\beta}{\gamma - \beta} C e^{-\beta t} + \Big(s_0 - \frac{\alpha}{\gamma} - \frac{\beta}{\gamma - \beta} C\Big) e^{-\gamma t} & \text{if } \beta \neq \gamma \\ \frac{\alpha}{\beta} + \Big(s_0 - \frac{\alpha}{\beta} + \beta C t\Big) e^{-\beta t} & \text{if } \beta = \gamma. \end{cases} -\end{align} +\end{aligned} \end{equation} These solutions are the essential building block used to compute all cell-gene-specific ODE solutions for the unspliced and spliced expression levels in the generative model. From 95d4db156915223cc3bc15db209ce05a82360990 Mon Sep 17 00:00:00 2001 From: mattlevine22 Date: Thu, 29 Aug 2024 15:20:24 -0400 Subject: [PATCH 17/18] updating later methods section to have consistent notation --- reproducibility/manuscript/manuscript.qmd | 60 ++++++++++++----------- 1 file changed, 31 insertions(+), 29 deletions(-) diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd index 634c4299d..73b8ecffd 100644 --- a/reproducibility/manuscript/manuscript.qmd +++ b/reproducibility/manuscript/manuscript.qmd @@ -510,8 +510,8 @@ where $u(t), s(t)$ are the unspliced and spliced expression levels of a gene at \begin{equation} \label{eq-rna-cg} \begin{aligned} -\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \\ - \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)}. +\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label{eq-du-cg} \\ + \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} \label{eq-ds-cg} . \end{aligned} \end{equation} In the equation, the subscript $c$ is the cell dimension, $g$ is the gene dimension, @@ -581,6 +581,7 @@ Now, we describe the priors for the initial conditions, noting that this is the We define the ODE solution at time $\tau_{cg}$ as: \begin{equation} + \label{eq-ODE-solve} \hat{u}_{cg}, \hat{s}_{cg} = \text{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big). \end{equation} @@ -594,8 +595,8 @@ Next, we define the observation model that gives rise to the observed counts as: \hat{\mu}^{(u)}_c &= \sum_{g=1}^G \hat{u}_{cg}, \quad \hat{\mu}^{(s)}_c = \sum_{g=1}^G \hat{s}_{cg} \\ \lambda^{(u)}_{cg} &= \log(\hat{u}_{cg}) + \log(\eta^{(u)}_{c}) - \log(\hat{\mu}^{(u)}_c) \\ \lambda^{(s)}_{cg} &= \log(\hat{s}_{cg}) + \log(\eta^{(s)}_{c}) - \log(\hat{\mu}^{(s)}_c) \\ - \hat{u}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(u)}_{cg})\Big) \\ - \hat{s}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(s)}_{cg})\Big) + \hat{u}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(u)}_{cg})\Big) \label{eq-u-hat-obs} \\ + \hat{s}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(s)}_{cg})\Big) \label{eq-s-hat-obs} \end{align} Here, we use ${u}^{\text{(obs)}}_{cg}, {s}^{\text{(obs)}}_{cg}$ to denote the observed unspliced and spliced counts for cell $c$ and gene $g$. We use $\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg}$ to @@ -620,12 +621,12 @@ These solutions are the essential building block used to compute all cell-gene-s ## Variational inference {#sec-methods-inference} -Given observations $\tilde{X}_{cg} = \left( u_{cg}^{obs}, s_{cg}^{obs} \right)$, -we would like to compute the posterior distribution over the random variables +Given observations $\tilde{X}_{cg} = \left( u_{cg}^{\text{(obs)}}, s_{cg}^{\text{(obs)}} \right)$ over cells $1 \dots C$ and genes $1 \dots G$, +we would like to compute the posterior distribution over the random variables $\theta := \left( \theta_1, \ \dots, \theta_C \right)$ and $\psi := \left( \psi_1, \ \dots, \psi_G \right)$, where \begin{align} -\theta &= \left( t_{c}, \eta_{c}^{(u)}, \eta_{c}^{(s)} \right), \\ -\phi &= \left( {t_{0}^{(0)}}_g, \Delta \text{switching}_{g}, - {\alpha^{(0)}}_{g}, \beta_{g}, \gamma_{g} \right), +\theta_c &= \left( t_{c}, \eta_{c}^{(u)}, \eta_{c}^{(s)} \right), \\ +\phi_g &= \left( {t_{0,g}^{(0)}}, \Delta \text{switching}_{g}, + \alpha^{(0)}_{g}, \beta_{g}, \gamma_{g} \right), \end{align} but exact Bayesian inference is intractable in this model. We use Pyro to automatically integrate out the local discrete latent variables $k$, which is @@ -686,18 +687,20 @@ machine with an NVIDIA A100 GPU and the CentOS 7 operating system. ## Posterior prediction {#sec-methods-posterior-prediction} To benchmark Pyro -Velocity performance in predicting cell fate, we -generated the posterior samples measurement -$x_{cg}=\left(u\left(\tau^{(k_{cg})}\right), -s\left(\tau^{(k_{cg})}\right)\right)$ or $x_{cg}=\left(u_{cg}, s_{cg} \right)$, -$t_c$, $\beta_g$, $\gamma_g$ from the same single cell RNA-seq using $N=30$ -Monte Carlo samples from the posterior predictive distribution following -$p(x_{cg} \vert \theta, \psi) p(\theta, \psi) \approx p(x_{cg} \vert \theta, -\psi) q_{\phi}(\theta, \psi)$. $x_{cg}$ can be posterior samples of either -denoised gene expression (used for phase portraits and vector field-based -trajectory inference) or raw read counts (used for gene selection). Then we -calculated the posterior samples of RNA velocity as -$v_{cg}=\frac{ds(\tau^{(k_{cg})})}{d \tau^{(k{cg})}}$ based on posterior samples -measurement of denoised gene expression $x_{cg}$ and $\beta_{g}$, $\gamma_{g}$. +generated posterior samples of quantitites of interest in the model. +In particular, we focus on posterior samples of $i)$ spliced and unspliced gene expression across all cells and genes, defined in Eq. \ref{eq-ODE-solve} and denoted by +$X_{cg}=\left(u_{cg}, s_{cg} \right)$ (used for phase portraits and vector-field-based +trajectory inference), +$ii)$ read counts, defined in Eqs. \ref{eq-u-hat-obs} and \ref{eq-s-hat-obs} and denoted by +$\hat{X}^{\text{(obs)}}_{cg}=\left(\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg} \right)$ +(used for gene selection), +and +$iii)$ RNA velocity, defined in Eq. \ref{eq-ds-cg} as the time derivative of the spliced expression, denoted by +$v_{cg}=\frac{ds_{cg}}{d t}(X_{cg})$. +We generate 30 Monte Carlo samples from the posterior predictive distribution +$X, \hat{X}^{(\text{obs})}, v \sim p(X, \hat{X}^{(\text{obs})}, v \vert \theta, \psi) p(\theta, \psi) \approx q_{\phi}(\theta, \psi)$ +by sampling the posterior distribution of $\theta$, $\psi$ using the trained model. +We use these samples to compute statistics on the posterior predictive distribution of our quantities of interest; these statistics are used to evaluate the model's performance and uncertainty in predicting cell fate. ## Prioritization of cell fate markers {#sec-methods-fate-markers} @@ -706,9 +709,7 @@ correlation between each gene's posterior mean of the denoised spliced expression and posterior mean of the shared time; Second, the negative mean absolute errors of each gene's observed spliced, and unspliced read counts with posterior predictive samples of spliced and unspliced raw read counts, i.e. -$\frac{1}{N n_c} \sum_i \sum_c (x_{icg}-\tilde{X}_{cg})$, where $N$ is the -posterior sample number that is set to $30$, $i$ is the posterior sample index, -and $n_c$ is the cell number. We first select the top $300$ genes with the +$\mathcal{E}_g = \frac{1}{NC} \sum_{i=1}^N \sum_{c=1}^{C} \vert \tilde{X}_{cg} - \hat{X}_{cg}^{(\text{obs}), (i)} \vert$, where we use $N=30$ Monte Carlo samples. We first select the top $300$ genes with the highest negative mean absolute errors and then rank the $300$ genes based on the most positively correlated genes and the least negatively correlated genes. We use the same strategy for scVelo to rank the markers by the model likelihood to @@ -847,8 +848,9 @@ is state uncertainty which evaluates the mean Euclidean distance between posterior mean and posterior samples of the raw read count prediction for each cell \begin{align*} -\frac{1}{N} \sum_{i} \sqrt{\sum_{g} \left(x_{icg} - \frac{1}{N}\sum_i x_{icg}\right)^2}. +\mathcal{U}_c = \frac{1}{N} \sum_{i=1}^N \sqrt{\sum_{g=1}^G \left(\hat{X}^{\text{(obs)},(i)}_{cg} - \frac{1}{N}\sum_{i=1}^N \hat{X}^{\text{(obs)},(i)}_{cg}\right)^2} \end{align*} +with $N=30$ Monte Carlo samples. ## Cospar {#sec-methods-cospar} @@ -895,11 +897,11 @@ to use for a given dataset. Namely, the mean absolute errors of posterior samples prediction of spliced and unspliced raw read counts $x_{icg}$ from its observation $\tilde{X}_{cg}$ per gene and cell combinations \begin{align*} -\frac{1}{N_i N_c N_g} \sum_{i=1}^{N_i} \sum_{c=1}^{N_c} \sum_{g=1}^{N_g} - \vert x_{icg} - \tilde{X}_{cg} \vert, +\frac{1}{N C G} \sum_{i=1}^{N} \sum_{c=1}^{C} \sum_{g=1}^{G} + \vert X^{\text{(obs), (i)}}_{cg} - \tilde{X}_{cg} \vert, \end{align*} -where $i$ is the posterior sample index, $N_i$ is the posterior sample number, -$N_c$ is the cell number, $N_g$ is the gene number. Then it is possible to +where $i$ is the posterior sample index, $N$ is the posterior sample number, +$C$ is the cell number, $G$ is the gene number. Then it is possible to evaluate the cell fate inference performance of the models based on two metrics: 1. the consistency between the velocity vector field and the *clonal progression From b7183f287ab904502250bf9d5605b8614159da75 Mon Sep 17 00:00:00 2001 From: mattlevine22 Date: Tue, 3 Sep 2024 00:23:57 -0400 Subject: [PATCH 18/18] latex label bugfix --- reproducibility/manuscript/manuscript.qmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd index 73b8ecffd..c7223eebc 100644 --- a/reproducibility/manuscript/manuscript.qmd +++ b/reproducibility/manuscript/manuscript.qmd @@ -510,8 +510,8 @@ where $u(t), s(t)$ are the unspliced and spliced expression levels of a gene at \begin{equation} \label{eq-rna-cg} \begin{aligned} -\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label{eq-du-cg} \\ - \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} \label{eq-ds-cg} . +\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \\ + \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} . \end{aligned} \end{equation} In the equation, the subscript $c$ is the cell dimension, $g$ is the gene dimension,