From 62bfc064152fac1b549c80bf956ba386b1e2c20c Mon Sep 17 00:00:00 2001
From: mattlevine22 <mattlevine22@gmail.com>
Date: Mon, 26 Aug 2024 11:47:47 -0400
Subject: [PATCH 01/18] starting to update methods, part 1

---
 reproducibility/manuscript/manuscript.qmd | 88 ++++++++++++++++++-----
 1 file changed, 71 insertions(+), 17 deletions(-)

diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd
index 6002915f0..685530f48 100644
--- a/reproducibility/manuscript/manuscript.qmd
+++ b/reproducibility/manuscript/manuscript.qmd
@@ -497,35 +497,89 @@ fate potency are reported in the titles.
 
 We assume the dynamical gene expression is determined by the RNA splicing
 process, and infer the unspliced and spliced gene expression level from the
-differential equations proposed in velocyto [@La_Manno2018-lj] and scVelo
+ordinary differential equation (ODEs) proposed in velocyto [@La_Manno2018-lj] and scVelo
 [@Bergen2020-pj]
 \begin{align}
-\frac{d u\left(\tau^{\left(k_{cg}\right)}\right)}{d \tau^{\left(k_{cg}\right)}}
-  &= \alpha^{\left(k_{cg}\right)}-\beta_g u\left(\tau^{\left(k_{cg}\right)}\right),
-   \label{eq-dudt}\\
-\frac{d s\left(\tau^{\left(k_{cg}\right)}\right)}{d \tau^{\left(k_{cg}\right)}}
-  &= \beta_g u\left(\tau^{\left(k_{cg}\right)}\right)
-  -\gamma_g s\left(\tau^{\left(k_{cg}\right)}\right). \label{eq-dsdt}
+\frac{du}{dt} &= \alpha(t) - \beta u, \quad u(0) = u_0 \label{eq-dudt}\\
+   \frac{ds}{dt} &= \beta u - \gamma s, \quad s(0) = s_0, \label{eq-dsdt}
+\end{align}
+where $u(t), s(t)$ are the unspliced and spliced expression levels of a gene at time $t$ under a transcription rate $\alpha(t)$ with possible temporal dependence, splicing rate $\beta$, and degradation rate $\gamma$. We specify this model to a setting that depends on cell $c$ and gene $g$ as follows:
+\begin{align}
+\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label{eq-dudt}\\
+   \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)}, \label{eq-dsdt}.
 \end{align}
 In the equation, the subscript $c$ is the cell dimension, $g$ is the gene dimension,
-$\left( u\left( \tau^{(k_{cg})} \right), s\left( \tau^{(k_{cg})} \right) \right)$ 
+$\left( u_{cg}(t), s_{cg}(t) \right)$ 
 are the unspliced and spliced expression functions given the
-change of time per cell and gene. $\tau_{cg}$ represents the displacement of time per
+change of time per cell and gene.
+We restrict attention to piecewise-constant $\alpha_{cg}(t)$ to capture gene-specific activation and repression. We take special care to model a gene- and cell-specific switching time that marks a single transition from activation to repression by introducing a Bernoulli variable $k_{cg}$ to model unknown activation state. We assume our cell-by-gene data-matrix arrive as observations of Poisson-counts related to the solution of the above ODEs at unknown times $\tau_{cg}$, which is modeled as a relationship between an unknown latent time shared across each cell, $t_c$, and unknown gene-specific time-offsets $t_{0,g}$ where all read counts for a single cell occurred at an unknown, but shared latent time $t_c$. These relative times are also used to parametrize the Bernoulli process for $k_{cg}$. 
+Importantly, we recognize that the initial conditions are in fact unknown. 
+
+We propose and study two models: Model 1 assumes that spliced and unspliced concentrations are both 0 at time 0; Model 2 considers these initial conditions as unknowns with a log-Normal prior distribution. In general, the solution space of ODEs becomes much richer when considered over a domain of initial conditions (as opposed to a single point); indeed, this affords Model 2 much greater expressivity. For clarity, we first present the generative framework for both models, then provide further interpretation and intuition. 
+
+First, we introduce the generative model that describes the various unobserved times:
+\begin{align}
+  % unit lognormal t_c
+  t_c &\sim \text{LogNormal}(0, 1) \\
+  % gene-specific t_0
+  t^{(0)}_{0,g} &\sim \text{LogNormal}(0, 1) \\
+  % switching time
+  \Delta \textrm{switching}_g &\sim \text{LogNormal}(0, 1) \\
+  % gene-specific t_1
+  t^{(1)}_{0,g} &= t^{(0)}_{0,g} + \Delta \textrm{switching}_g \\
+  %cell-gene-specific activation state
+  k_{cg} &\sim \text{Bernoulli}(\textrm{logits}=t_c - t^{(1)}_{0,g}) \\
+  % cell-gene-specific latent time
+  \tau_{cg} &= \text{softplus}(t_c - t^{(k_{cg})}_{0,g}).
+\end{align}
+Here, $\tau_{cg}$ represents the displacement of time per
 cell and gene with 
 \begin{align}
- \tau^{(k_{cg})} &= \operatorname{softplus} \left( t_{c} - {t_{0}^{(k_{cg})}}_g \right) \\
- & = \log( 1 + \exp (t_c - {t_{0}^{(k_{cg})}}_g)), 
+ textrm{softplus}(t) :=  \log( 1 + e^t).
 \end{align}
-in which $t_c$ is the shared time per cell, ${t_{0}^{(kcg)}}_g$ is the
+Recall that $t_c$ is the shared time per cell, $t^{(k_{cg})}_{0,g}$ is the
 gene-specific switching time. Each cell and gene combination has its
 transcriptional state $k_{cg} \in \{ 0, 1 \}$, where $0$ indicates the
 activation state and $1$ indicates the expression state. Each gene has two
-switching times for representing activation and repression: ${t_{0}^{(0)}}_g$ is
+switching times for representing activation and repression: $t^{(0)}_{0,g}$ is
 the first switching time corresponding to when the gene expression starts to be
-activated, ${t_0^{(1)}}_g$  is the second switching time corresponding to when
-the gene expression starts to be repressed. We note that $\alpha^{(1)}$ is
-shared for all the genes, while ${\alpha^{(0)}}_g$ is learned independently for
-each gene. 
+activated, $t^{(1)}_{0,g}$  is the second switching time corresponding to when
+the gene expression starts to be repressed.
+
+Next we introduce the priors for the splicing parameters (where the activation rate $\alpha$ depends on the activation state $k_{cg}$ from above):
+\begin{align}
+  \alpha^{(0)}_g &\sim \text{LogNormal}(0, 1) \\
+  \beta_g &\sim \text{LogNormal}(0, 1) \\
+  \gamma_g &\sim \text{LogNormal}(0, 1) \\
+  \alpha_{cg} = \begin{cases}
+    \alpha^{(0)}_g & \text{if } k_{cg} = 0 \\
+    0 & \text{if } k_{cg} = 1
+  \end{cases}
+\end{align}
+\textbf{Note that $\alpha^{(1)}$ is shared for all the genes, while ${\alpha^{(0)}}_g$ is learned independently for
+each gene. MATT: this was in the old text, but I think $\alpha^{(1)}$ is no longer used based on conversations with Alvin?}
+
+Now, we describe the priors for the initial conditions, noting that this is the only difference between Model 1 and Model 2:
+\begin{align}
+  \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} &\sim \begin{cases}
+    (0, 0) & \text{Model 1} \\
+    (\text{LogNormal}(0, 1), \text{LogNormal}(0, 1)) & \text{Model 2}
+  \end{cases}
+
+  u^{(0)}_{cg}, s^{(0)}_{cg} &= \begin{cases}
+    \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} & \text{if } k_{cg} = 0 \\
+    \textrm{ODESolve}\Big( \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg}, \alpha^{(0)}_g, \beta_g, \gamma_g; \ T_0=0, T_1=\Delta \textrm{switching}_g \Big) & \text{if } k_{cg} = 1
+  \end{cases}
+\end{align}
+
+Finally, we describe the ODE solution at time $\tau_{cg}$, and the observation model at those times that give rise to the observed counts:
+\begin{align}
+  \hat{u}_{cg}, \hat{s}_{cg} &= \textrm{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big) \\
+  u_{cg} &\sim \text{Poisson}(\hat{u}_{cg}) \\
+  s_{cg} &\sim \text{Poisson}(\hat{s}_{cg})
+\end{align}
+
+
 
 The analytic solution of the differential equations to predict
 spliced and unspliced gene expression given their parameters is derived by the

From 63c188ba21d384ed75d489573fff4c1fbf9a1cfa Mon Sep 17 00:00:00 2001
From: mattlevine22 <mattlevine22@gmail.com>
Date: Mon, 26 Aug 2024 14:36:37 -0400
Subject: [PATCH 02/18] starting to update methods, part 2

---
 reproducibility/manuscript/manuscript.qmd | 108 ++++------------------
 1 file changed, 19 insertions(+), 89 deletions(-)

diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd
index 685530f48..fe7141724 100644
--- a/reproducibility/manuscript/manuscript.qmd
+++ b/reproducibility/manuscript/manuscript.qmd
@@ -535,7 +535,7 @@ First, we introduce the generative model that describes the various unobserved t
 Here, $\tau_{cg}$ represents the displacement of time per
 cell and gene with 
 \begin{align}
- textrm{softplus}(t) :=  \log( 1 + e^t).
+ \text{softplus}(t) :=  \log( 1 + e^t).
 \end{align}
 Recall that $t_c$ is the shared time per cell, $t^{(k_{cg})}_{0,g}$ is the
 gene-specific switching time. Each cell and gene combination has its
@@ -551,7 +551,7 @@ Next we introduce the priors for the splicing parameters (where the activation r
   \alpha^{(0)}_g &\sim \text{LogNormal}(0, 1) \\
   \beta_g &\sim \text{LogNormal}(0, 1) \\
   \gamma_g &\sim \text{LogNormal}(0, 1) \\
-  \alpha_{cg} = \begin{cases}
+  \alpha_{cg} &= \begin{cases}
     \alpha^{(0)}_g & \text{if } k_{cg} = 0 \\
     0 & \text{if } k_{cg} = 1
   \end{cases}
@@ -574,13 +574,25 @@ Now, we describe the priors for the initial conditions, noting that this is the
 
 Finally, we describe the ODE solution at time $\tau_{cg}$, and the observation model at those times that give rise to the observed counts:
 \begin{align}
-  \hat{u}_{cg}, \hat{s}_{cg} &= \textrm{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big) \\
-  u_{cg} &\sim \text{Poisson}(\hat{u}_{cg}) \\
-  s_{cg} &\sim \text{Poisson}(\hat{s}_{cg})
+  \hat{u}_{cg}, \hat{s}_{cg} &= \text{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big) \\
+  \mu^{(u)}_c &= \sum_{g=1}^G {u}^{\text{(obs)}}_{cg}, \quad \mu^{(s)}_c = \sum_{g=1}^G {s}^{\text{(obs)}}_{cg} \\
+  \sigma^{(u)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( u_{cg}^{\text{(obs)}} - \mu^{(u)}_c \right)^2} \\
+  \sigma^{(s)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( s_{cg}^{\text{(obs)}} - \mu^{(s)}_c \right)^2} \\
+  \eta^{(u)}_c &\sim \text{Normal}\Big(\mu^{(u)}_c, \ \sigma^{(u)}_c\Big) \\
+  \eta^{(s)}_c &\sim \text{Normal}\Big(\mu^{(s)}_c, \ \sigma^{(s)}_c\Big) \\
+  \hat{\mu}^{(u)}_c &= \sum_{g=1}^G \hat{u}_{cg}, \quad \hat{\mu}^{(s)}_c = \sum_{g=1}^G \hat{s}_{cg} \\
+  \lambda^{(u)}_{cg} &= \log(\hat{u}_{cg}) + \log(\eta^{(u)}_{c}) - \log(\hat{\mu}^{(u)}_c) \\
+  \lambda^{(s)}_{cg} &= \log(\hat{s}_{cg}) + \log(\eta^{(s)}_{c}) - \log(\hat{\mu}^{(s)}_c) \\
+  \hat{u}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(u)}_{cg})\Big) \\
+  \hat{s}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(s)}_{cg})\Big)
 \end{align}
+Here, we use ${u}^{\text{(obs)}}_{cg}, {s}^{\text{(obs)}}_{cg}$ to denote the observed unspliced and spliced counts
+for cell $c$ and gene $g$. We use $\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg}$ to
+denote our generative model's prediction of these unspliced and spliced expression levels.
+The generative process for modeling these observed read counts given denoised gene transcript expression level $\hat{u}_cg, \hat{s}_cg$ considers the expected number of observed reads for a given gene in a given cell as the number of transcripts times the ratio of the cell's total reads to total transcripts.
+\textbf{Improve descriptions of how noise is modeled in the observation model.}
 
-
-
+\textbf{Need to update the analytic solutions, but first need to confirm the above is correct. Also, I recommend pushing all of the below analytic solutions to the appendix.}
 The analytic solution of the differential equations to predict
 spliced and unspliced gene expression given their parameters is derived by the
 authors of scVelo and a theoretical RNA velocity study
@@ -650,88 +662,6 @@ s\left(\tau^{(1)}\right)=s_0^{(1)}{ }_g e^{-\gamma_g \tau^{(1)}}
   +\beta_g u_0^{(1)}{ }_g \tau^{(1)} e^{-\beta_g \tau^{(1)}}.
 \end{align}
 
-We use these solutions to formulate an end-to-end probabilistic generative model 
-that relates prior distributions on kinetic parameters to a distribution on pairs 
-of observed unspliced and spliced read count matrices
-
-\begin{align}
-\alpha^{(0)}{ }_g &\sim \operatorname{LogNormal}(0,1), \\
-\beta_g &\sim \operatorname{LogNormal}(0,1), \\
-\gamma_g &\sim \operatorname{LogNormal}(0,1), \\
-&\hskip -18pt \Delta \text { switching }_g \sim \operatorname{LogNormal}(0,1), \\
-t_0^{\left(k_{c g}\right)} &= \left\{
-  \begin{array}{l}
-    t_0^{(0)}{ }_g \sim \operatorname{Normal}(0,1), k_{c g}=0 \\
-    t_0^{(1)}{ }_g=t_0^{(0)}{ }_g+\Delta \text { switching }_g, \\
-    \quad k_{c g}=1
-  \end{array}\right. \\
-t_c &\sim \operatorname{LogNormal}(0,1), \\
-k_{c g} &\sim \text{Bernoulli} \left( \text{logits}= t_c-t_0^{(1)} \right), \\
-\tau^{\left(k_{c g}\right)} 
-  &= \operatorname{softplus}\left(t_c-t_0^{\left(k_{c g}\right)}{ }_g\right), \\
-u_{c g} 
- &= \text { Measurement }_u \left( u\left(\tau^{\left(k_{c g}\right)}\right) ; 
-                                   u_{c g}^{obs}\right), \\
-s_{c g} 
-  &= \text { Measurement }_s \left( s\left(\tau^{(k_{c g})}\right) ; 
-                                    s_{c g}^{obs}\right).
-\end{align}
-$u\left(\tau^{\left(k_{c g}\right)}\right)$ and $s\left(\tau^{(k_{c g})}\right)$ are
-are called the denoised gene expression calculated from the velocity analytic
-solution input with the kinetics random variables. $u_{cg}$ and $s_{cg}$ are the spliced
-and unspliced read count sampled from the Poisson models. $u_{cg}^{obs}$ and $s_{cg}^{obs}$ are
-the observed spliced and unspliced read count tables. The generative process
-
-$\text{Measurement}(\cdot)$ for observed unspliced read counts given denoised unspliced
-gene transcript expression level $u\left(\tau^{(k_{cg})}\right)$ 
-(and identical for observed spliced read counts) models the expected number of observed 
-reads for a given gene in a given cell as the number of transcripts times the ratio of 
-the cell's total reads to total transcripts
-\begin{align}
-u_c^{\hat{obs}} &= \sum_g u_{c g}^{obs}, \\
-\hat{u}_c &= \sum_g u\left( \tau^{(k_{c g})}\right), \\
-\eta_c^{(u)} &\sim \operatorname{Normal}\left(
-    u_c^{\hat{obs}_c}, 
-    \operatorname{std} \left(u_c^{\hat{obs}}\right)
-  \right), \\
-\mu_{c g}^{(u)} &= \log \left(u\left(\tau^k{ }_{c g}\right)\right)
-  +\log \left(\eta_c^{(u)}\right)-\log \left(\hat{u}_c\right), \\
-u_{c g}^{obs} &\sim 
-  \operatorname{Poisson}\left(\lambda=\exp \left(\mu_{c g}^{(u)}\right)\right).
-\end{align} 
-
-For the first Pyro-Velocity model (Model 1), we constrain the shared time to be strictly
-larger than $t_{0}^{(0)}$ by introducing auxiliary random variables 
-$$
-\text{t\_constraint}_{cg} 
-  \sim \text{Bernoulli} \left( \text{logits} = t_c - {t_{0}^{(0)}}_g \right),
-$$
-and setting their values to $1$, and we set the initial condition per gene to be
-\begin{align}
-\left( {u_{0}^{(k_{cg})}}_g , {s_{0}^{(k_{cg})}}_g \right) &= \left\{
-  \begin{array}{l}
-    (0,0), k_{c g}=0 \\
-    \bigg( {u \left( \Delta \text { switching }_g \right)}_g,\\
-           \quad {s \left( \Delta \text { switching }_g \right)}_g \bigg), \\
-           \quad k_{c g}=1
-  \end{array}\right.
-\end{align}
-For the extended Pyro&thinsp;-Velocity model (Model 2), we remove the shared
-time constraint $\text{t\_constraint}_{cg}$, thus allowing a time lag per gene
-that might be caused by delayed gene activation and set the initial condition
-per gene as random variables that are strictly positive $\left({u_{0}^{(0)}}_g,
-{s_{0}^{(0)}}_g\right)$, which allow genes having a basal expression level before gene
-activation. Then, we compute the gene expression at the second switching time as
-\begin{align}
-({u_{0}^{(1)}}_g, {s_{0}^{(1)}}_g) &= 
-  \bigg( {u \left( \Delta \text { switching }_g \right)}_g, \nonumber \\
-& \qquad {s \left( \Delta \text { switching }_g \right)}_g \bigg),
-\end{align}
-which shares the same initial condition $\left({u_{0}^{(0)}}_g, {s_{0}^{(0)}}_g\right)$ where
-\begin{align}
-{u_{0}^{(0)}}_g &\sim \operatorname{LogNormal}(0,1),\\
-{s_{0}^{(0)}}_g &\sim \operatorname{LogNormal}(0,1).
-\end{align}
 
 ## Variational inference {#sec-methods-inference}
 

From f327453a6a570cd59c2a0144c029cb061adcc9e8 Mon Sep 17 00:00:00 2001
From: Cameron Smith <cameron.ray.smith@gmail.com>
Date: Mon, 26 Aug 2024 15:04:50 -0400
Subject: [PATCH 03/18] fix(manuscript): remove paragraph break in align env

Signed-off-by: Cameron Smith <cameron.ray.smith@gmail.com>
---
 reproducibility/manuscript/manuscript.qmd | 1 -
 1 file changed, 1 deletion(-)

diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd
index fe7141724..ebf76594d 100644
--- a/reproducibility/manuscript/manuscript.qmd
+++ b/reproducibility/manuscript/manuscript.qmd
@@ -565,7 +565,6 @@ Now, we describe the priors for the initial conditions, noting that this is the
     (0, 0) & \text{Model 1} \\
     (\text{LogNormal}(0, 1), \text{LogNormal}(0, 1)) & \text{Model 2}
   \end{cases}
-
   u^{(0)}_{cg}, s^{(0)}_{cg} &= \begin{cases}
     \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} & \text{if } k_{cg} = 0 \\
     \textrm{ODESolve}\Big( \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg}, \alpha^{(0)}_g, \beta_g, \gamma_g; \ T_0=0, T_1=\Delta \textrm{switching}_g \Big) & \text{if } k_{cg} = 1

From 1d6b1e21766ea8880921eaa63042e9abdb6bf224 Mon Sep 17 00:00:00 2001
From: mattlevine22 <mattlevine22@gmail.com>
Date: Mon, 26 Aug 2024 15:21:36 -0400
Subject: [PATCH 04/18] latex fixes and updates to methods. ready to be read.

---
 reproducibility/manuscript/manuscript.qmd | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd
index ebf76594d..b40104d70 100644
--- a/reproducibility/manuscript/manuscript.qmd
+++ b/reproducibility/manuscript/manuscript.qmd
@@ -506,7 +506,7 @@ ordinary differential equation (ODEs) proposed in velocyto [@La_Manno2018-lj] an
 where $u(t), s(t)$ are the unspliced and spliced expression levels of a gene at time $t$ under a transcription rate $\alpha(t)$ with possible temporal dependence, splicing rate $\beta$, and degradation rate $\gamma$. We specify this model to a setting that depends on cell $c$ and gene $g$ as follows:
 \begin{align}
 \frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label{eq-dudt}\\
-   \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)}, \label{eq-dsdt}.
+   \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} \label{eq-dsdt}.
 \end{align}
 In the equation, the subscript $c$ is the cell dimension, $g$ is the gene dimension,
 $\left( u_{cg}(t), s_{cg}(t) \right)$ 
@@ -543,8 +543,12 @@ transcriptional state $k_{cg} \in \{ 0, 1 \}$, where $0$ indicates the
 activation state and $1$ indicates the expression state. Each gene has two
 switching times for representing activation and repression: $t^{(0)}_{0,g}$ is
 the first switching time corresponding to when the gene expression starts to be
-activated, $t^{(1)}_{0,g}$  is the second switching time corresponding to when
-the gene expression starts to be repressed.
+activated, $t^{(1)}_{0,g}$ is the second switching time corresponding to when
+the gene expression starts to be repressed, and is determined by the first
+switching time and the gene-specific switching time $\Delta \text{switching}_g$.
+The cell-gene-specific activation state $k_{cg}$ is a Bernoulli random variable
+with logits equal to the difference between the cell's shared time $t_c$ and the time $t^{(1)}_{0,g}$ when the gene expression starts to be repressed.
+
 
 Next we introduce the priors for the splicing parameters (where the activation rate $\alpha$ depends on the activation state $k_{cg}$ from above):
 \begin{align}
@@ -564,16 +568,20 @@ Now, we describe the priors for the initial conditions, noting that this is the
   \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} &\sim \begin{cases}
     (0, 0) & \text{Model 1} \\
     (\text{LogNormal}(0, 1), \text{LogNormal}(0, 1)) & \text{Model 2}
-  \end{cases}
+  \end{cases} \\
   u^{(0)}_{cg}, s^{(0)}_{cg} &= \begin{cases}
     \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} & \text{if } k_{cg} = 0 \\
     \textrm{ODESolve}\Big( \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg}, \alpha^{(0)}_g, \beta_g, \gamma_g; \ T_0=0, T_1=\Delta \textrm{switching}_g \Big) & \text{if } k_{cg} = 1
   \end{cases}
 \end{align}
 
-Finally, we describe the ODE solution at time $\tau_{cg}$, and the observation model at those times that give rise to the observed counts:
+We define the ODE solution at time $\tau_{cg}$ as:
+\begin{equation}
+    \hat{u}_{cg}, \hat{s}_{cg} = \text{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big).
+\end{equation}
+
+Next, we define the observation model that gives rise to the observed counts as:
 \begin{align}
-  \hat{u}_{cg}, \hat{s}_{cg} &= \text{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big) \\
   \mu^{(u)}_c &= \sum_{g=1}^G {u}^{\text{(obs)}}_{cg}, \quad \mu^{(s)}_c = \sum_{g=1}^G {s}^{\text{(obs)}}_{cg} \\
   \sigma^{(u)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( u_{cg}^{\text{(obs)}} - \mu^{(u)}_c \right)^2} \\
   \sigma^{(s)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( s_{cg}^{\text{(obs)}} - \mu^{(s)}_c \right)^2} \\
@@ -588,7 +596,7 @@ Finally, we describe the ODE solution at time $\tau_{cg}$, and the observation m
 Here, we use ${u}^{\text{(obs)}}_{cg}, {s}^{\text{(obs)}}_{cg}$ to denote the observed unspliced and spliced counts
 for cell $c$ and gene $g$. We use $\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg}$ to
 denote our generative model's prediction of these unspliced and spliced expression levels.
-The generative process for modeling these observed read counts given denoised gene transcript expression level $\hat{u}_cg, \hat{s}_cg$ considers the expected number of observed reads for a given gene in a given cell as the number of transcripts times the ratio of the cell's total reads to total transcripts.
+The generative process for modeling these observed read counts given denoised gene transcript expression level $\hat{u}_{cg}, \hat{s}_{cg}$ considers the expected number of observed reads for a given gene in a given cell as the number of transcripts times the ratio of the cell's total reads to total transcripts.
 \textbf{Improve descriptions of how noise is modeled in the observation model.}
 
 \textbf{Need to update the analytic solutions, but first need to confirm the above is correct. Also, I recommend pushing all of the below analytic solutions to the appendix.}

From b85530633440ca67a0267795eb949586a2c8abc8 Mon Sep 17 00:00:00 2001
From: Cameron Smith <cameron.ray.smith@gmail.com>
Date: Mon, 26 Aug 2024 15:28:46 -0400
Subject: [PATCH 05/18] chore(manuscript): update v2 diff file

---
 reproducibility/manuscript/v2.tex | 228 ++++++++++++++++--------------
 1 file changed, 122 insertions(+), 106 deletions(-)

diff --git a/reproducibility/manuscript/v2.tex b/reproducibility/manuscript/v2.tex
index b69d93566..914342c02 100644
--- a/reproducibility/manuscript/v2.tex
+++ b/reproducibility/manuscript/v2.tex
@@ -661,34 +661,133 @@ \subsection{Model formulation}\label{sec-methods-model}
 
 We assume the dynamical gene expression is determined by the RNA
 splicing process, and infer the unspliced and spliced gene expression
-level from the differential equations proposed in velocyto
-\citep{La_Manno2018-lj} and scVelo \citep{Bergen2020-pj} \begin{align}
-\frac{d u\left(\tau^{\left(k_{cg}\right)}\right)}{d \tau^{\left(k_{cg}\right)}}
-  &= \alpha^{\left(k_{cg}\right)}-\beta_g u\left(\tau^{\left(k_{cg}\right)}\right),
-   \label{eq-dudt}\\
-\frac{d s\left(\tau^{\left(k_{cg}\right)}\right)}{d \tau^{\left(k_{cg}\right)}}
-  &= \beta_g u\left(\tau^{\left(k_{cg}\right)}\right)
-  -\gamma_g s\left(\tau^{\left(k_{cg}\right)}\right). \label{eq-dsdt}
+level from the ordinary differential equation (ODEs) proposed in
+velocyto \citep{La_Manno2018-lj} and scVelo \citep{Bergen2020-pj}
+\begin{align}
+\frac{du}{dt} &= \alpha(t) - \beta u, \quad u(0) = u_0 \label{eq-dudt}\\
+   \frac{ds}{dt} &= \beta u - \gamma s, \quad s(0) = s_0, \label{eq-dsdt}
+\end{align} where \(u(t), s(t)\) are the unspliced and spliced
+expression levels of a gene at time \(t\) under a transcription rate
+\(\alpha(t)\) with possible temporal dependence, splicing rate
+\(\beta\), and degradation rate \(\gamma\). We specify this model to a
+setting that depends on cell \(c\) and gene \(g\) as follows:
+\begin{align}
+\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label{eq-dudt}\\
+   \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} \label{eq-dsdt}.
 \end{align} In the equation, the subscript \(c\) is the cell dimension,
-\(g\) is the gene dimension,
-\(\left( u\left( \tau^{(k_{cg})} \right), s\left( \tau^{(k_{cg})} \right) \right)\)
-are the unspliced and spliced expression functions given the change of
-time per cell and gene. \(\tau_{cg}\) represents the displacement of
-time per cell and gene with \begin{align}
- \tau^{(k_{cg})} &= \operatorname{softplus} \left( t_{c} - {t_{0}^{(k_{cg})}}_g \right) \\
- & = \log( 1 + \exp (t_c - {t_{0}^{(k_{cg})}}_g)), 
-\end{align} in which \(t_c\) is the shared time per cell,
-\({t_{0}^{(kcg)}}_g\) is the gene-specific switching time. Each cell and
-gene combination has its transcriptional state
+\(g\) is the gene dimension, \(\left( u_{cg}(t), s_{cg}(t) \right)\) are
+the unspliced and spliced expression functions given the change of time
+per cell and gene. We restrict attention to piecewise-constant
+\(\alpha_{cg}(t)\) to capture gene-specific activation and repression.
+We take special care to model a gene- and cell-specific switching time
+that marks a single transition from activation to repression by
+introducing a Bernoulli variable \(k_{cg}\) to model unknown activation
+state. We assume our cell-by-gene data-matrix arrive as observations of
+Poisson-counts related to the solution of the above ODEs at unknown
+times \(\tau_{cg}\), which is modeled as a relationship between an
+unknown latent time shared across each cell, \(t_c\), and unknown
+gene-specific time-offsets \(t_{0,g}\) where all read counts for a
+single cell occurred at an unknown, but shared latent time \(t_c\).
+These relative times are also used to parametrize the Bernoulli process
+for \(k_{cg}\). Importantly, we recognize that the initial conditions
+are in fact unknown.
+
+We propose and study two models: Model 1 assumes that spliced and
+unspliced concentrations are both 0 at time 0; Model 2 considers these
+initial conditions as unknowns with a log-Normal prior distribution. In
+general, the solution space of ODEs becomes much richer when considered
+over a domain of initial conditions (as opposed to a single point);
+indeed, this affords Model 2 much greater expressivity. For clarity, we
+first present the generative framework for both models, then provide
+further interpretation and intuition.
+
+First, we introduce the generative model that describes the various
+unobserved times: \begin{align}
+  % unit lognormal t_c
+  t_c &\sim \text{LogNormal}(0, 1) \\
+  % gene-specific t_0
+  t^{(0)}_{0,g} &\sim \text{LogNormal}(0, 1) \\
+  % switching time
+  \Delta \textrm{switching}_g &\sim \text{LogNormal}(0, 1) \\
+  % gene-specific t_1
+  t^{(1)}_{0,g} &= t^{(0)}_{0,g} + \Delta \textrm{switching}_g \\
+  %cell-gene-specific activation state
+  k_{cg} &\sim \text{Bernoulli}(\textrm{logits}=t_c - t^{(1)}_{0,g}) \\
+  % cell-gene-specific latent time
+  \tau_{cg} &= \text{softplus}(t_c - t^{(k_{cg})}_{0,g}).
+\end{align} Here, \(\tau_{cg}\) represents the displacement of time per
+cell and gene with \begin{align}
+ \text{softplus}(t) :=  \log( 1 + e^t).
+\end{align} Recall that \(t_c\) is the shared time per cell,
+\(t^{(k_{cg})}_{0,g}\) is the gene-specific switching time. Each cell
+and gene combination has its transcriptional state
 \(k_{cg} \in \{ 0, 1 \}\), where \(0\) indicates the activation state
 and \(1\) indicates the expression state. Each gene has two switching
-times for representing activation and repression: \({t_{0}^{(0)}}_g\) is
+times for representing activation and repression: \(t^{(0)}_{0,g}\) is
 the first switching time corresponding to when the gene expression
-starts to be activated, \({t_0^{(1)}}_g\) is the second switching time
-corresponding to when the gene expression starts to be repressed. We
-note that \(\alpha^{(1)}\) is shared for all the genes, while
-\({\alpha^{(0)}}_g\) is learned independently for each gene.
+starts to be activated, \(t^{(1)}_{0,g}\) is the second switching time
+corresponding to when the gene expression starts to be repressed, and is
+determined by the first switching time and the gene-specific switching
+time \(\Delta \text{switching}_g\). The cell-gene-specific activation
+state \(k_{cg}\) is a Bernoulli random variable with logits equal to the
+difference between the cell's shared time \(t_c\) and the time
+\(t^{(1)}_{0,g}\) when the gene expression starts to be repressed.
+
+Next we introduce the priors for the splicing parameters (where the
+activation rate \(\alpha\) depends on the activation state \(k_{cg}\)
+from above): \begin{align}
+  \alpha^{(0)}_g &\sim \text{LogNormal}(0, 1) \\
+  \beta_g &\sim \text{LogNormal}(0, 1) \\
+  \gamma_g &\sim \text{LogNormal}(0, 1) \\
+  \alpha_{cg} &= \begin{cases}
+    \alpha^{(0)}_g & \text{if } k_{cg} = 0 \\
+    0 & \text{if } k_{cg} = 1
+  \end{cases}
+\end{align}
+\textbf{Note that $\alpha^{(1)}$ is shared for all the genes, while ${\alpha^{(0)}}_g$ is learned independently for
+each gene. MATT: this was in the old text, but I think $\alpha^{(1)}$ is no longer used based on conversations with Alvin?}
+
+Now, we describe the priors for the initial conditions, noting that this
+is the only difference between Model 1 and Model 2: \begin{align}
+  \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} &\sim \begin{cases}
+    (0, 0) & \text{Model 1} \\
+    (\text{LogNormal}(0, 1), \text{LogNormal}(0, 1)) & \text{Model 2}
+  \end{cases} \\
+  u^{(0)}_{cg}, s^{(0)}_{cg} &= \begin{cases}
+    \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} & \text{if } k_{cg} = 0 \\
+    \textrm{ODESolve}\Big( \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg}, \alpha^{(0)}_g, \beta_g, \gamma_g; \ T_0=0, T_1=\Delta \textrm{switching}_g \Big) & \text{if } k_{cg} = 1
+  \end{cases}
+\end{align}
 
+We define the ODE solution at time \(\tau_{cg}\) as: \begin{equation}
+    \hat{u}_{cg}, \hat{s}_{cg} = \text{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big).
+\end{equation}
+
+Next, we define the observation model that gives rise to the observed
+counts as: \begin{align}
+  \mu^{(u)}_c &= \sum_{g=1}^G {u}^{\text{(obs)}}_{cg}, \quad \mu^{(s)}_c = \sum_{g=1}^G {s}^{\text{(obs)}}_{cg} \\
+  \sigma^{(u)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( u_{cg}^{\text{(obs)}} - \mu^{(u)}_c \right)^2} \\
+  \sigma^{(s)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( s_{cg}^{\text{(obs)}} - \mu^{(s)}_c \right)^2} \\
+  \eta^{(u)}_c &\sim \text{Normal}\Big(\mu^{(u)}_c, \ \sigma^{(u)}_c\Big) \\
+  \eta^{(s)}_c &\sim \text{Normal}\Big(\mu^{(s)}_c, \ \sigma^{(s)}_c\Big) \\
+  \hat{\mu}^{(u)}_c &= \sum_{g=1}^G \hat{u}_{cg}, \quad \hat{\mu}^{(s)}_c = \sum_{g=1}^G \hat{s}_{cg} \\
+  \lambda^{(u)}_{cg} &= \log(\hat{u}_{cg}) + \log(\eta^{(u)}_{c}) - \log(\hat{\mu}^{(u)}_c) \\
+  \lambda^{(s)}_{cg} &= \log(\hat{s}_{cg}) + \log(\eta^{(s)}_{c}) - \log(\hat{\mu}^{(s)}_c) \\
+  \hat{u}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(u)}_{cg})\Big) \\
+  \hat{s}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(s)}_{cg})\Big)
+\end{align} Here, we use
+\({u}^{\text{(obs)}}_{cg}, {s}^{\text{(obs)}}_{cg}\) to denote the
+observed unspliced and spliced counts for cell \(c\) and gene \(g\). We
+use \(\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg}\) to
+denote our generative model's prediction of these unspliced and spliced
+expression levels. The generative process for modeling these observed
+read counts given denoised gene transcript expression level
+\(\hat{u}_{cg}, \hat{s}_{cg}\) considers the expected number of observed
+reads for a given gene in a given cell as the number of transcripts
+times the ratio of the cell's total reads to total transcripts.
+\textbf{Improve descriptions of how noise is modeled in the observation model.}
+
+\textbf{Need to update the analytic solutions, but first need to confirm the above is correct. Also, I recommend pushing all of the below analytic solutions to the appendix.}
 The analytic solution of the differential equations to predict spliced
 and unspliced gene expression given their parameters is derived by the
 authors of scVelo and a theoretical RNA velocity study
@@ -753,89 +852,6 @@ \subsection{Model formulation}\label{sec-methods-model}
   +\beta_g u_0^{(1)}{ }_g \tau^{(1)} e^{-\beta_g \tau^{(1)}}.
 \end{align}
 
-We use these solutions to formulate an end-to-end probabilistic
-generative model that relates prior distributions on kinetic parameters
-to a distribution on pairs of observed unspliced and spliced read count
-matrices
-
-\begin{align}
-\alpha^{(0)}{ }_g &\sim \operatorname{LogNormal}(0,1), \\
-\beta_g &\sim \operatorname{LogNormal}(0,1), \\
-\gamma_g &\sim \operatorname{LogNormal}(0,1), \\
-&\hskip -18pt \Delta \text { switching }_g \sim \operatorname{LogNormal}(0,1), \\
-t_0^{\left(k_{c g}\right)} &= \left\{
-  \begin{array}{l}
-    t_0^{(0)}{ }_g \sim \operatorname{Normal}(0,1), k_{c g}=0 \\
-    t_0^{(1)}{ }_g=t_0^{(0)}{ }_g+\Delta \text { switching }_g, \\
-    \quad k_{c g}=1
-  \end{array}\right. \\
-t_c &\sim \operatorname{LogNormal}(0,1), \\
-k_{c g} &\sim \text{Bernoulli} \left( \text{logits}= t_c-t_0^{(1)} \right), \\
-\tau^{\left(k_{c g}\right)} 
-  &= \operatorname{softplus}\left(t_c-t_0^{\left(k_{c g}\right)}{ }_g\right), \\
-u_{c g} 
- &= \text { Measurement }_u \left( u\left(\tau^{\left(k_{c g}\right)}\right) ; 
-                                   u_{c g}^{obs}\right), \\
-s_{c g} 
-  &= \text { Measurement }_s \left( s\left(\tau^{(k_{c g})}\right) ; 
-                                    s_{c g}^{obs}\right).
-\end{align} \(u\left(\tau^{\left(k_{c g}\right)}\right)\) and
-\(s\left(\tau^{(k_{c g})}\right)\) are are called the denoised gene
-expression calculated from the velocity analytic solution input with the
-kinetics random variables. \(u_{cg}\) and \(s_{cg}\) are the spliced and
-unspliced read count sampled from the Poisson models. \(u_{cg}^{obs}\)
-and \(s_{cg}^{obs}\) are the observed spliced and unspliced read count
-tables. The generative process
-
-\(\text{Measurement}(\cdot)\) for observed unspliced read counts given
-denoised unspliced gene transcript expression level
-\(u\left(\tau^{(k_{cg})}\right)\) (and identical for observed spliced
-read counts) models the expected number of observed reads for a given
-gene in a given cell as the number of transcripts times the ratio of the
-cell's total reads to total transcripts \begin{align}
-u_c^{\hat{obs}} &= \sum_g u_{c g}^{obs}, \\
-\hat{u}_c &= \sum_g u\left( \tau^{(k_{c g})}\right), \\
-\eta_c^{(u)} &\sim \operatorname{Normal}\left(
-    u_c^{\hat{obs}_c}, 
-    \operatorname{std} \left(u_c^{\hat{obs}}\right)
-  \right), \\
-\mu_{c g}^{(u)} &= \log \left(u\left(\tau^k{ }_{c g}\right)\right)
-  +\log \left(\eta_c^{(u)}\right)-\log \left(\hat{u}_c\right), \\
-u_{c g}^{obs} &\sim 
-  \operatorname{Poisson}\left(\lambda=\exp \left(\mu_{c g}^{(u)}\right)\right).
-\end{align}
-
-For the first Pyro-Velocity model (Model 1), we constrain the shared
-time to be strictly larger than \(t_{0}^{(0)}\) by introducing auxiliary
-random variables \[
-\text{t\_constraint}_{cg} 
-  \sim \text{Bernoulli} \left( \text{logits} = t_c - {t_{0}^{(0)}}_g \right),
-\] and setting their values to \(1\), and we set the initial condition
-per gene to be \begin{align}
-\left( {u_{0}^{(k_{cg})}}_g , {s_{0}^{(k_{cg})}}_g \right) &= \left\{
-  \begin{array}{l}
-    (0,0), k_{c g}=0 \\
-    \bigg( {u \left( \Delta \text { switching }_g \right)}_g,\\
-           \quad {s \left( \Delta \text { switching }_g \right)}_g \bigg), \\
-           \quad k_{c g}=1
-  \end{array}\right.
-\end{align} For the extended Pyro -Velocity model (Model 2), we remove
-the shared time constraint \(\text{t\_constraint}_{cg}\), thus allowing
-a time lag per gene that might be caused by delayed gene activation and
-set the initial condition per gene as random variables that are strictly
-positive \(\left({u_{0}^{(0)}}_g,
-{s_{0}^{(0)}}_g\right)\), which allow genes having a basal expression
-level before gene activation. Then, we compute the gene expression at
-the second switching time as \begin{align}
-({u_{0}^{(1)}}_g, {s_{0}^{(1)}}_g) &= 
-  \bigg( {u \left( \Delta \text { switching }_g \right)}_g, \nonumber \\
-& \qquad {s \left( \Delta \text { switching }_g \right)}_g \bigg),
-\end{align} which shares the same initial condition
-\(\left({u_{0}^{(0)}}_g, {s_{0}^{(0)}}_g\right)\) where \begin{align}
-{u_{0}^{(0)}}_g &\sim \operatorname{LogNormal}(0,1),\\
-{s_{0}^{(0)}}_g &\sim \operatorname{LogNormal}(0,1).
-\end{align}
-
 \subsection{Variational inference}\label{sec-methods-inference}
 
 Given observations

From bedcc1e26ded0f2bdf7a0331b192a14072317c49 Mon Sep 17 00:00:00 2001
From: Cameron Smith <cameron.ray.smith@gmail.com>
Date: Mon, 26 Aug 2024 15:39:12 -0400
Subject: [PATCH 06/18] fix(ci): include log files in manuscript artifacts

Signed-off-by: Cameron Smith <cameron.ray.smith@gmail.com>
---
 .github/workflows/manuscript.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.github/workflows/manuscript.yaml b/.github/workflows/manuscript.yaml
index 3f5b1be87..6bd88a198 100644
--- a/.github/workflows/manuscript.yaml
+++ b/.github/workflows/manuscript.yaml
@@ -62,3 +62,4 @@ jobs:
             reproducibility/manuscript/v2*.*
             reproducibility/manuscript/*.bib
             reproducibility/manuscript/*.dvc
+            reproducibility/manuscript/*.log

From 767d18931f65a8e9d37526cea271f9df623db715 Mon Sep 17 00:00:00 2001
From: Cameron Smith <cameron.ray.smith@gmail.com>
Date: Mon, 26 Aug 2024 15:44:10 -0400
Subject: [PATCH 07/18] test(manuscript): disable latexdiff bibtex diff

Signed-off-by: Cameron Smith <cameron.ray.smith@gmail.com>
---
 reproducibility/manuscript/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/reproducibility/manuscript/Makefile b/reproducibility/manuscript/Makefile
index 2b75a9b59..21c3c29cb 100644
--- a/reproducibility/manuscript/Makefile
+++ b/reproducibility/manuscript/Makefile
@@ -108,12 +108,12 @@ cache-tex: copy-tex
 # run as make latexdiff COMPARISON_SHA1="commit-hash"
 # to diff with a different commit
 # see: https://gitlab.com/git-latexdiff/git-latexdiff
+#	--bibtex
 latexdiff:
 	git-latexdiff \
 	--ignore-latex-errors \
 	--latexmk \
 	--xelatex \
-	--bibtex \
 	--no-view \
 	--ignore-makefile \
 	--ln-untracked \

From a835cd29ce8b368b00fec5455debba0ff09f343e Mon Sep 17 00:00:00 2001
From: Cameron Smith <cameron.ray.smith@gmail.com>
Date: Mon, 26 Aug 2024 15:58:47 -0400
Subject: [PATCH 08/18] fix(ci): include tmp log files in manuscript artifacts

Signed-off-by: Cameron Smith <cameron.ray.smith@gmail.com>
---
 .github/workflows/manuscript.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.github/workflows/manuscript.yaml b/.github/workflows/manuscript.yaml
index 6bd88a198..f9614b382 100644
--- a/.github/workflows/manuscript.yaml
+++ b/.github/workflows/manuscript.yaml
@@ -63,3 +63,4 @@ jobs:
             reproducibility/manuscript/*.bib
             reproducibility/manuscript/*.dvc
             reproducibility/manuscript/*.log
+            /tmp/git-latexdiff.*/new/reproducibility/manuscript/*.log

From 10528ca171677ac699a353eac1053a1bd1a52490 Mon Sep 17 00:00:00 2001
From: Cameron Smith <cameron.ray.smith@gmail.com>
Date: Mon, 26 Aug 2024 15:59:07 -0400
Subject: [PATCH 09/18] test(manuscript): disable latexdiff tmp cleanup to save
 log file

Signed-off-by: Cameron Smith <cameron.ray.smith@gmail.com>
---
 reproducibility/manuscript/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/reproducibility/manuscript/Makefile b/reproducibility/manuscript/Makefile
index 21c3c29cb..6e0ed8929 100644
--- a/reproducibility/manuscript/Makefile
+++ b/reproducibility/manuscript/Makefile
@@ -108,15 +108,16 @@ cache-tex: copy-tex
 # run as make latexdiff COMPARISON_SHA1="commit-hash"
 # to diff with a different commit
 # see: https://gitlab.com/git-latexdiff/git-latexdiff
-#	--bibtex
 latexdiff:
 	git-latexdiff \
 	--ignore-latex-errors \
 	--latexmk \
 	--xelatex \
+	--bibtex \
 	--no-view \
 	--ignore-makefile \
 	--ln-untracked \
+	--cleanup none \
 	--output $(TEX_FILE)-$(COMPARISON_SHA1)-diff.pdf \
 	--main $(TEX_FILE).tex $(COMPARISON_SHA1) HEAD
 

From fa824bb394d152636df1792745cecfee913d1949 Mon Sep 17 00:00:00 2001
From: Cameron Smith <cameron.ray.smith@gmail.com>
Date: Mon, 26 Aug 2024 16:10:42 -0400
Subject: [PATCH 10/18] chore(manuscript): update v2 diff file

---
 reproducibility/manuscript/v2.tex | 1 +
 1 file changed, 1 insertion(+)

diff --git a/reproducibility/manuscript/v2.tex b/reproducibility/manuscript/v2.tex
index 914342c02..7c08cd37e 100644
--- a/reproducibility/manuscript/v2.tex
+++ b/reproducibility/manuscript/v2.tex
@@ -156,6 +156,7 @@
 \raggedbottom
 \usepackage{nameref}
 \usepackage{placeins}
+\usepackage[utf8]{inputenc}
 \usepackage[labelfont=bf]{caption}
 % \usepackage{showframe}
 % \usepackage{layout}

From 86cbd30e7727d9308ee41cd6247019d614ee817d Mon Sep 17 00:00:00 2001
From: Cameron Smith <cameron.ray.smith@gmail.com>
Date: Mon, 26 Aug 2024 16:14:24 -0400
Subject: [PATCH 11/18] chore(manuscript): update v2 diff file

---
 reproducibility/manuscript/v2.tex | 1 +
 1 file changed, 1 insertion(+)

diff --git a/reproducibility/manuscript/v2.tex b/reproducibility/manuscript/v2.tex
index 7c08cd37e..45996fbdb 100644
--- a/reproducibility/manuscript/v2.tex
+++ b/reproducibility/manuscript/v2.tex
@@ -157,6 +157,7 @@
 \usepackage{nameref}
 \usepackage{placeins}
 \usepackage[utf8]{inputenc}
+\usepackage{fontspec}
 \usepackage[labelfont=bf]{caption}
 % \usepackage{showframe}
 % \usepackage{layout}

From 7e0d2bfd70ad45a009b1b6d79b812b1da79909da Mon Sep 17 00:00:00 2001
From: Cameron Smith <cameron.ray.smith@gmail.com>
Date: Mon, 26 Aug 2024 16:15:05 -0400
Subject: [PATCH 12/18] fix(manuscript): support unicode thin space in tex

Signed-off-by: Cameron Smith <cameron.ray.smith@gmail.com>
---
 reproducibility/manuscript/header.tex | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/reproducibility/manuscript/header.tex b/reproducibility/manuscript/header.tex
index 0fa6ebe3f..04d22655a 100644
--- a/reproducibility/manuscript/header.tex
+++ b/reproducibility/manuscript/header.tex
@@ -1,5 +1,7 @@
 \usepackage{nameref}
 \usepackage{placeins}
+\usepackage[utf8]{inputenc}
+\usepackage{fontspec}
 \usepackage[labelfont=bf]{caption}
 % \usepackage{showframe}
 % \usepackage{layout}

From 53f08f24adaf0d64cab278e6349295e89f1c9a5f Mon Sep 17 00:00:00 2001
From: Cameron Smith <cameron.ray.smith@gmail.com>
Date: Mon, 26 Aug 2024 16:34:31 -0400
Subject: [PATCH 13/18] fix(ci): note artifact path to include latexdiff log

Signed-off-by: Cameron Smith <cameron.ray.smith@gmail.com>
---
 .github/workflows/manuscript.yaml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/.github/workflows/manuscript.yaml b/.github/workflows/manuscript.yaml
index f9614b382..ec2bd9d37 100644
--- a/.github/workflows/manuscript.yaml
+++ b/.github/workflows/manuscript.yaml
@@ -57,10 +57,12 @@ jobs:
         uses: actions/upload-artifact@834a144ee995460fba8ed112a2fc961b36a5ec5a # v4
         with:
           name: manuscript-${{ github.ref_name }}-${{ github.sha }}
+          # Include
+          # /tmp/git-latexdiff.*/new/reproducibility/manuscript/*.log
+          # to capture latexdiff log file
           path: |
             reproducibility/manuscript/manuscript.*
             reproducibility/manuscript/v2*.*
             reproducibility/manuscript/*.bib
             reproducibility/manuscript/*.dvc
             reproducibility/manuscript/*.log
-            /tmp/git-latexdiff.*/new/reproducibility/manuscript/*.log

From bbbca8a65b014f484fa227a5f8ebd154c4cf6412 Mon Sep 17 00:00:00 2001
From: Cameron Smith <cameron.ray.smith@gmail.com>
Date: Mon, 26 Aug 2024 16:35:08 -0400
Subject: [PATCH 14/18] fix(manuscript): note how to retain latexdiff log

Signed-off-by: Cameron Smith <cameron.ray.smith@gmail.com>
---
 reproducibility/manuscript/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/reproducibility/manuscript/Makefile b/reproducibility/manuscript/Makefile
index 6e0ed8929..8eb6fde05 100644
--- a/reproducibility/manuscript/Makefile
+++ b/reproducibility/manuscript/Makefile
@@ -108,6 +108,7 @@ cache-tex: copy-tex
 # run as make latexdiff COMPARISON_SHA1="commit-hash"
 # to diff with a different commit
 # see: https://gitlab.com/git-latexdiff/git-latexdiff
+# Set `--cleanup none` to keep the tmp files including the latex log
 latexdiff:
 	git-latexdiff \
 	--ignore-latex-errors \
@@ -117,7 +118,6 @@ latexdiff:
 	--no-view \
 	--ignore-makefile \
 	--ln-untracked \
-	--cleanup none \
 	--output $(TEX_FILE)-$(COMPARISON_SHA1)-diff.pdf \
 	--main $(TEX_FILE).tex $(COMPARISON_SHA1) HEAD
 

From d2492f481c8b06276de107869ee57625b583febe Mon Sep 17 00:00:00 2001
From: mattlevine22 <mattlevine22@gmail.com>
Date: Thu, 29 Aug 2024 12:54:33 -0400
Subject: [PATCH 15/18] updated and tersened analytic solutions methods
 sub-section

---
 reproducibility/manuscript/manuscript.qmd | 104 ++++++----------------
 1 file changed, 26 insertions(+), 78 deletions(-)

diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd
index b40104d70..438c393b5 100644
--- a/reproducibility/manuscript/manuscript.qmd
+++ b/reproducibility/manuscript/manuscript.qmd
@@ -499,21 +499,27 @@ We assume the dynamical gene expression is determined by the RNA splicing
 process, and infer the unspliced and spliced gene expression level from the
 ordinary differential equation (ODEs) proposed in velocyto [@La_Manno2018-lj] and scVelo
 [@Bergen2020-pj]
+\begin{equation}
+  \label{eq-rna}
 \begin{align}
-\frac{du}{dt} &= \alpha(t) - \beta u, \quad u(0) = u_0 \label{eq-dudt}\\
-   \frac{ds}{dt} &= \beta u - \gamma s, \quad s(0) = s_0, \label{eq-dsdt}
+\frac{du}{dt} &= \alpha(t) - \beta u, \quad u(0) = u_0 \\
+   \frac{ds}{dt} &= \beta u - \gamma s, \quad s(0) = s_0,
 \end{align}
+\end{equation}
 where $u(t), s(t)$ are the unspliced and spliced expression levels of a gene at time $t$ under a transcription rate $\alpha(t)$ with possible temporal dependence, splicing rate $\beta$, and degradation rate $\gamma$. We specify this model to a setting that depends on cell $c$ and gene $g$ as follows:
+\begin{equation}
+    \label{eq-rna-cg}
 \begin{align}
-\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label{eq-dudt}\\
-   \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} \label{eq-dsdt}.
+\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \\
+   \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)}.
 \end{align}
+\end{equation}
 In the equation, the subscript $c$ is the cell dimension, $g$ is the gene dimension,
 $\left( u_{cg}(t), s_{cg}(t) \right)$ 
 are the unspliced and spliced expression functions given the
 change of time per cell and gene.
-We restrict attention to piecewise-constant $\alpha_{cg}(t)$ to capture gene-specific activation and repression. We take special care to model a gene- and cell-specific switching time that marks a single transition from activation to repression by introducing a Bernoulli variable $k_{cg}$ to model unknown activation state. We assume our cell-by-gene data-matrix arrive as observations of Poisson-counts related to the solution of the above ODEs at unknown times $\tau_{cg}$, which is modeled as a relationship between an unknown latent time shared across each cell, $t_c$, and unknown gene-specific time-offsets $t_{0,g}$ where all read counts for a single cell occurred at an unknown, but shared latent time $t_c$. These relative times are also used to parametrize the Bernoulli process for $k_{cg}$. 
-Importantly, we recognize that the initial conditions are in fact unknown. 
+We restrict attention to piecewise-constant $\alpha_{cg}(t)$ to capture gene-specific activation and repression. We take special care to model a gene- and cell-specific switching time that marks a single transition from activation to repression by introducing a Bernoulli variable $k_{cg}$ to model unknown transcription state. We assume our cell-by-gene data-matrix arrive as observations of Poisson-counts related to the solution of the above ODEs at unknown times, which is modeled as a relationship between an unknown latent time shared across each cell, $t_c$, and unknown gene-specific time-offsets $t_{0,g}$ where all read counts for a single cell occurred at an unknown, but shared latent time $t_c$. These relative times are also used to parametrize the Bernoulli process for $k_{cg}$. 
+Importantly, we recognize that the initial conditions are, in fact, unknown.
 
 We propose and study two models: Model 1 assumes that spliced and unspliced concentrations are both 0 at time 0; Model 2 considers these initial conditions as unknowns with a log-Normal prior distribution. In general, the solution space of ODEs becomes much richer when considered over a domain of initial conditions (as opposed to a single point); indeed, this affords Model 2 much greater expressivity. For clarity, we first present the generative framework for both models, then provide further interpretation and intuition. 
 
@@ -557,11 +563,9 @@ Next we introduce the priors for the splicing parameters (where the activation r
   \gamma_g &\sim \text{LogNormal}(0, 1) \\
   \alpha_{cg} &= \begin{cases}
     \alpha^{(0)}_g & \text{if } k_{cg} = 0 \\
-    0 & \text{if } k_{cg} = 1
+    0 & \text{if } k_{cg} = 1.
   \end{cases}
 \end{align}
-\textbf{Note that $\alpha^{(1)}$ is shared for all the genes, while ${\alpha^{(0)}}_g$ is learned independently for
-each gene. MATT: this was in the old text, but I think $\alpha^{(1)}$ is no longer used based on conversations with Alvin?}
 
 Now, we describe the priors for the initial conditions, noting that this is the only difference between Model 1 and Model 2:
 \begin{align}
@@ -597,78 +601,22 @@ Here, we use ${u}^{\text{(obs)}}_{cg}, {s}^{\text{(obs)}}_{cg}$ to denote the ob
 for cell $c$ and gene $g$. We use $\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg}$ to
 denote our generative model's prediction of these unspliced and spliced expression levels.
 The generative process for modeling these observed read counts given denoised gene transcript expression level $\hat{u}_{cg}, \hat{s}_{cg}$ considers the expected number of observed reads for a given gene in a given cell as the number of transcripts times the ratio of the cell's total reads to total transcripts.
-\textbf{Improve descriptions of how noise is modeled in the observation model.}
-
-\textbf{Need to update the analytic solutions, but first need to confirm the above is correct. Also, I recommend pushing all of the below analytic solutions to the appendix.}
-The analytic solution of the differential equations to predict
-spliced and unspliced gene expression given their parameters is derived by the
-authors of scVelo and a theoretical RNA velocity study
-[@Bergen2020-pj;@Li2021-qa] and given in
-Eqs. \ref{eq-solution-u}-\ref{eq-solution-s2}.
-\begin{align}
-u\left(\tau^{\left(k_{c g}\right)}\right) 
-  &= u_0^{\left(k_{c g}\right)}{ }_g e^{-\beta_g \tau^{\left(k_{c g}\right)}} 
-  \nonumber \\
-&\hskip -24pt + \frac{\alpha^{\left(k_{c g}\right)}}
-  {\beta_g}\left(1-e^{-\beta_g \tau^{\left(k_{c g}\right)}}\right) 
-  \label{eq-solution-u}\\
-s\left(\tau^{\left(k_{c g}\right)}\right) 
-  &= s_0^{\left(k_{c g}\right)} e^{-\gamma_g \tau^{\left(k_{c g}\right)}} 
-  \nonumber \\
-  &\hskip -24pt + \frac{\alpha^{\left(k_{c g}\right)}}{\gamma_g}
-    \left(1-e^{-\gamma_g \tau^{\left(k_{c g}\right)}}\right) 
-    \nonumber\\
-  &\hskip -24pt + \frac{\alpha^{\left(k_{c g}\right)}-\beta_g u_0^{\left(k_{c g}\right)}}
-    {\gamma_g-\beta_g}\left(e^{-\gamma_g \tau^{\left(k_{c g}\right)}}
-    -e^{-\beta_g \tau^{\left(k_{c g}\right)}}\right), 
-    \nonumber \\
-  &\qquad \beta \neq \gamma \label{eq-solution-s} \\
-s\left(\tau^{\left(k_{c g}\right)}\right) 
-  &= {s_0^{\left(k_{c g}\right)}}_g e^{-\beta_g \tau^{\left(k_{c g}\right)}} \nonumber \\
-  &\hskip -24pt +\frac{\alpha^{\left(k_{c g}\right)}}{\beta_g}
-  \left(1-e^{-\beta_g \tau^{(k c g)}}\right) 
-  \nonumber \\
-  &\hskip -24pt -\left(\alpha^{\left(k_{c g}\right)}
-  -\beta_g u_0^{\left(k_{c g}\right)}{ }_g\right) \tau^{\left(k_{c g}\right)} 
-  e^{-\beta_g \tau^{\left(k_{c g}\right)}}, \nonumber \\
-  &\qquad \beta = \gamma \label{eq-solution-s2}
-\end{align}
 
-To simplify these equations, consider the case when $k_{cg} = 0$ and 
-$\beta_g \neq \gamma_g$. Then, 
-\begin{align}
-u\left(\tau^{(0)}\right) &= {u_0^{(0)}}_g e^{-\beta_g \tau^{(0)}} \nonumber \\
-  & \hskip -24pt + \frac{{\alpha^{(0)}}_g}{\beta_g}\left(1-e^{-\beta_g \tau^{(0)}}\right), 
-  \label{eq-sol-usimp} \\
-s\left(\tau^{(0)}\right) &= s_0^{(0)}{ }_g e^{-\gamma_g \tau^{(0)}} \nonumber \\
-  & \hskip -24pt +\frac{{\alpha^{(0)}}_g}{\gamma_g}\left(1-e^{-\gamma_g \tau^{(0)}}\right) 
-  \nonumber\\
-  & \hskip -24pt +\frac{{\alpha^{(0)}}_g-\beta_g {u_0^{(0)}}_g}{\gamma_g-\beta_g}
-  \left(e^{-\gamma_g \tau^{(0)}}-e^{-\beta_g \tau^{(0)}}\right). \label{eq-sol-ssimp}
-\end{align}
-When $k_{cg} = 0$ and $\beta_g = \gamma_g$, then $u\left(\tau^{(0)}\right)$ 
-has the same solution, and $s\left(\tau^{(0)}\right)$ becomes
-\begin{align}
-s\left(\tau^{(0)}\right) &= s_0^{(0)}{ }_g e^{-\gamma_g \tau^{(0)}} \nonumber \\
-  & \hskip -24pt +\frac{{\alpha^{(0)}}_g}{\gamma_g}
-    \left(1-e^{-\gamma_g \tau^{(0)}}\right) \nonumber\\
-  & \hskip -24pt - \left( {\alpha^{(0)}}_g-\beta_g {u_0^{(0)}}_g \right) 
-    \tau^{(0)} e^{-\beta_g \tau^{(0)}}. \label{eq-sol-ssimp2}
-\end{align}
-When $k_{cg} = 1$ and $\beta_g \neq \gamma_g$, then
-\begin{align}
-u\left(\tau^{(1)}\right) &=u_0^{(1)} g^{e^{-\beta_g \tau^{(1)}}}, \\
-s\left(\tau^{(1)}\right) &=s_0^{(1)} e^{-\gamma_g \tau^{(1)}} \nonumber \\
-  & \hskip -24pt +\frac{-\beta_g u_0^{(1)}}{\gamma_g-\beta_g}
-  \left(e^{-\gamma_g \tau^{(1)}}-e^{-\beta_g \tau^{(1)}}\right).
-\end{align}
-When $k_{cg} = 1$ and $\beta_g = \gamma_g$, then $u\left(\tau^{(1)}\right)$ 
-has the same solution, and $s\left(\tau^{(1)}\right)$ becomes
+The linear ODE in Eq. \ref{eq-rna} has an analytic solution under the assumption that the transcription rate $\alpha(t)$ is piece-wise constant, and is derived by the authors of scVelo and a theoretical RNA velocity study
+[@Bergen2020-pj;@Li2021-qa].
+For brevity, we present the solution under constant transcription rate $\alpha(t) = \alpha_0$, and note that this can be combined to assemble the solution in the piece-wise constant case.
+First, let $C = u_0 - \frac{\alpha}{\beta}$. Then, the solution to the ODE in Eq. \ref{eq-rna} is:
+\begin{equation}
+\label{eq-rna-solution}
 \begin{align}
-s\left(\tau^{(1)}\right)=s_0^{(1)}{ }_g e^{-\gamma_g \tau^{(1)}}
-  +\beta_g u_0^{(1)}{ }_g \tau^{(1)} e^{-\beta_g \tau^{(1)}}.
+  u(t) &= \frac{\alpha}{\beta} + C e^{-\beta t} \\
+  s(t) &= \begin{cases}
+    \frac{\alpha}{\gamma} + \frac{\beta}{\gamma - \beta} C e^{-\beta t} + \Big(s_0 - \frac{\alpha}{\gamma} - \frac{\beta}{\gamma - \beta} C\Big) e^{-\gamma t} & \text{if } \beta \neq \gamma \\
+    \frac{\alpha}{\beta} + \Big(s_0 - \frac{\alpha}{\beta} + \beta C t\Big) e^{-\beta t} & \text{if } \beta = \gamma.
+  \end{cases}
 \end{align}
-
+\end{equation}
+These solutions are the essential building block used to compute all cell-gene-specific ODE solutions for the unspliced and spliced expression levels in the generative model.
 
 ## Variational inference {#sec-methods-inference}
 

From 6cf2faa9a0e6c9c7ab00bd50c7bae34ab05fae42 Mon Sep 17 00:00:00 2001
From: mattlevine22 <mattlevine22@gmail.com>
Date: Thu, 29 Aug 2024 12:59:00 -0400
Subject: [PATCH 16/18] bugfix with align vs aligned

---
 reproducibility/manuscript/manuscript.qmd | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd
index 438c393b5..634c4299d 100644
--- a/reproducibility/manuscript/manuscript.qmd
+++ b/reproducibility/manuscript/manuscript.qmd
@@ -501,18 +501,18 @@ ordinary differential equation (ODEs) proposed in velocyto [@La_Manno2018-lj] an
 [@Bergen2020-pj]
 \begin{equation}
   \label{eq-rna}
-\begin{align}
+\begin{aligned}
 \frac{du}{dt} &= \alpha(t) - \beta u, \quad u(0) = u_0 \\
    \frac{ds}{dt} &= \beta u - \gamma s, \quad s(0) = s_0,
-\end{align}
+\end{aligned}
 \end{equation}
 where $u(t), s(t)$ are the unspliced and spliced expression levels of a gene at time $t$ under a transcription rate $\alpha(t)$ with possible temporal dependence, splicing rate $\beta$, and degradation rate $\gamma$. We specify this model to a setting that depends on cell $c$ and gene $g$ as follows:
 \begin{equation}
     \label{eq-rna-cg}
-\begin{align}
+\begin{aligned}
 \frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \\
    \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)}.
-\end{align}
+\end{aligned}
 \end{equation}
 In the equation, the subscript $c$ is the cell dimension, $g$ is the gene dimension,
 $\left( u_{cg}(t), s_{cg}(t) \right)$ 
@@ -608,13 +608,13 @@ For brevity, we present the solution under constant transcription rate $\alpha(t
 First, let $C = u_0 - \frac{\alpha}{\beta}$. Then, the solution to the ODE in Eq. \ref{eq-rna} is:
 \begin{equation}
 \label{eq-rna-solution}
-\begin{align}
+\begin{aligned}
   u(t) &= \frac{\alpha}{\beta} + C e^{-\beta t} \\
   s(t) &= \begin{cases}
     \frac{\alpha}{\gamma} + \frac{\beta}{\gamma - \beta} C e^{-\beta t} + \Big(s_0 - \frac{\alpha}{\gamma} - \frac{\beta}{\gamma - \beta} C\Big) e^{-\gamma t} & \text{if } \beta \neq \gamma \\
     \frac{\alpha}{\beta} + \Big(s_0 - \frac{\alpha}{\beta} + \beta C t\Big) e^{-\beta t} & \text{if } \beta = \gamma.
   \end{cases}
-\end{align}
+\end{aligned}
 \end{equation}
 These solutions are the essential building block used to compute all cell-gene-specific ODE solutions for the unspliced and spliced expression levels in the generative model.
 

From 95d4db156915223cc3bc15db209ce05a82360990 Mon Sep 17 00:00:00 2001
From: mattlevine22 <mattlevine22@gmail.com>
Date: Thu, 29 Aug 2024 15:20:24 -0400
Subject: [PATCH 17/18] updating later methods section to have consistent
 notation

---
 reproducibility/manuscript/manuscript.qmd | 60 ++++++++++++-----------
 1 file changed, 31 insertions(+), 29 deletions(-)

diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd
index 634c4299d..73b8ecffd 100644
--- a/reproducibility/manuscript/manuscript.qmd
+++ b/reproducibility/manuscript/manuscript.qmd
@@ -510,8 +510,8 @@ where $u(t), s(t)$ are the unspliced and spliced expression levels of a gene at
 \begin{equation}
     \label{eq-rna-cg}
 \begin{aligned}
-\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \\
-   \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)}.
+\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label{eq-du-cg} \\
+   \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} \label{eq-ds-cg} .
 \end{aligned}
 \end{equation}
 In the equation, the subscript $c$ is the cell dimension, $g$ is the gene dimension,
@@ -581,6 +581,7 @@ Now, we describe the priors for the initial conditions, noting that this is the
 
 We define the ODE solution at time $\tau_{cg}$ as:
 \begin{equation}
+  \label{eq-ODE-solve}
     \hat{u}_{cg}, \hat{s}_{cg} = \text{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big).
 \end{equation}
 
@@ -594,8 +595,8 @@ Next, we define the observation model that gives rise to the observed counts as:
   \hat{\mu}^{(u)}_c &= \sum_{g=1}^G \hat{u}_{cg}, \quad \hat{\mu}^{(s)}_c = \sum_{g=1}^G \hat{s}_{cg} \\
   \lambda^{(u)}_{cg} &= \log(\hat{u}_{cg}) + \log(\eta^{(u)}_{c}) - \log(\hat{\mu}^{(u)}_c) \\
   \lambda^{(s)}_{cg} &= \log(\hat{s}_{cg}) + \log(\eta^{(s)}_{c}) - \log(\hat{\mu}^{(s)}_c) \\
-  \hat{u}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(u)}_{cg})\Big) \\
-  \hat{s}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(s)}_{cg})\Big)
+  \hat{u}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(u)}_{cg})\Big)   \label{eq-u-hat-obs} \\
+  \hat{s}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(s)}_{cg})\Big) \label{eq-s-hat-obs}
 \end{align}
 Here, we use ${u}^{\text{(obs)}}_{cg}, {s}^{\text{(obs)}}_{cg}$ to denote the observed unspliced and spliced counts
 for cell $c$ and gene $g$. We use $\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg}$ to
@@ -620,12 +621,12 @@ These solutions are the essential building block used to compute all cell-gene-s
 
 ## Variational inference {#sec-methods-inference}
 
-Given observations $\tilde{X}_{cg} = \left( u_{cg}^{obs}, s_{cg}^{obs} \right)$,
-we would like to compute the posterior distribution over the random variables
+Given observations $\tilde{X}_{cg} = \left( u_{cg}^{\text{(obs)}}, s_{cg}^{\text{(obs)}} \right)$ over cells $1 \dots C$ and genes $1 \dots G$,
+we would like to compute the posterior distribution over the random variables $\theta := \left( \theta_1, \ \dots, \theta_C \right)$ and $\psi := \left( \psi_1, \ \dots, \psi_G \right)$, where
 \begin{align}
-\theta &= \left( t_{c}, \eta_{c}^{(u)}, \eta_{c}^{(s)} \right), \\
-\phi &= \left( {t_{0}^{(0)}}_g, \Delta \text{switching}_{g}, 
-          {\alpha^{(0)}}_{g}, \beta_{g}, \gamma_{g} \right),
+\theta_c &= \left( t_{c}, \eta_{c}^{(u)}, \eta_{c}^{(s)} \right), \\
+\phi_g &= \left( {t_{0,g}^{(0)}}, \Delta \text{switching}_{g}, 
+          \alpha^{(0)}_{g}, \beta_{g}, \gamma_{g} \right),
 \end{align}
 but exact Bayesian inference is intractable in this model. We use Pyro to
 automatically integrate out the local discrete latent variables $k$, which is
@@ -686,18 +687,20 @@ machine with an NVIDIA A100 GPU and the CentOS 7 operating system.
 ## Posterior prediction {#sec-methods-posterior-prediction}
 
 To benchmark Pyro&thinsp;-Velocity performance in predicting cell fate, we
-generated the posterior samples measurement
-$x_{cg}=\left(u\left(\tau^{(k_{cg})}\right),
-s\left(\tau^{(k_{cg})}\right)\right)$ or $x_{cg}=\left(u_{cg}, s_{cg} \right)$,
-$t_c$, $\beta_g$, $\gamma_g$ from the same single cell RNA-seq using $N=30$
-Monte Carlo samples from the posterior predictive distribution following
-$p(x_{cg} \vert \theta, \psi) p(\theta, \psi) \approx p(x_{cg} \vert \theta,
-\psi) q_{\phi}(\theta, \psi)$. $x_{cg}$ can be posterior samples of either
-denoised gene expression (used for phase portraits and vector field-based
-trajectory inference) or raw read counts (used for gene selection). Then we
-calculated the posterior samples of RNA velocity as
-$v_{cg}=\frac{ds(\tau^{(k_{cg})})}{d \tau^{(k{cg})}}$ based on posterior samples
-measurement of denoised gene expression $x_{cg}$ and $\beta_{g}$, $\gamma_{g}$.
+generated posterior samples of quantitites of interest in the model.
+In particular, we focus on posterior samples of $i)$ spliced and unspliced gene expression across all cells and genes, defined in Eq. \ref{eq-ODE-solve} and denoted by
+$X_{cg}=\left(u_{cg}, s_{cg} \right)$ (used for phase portraits and vector-field-based
+trajectory inference),
+$ii)$ read counts, defined in Eqs. \ref{eq-u-hat-obs} and \ref{eq-s-hat-obs} and denoted by
+$\hat{X}^{\text{(obs)}}_{cg}=\left(\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg} \right)$
+(used for gene selection),
+and
+$iii)$ RNA velocity, defined in Eq. \ref{eq-ds-cg} as the time derivative of the spliced expression, denoted by
+$v_{cg}=\frac{ds_{cg}}{d t}(X_{cg})$.
+We generate 30 Monte Carlo samples from the posterior predictive distribution
+$X, \hat{X}^{(\text{obs})}, v \sim p(X, \hat{X}^{(\text{obs})}, v \vert \theta, \psi) p(\theta, \psi) \approx q_{\phi}(\theta, \psi)$ 
+by sampling the posterior distribution of $\theta$, $\psi$ using the trained model.
+We use these samples to compute statistics on the posterior predictive distribution of our quantities of interest; these statistics are used to evaluate the model's performance and uncertainty in predicting cell fate.
 
 ## Prioritization of cell fate markers {#sec-methods-fate-markers}
 
@@ -706,9 +709,7 @@ correlation between each gene's posterior mean of the denoised spliced
 expression and posterior mean of the shared time; Second, the negative mean
 absolute errors of each gene's observed spliced, and unspliced read counts with
 posterior predictive samples of spliced and unspliced raw read counts, i.e.
-$\frac{1}{N n_c} \sum_i \sum_c (x_{icg}-\tilde{X}_{cg})$, where $N$ is the
-posterior sample number that is set to $30$, $i$ is the posterior sample index,
-and $n_c$ is the cell number. We first select the top $300$ genes with the
+$\mathcal{E}_g = \frac{1}{NC} \sum_{i=1}^N \sum_{c=1}^{C} \vert \tilde{X}_{cg} - \hat{X}_{cg}^{(\text{obs}), (i)} \vert$, where we use $N=30$ Monte Carlo samples. We first select the top $300$ genes with the
 highest negative mean absolute errors and then rank the $300$ genes based on the
 most positively correlated genes and the least negatively correlated genes. We
 use the same strategy for scVelo to rank the markers by the model likelihood to
@@ -847,8 +848,9 @@ is state uncertainty which evaluates the mean Euclidean distance between
 posterior mean and posterior samples of the raw read count prediction for each
 cell
 \begin{align*}
-\frac{1}{N} \sum_{i} \sqrt{\sum_{g} \left(x_{icg} - \frac{1}{N}\sum_i x_{icg}\right)^2}.
+\mathcal{U}_c = \frac{1}{N} \sum_{i=1}^N \sqrt{\sum_{g=1}^G \left(\hat{X}^{\text{(obs)},(i)}_{cg} - \frac{1}{N}\sum_{i=1}^N \hat{X}^{\text{(obs)},(i)}_{cg}\right)^2}
 \end{align*}
+with $N=30$ Monte Carlo samples.
 
 ## Cospar {#sec-methods-cospar}
 
@@ -895,11 +897,11 @@ to use for a given dataset. Namely, the mean absolute errors of posterior
 samples prediction of spliced and unspliced raw read counts $x_{icg}$ from its
 observation $\tilde{X}_{cg}$ per gene and cell combinations
 \begin{align*}
-\frac{1}{N_i N_c N_g} \sum_{i=1}^{N_i} \sum_{c=1}^{N_c} \sum_{g=1}^{N_g} 
-  \vert x_{icg} - \tilde{X}_{cg} \vert,
+\frac{1}{N C G} \sum_{i=1}^{N} \sum_{c=1}^{C} \sum_{g=1}^{G} 
+  \vert X^{\text{(obs), (i)}}_{cg} - \tilde{X}_{cg} \vert,
 \end{align*}
-where $i$ is the posterior sample index, $N_i$ is the posterior sample number,
-$N_c$ is the cell number, $N_g$ is the gene number. Then it is possible to
+where $i$ is the posterior sample index, $N$ is the posterior sample number,
+$C$ is the cell number, $G$ is the gene number. Then it is possible to
 evaluate the cell fate inference performance of the models based on two metrics:
 
 1. the consistency between the velocity vector field and the *clonal progression

From b7183f287ab904502250bf9d5605b8614159da75 Mon Sep 17 00:00:00 2001
From: mattlevine22 <mattlevine22@gmail.com>
Date: Tue, 3 Sep 2024 00:23:57 -0400
Subject: [PATCH 18/18] latex label bugfix

---
 reproducibility/manuscript/manuscript.qmd | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd
index 73b8ecffd..c7223eebc 100644
--- a/reproducibility/manuscript/manuscript.qmd
+++ b/reproducibility/manuscript/manuscript.qmd
@@ -510,8 +510,8 @@ where $u(t), s(t)$ are the unspliced and spliced expression levels of a gene at
 \begin{equation}
     \label{eq-rna-cg}
 \begin{aligned}
-\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label{eq-du-cg} \\
-   \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} \label{eq-ds-cg} .
+\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \\
+   \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} .
 \end{aligned}
 \end{equation}
 In the equation, the subscript $c$ is the cell dimension, $g$ is the gene dimension,