graph_diffusion.qmd

---
title: "Diffusion distances"
---

#### Load example data

Before getting into the topic, we load the usual example data and performing standard preprocessing

```{r}
suppressPackageStartupMessages({
  library( tidyverse )
  library( Matrix )
  library( sparseMatrixStats )
  library( Seurat ) })

ReadMtx( "~/Downloads/ifnagrko/ifnagrko_raw_counts.mtx.gz",
    "~/Downloads/ifnagrko/ifnagrko_obs.csv",
    "~/Downloads/ifnagrko/ifnagrko_var.csv",
    cell.sep=",", feature.sep=",", skip.cell=1, skip.feature=1, 
    mtx.transpose=TRUE) -> count_matrix
```

```{r}
count_matrix %>%
CreateSeuratObject() %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA( npcs=20 ) %>%
FindNeighbors( dims=1:20 ) %>%
FindClusters( resolution=0.5 ) %>%
RunUMAP( dims=1:20 ) -> seu
```

```{r}
UMAPPlot( seu, label=TRUE ) + coord_equal()
```

#### Nearest neighbors

In this lesson, we will mainly work with the nearest neighbor data. Seurat has already calculated this but we do this again here:

```{r}
FNN::get.knn( Embeddings( seu, "pca" ), k=15 ) -> nn

head( nn$nn.index)
```

#### Adjacency matrix

We next construct the adjacency matrix of the undirected nearest neighbor graph. To connect each vertex to its $i$th nearest neighbor, we need the following edges:

```{r}
ncells <- ncol(seu)

i <- 3
cbind( vertex_A=1:ncells, vertex_B=nn$nn.index[,i] ) %>% head()
```

We can use this as indices of the matrix cells we want to set to one and thus construct a sparse matrix

```{r}
sparseMatrix(  i=1:ncells, j=nn$nn.index[,i], x=1, dims=c(ncells,ncells) ) %>% summary() %>% head()
```


Adding up on such matrix for i running from 1 to 15 gives us the adjacency matrix

```{r}
adjm <- sparseMatrix( i=integer(), j=integer(), x=numeric(), dims=c(ncells,ncells) ) # zero matrix
for( i in 1:ncol(nn$nn.index) ) {
   adjm <- adjm + sparseMatrix(  i=1:ncells, j=nn$nn.index[,i], x=1, dims=c(ncells,ncells) ) }
summary(adjm) %>% head()
```

We make the matrix symmetric

```{r}
adjm <-  adjm + t(adjm)
```

Now, some matrix entries have become 2 rather than 1. We set everything back to 1:

```{r}
adjm@x[] <- 1
```

Now, we have an adjacency matrix for our nearest neighbor graph.

#### Random walk

We now define a random walk on our graph as follows: A "walker" (or: "token")
starts at a vertex $i$. In each time step, it choses one of the vertex's
neighbors at random and moves there. What is the probability of the walker 
being on vertex $j$ after $\ell$ steps?

We represent the walker being at vertex $i$ with the unit vector in direction $i$,
i.e., the vector $\vec e_i$, with a 1 at component $i$ and zero elsewhere.
The transition matrix $T$ with elements $T_{ij}$ tells us the probability of the walker
moving to vertex $j$ in a step if it was before at vertex $i$:

$$ T_{ij} = A_{ij} \Big/ \sum_{j'}{A_{ij'}}. $$
The division normalizes the probabiliities by dividing by the number of neighbors
that the walker can chose from.

```{r}
trm <- adjm / rowSums(adjm)
```

Check normalization:

```{r}
rowSums(trm) %>% head()
```


To try this out, we pick a cell close to the point (-6,6) in the UMAP:

```{r}
cell <- which.min( ( Embeddings(seu,"umap")[,1] + 6 )^2 + ( Embeddings(seu,"umap")[,2] - 6 )^2 )
cell
```

Here's a UMAP plot of this cell and it's neighbors:

```{r}
Embeddings(seu,"umap") %>%
as_tibble() %>%
mutate( w = case_when(
  row_number() == cell ~ "cell",
  adjm[cell,] == 1 ~ "neighbor",
  TRUE ~ "other"
)) %>% 
ggplot +
  geom_point( aes( x=umap_1, y=umap_2, col=w ), size=.3 ) + coord_equal() +
  scale_color_manual( values=c("darkgreen","magenta","#00000006"))
```

Let's perform 10 steps. We start with a sparse vector with a single 1 at the chosen cell's
index and 0 elsewhere, then multiply this 10 times with $T$:

```{r}
u <- sparseVector( i=cell, x=1, length=ncells )
for( i in 1:10 )
  u <- u %*% trm
```

We first check whether `u` is still normalized:

```{r}
sum(u)
```

Here's a plot of $\vec{u}=\vec{e}_i^\top T^{10}$.

```{r}
Embeddings(seu,"umap") %>%
as_tibble() %>%
mutate( u = as.vector(u) ) %>% 
ggplot +
  geom_point( aes( x=umap_1, y=umap_2, col=u ), size=.3 ) + coord_equal() +
  scale_color_viridis_c(direction=-1)
```

#### Exponantiating the transition matrix

Calculating $\ell$ steps by repeated multiplication is wasteful. We should use
a matrix exponential. 

As preparation for this, we define the diagonal "degree matrix" $D$, that contains the vertex degrees:
$$ D_{ij} = \delta_{ij} \sum_{j'} A_{ij'}. $$

Now, we have:
$$ T = D^{-1}A.$$

$T$ is a row-stochastic matrix, i.e., its values are all non-negative and its 
rows sum to 1.

The probability mass vector for a walker starting at vertex $i$ after one step is
$\vec{e}_i^\top T$, and after $\ell$ steps, $\vec{e}_i^\top T^\ell$.

To calculate $T^\ell$, we will need the eigendecomposition of the symmetrized transition matrix
$\tilde T = D^{-1/2} T D^{-1/2}$:
$$ \tilde T = U\Lambda U^\top,$$
with the columns of $U$ containing the eigenvectors of $\tilde T$ and the diagonal matrix $\Lambda$
containing the eigenvalues.

With this, we get
$$ \begin{align}
T^\ell &= \left(D^{-1} A\right)^\ell \\
&= D^{-1/2} \left(D^{-1/2} A D^{-1/2}\right)^\ell D^{1/2} \\ 
&= D^{-1/2} \left(U \Lambda U^T\right)^\ell D^{1/2} \\ 
&= \underbrace{D^{-1/2} U \Lambda^\ell}_{=X_\ell} U^T D^{1/2}.
\end{align} $$

We construct $D$

```{r}
degdiag <- sparseMatrix( i=1:ncells, j=1:ncells, x=rowSums(adjm) )
```

We also write down $D^{-1}$, $D^{1/2}$ and $D^{-1/2}$:

```{r}
invdegdiag     <- sparseMatrix( i=1:ncells, j=1:ncells, x=1/rowSums(adjm) )
sqrtdegdiag    <- sparseMatrix( i=1:ncells, j=1:ncells, x=sqrt(rowSums(adjm)) )
invsqrtdegdiag <- sparseMatrix( i=1:ncells, j=1:ncells, x=1/sqrt(rowSums(adjm)) )
```

Now we get the eigensystem of $\tilde T$, requesting the 100 eigenvalues
that are largest by magnitude:

```{r}
eigtrm <- RSpectra::eigs_sym( invsqrtdegdiag %*% adjm %*% invsqrtdegdiag, k=100 )
```

Now, we can calculate $\vec e_i^\top T^\ell$ as follows:

We calculate first $X_\ell = D^{-1/2} U \Lambda^\ell$:

```{r}
invsqrtdegdiag %*% eigtrm$vectors %*% diag( eigtrm$values^10 ) %>% as.matrix() -> x10

dim(x10)
```  

Note that the matrix $X_\ell$ has been trimmed to only the first 50 columns. That is ok
because the factor $\Lambda^\ell$ gets small quickly and therefore, the columns stay close
to zero once we get past the first few on the left:

```{r}
plot( colSums(x10^2) )
```

Of course, this only works for $\ell \gg 1$ and with $\ell=10$ we may have only just enough steps for the
approximation becoming valid.

We take the row of $X_\ell$ that corresponds to our cell and multiply this with $U^\top D^{1/2}$ to get 
$e_i^\top X_\ell U^\top D^{1/2}=e_i^\top T^\ell$:

```{r}
x10[cell,] %*% t(eigtrm$vectors) %*% sqrtdegdiag %>% as.vector() -> u10
```

Here is a plot of that vector:

```{r}
Embeddings(seu,"umap") %>%
as_tibble() %>%
mutate( u = u10 ) %>% 
ggplot +
  geom_point( aes( x=umap_1, y=umap_2, col=u ), size=.3 ) + coord_equal() +
  scale_color_viridis_c(direction=-1)
```
This looks quite similar as the plot we made before. 

Here's a comparison of the two results

```{r}
plot( u, u10, asp=1, cex=.2, xlab="exact calculation", ylab="using top 100 eigenvectors" )
abline( 0, 1, col="#00000020" )
abline( h=0, v=0, col="#00000020" )
```

#### Diffusion distances

We now define a new distance metrix for our cells. Intuitively: To quantify the
distance between two cells $i$ and $j$, we start random walks at both cells,
evolve them for $\ell$ steps, and then ask about the overlap between the resulting
probability vectors. 

We might therefore use
$$ \left\| e_i^\top T^\ell - e_j
^\top T^\ell \right\|_2, $$ 
i.e., the Euclidean distance between rows $i$ and $j$ of $T^\ell$ as the distance between cells $i$ and $j$. 

Note that the number of steps, $\ell$, selects a "length scale" at which the 
that distance is informative. 

There is a practical difficulty in using this definition, though: When multiplying 
the small $n\times k$ matrix $X_\ell$ with the transpose of the $n\times k$ eigenvector matrix $U$
(with $n$ being the number of cells and $k$ being the number of eigenvectors that have been calculated)
our data blows up to an unwieldy $n\times n$ matrix.

We can avoid this by using the rows of $X_\ell$ instead of the rows of $T^\ell$, 
and therefore define: 

The *$\ell$-steps diffusion distance* between cells $i$ and $j$ is
$$ d_{\ell,ij} = \left\| e_i^\top X_\ell - e_j^\top X_\ell \right\|_2, $$ 

Note that this is also 
$$ d_{\ell,ij} = \left\| \left( e_i^\top T^\ell  - e_i^\top T^\ell\right) D^{-1/2} \right\|_2, $$ 
i.e., the components of the resulting probability vectors get reweighted by $D^{-1/2}$ before calculating the
norm of the difference. If we accept this (somewhat unmotivated) reweighting, because it does not change much (as the vertex
degrees do not differ that much from each other), we have a computationally efficient way of calculating $d_{\ell,ij}$:
All we need is $X_\ell$.

By calculating the Euclidean distance of every row of $X_\ell$ to the row for our selected cell,
we get the $\ell$-step diffusion distance of that cell to all other cells:

```{r}
Embeddings(seu,"umap") %>%
as_tibble() %>%
mutate( d = sqrt( rowSums( t( t(x10) - x10[cell,] )^2 ) ) ) %>% 
arrange(-d) %>%
ggplot +
  geom_point( aes( x=umap_1, y=umap_2, col=d ), size=.3 ) + coord_equal() +
  scale_color_viridis_c(direction=-1)

```

Below, this plot is repeated for several values of $\ell$. (But remember that our
approximation error is larger for $\ell \lesssim 10$.)

```{r}
for( l in c(3, 10, 30, 100, 300, 1000 ) ) {
  invsqrtdegdiag %*% eigtrm$vectors %*% diag( eigtrm$values^l ) %>% as.matrix() -> xm
  print(
    Embeddings(seu,"umap") %>%
    as_tibble() %>%
    mutate( d = sqrt( rowSums( t( t(xm) - xm[cell,] )^2 ) ) ) %>% 
    arrange(-d) %>%
    ggplot +
      geom_point( aes( x=umap_1, y=umap_2, col=d ), size=.3 ) + coord_equal() +
      scale_color_viridis_c(direction=-1) + ggtitle( sprintf( "%d steps", l ) ) )
}
```

#### Pseudotime

We now want to define a pseudotime along the lineage, from cluster 0 to cluster 2 and on to 7.

We define a start cell and an end cell, by taking our cell (which is arguably somewhere in the
middle) and find the cell with the largest distance to it within cluster 0 and 7, respectivelty.

We will work with $\ell=300$:

```{r}
l <- 300
xm <- as.matrix( invsqrtdegdiag %*% eigtrm$vectors %*% diag( eigtrm$values^l ) )
# distances to "intermediate" cell:
d <- sqrt( rowSums( t( t(xm) - xm[cell,] )^2 ) )
```

Here is the distance of the cells in cluster 0 (the astrocyte / neuronal stem cell cluster, where the lineage starts) to our intermediate cell:

```{r}
hist( d[seu$seurat_clusters==0], 100 )
abline(v=0.0013, col="orange")
```


It seems reasonable to assume that the cells at the steep cliff are the actual start of the
lineage and the cells further out or some outliers (perhaps cells that started into
another direction)

```{r}
start_cell <- which.min( ( d * (seu$seurat_clusters==0) - 0.0013 )^2 )
```

Let's also find a reasonable "end cell" in the neuron cluster 7:

```{r}
hist( d[seu$seurat_clusters==7], 100 )
```
Here' let's simply take the last cell:

```{r}
end_cell <- which.max( d * (seu$seurat_clusters==7) )
```

Now, get distances to these two cells:

```{r}
dist_to_start <- sqrt( rowSums( t( t(xm) - xm[start_cell,] )^2 ) )
dist_to_end <- sqrt( rowSums( t( t(xm) - xm[end_cell,] )^2 ) )
```

```{r}
tibble( 
  dist_to_start, dist_to_end, 
  cluster=seu$seurat_clusters ) %>%
mutate(
  type = case_when(
    cluster %in% c( 0, 3, 5, 1, 2, 7 ) ~ "lineage_straight",
    cluster %in% c( 6, 11 )  ~ "lineage_cycle",
    TRUE ~ "other" ) ) %>%
ggplot +
  geom_point( aes( x=dist_to_end, y=dist_to_start, col=type ), size=.1 ) +
  coord_equal()
```
Let's rotate this plot by 45°:

```{r}
tibble( 
  dist_to_start, dist_to_end, 
  cluster=seu$seurat_clusters ) %>%
mutate(
  type = case_when(
    cluster %in% c( 0, 3, 5, 1, 2, 7 ) ~ "lineage_straight",
    cluster %in% c( 6, 11 )  ~ "lineage_cycle",
    TRUE ~ "other" ) ) %>%
ggplot +
  geom_point( aes( 
    x = dist_to_start - dist_to_end, 
    y = dist_to_start + dist_to_end, col=type ), size=.1 )
```
Our new x-axis should be suitable as pseudotime:
```{r}
pseudotime <- dist_to_start - dist_to_end
```

Let's plot this into the UMAP:

```{r}
Embeddings(seu,"umap") %>%
as_tibble() %>%
add_column( pseudotime ) %>% 
ggplot +
  geom_point( aes( x=umap_1, y=umap_2, col=pseudotime ), size=.3 ) + coord_equal() +
  scale_color_gradientn( colours=rje::cubeHelix(100,r=4) ) 
```

Before we dive into this, let's also plot the `y axis of `dist+to+start+dfist+to_end`:

```{r}
Embeddings(seu,"umap") %>%
as_tibble() %>%
add_column( distsum = dist_to_start + dist_to_end ) %>% 
ggplot +
  geom_point( aes( x=umap_1, y=umap_2, col=distsum ), size=.3 ) + coord_equal() +
  scale_color_gradient2( midpoint = .01, limits = c(0,.02), oob=scales::oob_squish )
```
Comparing with the plot above, our pseudotime is valid for the red, white and 
perhaps the very faintly blue regions.

### Comaprison to principal curve

Let's recreate the principal curve that we used before for pseudotime. 

As before, we calculate the principal curve using the cells in the lineage as
input, without the cells from the two cycling cluster, and use smoothing splines
with 10 degrees of freedom.

```{r}
princurve::principal_curve(
  Embeddings(seu,"pca")[ seu$seurat_clusters %in% c( 0, 3, 5, 1, 2, 7 ), ],
  df = 10, approx_points=1000 ) -> prc
```

We again project the remaining cells onto outo the cells on the curve and assign to them
the same pseudotime as the one of the nearest point on the curve:

```{r}
FNN::get.knnx( prc$s, Embeddings(seu,"pca"), k=1 ) -> prc_nn
pseudotime_prc <- prc$lambda[ prc_nn$nn.index ]
```

Here is the principal-curve-based pseudotime:

```{r}
Embeddings(seu,"umap") %>%
as_tibble() %>%
add_column( pseudotime_prc ) %>% 
ggplot +
  geom_point( aes( x=umap_1, y=umap_2, col=pseudotime_prc ), size=.3 ) + coord_equal() +
  scale_color_gradientn( colours=rje::cubeHelix(100,r=4) ) 
```

We can also compare the two:

```{r}
tibble(
  pt_diffusion = pseudotime,
  pt_prc = -pseudotime_prc,
  cluster = seu$seurat_clusters
) %>%
mutate(
  type = case_when(
    cluster %in% c( 0, 3, 5, 1, 2, 7 ) ~ "lineage_straight",
    cluster %in% c( 6, 11 )  ~ "lineage_cycle",
    TRUE ~ "other" ) ) %>%
ggplot +
  geom_point( aes( x=pt_diffusion, y=pt_prc, col=type ), size=.2 )
```

Here, teh same, but coloured for cluster (using only the lineage cells). See below for a UMAP
coloured in the same manner.

```{r}
tibble(
  pt_diffusion = pseudotime,
  pt_prc = -pseudotime_prc,
  cluster = seu$seurat_clusters
) %>%
filter(
  cluster %in% c( 0, 3, 5, 1, 2, 7, 6, 11 ) )  %>%
ggplot +
  geom_point( aes( x=pt_diffusion, y=pt_prc, col=cluster ), size=.2 )
```

### Diffusion space

The space of $X_\ell$ can also be interpreted as an alternative to the feature space,
called "diffusion space". Using the first few components provides a dimension redution,
the "diffusion map". 

For this to work well, our neighborhood graph should be connected.

Ours has two connection components, however, as is evident from the fact that our
transition matrix has two unit eigenvalues:

```{r}
head( eigtrm$values )
```

To make our live easier, let's reduce the data to only the lineage cells

```{r}
in_lineage <- seu$seurat_clusters %in% c( 0, 3, 5, 1, 2, 7, 6, 11 )
```

Subset the adjacency matrix to these and recalculate the transition matrix and
its spectrum

```{r}
adjml <- adjm[ in_lineage, ][ , in_lineage ]
ncells <- nrow(adjml)

invsqrtdegdiag <- sparseMatrix( i=1:ncells, j=1:ncells, x=1/sqrt(rowSums(adjml)) )
eigtrm <- RSpectra::eigs_sym( invsqrtdegdiag %*% adjml %*% invsqrtdegdiag, k=10 )

x300 <- as.matrix( invsqrtdegdiag %*% eigtrm$vectors %*% diag( eigtrm$values^300 ) )
```

```{r}
as_tibble( x300[,2:3 ] ) %>%
add_column( cluster = seu$seurat_clusters[in_lineage] ) %>%
ggplot +
  geom_point( aes( x=V1, y=V2, col=cluster ), size=.1 ) + coord_equal()
```
For comparison, the UMAP with the same c

```{r}
as_tibble( Embeddings(seu,"umap")[in_lineage,] ) %>%
add_column( cluster = seu$seurat_clusters[in_lineage] ) %>%
ggplot +
  geom_point( aes( x=umap_1, y=umap_2, col=cluster ), size=.1 ) + coord_equal()
```

### References

The idea of diffusion distances and diffusion maps has been introduced in this paper:

- Coifman and Lafon (2006): *Diffusion Maps*. Applied and Computational Harmonic Analysis, Vol. 21, Pages 5-30. [doi:10.1016/j.acha.2006.04.006](https://doi.org/10.1016/j.acha.2006.04.006)

Applying these ideas to single-cell data is explored in

- Haghverdi, Büttner, Theis (2015): *Diffusion maps for high-dimensional single-cell analysis of differentiation data*
  Bioinformatics, Vol. 31, Pages 2989–2998, [doi:10.1093/bioinformatics/btv325](https://doi.org/10.1093/bioinformatics/btv325)
  
- Haghverdi, Büttner, Wolf, Buettner, Theis (2016): *Diffusion pseudotime robustly reconstructs lineage branching*
  Nature Methods, Vol. 13, Pages 845–848
  
- Angerer, Haghverdi, Büttner, Theis, Marr, Buettner: *destiny: diffusion maps for large-scale single-cell data in R*,
Bioinformatic, Vol. 32, Pages 1241-1243, [doi:10.1093/bioinformatics/btv715](https://doi.org/10.1093/bioinformatics/btv715)