Merged the two vignettes and updated tutorial content and code descri…

…ptions.
SydneyBioX · Sep 19, 2024 · 5440290 · 5440290
1 parent 10a673b
commit 5440290
Show file tree

Hide file tree

Showing 3 changed files with 397 additions and 400 deletions.
diff --git a/vignettes/clustSIGNAL.Rmd b/vignettes/clustSIGNAL.Rmd
@@ -0,0 +1,397 @@
+---
+title: "clustSIGNAL tutorial"
+author:
+  - Pratibha Panwar, Boyi Guo, Haowen Zhou, Stephanie Hicks, Shila Ghazanfar
+date: "`r Sys.Date()`"
+output:
+  BiocStyle::html_document:
+    toc_float: true
+  BiocStyle::pdf_document: default
+package: clustSIGNAL
+vignette: |
+    %\VignetteEngine{knitr::rmarkdown}
+    %\VignetteIndexEntry{clustSIGNAL tutorial}
+    %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = FALSE
+)
+```
+
+# Overview
+
+In this vignette, we will demonstrate how to perform spatially-resolved clustering with clusSIGNAL on a dataset containing only one sample. Following this, we will explore the clusters using pre-defined metrics like adjusted rand index (ARI), normalised mutual information (NMI), average silhouette width, and spatial plots. We will also display the use of entropy measures generated as a by-product of clustSIGNAL process in understanding the tissue structure of the sample. Furthermore, the adaptively-smoothed gene expression data generated by clustSIGNAL could be useful for other downstream analyses and will be accessible to the user if they choose to output the final SpatialExperiment object.
+
+```{r setup, message = FALSE, warning = FALSE}
+# load required packages
+library(clustSIGNAL)
+library(distances)
+library(cluster)
+library(aricode)
+library(dplyr)
+library(ggplot2)
+library(patchwork)
+library(scattermore)
+```
+
+# Single sample data analysis 
+
+Here, we use the SeqFISH mouse embryo dataset from [Lohoff et al, 2021](https://www.nature.com/articles/s41587-021-01006-2), which contains spatial transcriptomics data from 3 mouse embryos, with 351 genes and a total of 57536 cells. For this vignette, we subset the data by randomly selecting 5000 cells from Embryo 2, excluding cells that were manually annotated as 'Low quality'.
+
+We begin by creating a SpatialExperiment object from the gene expression and cell information in the data subset, ensuring that the spatial coordinates are stored in spatialCoords within the SpatialExperiment object.
+
+```{r}
+data(mEmbryo2)
+spe <- SpatialExperiment(assays = list(logcounts = me_expr),
+                        colData = me_data, spatialCoordsNames = c("X", "Y"))
+spe
+```
+
+For running clustSIGNAL, we need to know the column names in colData of the SpatialExperiment object that contain the sample and cell labels. Here, the sample labels are in the 'sample_id' column, and the cell labels are in the 'uniqueID' column.
+
+```{r}
+colnames(colData(spe))
+```
+
+
+# Running clustSIGNAL on one sample
+
+Next, we run clustSIGNAL using the sample and cell labels we identified earlier. The simplest clustSIGNAL run requires a SpatialExperiment object, two variables holding colData column names containing sample and cell labels, and the type of output the user would like to see. Other parameters that can be modified include dimRed to specify the low dimension data to use, batch to perform batch correction, NN to specify the neighbourhood size, kernel for weight distribution to use, spread for distribution spread value, sort to sort the neighbourhood, threads to specify the number of cpus to use in parallel runs, and ... for additional parameters for clustering steps. 
+
+```{r}
+set.seed(100)
+samples <- "sample_id"
+cells <- "uniqueID"
+res_emb <- clustSIGNAL(spe, samples, cells, outputs = "a")
+```
+
+This returns a list that can contain a dataframe of cluster names, a matrix of cell labels from each region's neighbourhood, a final SpatialExperiment object, or a combination of these, depending on the choice of 'outputs' selected. Here, the output contains all three data types.
+
+```{r}
+names(res_emb)
+```
+
+The cluster dataframe contains cell labels and their cluster numbers allotted by clustSIGNAL.
+
+```{r}
+head(res_emb$clusters, n = 3)
+```
+
+The final SpatialExperiment object contains the adaptively smoothed gene expression data as an additional assay, as well initial clusters, entropy values, and clustSIGNAL clusters. 
+
+```{r}
+spe <- res_emb$spe_final
+spe
+```
+
+# Analysing clustSIGNAL results
+
+In this section, we analyse the results from clustSIGNAL through spatial plots and clustering metrics. 
+
+## Visualising clustSIGNAL clusters
+
+```{r}
+colors <- c("#635547", "#8EC792", "#9e6762", "#FACB12", "#3F84AA", "#0F4A9C", 
+           "#ff891c", "#EF5A9D", "#C594BF", "#DFCDE4", "#139992", "#65A83E", 
+           "#8DB5CE", "#005579", "#C9EBFB", "#B51D8D", "#532C8A", "#8870ad", 
+           "#cc7818", "#FBBE92", "#EF4E22", "#f9decf", "#c9a997", "#C72228", 
+           "#f79083", "#F397C0", "#DABE99", "#c19f70", "#354E23", "#C3C388",
+           "#647a4f", "#CDE088", "#f7f79e", "#F6BFCB", "#7F6874", "#989898", 
+           "#1A1A1A", "#FFFFFF", "#e6e6e6", "#77441B", "#F90026", "#A10037", 
+           "#DA5921", "#E1C239", "#9DD84A")
+```
+
+We use spatial coordinates of cells and their cluster labels and entropy values to visualize the clustering output.
+
+```{r}
+df_ent <- as.data.frame(colData(spe))
+
+# spatial plot
+spt_clust <- df_ent %>%
+  ggplot(aes(x = spatialCoords(spe)[, 1],
+             y = -spatialCoords(spe)[, 2])) +
+  geom_scattermore(pointsize = 3, aes(colour = reCluster)) +
+  scale_color_manual(values = colors) +
+  ggtitle("A") +
+  labs(x = "x-coordinate", y = "y-coordinate") +
+  guides(color = guide_legend(title = "Clusters",
+                              override.aes = list(size = 3))) +
+  theme_classic() +
+  theme(text = element_text(size = 12))
+
+# calculating median entropy of each cluster
+celltype_ent <- df_ent %>%
+  group_by(as.character(reCluster)) %>%
+  summarise(meanEntropy = median(entropy))
+# reordering clusters by their median entropy
+# low to high median entropy
+cellOrder <- celltype_ent$meanEntropy
+names(cellOrder) <- celltype_ent$`as.character(reCluster)`
+cellOrder <- sort(cellOrder)
+df_ent$reCluster <- factor(df_ent$reCluster, levels = names(cellOrder))
+# box plot of cluster entropy
+colors_ent <- colors[as.numeric(names(cellOrder))]
+box_clust <- df_ent %>%
+  ggplot(aes(x = reCluster, y = entropy, fill = reCluster)) +
+  geom_boxplot() +
+  scale_fill_manual(values = colors_ent) +
+  ggtitle("B") +
+  labs(x = "clustSIGNAL clusters", y = "Entropy") +
+  theme_classic() +
+  theme(legend.position = "none",
+        text = element_text(size = 12),
+        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
+
+spt_clust + box_clust + patchwork::plot_layout(guides = "collect", widths = c(2, 3))
+```
+
+The spatial location (A) and entropy distribution (B) of the clusters provide spatial context of the cells and their neighbourhoods, as well as the compositions of the neighbourhoods. For example, the low entropy of cluster 4 indicates that the cells in this cluster are generally found in more homogeneous space, whereas the high entropy of cluster 7 cells indicates that they belong to regions with more cell diversity. The spatial plot (A) concurs with this entropy-based observation. 
+
+## Cluster metrics
+
+We assess the clustering efficiency of clustSIGNAL using the commonly used clustering metrics ARI, NMI, and silhouette width. ARI and NMI are usable only when prior cell annotation information is available, and assume that this cell annotation is ground truth. Here, ARI and NMI measure the similarity or agreement (respectively) between cluster labels obtained from clustSIGNAL and manual cell annotation labels. On the contrary, silhouette width is reference-free and evaluates how well a cell fits within its assigned cluster compared to other clusters.
+
+```{r}
+# average silhouette width
+clusts <- as.numeric(as.character(spe$reCluster))
+cXg_mat <- t(as.matrix(logcounts(spe)))
+distMat <- distances(cXg_mat)
+silCluster <- as.matrix(silhouette(clusts, distMat))
+spe$rcSil <- silCluster[, 3]
+
+# ARI and NMI
+as.data.frame(colData(spe)) %>%
+  summarise(ARI = aricode::ARI(celltype_mapped_refined, reCluster),
+            NMI = aricode::NMI(celltype_mapped_refined, reCluster),
+            ASW = mean(rcSil))
+```
+
+## Entropy spread and distribution
+
+The entropy values generated through clustSIGNAL process can be useful in analyzing the sample structure. The entropy range can indicate whether the tissue sample contains any homogeneous domain-like structures. For example, here the minimum entropy value is 0, which means some cells are placed in completely homogeneous space when looking at neighbourhood sizes of 30 cells (since NN = 30 used for generating entropy data). Moreover, the mean entropy value is low, which can be interpreted as the tissue having at least some domain-like structures containing 30 cells.
+
+```{r}
+# Data assessment - Overall entropy
+as.data.frame(colData(spe)) %>%
+  summarise(min_Entropy = min(entropy),
+            max_Entropy = max(entropy),
+            mean_Entropy = mean(entropy))
+```
+
+```{r}
+# Histogram of entropy spread
+hst_ent <- as.data.frame(colData(spe)) %>%
+  ggplot(aes(entropy)) +
+  geom_histogram(binwidth = 0.05) +
+  ggtitle("A") +
+  labs(x = "Entropy", y = "Number of regions") +
+  theme_grey() +
+  theme(text = element_text(size = 12))
+
+# Spatial plot showing sample entropy distribution
+spt_ent <- as.data.frame(colData(spe)) %>%
+  ggplot(aes(x = spatialCoords(spe)[, 1],
+             y = -spatialCoords(spe)[, 2])) +
+  geom_scattermore(pointsize = 3,
+             aes(colour = entropy)) +
+  scale_colour_gradient2("Entropy", low = "grey", high = "blue") +
+  scale_size_continuous(range = c(0, max(spe$entropy))) +
+  ggtitle("B") +
+  labs(x = "x-coordinate", y = "y-coordinate") +
+  theme_classic() +
+  theme(text = element_text(size = 12))
+
+hst_ent + spt_ent
+```
+
+The spread (A) and spatial distribution (B) of region entropy measures can be very useful in assessing the tissue composition of samples - low entropy regions are more homogeneous with domain-like structure, whereas high entropy regions are heterogeneous with more uniform distribution of cells.
+
+# Generating entropy data only
+
+To evaluate tissue structure using entropy values, we can run clustSIGNAL up to the entropy measurement step, without running the complete method. The entropy values will be added to the SpatialExperiment object.
+
+```{r}
+data(mEmbryo2)
+spe <- SpatialExperiment(assays = list(logcounts = me_expr),
+                        colData = me_data, spatialCoordsNames = c("X", "Y"))
+
+set.seed(100)
+spe <- scater::runPCA(spe)
+spe <- clustSIGNAL::nsClustering(spe, samples = "sample_id", 
+                                 dimRed = "PCA", reclust = FALSE)
+outReg <- clustSIGNAL::neighbourDetect(spe, samples = "sample_id", NN = 30, 
+                                       cells = "uniqueID", sort = TRUE)
+spe <- entropyMeasure(spe, cells = "uniqueID", outReg$regXclust)
+head(spe$entropy)
+```
+
+# Multisample analysis with clustSIGNAL
+
+Here, we use the MERFISH mouse hypothalamic preoptic region dataset from [Moffitt et al, 2018](https://www.science.org/doi/10.1126/science.aau5324), which contains spatial transcriptomics data from 181 samples, with 155 genes and a total of 1,027,080 cells. For this vignette, we subset the data by selecting a total of 6000 random cells from only 3 samples - Animal 1 Bregma -0.09 (2080 cells), Animal 7 Bregma 0.16 (1936 cells), and Animal 7 Bregma -0.09 (1984 cells), excluding cells that were manually annotated as 'ambiguous' and 20 genes that were assessed using a different technology.
+
+We start the analysis by creating a SpatialExperiment object from the gene expression and cell information in the data subset, ensuring that the spatial coordinates are stored in spatialCoords within the SpatialExperiment object.
+
+```{r}
+data(mHypothal)
+spe2 <- SpatialExperiment(assays = list(logcounts = mh_expr),
+                        colData = mh_data, spatialCoordsNames = c("X", "Y"))
+spe2
+```
+
+Here, the cell labels are in the column 'Cell_ID' and sample labels are in 'samples' column in the SpatialExperiment object.
+
+```{r}
+colnames(colData(spe2))
+```
+
+## clustSIGNAL run
+
+One of the important concepts to take into account when running multisample analysis is batch effects. For samples gathered from different sources or through different technologies/procedures, some technical batch effects might be observed. We run clustSIGNAL in batch correction mode simply by setting batch = TRUE. The method then uses [harmony](https://portals.broadinstitute.org/harmony/) internally for batch correction.
+
+```{r}
+set.seed(101)
+samples <- "samples"
+cells <- "Cell_ID"
+res_hyp <- clustSIGNAL(spe2, samples, cells, batch = TRUE, threads = 4, outputs = "a")
+```
+
+```{r}
+spe2 <- res_hyp$spe_final
+spe2
+```
+
+## Clustering metrics
+
+Clustering and entropy results can be calculated and visualized for each sample. clustSIGNAL works well with samples that have more uniform distribution of cells.
+
+```{r}
+samplesList <- levels(spe2[[samples]])
+samplesList
+```
+
+```{r}
+# calculating silhouette width per sample
+silWidthRC <- matrix(nrow = 0, ncol = 3)
+for (s in samplesList) {
+  speX <- spe2[, spe2[[samples]] == s]
+  clust_sub <- as.numeric(as.character(speX$reCluster))
+  cXg <- t(as.matrix(logcounts(speX)))
+  distMat <- distances(cXg)
+  silCluster <- as.matrix(silhouette(clust_sub, distMat))
+  silWidthRC <- rbind(silWidthRC, silCluster)
+}
+spe2$rcSil <- silWidthRC[, 3]
+
+as.data.frame(colData(spe2)) %>%
+  group_by(samples) %>%
+  summarise(ARI = aricode::ARI(Cell_class, reCluster),
+            NMI = aricode::NMI(Cell_class, reCluster),
+            ASW = mean(rcSil),
+            min_Entropy = min(entropy),
+            max_Entropy = max(entropy),
+            mean_Entropy = mean(entropy))
+```
+
+## Visualizing clustSIGNAL clusters
+
+clustSIGNAL performs clustering on all cells in the dataset in one setting, thereby generating the same clusters across multiple samples. The user does not need to map cluster labels between samples. For example, cluster 1 represents the same cell type in all three samples, without needing explicit mapping between samples.
+
+```{r}
+df_ent <- as.data.frame(colData(spe2))
+
+# spatial plot
+spt_clust <- df_ent %>%
+    ggplot(aes(x = spatialCoords(spe2)[, 1],
+               y = -spatialCoords(spe2)[, 2])) +
+    geom_scattermore(pointsize = 3, aes(colour = reCluster)) +
+    scale_color_manual(values = colors) +
+    facet_wrap(vars(samples), scales = "free", nrow = 1) +
+    labs(x = "x-coordinate", y = "y-coordinate") +
+    guides(color = guide_legend(title = "Clusters",
+                                override.aes = list(size = 3))) +
+    theme_classic() +
+    theme(text = element_text(size = 12),
+          axis.text.x = element_text(angle = 90, vjust = 0.5))
+
+box_clust <- list()
+for (s in samplesList) {
+  df_ent_sub <- as.data.frame(colData(spe2)[spe2[[samples]] == s, ])
+  # calculating median entropy of each cluster in a sample
+  celltype_ent <- df_ent_sub %>%
+    group_by(as.character(reCluster)) %>%
+    summarise(meanEntropy = median(entropy))
+  # reordering clusters by their median entropy
+  # low to high median entropy
+  cellOrder <- celltype_ent$meanEntropy
+  names(cellOrder) <- celltype_ent$`as.character(reCluster)`
+  cellOrder = sort(cellOrder)
+  df_ent_sub$reCluster <- factor(df_ent_sub$reCluster, levels = names(cellOrder))
+
+  # box plot of cluster entropy
+  colors_ent <- colors[as.numeric(names(cellOrder))]
+  box_clust[[s]] <- df_ent_sub %>%
+    ggplot(aes(x = reCluster, y = entropy, fill = reCluster)) +
+    geom_boxplot() +
+    scale_fill_manual(values = colors_ent) +
+    facet_wrap(vars(samples), nrow = 1) +
+    labs(x = "clustSIGNAL clusters", y = "Entropy") +
+    ylim(0, NA) +
+    theme_classic() +
+    theme(strip.text = element_blank(),
+          legend.position = "none",
+          text = element_text(size = 12),
+          axis.text.x = element_text(angle = 90, vjust = 0.5))
+}
+
+spt_clust / (patchwork::wrap_plots(box_clust[1:3], nrow = 1) +
+               plot_layout(axes = "collect")) +
+  plot_layout(guides = "collect", heights = c(5, 3)) +
+  plot_annotation(title = "Spatial (top) and entropy (bottom) distributions of clusters")
+```
+
+The spatial location (top) and entropy distribution (bottom) of the clusters can be compared in a multisample analysis, providing spatial context of the cluster cells and their neighbourhood compositions in the different samples.
+
+## Visualising entropy spread and distribution
+
+In multisample analysis, the spread (A) and spatial distribution (B) of region entropy measures can be useful in assessing and comparing the tissue structure in the samples.
+
+```{r}
+# Histogram of entropy spread
+hst_ent <- as.data.frame(colData(spe2)) %>%
+  ggplot(aes(entropy)) +
+  geom_histogram(binwidth = 0.05) +
+  facet_wrap(vars(samples), nrow = 1) +
+  labs(x = "Entropy", y = "Number of regions") +
+  theme_classic() +
+  theme(text = element_text(size = 12))
+
+# Spatial plot showing sample entropy distribution
+spt_ent <- as.data.frame(colData(spe2)) %>%
+  ggplot(aes(x = spatialCoords(spe2)[, 1],
+             y = -spatialCoords(spe2)[, 2])) +
+  geom_scattermore(pointsize = 3,
+             aes(colour = entropy)) +
+  scale_colour_gradient2("Entropy", low = "grey", high = "blue") +
+  scale_size_continuous(range = c(0, max(spe2$entropy))) +
+  facet_wrap(vars(samples), scales = "free", nrow = 1) +
+  labs(x = "x-coordinate", y = "y-coordinate") +
+  theme_classic() +
+  theme(strip.text = element_blank(),
+        text = element_text(size = 12),
+        axis.text.x = element_text(angle = 90, vjust = 0.5))
+hst_ent / spt_ent + plot_layout(heights = c(3,5)) +
+    plot_annotation(title = "Entropy spread (top) and spatial distribution (bottom)")
+```
+
+
+<details>
+
+<summary>**Session Information**</summary>
+
+```{r}
+sessionInfo()
+```
+
+</details>