From c54ad2c0b5772d1cb3d7f159dbd31d1c34b2757a Mon Sep 17 00:00:00 2001 From: alexq Date: Thu, 19 Dec 2024 14:31:12 +1100 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- 01-processing.html | 43 ++++++++++++++++++++-------------------- 02-quality_control.html | 2 +- 03a-cell_annotation.html | 16 +++++++++++++-- search.json | 12 +++++------ 5 files changed, 44 insertions(+), 31 deletions(-) diff --git a/.nojekyll b/.nojekyll index 3266745..ce56218 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -52c18861 \ No newline at end of file +4bd82853 \ No newline at end of file diff --git a/01-processing.html b/01-processing.html index eb35340..5168a34 100644 --- a/01-processing.html +++ b/01-processing.html @@ -404,7 +404,7 @@

# assign metadata columns mcols(images) <- S4Vectors::DataFrame(imageID = names(images)) -

Time for this code chunk to run with 40 cores: 73.99 seconds

+

Time for this code chunk to run with 40 cores: 69.52 seconds

When reading the image channels directly from the names of the TIFF images, they will often need to be cleaned for ease of downstream processing. The channel names can be accessed from the CytoImageList object using the channelNames function.

@@ -459,7 +459,7 @@

tissue = c("panCK", "CD45", "HH3"), cores = nCores )

-

Time for this code chunk to run with 40 cores: 44.22 seconds

+

Time for this code chunk to run with 40 cores: 44.47 seconds

4.2.1 Visualise separation

@@ -637,7 +637,7 @@

spatialCoords names(2) : x y imgData names(1): sample_id -

Time for this code chunk to run with 40 cores: 73.95 seconds

+

Time for this code chunk to run with 40 cores: 74.32 seconds

So far, we have obtained our raw TIFF images, performed cell segmentation to isolate individual cells, and then stored our data as a SpatialExperiment object. We can now move on to quality control, data transformation, and normalisation to address batch effects.

@@ -696,24 +696,25 @@

[43] R6_2.5.1 RColorBrewer_1.1-3 spatstat.data_3.1-4 [46] spatstat.univar_3.1-1 Rcpp_1.0.13-1 knitr_1.49 [49] httpuv_1.6.15 Matrix_1.7-1 nnls_1.6 - [52] tidyselect_1.2.1 yaml_2.3.10 abind_1.4-8 - [55] viridis_0.6.5 codetools_0.2-20 curl_6.0.1 - [58] lattice_0.22-6 tibble_3.2.1 KEGGREST_1.46.0 - [61] shiny_1.10.0 withr_3.0.2 evaluate_1.0.1 - [64] polyclip_1.10-7 Biostrings_2.74.0 filelock_1.0.3 - [67] BiocManager_1.30.25 pillar_1.9.0 generics_0.1.3 - [70] sp_2.1-4 RCurl_1.98-1.16 BiocVersion_3.20.0 - [73] munsell_0.5.1 scales_1.3.0 xtable_1.8-4 - [76] glue_1.8.0 tools_4.4.1 locfit_1.5-9.10 - [79] rhdf5_2.50.1 grid_4.4.1 AnnotationDbi_1.68.0 - [82] colorspace_2.1-1 GenomeInfoDbData_1.2.13 raster_3.6-30 - [85] beeswarm_0.4.0 HDF5Array_1.34.0 vipor_0.4.7 - [88] cli_3.6.3 rappdirs_0.3.3 fansi_1.0.6 - [91] S4Arrays_1.6.0 viridisLite_0.4.2 svglite_2.1.3 - [94] dplyr_1.1.4 gtable_0.3.6 digest_0.6.37 - [97] SparseArray_1.6.0 rjson_0.2.23 htmlwidgets_1.6.4 -[100] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4 -[103] httr_1.4.7 mime_0.12 bit64_4.5.2 + [52] tidyselect_1.2.1 yaml_2.3.10 rstudioapi_0.17.1 + [55] abind_1.4-8 viridis_0.6.5 codetools_0.2-20 + [58] curl_6.0.1 lattice_0.22-6 tibble_3.2.1 + [61] KEGGREST_1.46.0 shiny_1.10.0 withr_3.0.2 + [64] evaluate_1.0.1 polyclip_1.10-7 Biostrings_2.74.0 + [67] filelock_1.0.3 BiocManager_1.30.25 pillar_1.9.0 + [70] generics_0.1.3 sp_2.1-4 RCurl_1.98-1.16 + [73] BiocVersion_3.20.0 munsell_0.5.1 scales_1.3.0 + [76] xtable_1.8-4 glue_1.8.0 tools_4.4.1 + [79] locfit_1.5-9.10 rhdf5_2.50.1 grid_4.4.1 + [82] AnnotationDbi_1.68.0 colorspace_2.1-1 GenomeInfoDbData_1.2.13 + [85] raster_3.6-30 beeswarm_0.4.0 HDF5Array_1.34.0 + [88] vipor_0.4.7 cli_3.6.3 rappdirs_0.3.3 + [91] fansi_1.0.6 S4Arrays_1.6.0 viridisLite_0.4.2 + [94] svglite_2.1.3 dplyr_1.1.4 gtable_0.3.6 + [97] digest_0.6.37 SparseArray_1.6.0 rjson_0.2.23 +[100] htmlwidgets_1.6.4 memoise_2.0.1 htmltools_0.5.8.1 +[103] lifecycle_1.0.4 httr_1.4.7 mime_0.12 +[106] bit64_4.5.2 diff --git a/02-quality_control.html b/02-quality_control.html index 33884fb..ffbf5e5 100644 --- a/02-quality_control.html +++ b/02-quality_control.html @@ -411,7 +411,7 @@

What we’re looking for

    -
  1. Do the CD3+ and CD3- peaks clearly separate out in the density plot? To ensure that downstream clustering goes smoothly, we want our cell type specific markers to show 2 distinct peaks representing our CD3+ and CD3- cells. If these
  2. +
  3. Do the CD3+ and CD3- peaks clearly separate out in the density plot? To ensure that downstream clustering goes smoothly, we want our cell type specific markers to show 2 distinct peaks representing our CD3+ and CD3- cells. If we see 3 or more peaks where we don’t expect, this might be an indicator that further normalization is required.
  4. Are our CD3+ and CD3- peaks consistent across our images? We want to make sure that our density plots for CD3 are largely the same across images so that a CD3+ cell in 1 image is equivalent to a CD3+ cell in another image.
diff --git a/03a-cell_annotation.html b/03a-cell_annotation.html index 07b8dd5..2d140aa 100644 --- a/03a-cell_annotation.html +++ b/03a-cell_annotation.html @@ -445,7 +445,11 @@

-

How do I identify imperfect clustering? 1. Do our cell-type specific markers clearly separate out by cluster? We expect to see discrete expression of our markers in specific cell types, e.g. CD4 2. If we instead see “smearing” of our markers across clusters, where several clusters express high levels of a cell type specific marker such as CD4, it is likely a normalization issue.

+

How do I identify imperfect clustering?

+
    +
  1. Do our cell-type specific markers clearly separate out by cluster? We expect to see discrete expression of our markers in specific cell types, e.g. CD4
  2. +
  3. If we instead see “smearing” of our markers across clusters, where several clusters express high levels of a cell type specific marker such as CD4, it is likely a normalization issue.
  4. +
@@ -458,7 +462,15 @@

-

Imperfect clustering can stem from many issues: the 3 most common are outlined below: 1. Imperfect segmentation - excessive lateral marker spill over can severely impact downstream clustering, as cell type specific markers leak into nearby cells. This should largely be diagnosed in the segmentation step and will need to be fixed by optimizing the upstream segmentation algorithm. 2. Imperfect normalization - excessively variable intensities across images could cause issues in the normalization process. This can generally be diagnosed with density plots and box plots for specific markers across images and can be fixed by identifying the exact issue, e.g. extremely high values for a small subset of images, and choosing a normalization strategy to remove/reduce this effect. 3. Imperfect clustering - choosing a k that’s too low or too high could lead to imperfect clustering. This is usually diagnosed by clusters which either express too many markers very highly or express too few markers, and is usually remedied by choosing an ideal k based on an elbow plot described below.

+

Imperfect clustering can stem from many issues: the 3 most common are outlined below:

+
    +
  1. +Imperfect segmentation - excessive lateral marker spill over can severely impact downstream clustering, as cell type specific markers leak into nearby cells. This should largely be diagnosed in the segmentation step and will need to be fixed by optimizing the upstream segmentation algorithm.
  2. +
  3. +Imperfect normalization - excessively variable intensities across images could cause issues in the normalization process. This can generally be diagnosed with density plots and box plots for specific markers across images and can be fixed by identifying the exact issue, e.g. extremely high values for a small subset of images, and choosing a normalization strategy to remove/reduce this effect.
  4. +
  5. +Imperfect clustering - choosing a k that’s too low or too high could lead to imperfect clustering. This is usually diagnosed by clusters which either express too many markers very highly or express too few markers, and is usually remedied by choosing an ideal k based on an elbow plot described below.
  6. +

diff --git a/search.json b/search.json index 6cb42c0..49f5bc3 100644 --- a/search.json +++ b/search.json @@ -202,7 +202,7 @@ "href": "01-processing.html", "title": "\n4  Cell segmentation and pre-processing\n", "section": "", - "text": "4.1 Reading in data with cytomapper\nThe typical first step in analyzing spatial data is cell segmentation, which involves identifying and isolating individual cells in each image. This step is essential for assigning cell type labels and evaluating cellular relationships within the tissue.\nIn this section, we will outline the process of reading and pre-processing images obtained from various imaging technologies to prepare them for downstream analysis. Additionally, we will demonstrate how our simpleSeg package facilitates efficient and straightforward cell segmentation, as well as how to evaluate the quality of the segmentation results.\nWe recommend setting the number of cores to enable running code in parallel. Please choose a number that is appropriate for your resources. A minimum of 2 cores is suggested since running this workflow can be computationally intensive.\nWe will be using the Ferguson 2022 dataset to demonstrate how to perform pre-processing and cell segmentation. This dataset can be accessed through the SpatialDatasets package. The loadImages function form the cytomapper package can be used to load all the TIFF images into a CytoImageList object and store the images as an h5 file on-disk in a temporary directory using the h5FilesPath = HDF5Array::getHDF5DumpDir() parameter.\nWe will also assign the metadata columns of the CytoImageList object using the mcols function.\npathToZip <- SpatialDatasets::Ferguson_Images()\n\nsee ?SpatialDatasets and browseVignettes('SpatialDatasets') for documentation\n\n\nloading from cache\n\npathToImages <- \"data/processing/images\"\nunzip(pathToZip, exdir = \"data/processing/images\")\n\n# store images in a CytoImageList on_disk as h5 files to save memory\nimages <- cytomapper::loadImages(\n pathToImages,\n single_channel = TRUE,\n on_disk = TRUE,\n h5FilesPath = HDF5Array::getHDF5DumpDir(),\n BPPARAM = BPPARAM\n)\n\n# assign metadata columns\nmcols(images) <- S4Vectors::DataFrame(imageID = names(images))\nTime for this code chunk to run with 40 cores: 73.99 seconds\nWhen reading the image channels directly from the names of the TIFF images, they will often need to be cleaned for ease of downstream processing. The channel names can be accessed from the CytoImageList object using the channelNames function.\nchannelNames(images) <- channelNames(images) |>\n # remove preceding letters\n sub(pattern = \".*_\", replacement = \"\", x = _) |> \n # remove the .ome\n sub(pattern = \".ome\", replacement = \"\", x = _)\nSimilarly, the image names will be taken from the folder name containing the individual TIFF images for each channel. These will often also need to be cleaned.\n# cleaning image names to obtain image IDs\nsplit_names <- function(x) {\n sapply(strsplit(x, \"_\"), `[`, 3)\n}\n\nnames(images) <- names(images) |> split_names()\n\nmcols(images) <- S4Vectors::DataFrame(imageID = names(images))", + "text": "4.1 Reading in data with cytomapper\nThe typical first step in analyzing spatial data is cell segmentation, which involves identifying and isolating individual cells in each image. This step is essential for assigning cell type labels and evaluating cellular relationships within the tissue.\nIn this section, we will outline the process of reading and pre-processing images obtained from various imaging technologies to prepare them for downstream analysis. Additionally, we will demonstrate how our simpleSeg package facilitates efficient and straightforward cell segmentation, as well as how to evaluate the quality of the segmentation results.\nWe recommend setting the number of cores to enable running code in parallel. Please choose a number that is appropriate for your resources. A minimum of 2 cores is suggested since running this workflow can be computationally intensive.\nWe will be using the Ferguson 2022 dataset to demonstrate how to perform pre-processing and cell segmentation. This dataset can be accessed through the SpatialDatasets package. The loadImages function form the cytomapper package can be used to load all the TIFF images into a CytoImageList object and store the images as an h5 file on-disk in a temporary directory using the h5FilesPath = HDF5Array::getHDF5DumpDir() parameter.\nWe will also assign the metadata columns of the CytoImageList object using the mcols function.\npathToZip <- SpatialDatasets::Ferguson_Images()\n\nsee ?SpatialDatasets and browseVignettes('SpatialDatasets') for documentation\n\n\nloading from cache\n\npathToImages <- \"data/processing/images\"\nunzip(pathToZip, exdir = \"data/processing/images\")\n\n# store images in a CytoImageList on_disk as h5 files to save memory\nimages <- cytomapper::loadImages(\n pathToImages,\n single_channel = TRUE,\n on_disk = TRUE,\n h5FilesPath = HDF5Array::getHDF5DumpDir(),\n BPPARAM = BPPARAM\n)\n\n# assign metadata columns\nmcols(images) <- S4Vectors::DataFrame(imageID = names(images))\nTime for this code chunk to run with 40 cores: 69.52 seconds\nWhen reading the image channels directly from the names of the TIFF images, they will often need to be cleaned for ease of downstream processing. The channel names can be accessed from the CytoImageList object using the channelNames function.\nchannelNames(images) <- channelNames(images) |>\n # remove preceding letters\n sub(pattern = \".*_\", replacement = \"\", x = _) |> \n # remove the .ome\n sub(pattern = \".ome\", replacement = \"\", x = _)\nSimilarly, the image names will be taken from the folder name containing the individual TIFF images for each channel. These will often also need to be cleaned.\n# cleaning image names to obtain image IDs\nsplit_names <- function(x) {\n sapply(strsplit(x, \"_\"), `[`, 3)\n}\n\nnames(images) <- names(images) |> split_names()\n\nmcols(images) <- S4Vectors::DataFrame(imageID = names(images))", "crumbs": [ "Processing", "4  Cell segmentation and pre-processing" @@ -213,7 +213,7 @@ "href": "01-processing.html#cell-segmentation-with-simpleseg", "title": "\n4  Cell segmentation and pre-processing\n", "section": "\n4.2 Cell segmentation with simpleSeg", - "text": "4.2 Cell segmentation with simpleSeg\nFor the sake of simplicity and efficiency, we will be subsetting our images down to 10 images.\n\nimages <- images[1:10]\n\nNext, we can perform our segmentation. The simpleSeg package provides functionality for user friendly, watershed based segmentation on multiplexed cellular images based on the intensity of user specified protein marker channels. The main function, simpleSeg, can be used to perform a simple cell segmentation process that traces out the nuclei using a specified channel.\nIn the particular example below, we have asked simpleSeg to do the following:\n\n\nnucleus = c(\"HH3\"): trace out the nuclei signal in the images using the HH3 channel.\n\npca = TRUE: segment out the nuclei mask using a principal component analysis of all channels and using the principal components most aligned with the nuclei channel, in this case, HH3.\n\ncellBody = \"dilate\": use a dilation strategy of segmentation, expanding out from the nucleus by a specified discSize. In this case, discSize = 3, which means simpleSeg dilates out from the nucleus by 3 pixels.\n\nsizeSelection = 20: ensure that only cells with a size greater than 20 pixels will be used.\n\ntransform = \"sqrt\": perform square root transformation on each of the channels prior to segmentation.\n\ntissue = c(\"panCK\", \"CD45\", \"HH3\"): use the specified tissue mask to filter out all background noise outside the tissue mask. This allows us to ignore background noise which happens outside of the tumour core.\n\nThere are many other parameters that can be specified in simpleSeg (smooth, watershed, tolerance, and ext), and we encourage the user to select the best parameters which suit their biological context.\n\nmasks <- simpleSeg(images,\n nucleus = c(\"HH3\"),\n pca = TRUE,\n cellBody = \"dilate\",\n discSize = 3,\n sizeSelection = 20,\n transform = \"sqrt\",\n tissue = c(\"panCK\", \"CD45\", \"HH3\"),\n cores = nCores\n )\nTime for this code chunk to run with 40 cores: 44.22 seconds\n\n\n4.2.1 Visualise separation\nWe can examine the performance of the cell segmentation using the display and colorLabels functions from the EBImage package. If used in an interactive session, display allows you to zoom in and out of the image.\n\nEBImage::display(colorLabels(masks[[1]]))\n\n\n\n\n\n\n\n\n4.2.2 Visualise outlines\nThe plotPixels function from cytomapper makes it easy to overlay the mask on top of the nucleus intensity marker to see how well our segmentation process has performed.\n\nplotPixels(image = images[\"F3\"], \n mask = masks[\"F3\"],\n img_id = \"imageID\", \n colour_by = c(\"HH3\"), \n display = \"single\",\n colour = list(HH3 = c(\"black\",\"blue\")),\n legend = NULL,\n bcg = list(\n HH3 = c(1, 1, 2)\n ))\n\n\n\n\n\n\n\nHere, we can see that the segmentation appears to be performing reasonably.\nIf you see over or under-segmentation of your images, discSize is a key parameter in the simpleSeg function for optimising the size of the dilation disc after segmenting out the nuclei.\nWe can also visualise multiple markers at once instead of just the HH3 marker to see how the segmentation mask performs.\n\nplotPixels(image = images[\"F3\"], \n mask = masks[\"F3\"],\n img_id = \"imageID\", \n colour_by = c(\"HH3\", \"CD31\", \"FX111A\"), \n display = \"single\",\n colour = list(HH3 = c(\"black\",\"blue\"),\n CD31 = c(\"black\", \"red\"),\n FX111A = c(\"black\", \"green\")),\n legend = NULL,\n bcg = list(\n HH3 = c(1, 1, 2),\n CD31 = c(0, 1, 2),\n FX111A = c(0, 1, 1.5)\n ))\n\n\n\n\n\n\n\n\n\n\n\n\n\nWhat to look for and change to obtain an ideal segmentation\n\n\n\nCritical thinking\n\nDoes the segmentation capture the full nucleus? If not, perhaps you need to try a different transform to improve the thresholding of the nuclei marker. You could also try using pca = TRUE which will borrow information across the markers to help find the nuclei.\nHow much of the cell body is the segmentation missing? Try increasing the dilation around the nucleus by setting discSize = 7.\nAre the segmentations capturing neighbouring cells? Try decreasing the dilation to limit lateral spillover of marker signal by setting discSize = 2.\n\n\n\nHere, we can see that our segmentation mask has done a good job of capturing the CD31 signal, but perhaps not such a good job of capturing the FXIIIA signal, which often lies outside of our dilated nuclear mask. This suggests that we might need to increase the discSize or other parameters of simpleSeg.\nIn particular, the cellBody and watershed parameters can strongly influence the way cells are segmented using simpleSeg. We have provided further details on how the user may specify cell body identification and watershedding in the tables below.\nAs simpleSeg is a nuclei-based dilation method, it suffers from tissues where cells might be multi-nucleated, or where cells have non-circular or elliptical morphologies. For tissues where you might expect these cells, it may be preferable to choose a different segmentation method.\n\n4.2.2.1 cellBody Parameters\n\n\n\n\n\n\n\nMethod\nDescription\n\n\n\n\n“distance”\nPerforms watershedding on a distance map of the thresholded nuclei signal. With a pixels distance being defined as the distance from the closest background signal.\n\n\n\n“intensity”\nPerforms watershedding using the intensity of the nuclei marker.\n\n\n\n“combine”\nCombines the previous two methods by multiplying the distance map by the nuclei marker intensity.\n\n\n\n\n4.2.2.2 watershed Parameters\n\n\n\n\n\n\n\nMethod\nDescription\n\n\n\n\n“dilation”\nDilates the nuclei by an amount defined by the user. The size of the dilatation in pixels may be specified with the discDize argument.\n\n\n\n“discModel”\nUses all the markers to predict the presence of dilated ‘discs’ around the nuclei. The model therefore learns which markers are typically present in the cell cytoplasm and generates a mask based on this.\n\n\n\n“marker”\nThe user may specify one or multiple dedicated cytoplasm markers to predict the cytoplasm. This can be done using cellBody = \"marker name\"/\"index\"\n\n\n\n“None”\nThe nuclei mask is returned directly.", + "text": "4.2 Cell segmentation with simpleSeg\nFor the sake of simplicity and efficiency, we will be subsetting our images down to 10 images.\n\nimages <- images[1:10]\n\nNext, we can perform our segmentation. The simpleSeg package provides functionality for user friendly, watershed based segmentation on multiplexed cellular images based on the intensity of user specified protein marker channels. The main function, simpleSeg, can be used to perform a simple cell segmentation process that traces out the nuclei using a specified channel.\nIn the particular example below, we have asked simpleSeg to do the following:\n\n\nnucleus = c(\"HH3\"): trace out the nuclei signal in the images using the HH3 channel.\n\npca = TRUE: segment out the nuclei mask using a principal component analysis of all channels and using the principal components most aligned with the nuclei channel, in this case, HH3.\n\ncellBody = \"dilate\": use a dilation strategy of segmentation, expanding out from the nucleus by a specified discSize. In this case, discSize = 3, which means simpleSeg dilates out from the nucleus by 3 pixels.\n\nsizeSelection = 20: ensure that only cells with a size greater than 20 pixels will be used.\n\ntransform = \"sqrt\": perform square root transformation on each of the channels prior to segmentation.\n\ntissue = c(\"panCK\", \"CD45\", \"HH3\"): use the specified tissue mask to filter out all background noise outside the tissue mask. This allows us to ignore background noise which happens outside of the tumour core.\n\nThere are many other parameters that can be specified in simpleSeg (smooth, watershed, tolerance, and ext), and we encourage the user to select the best parameters which suit their biological context.\n\nmasks <- simpleSeg(images,\n nucleus = c(\"HH3\"),\n pca = TRUE,\n cellBody = \"dilate\",\n discSize = 3,\n sizeSelection = 20,\n transform = \"sqrt\",\n tissue = c(\"panCK\", \"CD45\", \"HH3\"),\n cores = nCores\n )\nTime for this code chunk to run with 40 cores: 44.47 seconds\n\n\n4.2.1 Visualise separation\nWe can examine the performance of the cell segmentation using the display and colorLabels functions from the EBImage package. If used in an interactive session, display allows you to zoom in and out of the image.\n\nEBImage::display(colorLabels(masks[[1]]))\n\n\n\n\n\n\n\n\n4.2.2 Visualise outlines\nThe plotPixels function from cytomapper makes it easy to overlay the mask on top of the nucleus intensity marker to see how well our segmentation process has performed.\n\nplotPixels(image = images[\"F3\"], \n mask = masks[\"F3\"],\n img_id = \"imageID\", \n colour_by = c(\"HH3\"), \n display = \"single\",\n colour = list(HH3 = c(\"black\",\"blue\")),\n legend = NULL,\n bcg = list(\n HH3 = c(1, 1, 2)\n ))\n\n\n\n\n\n\n\nHere, we can see that the segmentation appears to be performing reasonably.\nIf you see over or under-segmentation of your images, discSize is a key parameter in the simpleSeg function for optimising the size of the dilation disc after segmenting out the nuclei.\nWe can also visualise multiple markers at once instead of just the HH3 marker to see how the segmentation mask performs.\n\nplotPixels(image = images[\"F3\"], \n mask = masks[\"F3\"],\n img_id = \"imageID\", \n colour_by = c(\"HH3\", \"CD31\", \"FX111A\"), \n display = \"single\",\n colour = list(HH3 = c(\"black\",\"blue\"),\n CD31 = c(\"black\", \"red\"),\n FX111A = c(\"black\", \"green\")),\n legend = NULL,\n bcg = list(\n HH3 = c(1, 1, 2),\n CD31 = c(0, 1, 2),\n FX111A = c(0, 1, 1.5)\n ))\n\n\n\n\n\n\n\n\n\n\n\n\n\nWhat to look for and change to obtain an ideal segmentation\n\n\n\nCritical thinking\n\nDoes the segmentation capture the full nucleus? If not, perhaps you need to try a different transform to improve the thresholding of the nuclei marker. You could also try using pca = TRUE which will borrow information across the markers to help find the nuclei.\nHow much of the cell body is the segmentation missing? Try increasing the dilation around the nucleus by setting discSize = 7.\nAre the segmentations capturing neighbouring cells? Try decreasing the dilation to limit lateral spillover of marker signal by setting discSize = 2.\n\n\n\nHere, we can see that our segmentation mask has done a good job of capturing the CD31 signal, but perhaps not such a good job of capturing the FXIIIA signal, which often lies outside of our dilated nuclear mask. This suggests that we might need to increase the discSize or other parameters of simpleSeg.\nIn particular, the cellBody and watershed parameters can strongly influence the way cells are segmented using simpleSeg. We have provided further details on how the user may specify cell body identification and watershedding in the tables below.\nAs simpleSeg is a nuclei-based dilation method, it suffers from tissues where cells might be multi-nucleated, or where cells have non-circular or elliptical morphologies. For tissues where you might expect these cells, it may be preferable to choose a different segmentation method.\n\n4.2.2.1 cellBody Parameters\n\n\n\n\n\n\n\nMethod\nDescription\n\n\n\n\n“distance”\nPerforms watershedding on a distance map of the thresholded nuclei signal. With a pixels distance being defined as the distance from the closest background signal.\n\n\n\n“intensity”\nPerforms watershedding using the intensity of the nuclei marker.\n\n\n\n“combine”\nCombines the previous two methods by multiplying the distance map by the nuclei marker intensity.\n\n\n\n\n4.2.2.2 watershed Parameters\n\n\n\n\n\n\n\nMethod\nDescription\n\n\n\n\n“dilation”\nDilates the nuclei by an amount defined by the user. The size of the dilatation in pixels may be specified with the discDize argument.\n\n\n\n“discModel”\nUses all the markers to predict the presence of dilated ‘discs’ around the nuclei. The model therefore learns which markers are typically present in the cell cytoplasm and generates a mask based on this.\n\n\n\n“marker”\nThe user may specify one or multiple dedicated cytoplasm markers to predict the cytoplasm. This can be done using cellBody = \"marker name\"/\"index\"\n\n\n\n“None”\nThe nuclei mask is returned directly.", "crumbs": [ "Processing", "4  Cell segmentation and pre-processing" @@ -224,7 +224,7 @@ "href": "01-processing.html#summarise-cell-features", "title": "\n4  Cell segmentation and pre-processing\n", "section": "\n4.3 Summarise cell features", - "text": "4.3 Summarise cell features\nWe can use the measureObjects from cytomapper to calculate the average intensity of each channel within each cell, as well as other morphological features. By default, measureObjects will return a SingleCellExperiment object, where the channel intensities are stored in the counts assay and the spatial location of each cell is stored in colData in the m.cx and m.cy columns.\nHowever, you can also specify measureObjects to return a SpatialExperiment object by specifying return_as = \"spe\". In a SpatialExperiment object, the spatial location of each cell is stored in the spatialCoords slot as m.cx and m.cy, which simplifies plotting. In this demonstration, we will return a SpatialExperiment object.\n\n# summarise the expression of each marker in each cell\ncells <- cytomapper::measureObjects(masks,\n images,\n img_id = \"imageID\",\n return_as = \"spe\",\n BPPARAM = BPPARAM\n )\n\nspatialCoordsNames(cells) <- c(\"x\", \"y\")\n\ncells\n\nclass: SpatialExperiment \ndim: 36 37941 \nmetadata(0):\nassays(1): counts\nrownames(36): panCK CD20 ... DNA1 DNA2\nrowData names(0):\ncolnames: NULL\ncolData names(8): imageID object_id ... sample_id objectNum\nreducedDimNames(0):\nmainExpName: NULL\naltExpNames(0):\nspatialCoords names(2) : x y\nimgData names(1): sample_id\n\nTime for this code chunk to run with 40 cores: 73.95 seconds\n\nSo far, we have obtained our raw TIFF images, performed cell segmentation to isolate individual cells, and then stored our data as a SpatialExperiment object. We can now move on to quality control, data transformation, and normalisation to address batch effects.", + "text": "4.3 Summarise cell features\nWe can use the measureObjects from cytomapper to calculate the average intensity of each channel within each cell, as well as other morphological features. By default, measureObjects will return a SingleCellExperiment object, where the channel intensities are stored in the counts assay and the spatial location of each cell is stored in colData in the m.cx and m.cy columns.\nHowever, you can also specify measureObjects to return a SpatialExperiment object by specifying return_as = \"spe\". In a SpatialExperiment object, the spatial location of each cell is stored in the spatialCoords slot as m.cx and m.cy, which simplifies plotting. In this demonstration, we will return a SpatialExperiment object.\n\n# summarise the expression of each marker in each cell\ncells <- cytomapper::measureObjects(masks,\n images,\n img_id = \"imageID\",\n return_as = \"spe\",\n BPPARAM = BPPARAM\n )\n\nspatialCoordsNames(cells) <- c(\"x\", \"y\")\n\ncells\n\nclass: SpatialExperiment \ndim: 36 37941 \nmetadata(0):\nassays(1): counts\nrownames(36): panCK CD20 ... DNA1 DNA2\nrowData names(0):\ncolnames: NULL\ncolData names(8): imageID object_id ... sample_id objectNum\nreducedDimNames(0):\nmainExpName: NULL\naltExpNames(0):\nspatialCoords names(2) : x y\nimgData names(1): sample_id\n\nTime for this code chunk to run with 40 cores: 74.32 seconds\n\nSo far, we have obtained our raw TIFF images, performed cell segmentation to isolate individual cells, and then stored our data as a SpatialExperiment object. We can now move on to quality control, data transformation, and normalisation to address batch effects.", "crumbs": [ "Processing", "4  Cell segmentation and pre-processing" @@ -235,7 +235,7 @@ "href": "01-processing.html#sessioninfo", "title": "\n4  Cell segmentation and pre-processing\n", "section": "\n4.4 sessionInfo", - "text": "4.4 sessionInfo\n\nsessionInfo()\n\nR version 4.4.1 (2024-06-14)\nPlatform: x86_64-pc-linux-gnu\nRunning under: Debian GNU/Linux 12 (bookworm)\n\nMatrix products: default\nBLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 \nLAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.21.so; LAPACK version 3.11.0\n\nlocale:\n [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 \n [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 \n [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C \n[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C \n\ntime zone: Australia/Sydney\ntzcode source: system (glibc)\n\nattached base packages:\n[1] stats4 stats graphics grDevices utils datasets methods \n[8] base \n\nother attached packages:\n [1] SpatialDatasets_1.4.0 SpatialExperiment_1.16.0 \n [3] ExperimentHub_2.14.0 AnnotationHub_3.14.0 \n [5] BiocFileCache_2.14.0 dbplyr_2.5.0 \n [7] simpleSeg_1.8.0 ggplot2_3.5.1 \n [9] cytomapper_1.18.0 SingleCellExperiment_1.28.1\n[11] SummarizedExperiment_1.36.0 Biobase_2.66.0 \n[13] GenomicRanges_1.58.0 GenomeInfoDb_1.42.1 \n[15] IRanges_2.40.1 S4Vectors_0.44.0 \n[17] BiocGenerics_0.52.0 MatrixGenerics_1.18.0 \n[19] matrixStats_1.4.1 EBImage_4.48.0 \n\nloaded via a namespace (and not attached):\n [1] DBI_1.2.3 bitops_1.0-9 deldir_2.0-4 \n [4] gridExtra_2.3 rlang_1.1.4 magrittr_2.0.3 \n [7] svgPanZoom_0.3.4 shinydashboard_0.7.2 RSQLite_2.3.9 \n [10] compiler_4.4.1 spatstat.geom_3.3-4 png_0.1-8 \n [13] systemfonts_1.1.0 fftwtools_0.9-11 vctrs_0.6.5 \n [16] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0 \n [19] magick_2.8.5 XVector_0.46.0 utf8_1.2.4 \n [22] promises_1.3.2 rmarkdown_2.29 UCSC.utils_1.2.0 \n [25] ggbeeswarm_0.7.2 purrr_1.0.2 bit_4.5.0.1 \n [28] xfun_0.49 cachem_1.1.0 zlibbioc_1.52.0 \n [31] jsonlite_1.8.9 blob_1.2.4 later_1.4.1 \n [34] rhdf5filters_1.18.0 DelayedArray_0.32.0 spatstat.utils_3.1-1 \n [37] Rhdf5lib_1.28.0 BiocParallel_1.40.0 jpeg_0.1-10 \n [40] tiff_0.1-12 terra_1.8-5 parallel_4.4.1 \n [43] R6_2.5.1 RColorBrewer_1.1-3 spatstat.data_3.1-4 \n [46] spatstat.univar_3.1-1 Rcpp_1.0.13-1 knitr_1.49 \n [49] httpuv_1.6.15 Matrix_1.7-1 nnls_1.6 \n [52] tidyselect_1.2.1 yaml_2.3.10 abind_1.4-8 \n [55] viridis_0.6.5 codetools_0.2-20 curl_6.0.1 \n [58] lattice_0.22-6 tibble_3.2.1 KEGGREST_1.46.0 \n [61] shiny_1.10.0 withr_3.0.2 evaluate_1.0.1 \n [64] polyclip_1.10-7 Biostrings_2.74.0 filelock_1.0.3 \n [67] BiocManager_1.30.25 pillar_1.9.0 generics_0.1.3 \n [70] sp_2.1-4 RCurl_1.98-1.16 BiocVersion_3.20.0 \n [73] munsell_0.5.1 scales_1.3.0 xtable_1.8-4 \n [76] glue_1.8.0 tools_4.4.1 locfit_1.5-9.10 \n [79] rhdf5_2.50.1 grid_4.4.1 AnnotationDbi_1.68.0 \n [82] colorspace_2.1-1 GenomeInfoDbData_1.2.13 raster_3.6-30 \n [85] beeswarm_0.4.0 HDF5Array_1.34.0 vipor_0.4.7 \n [88] cli_3.6.3 rappdirs_0.3.3 fansi_1.0.6 \n [91] S4Arrays_1.6.0 viridisLite_0.4.2 svglite_2.1.3 \n [94] dplyr_1.1.4 gtable_0.3.6 digest_0.6.37 \n [97] SparseArray_1.6.0 rjson_0.2.23 htmlwidgets_1.6.4 \n[100] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4 \n[103] httr_1.4.7 mime_0.12 bit64_4.5.2", + "text": "4.4 sessionInfo\n\nsessionInfo()\n\nR version 4.4.1 (2024-06-14)\nPlatform: x86_64-pc-linux-gnu\nRunning under: Debian GNU/Linux 12 (bookworm)\n\nMatrix products: default\nBLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 \nLAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.21.so; LAPACK version 3.11.0\n\nlocale:\n [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 \n [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 \n [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C \n[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C \n\ntime zone: Australia/Sydney\ntzcode source: system (glibc)\n\nattached base packages:\n[1] stats4 stats graphics grDevices utils datasets methods \n[8] base \n\nother attached packages:\n [1] SpatialDatasets_1.4.0 SpatialExperiment_1.16.0 \n [3] ExperimentHub_2.14.0 AnnotationHub_3.14.0 \n [5] BiocFileCache_2.14.0 dbplyr_2.5.0 \n [7] simpleSeg_1.8.0 ggplot2_3.5.1 \n [9] cytomapper_1.18.0 SingleCellExperiment_1.28.1\n[11] SummarizedExperiment_1.36.0 Biobase_2.66.0 \n[13] GenomicRanges_1.58.0 GenomeInfoDb_1.42.1 \n[15] IRanges_2.40.1 S4Vectors_0.44.0 \n[17] BiocGenerics_0.52.0 MatrixGenerics_1.18.0 \n[19] matrixStats_1.4.1 EBImage_4.48.0 \n\nloaded via a namespace (and not attached):\n [1] DBI_1.2.3 bitops_1.0-9 deldir_2.0-4 \n [4] gridExtra_2.3 rlang_1.1.4 magrittr_2.0.3 \n [7] svgPanZoom_0.3.4 shinydashboard_0.7.2 RSQLite_2.3.9 \n [10] compiler_4.4.1 spatstat.geom_3.3-4 png_0.1-8 \n [13] systemfonts_1.1.0 fftwtools_0.9-11 vctrs_0.6.5 \n [16] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0 \n [19] magick_2.8.5 XVector_0.46.0 utf8_1.2.4 \n [22] promises_1.3.2 rmarkdown_2.29 UCSC.utils_1.2.0 \n [25] ggbeeswarm_0.7.2 purrr_1.0.2 bit_4.5.0.1 \n [28] xfun_0.49 cachem_1.1.0 zlibbioc_1.52.0 \n [31] jsonlite_1.8.9 blob_1.2.4 later_1.4.1 \n [34] rhdf5filters_1.18.0 DelayedArray_0.32.0 spatstat.utils_3.1-1 \n [37] Rhdf5lib_1.28.0 BiocParallel_1.40.0 jpeg_0.1-10 \n [40] tiff_0.1-12 terra_1.8-5 parallel_4.4.1 \n [43] R6_2.5.1 RColorBrewer_1.1-3 spatstat.data_3.1-4 \n [46] spatstat.univar_3.1-1 Rcpp_1.0.13-1 knitr_1.49 \n [49] httpuv_1.6.15 Matrix_1.7-1 nnls_1.6 \n [52] tidyselect_1.2.1 yaml_2.3.10 rstudioapi_0.17.1 \n [55] abind_1.4-8 viridis_0.6.5 codetools_0.2-20 \n [58] curl_6.0.1 lattice_0.22-6 tibble_3.2.1 \n [61] KEGGREST_1.46.0 shiny_1.10.0 withr_3.0.2 \n [64] evaluate_1.0.1 polyclip_1.10-7 Biostrings_2.74.0 \n [67] filelock_1.0.3 BiocManager_1.30.25 pillar_1.9.0 \n [70] generics_0.1.3 sp_2.1-4 RCurl_1.98-1.16 \n [73] BiocVersion_3.20.0 munsell_0.5.1 scales_1.3.0 \n [76] xtable_1.8-4 glue_1.8.0 tools_4.4.1 \n [79] locfit_1.5-9.10 rhdf5_2.50.1 grid_4.4.1 \n [82] AnnotationDbi_1.68.0 colorspace_2.1-1 GenomeInfoDbData_1.2.13\n [85] raster_3.6-30 beeswarm_0.4.0 HDF5Array_1.34.0 \n [88] vipor_0.4.7 cli_3.6.3 rappdirs_0.3.3 \n [91] fansi_1.0.6 S4Arrays_1.6.0 viridisLite_0.4.2 \n [94] svglite_2.1.3 dplyr_1.1.4 gtable_0.3.6 \n [97] digest_0.6.37 SparseArray_1.6.0 rjson_0.2.23 \n[100] htmlwidgets_1.6.4 memoise_2.0.1 htmltools_0.5.8.1 \n[103] lifecycle_1.0.4 httr_1.4.7 mime_0.12 \n[106] bit64_4.5.2", "crumbs": [ "Processing", "4  Cell segmentation and pre-processing" @@ -257,7 +257,7 @@ "href": "02-quality_control.html#simpleseg-do-my-images-have-a-batch-effect", "title": "\n5  Quality Control\n", "section": "", - "text": "The intensities of images are often highly skewed, preventing any meaningful downstream analysis.\n\nThe intensities across different images are often different, meaning that what is considered “positive” can be different across images.\n\n\n\n\n\n\n\n\n\nNote\n\n\n\nWhat we’re looking for\n\nDo the CD3+ and CD3- peaks clearly separate out in the density plot? To ensure that downstream clustering goes smoothly, we want our cell type specific markers to show 2 distinct peaks representing our CD3+ and CD3- cells. If these\nAre our CD3+ and CD3- peaks consistent across our images? We want to make sure that our density plots for CD3 are largely the same across images so that a CD3+ cell in 1 image is equivalent to a CD3+ cell in another image.\n\n\n\n\n\n\n\n\n\n\ntransformation is an optional argument which specifies the function to be applied to the data. We do not apply an arcsinh transformation here, as we already apply a square root transform in the simpleSeg function.\n\nmethod = c(\"trim99\", \"mean\", PC1\") is an optional argument which specifies the normalisation method(s) to be performed. A comprehensive table of methods is provided below.\n\nassayIn = \"counts\" is a required argument which specifies the name of the assay that contains our intensity data. In our context, this is called counts.\n\n\n\n\n5.1.0.1 method Parameters\n\n\n\n\n\n\n\nMethod\nDescription\n\n\n\n\n“mean”\nDivides the marker cellular marker intensities by their mean.\n\n\n\n“minMax”\nSubtracts the minimum value and scales markers between 0 and 1.\n\n\n\n“trim99”\nSets the highest 1% of values to the value of the 99th percentile.`\n\n\n\n“PC1”\nRemoves the 1st principal component) can be performed with one call of the function, in the order specified by the user.\n\n\n\n\nWe can then plot the same density curve for the CD3 marker using the normalised data.\n\n# plot densities of CD3 for each image\nfergusonSPE |> \n join_features(features = rownames(fergusonSPE), shape = \"wide\", assay = \"norm\") |> \n ggplot(aes(x = CD3, colour = imageID)) + \n geom_density() + \n theme(legend.position = \"none\")\n\n\n\n\n\n\n\n\n\n\n\n\n\nNote\n\n\n\nQuestions revisited\n\nDo the CD3+ and CD3- peaks clearly separate out in the density plot? If not, we can try optimizing the transform if the distribution looks heavily skewed.\nAre our CD3+ and CD3- peaks consistent across our images? We can try to be more stringent in our normalization, such as by removing the 1st PC (method = c(..., \"PC1\")) or scaling the values for all images between 0 and 1 (method = c(..., \"minMax\")).\n\n\n\nHere, we can see that the normalised data appears more bimodal, and we can clearly observe a CD3+ peak at 5.00, and a CD3- peak at around 3.00. Image-level batch effects also appear to have been mitigated.\nWe can also visualise the effect of normalisation on the UMAP, which shows that the images now overlap with each other to a much greater extent.\n\nset.seed(51773)\n\n# perform dimension reduction using UMAP\nfergusonSPE <- scater::runUMAP(\n fergusonSPE,\n subset_row = ct_markers,\n exprs_values = \"norm\",\n name = \"normUMAP\"\n)\n\nsomeImages <- unique(fergusonSPE$imageID)[c(1, 5, 10, 20, 30, 40)]\n\n# UMAP by imageID\nscater::plotReducedDim(\n fergusonSPE[, fergusonSPE$imageID %in% someImages],\n dimred = \"normUMAP\",\n colour_by = \"imageID\"\n)", + "text": "The intensities of images are often highly skewed, preventing any meaningful downstream analysis.\n\nThe intensities across different images are often different, meaning that what is considered “positive” can be different across images.\n\n\n\n\n\n\n\n\n\nNote\n\n\n\nWhat we’re looking for\n\nDo the CD3+ and CD3- peaks clearly separate out in the density plot? To ensure that downstream clustering goes smoothly, we want our cell type specific markers to show 2 distinct peaks representing our CD3+ and CD3- cells. If we see 3 or more peaks where we don’t expect, this might be an indicator that further normalization is required.\nAre our CD3+ and CD3- peaks consistent across our images? We want to make sure that our density plots for CD3 are largely the same across images so that a CD3+ cell in 1 image is equivalent to a CD3+ cell in another image.\n\n\n\n\n\n\n\n\n\n\ntransformation is an optional argument which specifies the function to be applied to the data. We do not apply an arcsinh transformation here, as we already apply a square root transform in the simpleSeg function.\n\nmethod = c(\"trim99\", \"mean\", PC1\") is an optional argument which specifies the normalisation method(s) to be performed. A comprehensive table of methods is provided below.\n\nassayIn = \"counts\" is a required argument which specifies the name of the assay that contains our intensity data. In our context, this is called counts.\n\n\n\n\n5.1.0.1 method Parameters\n\n\n\n\n\n\n\nMethod\nDescription\n\n\n\n\n“mean”\nDivides the marker cellular marker intensities by their mean.\n\n\n\n“minMax”\nSubtracts the minimum value and scales markers between 0 and 1.\n\n\n\n“trim99”\nSets the highest 1% of values to the value of the 99th percentile.`\n\n\n\n“PC1”\nRemoves the 1st principal component) can be performed with one call of the function, in the order specified by the user.\n\n\n\n\nWe can then plot the same density curve for the CD3 marker using the normalised data.\n\n# plot densities of CD3 for each image\nfergusonSPE |> \n join_features(features = rownames(fergusonSPE), shape = \"wide\", assay = \"norm\") |> \n ggplot(aes(x = CD3, colour = imageID)) + \n geom_density() + \n theme(legend.position = \"none\")\n\n\n\n\n\n\n\n\n\n\n\n\n\nNote\n\n\n\nQuestions revisited\n\nDo the CD3+ and CD3- peaks clearly separate out in the density plot? If not, we can try optimizing the transform if the distribution looks heavily skewed.\nAre our CD3+ and CD3- peaks consistent across our images? We can try to be more stringent in our normalization, such as by removing the 1st PC (method = c(..., \"PC1\")) or scaling the values for all images between 0 and 1 (method = c(..., \"minMax\")).\n\n\n\nHere, we can see that the normalised data appears more bimodal, and we can clearly observe a CD3+ peak at 5.00, and a CD3- peak at around 3.00. Image-level batch effects also appear to have been mitigated.\nWe can also visualise the effect of normalisation on the UMAP, which shows that the images now overlap with each other to a much greater extent.\n\nset.seed(51773)\n\n# perform dimension reduction using UMAP\nfergusonSPE <- scater::runUMAP(\n fergusonSPE,\n subset_row = ct_markers,\n exprs_values = \"norm\",\n name = \"normUMAP\"\n)\n\nsomeImages <- unique(fergusonSPE$imageID)[c(1, 5, 10, 20, 30, 40)]\n\n# UMAP by imageID\nscater::plotReducedDim(\n fergusonSPE[, fergusonSPE$imageID %in% someImages],\n dimred = \"normUMAP\",\n colour_by = \"imageID\"\n)", "crumbs": [ "Processing", "5  Quality Control" @@ -290,7 +290,7 @@ "href": "03a-cell_annotation.html#clustering-with-fusesom", "title": "\n6  Unsupervised clustering for cell annotation\n", "section": "\n6.2 Clustering with FuseSOM", - "text": "6.2 Clustering with FuseSOM\nFuseSOM is an unsupervised clustering tool for highly multiplexed in situ imaging cytometry assays. It combines a Self Organiszing Map architecture and a MultiView integration of correlation-based metrics for robustness and high accuracy. It has been streamlined to accept multiple data structures including SingleCellExperiment objects, SpatialExperiment objects, and DataFrames.\n\n6.2.1 FuseSOM Matrix Input\nTo demonstrate the functionality of FuseSOM, we will use the Risom 2022 dataset, which profiles the spatial landscape of ductal carcinoma in situ (DCIS). We will be using the markers used in the original study to perform clustering.\n\n# load in the data\ndata(\"risom_dat\")\n\n# define the markers of interest\nrisomMarkers <- c('CD45','SMA','CK7','CK5','VIM','CD31','PanKRT','ECAD',\n 'Tryptase','MPO','CD20','CD3','CD8','CD4','CD14','CD68','FAP',\n 'CD36','CD11c','HLADRDPDQ','P63','CD44')\n\n# we will be using the manual_gating_phenotype as the true cell type to gauge \n# performance\nnames(risom_dat)[names(risom_dat) == 'manual_gating_phenotype'] <- 'CellType'\n\nNow that we have loaded the data and defined the markers of interest, we can run the FuseSOM algorithm using the runFuseSOM() function. We specify the number of clusters to be 23 based on prior domain knowledge. The output contains the cluster labels as well as the Self Organizing Map model.\n\nrisomRes <- runFuseSOM(data = risom_dat, markers = risomMarkers, \n numClusters = 23)\n\nLets look at the distribution of the clusters.\n\n# get the distribution of the clusters\ntable(risomRes$clusters)/sum(table(risomRes$clusters))\n\n\n cluster_1 cluster_10 cluster_11 cluster_12 cluster_13 cluster_14 \n0.323602021 0.035968538 0.005439775 0.021443334 0.061100586 0.026596050 \n cluster_15 cluster_16 cluster_17 cluster_18 cluster_19 cluster_2 \n0.020582156 0.032624297 0.024931106 0.076128143 0.015802618 0.014927087 \n cluster_20 cluster_21 cluster_22 cluster_23 cluster_3 cluster_4 \n0.049962682 0.009185900 0.051771156 0.066913538 0.004923068 0.014108968 \n cluster_5 cluster_6 cluster_7 cluster_8 cluster_9 \n0.040776783 0.064444827 0.020854863 0.010032725 0.007879780 \n\n\nIt appears that 32% of cells have been assigned to cluster_1. Next, lets generate a heatmap of the marker expression for each cluster.\n\nrisomHeat <- FuseSOM::markerHeatmap(data = risom_dat, markers = risomMarkers,\n clusters = risomRes$clusters, clusterMarkers = TRUE)\n\n\n\n\n\n\n\n\n\n\n\n\n\nCommon problems with clustering\n\n\n\nHow do I identify imperfect clustering? 1. Do our cell-type specific markers clearly separate out by cluster? We expect to see discrete expression of our markers in specific cell types, e.g. CD4 2. If we instead see “smearing” of our markers across clusters, where several clusters express high levels of a cell type specific marker such as CD4, it is likely a normalization issue.\n\n\n\n\n\n\n\n\nRemedying imperfect clustering\n\n\n\nImperfect clustering can stem from many issues: the 3 most common are outlined below: 1. Imperfect segmentation - excessive lateral marker spill over can severely impact downstream clustering, as cell type specific markers leak into nearby cells. This should largely be diagnosed in the segmentation step and will need to be fixed by optimizing the upstream segmentation algorithm. 2. Imperfect normalization - excessively variable intensities across images could cause issues in the normalization process. This can generally be diagnosed with density plots and box plots for specific markers across images and can be fixed by identifying the exact issue, e.g. extremely high values for a small subset of images, and choosing a normalization strategy to remove/reduce this effect. 3. Imperfect clustering - choosing a k that’s too low or too high could lead to imperfect clustering. This is usually diagnosed by clusters which either express too many markers very highly or express too few markers, and is usually remedied by choosing an ideal k based on an elbow plot described below.\n\n\n\n6.2.2 Using FuseSOM to estimate the number of clusters\nWhen the number of expected cell typess or clusters is not known beforehand, the estimateNumCluster() function can be used to estimate the number of clusters. Two methods have been developed to calculate the number of clusters:\n\nDiscriminant based method:\n\nA method developed in house based on discriminant based maximum clusterability projection pursuit\n\n\nDistance based methods which includes:\n\nThe Gap Statistic\nThe Jump Statistic\nThe Slope Statistic\nThe Within Cluster Dissimilarity Statistic\nThe Silhouette Statistic\n\n\n\nWe run estimateNumCluster() and specify method = c(\"Discriminant\", \"Distance\") to use both approaches.\n\n# lets estimate the number of clusters using all the methods\n# original clustering has 23 clusters so we will set kseq from 2:25\n# we pass it the SOM model generated in the previous step\nrisomKest <- estimateNumCluster(data = risomRes$model, kSeq = 2:25, \n method = c(\"Discriminant\", \"Distance\"))\n\nWe can then use this result to determine the best number of clusters for this dataset based on the different metrics.\n\n# what is the best number of clusters determined by the discriminant method?\nrisomKest$Discriminant \n\n[1] 7\n\n\nAccording to the Discriminant method, the optimal number of clusters is 7.\nWe can use the optiPlot() function to generate an elbow plot with the optimal value for the number of clusters for the distance based methods.\n\n# we can plot the results using the optiplot function\npSlope <- optiPlot(risomKest, method = 'slope')\npSlope\n\n\n\n\n\n\npJump <- optiPlot(risomKest, method = 'jump')\npJump\n\n\n\n\n\n\npWcd <- optiPlot(risomKest, method = 'wcd')\npWcd\n\n\n\n\n\n\npGap <- optiPlot(risomKest, method = 'gap')\npGap\n\n\n\n\n\n\npSil <- optiPlot(risomKest, method = 'silhouette')\npSil\n\n\n\n\n\n\n\nFrom the plots, we see that the Jump statistic almost perfectly captures the correct number of clusters. The Gap statistic is a close second with 15 clusters. All the other methods significantly underestimate the number of clusters.\n\n6.2.3 FuseSOM with Single Cell Experiment object as input\nThe FuseSOM algorithm is also equipped to take in a SingleCellExperiment object as input. The results of the pipeline will be written to either the metada or the colData fields.\nFirst, we create a SingleCellExperiment object using the Risom 2022 data.\n\nlibrary(SingleCellExperiment)\n\n# create an SCE object using Risom 2022 data\ncolDat <- risom_dat[, setdiff(colnames(risom_dat), risomMarkers)]\nsce <- SingleCellExperiment(assays = list(counts = t(risom_dat[, names(risom_dat) != \"CellType\"])),\n colData = colDat)\n\nsce\n\nclass: SingleCellExperiment \ndim: 22 69672 \nmetadata(0):\nassays(1): counts\nrownames(22): CD45 SMA ... P63 CD44\nrowData names(0):\ncolnames: NULL\ncolData names(1): X\nreducedDimNames(0):\nmainExpName: NULL\naltExpNames(0):\n\n\nNext, we pass it to the runFuseSOM() function. Here, we can provide the assay in which the data is stored (counts) and specify the column to store the clusters in using clusterCol = \"clusters\". The Self Organizing Map that is generated will be stored in the metadata field.\n\nrisomRessce <- runFuseSOM(sce, markers = risomMarkers, clusterCol = \"clusters\",\n assay = 'counts', numClusters = 23, verbose = FALSE)\n\ncolnames(colData(risomRessce))\n\n[1] \"X\" \"clusters\"\n\nnames(metadata(risomRessce))\n\n[1] \"SOM\"\n\n\nNotice how the there is now a clusters column in the colData and a SOM field in the metadata.\nIf necessary, you can run runFuseSOM() with a new cluster number and specify a new clusterCol. If clusterCol contains a new name, the new clusters will be stored in the new column. Otherwise, it will overwrite the the current clusters column. Running FuseSOM on the same object will overwrite the SOM field in the metadata.\nJust like before, we can plot a heatmap of the resulting clusters across all markers.\n\ndata <- risom_dat[, risomMarkers] # get the original data used\nclusters <- colData(risomRessce)$clusters # extract the clusters from the SCE object\n\n# generate the heatmap\nrisomHeatsce <- markerHeatmap(data = risom_dat, markers = risomMarkers,\n clusters = clusters, clusterMarkers = TRUE)\n\n\n\n\n\n\n\nOr we can directly plot from the SCE using the scater package.\n\n# Visualise marker expression in each cluster.\nscater::plotGroupedHeatmap(\n risomRessce,\n features = risomMarkers,\n group = \"clusters\",\n exprs_values = \"counts\",\n center = TRUE,\n scale = TRUE,\n zlim = c(-3, 3),\n cluster_rows = FALSE,\n block = \"clusters\"\n)\n\n\n\n\n\n\n\n\n6.2.4 Using FuseSOM to estimate the number of clusters for SingleCellExperiment objects\nJust like before, we will use estimateNumCluster() on our Risom SingleCellExperiment object.\n\n# lets estimate the number of clusters using all the methods\n# original clustering has 23 clusters so we will set kseq from 2:25\nrisomRessce <- estimateNumCluster(data = risomRessce, kSeq = 2:25, \n method = c(\"Discriminant\", \"Distance\"))\n\nYou have provided a dataset of class: SingleCellExperiment\n\n\nNow Computing the Number of Clusters using Discriminant Analysis\n\n\nNow Computing The Number Of Clusters Using Distance Analysis\n\nnames(metadata(risomRessce))\n\n[1] \"SOM\" \"clusterEstimation\"\n\n\nThe metadata now contains a clusterEstimation field which holds the results from the estimateNumCluster() function.\nWe can assess the results of cluster estimation as below.\n\n# what is the best number of clusters determined by the discriminant method?\nmetadata(risomRessce)$clusterEstimation$Discriminant \n\n[1] 7\n\n\nAccording to the discrminant method, the optimal number of clusters is 10.\n\n# we can plot the results using the optiplot function\npSlope <- optiPlot(risomRessce, method = 'slope')\n\nYou have provided a dataset of class: SingleCellExperiment\n\npSlope\n\n\n\n\n\n\npJump <- optiPlot(risomRessce, method = 'jump')\n\nYou have provided a dataset of class: SingleCellExperiment\n\npJump\n\n\n\n\n\n\npWcd <- optiPlot(risomRessce, method = 'wcd')\n\nYou have provided a dataset of class: SingleCellExperiment\n\npWcd\n\n\n\n\n\n\npGap <- optiPlot(risomRessce, method = 'gap')\n\nYou have provided a dataset of class: SingleCellExperiment\n\npGap\n\n\n\n\n\n\npSil <- optiPlot(risomRessce, method = 'silhouette')\n\nYou have provided a dataset of class: SingleCellExperiment\n\npSil\n\n\n\n\n\n\n\n\n\n\n\n\n\nFAQ\n\n\n\n\nHow do we choose our k? We’re generally looking for the k before the point of greatest inflection.\nIs there one best choice for k? There can be several options of k if there are several points of inflection. Choose the k which best reflects the number of clusters you expect to get from the tissue.\n\n\n\nAgain, we see that the Jump statistic almost perfectly captures the correct number of clusters with 24 clusters. The Gap method is a close second with 15 clusters. All the other methods significantly underestimate the number of clusters.", + "text": "6.2 Clustering with FuseSOM\nFuseSOM is an unsupervised clustering tool for highly multiplexed in situ imaging cytometry assays. It combines a Self Organiszing Map architecture and a MultiView integration of correlation-based metrics for robustness and high accuracy. It has been streamlined to accept multiple data structures including SingleCellExperiment objects, SpatialExperiment objects, and DataFrames.\n\n6.2.1 FuseSOM Matrix Input\nTo demonstrate the functionality of FuseSOM, we will use the Risom 2022 dataset, which profiles the spatial landscape of ductal carcinoma in situ (DCIS). We will be using the markers used in the original study to perform clustering.\n\n# load in the data\ndata(\"risom_dat\")\n\n# define the markers of interest\nrisomMarkers <- c('CD45','SMA','CK7','CK5','VIM','CD31','PanKRT','ECAD',\n 'Tryptase','MPO','CD20','CD3','CD8','CD4','CD14','CD68','FAP',\n 'CD36','CD11c','HLADRDPDQ','P63','CD44')\n\n# we will be using the manual_gating_phenotype as the true cell type to gauge \n# performance\nnames(risom_dat)[names(risom_dat) == 'manual_gating_phenotype'] <- 'CellType'\n\nNow that we have loaded the data and defined the markers of interest, we can run the FuseSOM algorithm using the runFuseSOM() function. We specify the number of clusters to be 23 based on prior domain knowledge. The output contains the cluster labels as well as the Self Organizing Map model.\n\nrisomRes <- runFuseSOM(data = risom_dat, markers = risomMarkers, \n numClusters = 23)\n\nLets look at the distribution of the clusters.\n\n# get the distribution of the clusters\ntable(risomRes$clusters)/sum(table(risomRes$clusters))\n\n\n cluster_1 cluster_10 cluster_11 cluster_12 cluster_13 cluster_14 \n0.323602021 0.035968538 0.005439775 0.021443334 0.061100586 0.026596050 \n cluster_15 cluster_16 cluster_17 cluster_18 cluster_19 cluster_2 \n0.020582156 0.032624297 0.024931106 0.076128143 0.015802618 0.014927087 \n cluster_20 cluster_21 cluster_22 cluster_23 cluster_3 cluster_4 \n0.049962682 0.009185900 0.051771156 0.066913538 0.004923068 0.014108968 \n cluster_5 cluster_6 cluster_7 cluster_8 cluster_9 \n0.040776783 0.064444827 0.020854863 0.010032725 0.007879780 \n\n\nIt appears that 32% of cells have been assigned to cluster_1. Next, lets generate a heatmap of the marker expression for each cluster.\n\nrisomHeat <- FuseSOM::markerHeatmap(data = risom_dat, markers = risomMarkers,\n clusters = risomRes$clusters, clusterMarkers = TRUE)\n\n\n\n\n\n\n\n\n\n\n\n\n\nCommon problems with clustering\n\n\n\nHow do I identify imperfect clustering?\n\nDo our cell-type specific markers clearly separate out by cluster? We expect to see discrete expression of our markers in specific cell types, e.g. CD4\nIf we instead see “smearing” of our markers across clusters, where several clusters express high levels of a cell type specific marker such as CD4, it is likely a normalization issue.\n\n\n\n\n\n\n\n\n\nRemedying imperfect clustering\n\n\n\nImperfect clustering can stem from many issues: the 3 most common are outlined below:\n\n\nImperfect segmentation - excessive lateral marker spill over can severely impact downstream clustering, as cell type specific markers leak into nearby cells. This should largely be diagnosed in the segmentation step and will need to be fixed by optimizing the upstream segmentation algorithm.\n\nImperfect normalization - excessively variable intensities across images could cause issues in the normalization process. This can generally be diagnosed with density plots and box plots for specific markers across images and can be fixed by identifying the exact issue, e.g. extremely high values for a small subset of images, and choosing a normalization strategy to remove/reduce this effect.\n\nImperfect clustering - choosing a k that’s too low or too high could lead to imperfect clustering. This is usually diagnosed by clusters which either express too many markers very highly or express too few markers, and is usually remedied by choosing an ideal k based on an elbow plot described below.\n\n\n\n\n6.2.2 Using FuseSOM to estimate the number of clusters\nWhen the number of expected cell typess or clusters is not known beforehand, the estimateNumCluster() function can be used to estimate the number of clusters. Two methods have been developed to calculate the number of clusters:\n\nDiscriminant based method:\n\nA method developed in house based on discriminant based maximum clusterability projection pursuit\n\n\nDistance based methods which includes:\n\nThe Gap Statistic\nThe Jump Statistic\nThe Slope Statistic\nThe Within Cluster Dissimilarity Statistic\nThe Silhouette Statistic\n\n\n\nWe run estimateNumCluster() and specify method = c(\"Discriminant\", \"Distance\") to use both approaches.\n\n# lets estimate the number of clusters using all the methods\n# original clustering has 23 clusters so we will set kseq from 2:25\n# we pass it the SOM model generated in the previous step\nrisomKest <- estimateNumCluster(data = risomRes$model, kSeq = 2:25, \n method = c(\"Discriminant\", \"Distance\"))\n\nWe can then use this result to determine the best number of clusters for this dataset based on the different metrics.\n\n# what is the best number of clusters determined by the discriminant method?\nrisomKest$Discriminant \n\n[1] 7\n\n\nAccording to the Discriminant method, the optimal number of clusters is 7.\nWe can use the optiPlot() function to generate an elbow plot with the optimal value for the number of clusters for the distance based methods.\n\n# we can plot the results using the optiplot function\npSlope <- optiPlot(risomKest, method = 'slope')\npSlope\n\n\n\n\n\n\npJump <- optiPlot(risomKest, method = 'jump')\npJump\n\n\n\n\n\n\npWcd <- optiPlot(risomKest, method = 'wcd')\npWcd\n\n\n\n\n\n\npGap <- optiPlot(risomKest, method = 'gap')\npGap\n\n\n\n\n\n\npSil <- optiPlot(risomKest, method = 'silhouette')\npSil\n\n\n\n\n\n\n\nFrom the plots, we see that the Jump statistic almost perfectly captures the correct number of clusters. The Gap statistic is a close second with 15 clusters. All the other methods significantly underestimate the number of clusters.\n\n6.2.3 FuseSOM with Single Cell Experiment object as input\nThe FuseSOM algorithm is also equipped to take in a SingleCellExperiment object as input. The results of the pipeline will be written to either the metada or the colData fields.\nFirst, we create a SingleCellExperiment object using the Risom 2022 data.\n\nlibrary(SingleCellExperiment)\n\n# create an SCE object using Risom 2022 data\ncolDat <- risom_dat[, setdiff(colnames(risom_dat), risomMarkers)]\nsce <- SingleCellExperiment(assays = list(counts = t(risom_dat[, names(risom_dat) != \"CellType\"])),\n colData = colDat)\n\nsce\n\nclass: SingleCellExperiment \ndim: 22 69672 \nmetadata(0):\nassays(1): counts\nrownames(22): CD45 SMA ... P63 CD44\nrowData names(0):\ncolnames: NULL\ncolData names(1): X\nreducedDimNames(0):\nmainExpName: NULL\naltExpNames(0):\n\n\nNext, we pass it to the runFuseSOM() function. Here, we can provide the assay in which the data is stored (counts) and specify the column to store the clusters in using clusterCol = \"clusters\". The Self Organizing Map that is generated will be stored in the metadata field.\n\nrisomRessce <- runFuseSOM(sce, markers = risomMarkers, clusterCol = \"clusters\",\n assay = 'counts', numClusters = 23, verbose = FALSE)\n\ncolnames(colData(risomRessce))\n\n[1] \"X\" \"clusters\"\n\nnames(metadata(risomRessce))\n\n[1] \"SOM\"\n\n\nNotice how the there is now a clusters column in the colData and a SOM field in the metadata.\nIf necessary, you can run runFuseSOM() with a new cluster number and specify a new clusterCol. If clusterCol contains a new name, the new clusters will be stored in the new column. Otherwise, it will overwrite the the current clusters column. Running FuseSOM on the same object will overwrite the SOM field in the metadata.\nJust like before, we can plot a heatmap of the resulting clusters across all markers.\n\ndata <- risom_dat[, risomMarkers] # get the original data used\nclusters <- colData(risomRessce)$clusters # extract the clusters from the SCE object\n\n# generate the heatmap\nrisomHeatsce <- markerHeatmap(data = risom_dat, markers = risomMarkers,\n clusters = clusters, clusterMarkers = TRUE)\n\n\n\n\n\n\n\nOr we can directly plot from the SCE using the scater package.\n\n# Visualise marker expression in each cluster.\nscater::plotGroupedHeatmap(\n risomRessce,\n features = risomMarkers,\n group = \"clusters\",\n exprs_values = \"counts\",\n center = TRUE,\n scale = TRUE,\n zlim = c(-3, 3),\n cluster_rows = FALSE,\n block = \"clusters\"\n)\n\n\n\n\n\n\n\n\n6.2.4 Using FuseSOM to estimate the number of clusters for SingleCellExperiment objects\nJust like before, we will use estimateNumCluster() on our Risom SingleCellExperiment object.\n\n# lets estimate the number of clusters using all the methods\n# original clustering has 23 clusters so we will set kseq from 2:25\nrisomRessce <- estimateNumCluster(data = risomRessce, kSeq = 2:25, \n method = c(\"Discriminant\", \"Distance\"))\n\nYou have provided a dataset of class: SingleCellExperiment\n\n\nNow Computing the Number of Clusters using Discriminant Analysis\n\n\nNow Computing The Number Of Clusters Using Distance Analysis\n\nnames(metadata(risomRessce))\n\n[1] \"SOM\" \"clusterEstimation\"\n\n\nThe metadata now contains a clusterEstimation field which holds the results from the estimateNumCluster() function.\nWe can assess the results of cluster estimation as below.\n\n# what is the best number of clusters determined by the discriminant method?\nmetadata(risomRessce)$clusterEstimation$Discriminant \n\n[1] 7\n\n\nAccording to the discrminant method, the optimal number of clusters is 10.\n\n# we can plot the results using the optiplot function\npSlope <- optiPlot(risomRessce, method = 'slope')\n\nYou have provided a dataset of class: SingleCellExperiment\n\npSlope\n\n\n\n\n\n\npJump <- optiPlot(risomRessce, method = 'jump')\n\nYou have provided a dataset of class: SingleCellExperiment\n\npJump\n\n\n\n\n\n\npWcd <- optiPlot(risomRessce, method = 'wcd')\n\nYou have provided a dataset of class: SingleCellExperiment\n\npWcd\n\n\n\n\n\n\npGap <- optiPlot(risomRessce, method = 'gap')\n\nYou have provided a dataset of class: SingleCellExperiment\n\npGap\n\n\n\n\n\n\npSil <- optiPlot(risomRessce, method = 'silhouette')\n\nYou have provided a dataset of class: SingleCellExperiment\n\npSil\n\n\n\n\n\n\n\n\n\n\n\n\n\nFAQ\n\n\n\n\nHow do we choose our k? We’re generally looking for the k before the point of greatest inflection.\nIs there one best choice for k? There can be several options of k if there are several points of inflection. Choose the k which best reflects the number of clusters you expect to get from the tissue.\n\n\n\nAgain, we see that the Jump statistic almost perfectly captures the correct number of clusters with 24 clusters. The Gap method is a close second with 15 clusters. All the other methods significantly underestimate the number of clusters.", "crumbs": [ "Cell annotation", "6  Unsupervised clustering for cell annotation"