01-processing.qmd

# Cell segmentation and pre-processing

```{r, include = FALSE}
knitr::knit_hooks$set(time_it = local({
  now <- NULL
  function(before, options) {
    if (before) {
      # record the current time before each chunk
      now <<- Sys.time()
    } else {
      # calculate the time difference after a chunk
      res <- difftime(Sys.time(), now, units = "secs")
      # return a character string to show the time
      paste("Time for this code chunk to run with ", nCores, " cores: ", round(res, 2), " seconds")
    }
  }
}))
```

The typical first step in analyzing spatial data is cell segmentation, which involves identifying and isolating individual cells in each image. This step is essential for assigning cell type labels and evaluating cellular relationships within the tissue.

In this section, we will outline the process of reading and pre-processing images obtained from various imaging technologies to prepare them for downstream analysis. Additionally, we will demonstrate how our simpleSeg package facilitates efficient and straightforward cell segmentation, as well as how to evaluate the quality of the segmentation results.

<!-- Steps: -->

<!-- 1.  Reading in data with cytomapper -->
<!-- 2.  Cell segmentation with simpleSeg -->
<!-- 3.  Cell segmentation with BIDCell -->
<!-- 4.  Reading in spot-based data with MoleculeExperiment -->

```{r 01-loadLibraries, echo=FALSE, results="hide", warning=FALSE}
suppressPackageStartupMessages({
  library(cytomapper)
  library(ggplot2)
  library(simpleSeg)
})
```

```{r, eval = FALSE}
# load required libraries
library(cytomapper)
library(ggplot2)
library(simpleSeg)
```

We recommend setting the number of cores to enable running code in parallel. Please choose a number that is appropriate for your resources. A minimum of 2 cores is suggested since running this workflow can be computationally intensive.

```{r 01-setParam}
# set parameters
set.seed(51773)
use_mc <- TRUE
if (use_mc) {
  nCores <- max(parallel::detectCores()/2, 1)
} else {
  nCores <- 1
}
BPPARAM <- simpleSeg:::generateBPParam(nCores)
theme_set(theme_classic())
```

## Reading in data with cytomapper

We will be using the Ferguson 2022 dataset to demonstrate how to perform pre-processing and cell segmentation. This dataset can be accessed through the `SpatialDatasets` package. The `loadImages` function form the `cytomapper` package can be used to load all the TIFF images into a `CytoImageList` object and store the images as an h5 file on-disk in a temporary directory using the `h5FilesPath = HDF5Array::getHDF5DumpDir()` parameter.

We will also assign the metadata columns of the `CytoImageList` object using the `mcols` function.

```{r 01-loadImages, time_it = TRUE}
pathToZip <- SpatialDatasets::Ferguson_Images()
pathToImages <- "data/processing/images"
unzip(pathToZip, exdir = "data/processing/images")

# store images in a CytoImageList on_disk as h5 files to save memory
images <- cytomapper::loadImages(
  pathToImages,
  single_channel = TRUE,
  on_disk = TRUE,
  h5FilesPath = HDF5Array::getHDF5DumpDir(),
  BPPARAM = BPPARAM
)

# assign metadata columns
mcols(images) <- S4Vectors::DataFrame(imageID = names(images))
```

When reading the image channels directly from the names of the TIFF images, they will often need to be cleaned for ease of downstream processing. The channel names can be accessed from the `CytoImageList` object using the `channelNames` function.

```{r 01-cleanChannelNames}
channelNames(images) <- channelNames(images) |>
                          # remove preceding letters
                          sub(pattern = ".*_", replacement = "", x = _) |> 
                          # remove the .ome
                          sub(pattern = ".ome", replacement = "", x = _)
```

Similarly, the image names will be taken from the folder name containing the individual TIFF images for each channel. These will often also need to be cleaned.

```{r 01-cleanImageNames}
# cleaning image names to obtain image IDs
split_names <- function(x) {
  sapply(strsplit(x, "_"), `[`, 3)
}

names(images) <- names(images) |> split_names()

mcols(images) <- S4Vectors::DataFrame(imageID = names(images))
```

## Cell segmentation with simpleSeg

For the sake of simplicity and efficiency, we will be subsetting our images down to 10 images.

```{r subset}
images <- images[1:10]
```

Next, we can perform our segmentation. The [simpleSeg](https://www.bioconductor.org/packages/release/bioc/html/simpleSeg.html) package provides functionality for user friendly, watershed based segmentation on multiplexed cellular images based on the intensity of user specified protein marker channels. The main function, `simpleSeg`, can be used to perform a simple cell segmentation process that traces out the nuclei using a specified channel.

In the particular example below, we have asked `simpleSeg` to do the following:

-   `nucleus = c("HH3")`: trace out the nuclei signal in the images using the HH3 channel.
-   `pca = TRUE`: segment out the nuclei mask using a principal component analysis of all channels and using the principal components most aligned with the nuclei channel, in this case, HH3.
-   `cellBody = "dilate"`: use a dilation strategy of segmentation, expanding out from the nucleus by a specified `discSize`. In this case, `discSize = 3`, which means simpleSeg dilates out from the nucleus by 3 pixels.
-   `sizeSelection = 20`: ensure that only cells with a size greater than 20 pixels will be used.
-   `transform = "sqrt"`: perform square root transformation on each of the channels prior to segmentation.
-   `tissue = c("panCK", "CD45", "HH3")`: use the specified tissue mask to filter out all background noise outside the tissue mask. This allows us to ignore background noise which happens outside of the tumour core.

There are many other parameters that can be specified in simpleSeg (`smooth`, `watershed`, `tolerance`, and `ext`), and we encourage the user to select the best parameters which suit their biological context.

```{r 01-simpleSeg, time_it = TRUE}
masks <- simpleSeg(images,
                   nucleus = c("HH3"),
                   pca = TRUE,
                   cellBody = "dilate",
                   discSize = 3,
                   sizeSelection = 20,
                   transform = "sqrt",
                   tissue = c("panCK", "CD45", "HH3"),
                   cores = nCores
                   )
```

### Visualise separation

We can examine the performance of the cell segmentation using the `display` and `colorLabels` functions from the `EBImage` package. If used in an interactive session, `display` allows you to zoom in and out of the image.

```{r 01-visualiseSegmentation}
EBImage::display(colorLabels(masks[[1]]))
```

### Visualise outlines

The `plotPixels` function from `cytomapper` makes it easy to overlay the mask on top of the nucleus intensity marker to see how well our segmentation process has performed.

```{r 01-plotHH3}
plotPixels(image = images["F3"], 
           mask = masks["F3"],
           img_id = "imageID", 
           colour_by = c("HH3"), 
           display = "single",
           colour = list(HH3 = c("black","blue")),
           legend = NULL,
           bcg = list(
             HH3 = c(1, 1, 2)
           ))
```

Here, we can see that the segmentation appears to be performing reasonably.

If you see over or under-segmentation of your images, `discSize` is a key parameter in the `simpleSeg` function for optimising the size of the dilation disc after segmenting out the nuclei.

We can also visualise multiple markers at once instead of just the HH3 marker to see how the segmentation mask performs.

```{r 01-plotMarkers}
plotPixels(image = images["F3"], 
           mask = masks["F3"],
           img_id = "imageID", 
           colour_by = c("HH3", "CD31", "FX111A"), 
           display = "single",
           colour = list(HH3 = c("black","blue"),
                         CD31 = c("black", "red"),
                         FX111A = c("black", "green")),
           legend = NULL,
           bcg = list(
             HH3 = c(1, 1, 2),
             CD31 = c(0, 1, 2),
             FX111A = c(0, 1, 1.5)
           ))
```

::: {.callout-tip title="What to look for and change to obtain an ideal segmentation"}
**Critical thinking**

1.    Does the segmentation capture the full nucleus? If not, perhaps you need to try a different transform to improve the thresholding of the nuclei marker. You could also try using `pca = TRUE` which will borrow information across the markers to help find the nuclei.
2.    How much of the cell body is the segmentation missing? Try increasing the dilation around the nucleus by setting `discSize = 7`.
3.    Are the segmentations capturing neighbouring cells? Try decreasing the dilation to limit lateral spillover of marker signal by setting `discSize = 2`.
:::

Here, we can see that our segmentation mask has done a good job of capturing the CD31 signal, but perhaps not such a good job of capturing the FXIIIA signal, which often lies outside of our dilated nuclear mask. This suggests that we might need to increase the `discSize` or other parameters of `simpleSeg`.

In particular, the `cellBody` and `watershed` parameters can strongly influence the way cells are segmented using `simpleSeg`. We have provided further details on how the user may specify cell body identification and watershedding in the tables below.

As `simpleSeg` is a nuclei-based dilation method, it suffers from tissues where cells might be multi-nucleated, or where cells have non-circular or elliptical morphologies. For tissues where you might expect these cells, it may be preferable to choose a different segmentation method. 

#### `cellBody` Parameters

| Method | Description |   |
|---------------|:----------------------------------------:|:-------------:|
| ["distance"]{style="font-family: 'Courier New', monospace;"} | [Performs watershedding on a distance map of the thresholded nuclei signal. With a pixels distance being defined as the distance from the closest background signal.]{style="font-family: 'Courier New', monospace;"} |  |
| ["intensity"]{style="font-family: 'Courier New', monospace;"} | [Performs watershedding using the intensity of the nuclei marker.]{style="font-family: 'Courier New', monospace;"} |  |
| ["combine"]{style="font-family: 'Courier New', monospace;"} | [Combines the previous two methods by multiplying the distance map by the nuclei marker intensity.]{style="font-family: 'Courier New', monospace;"} |  |

#### `watershed` Parameters

| Method | Description |   |
|---------------|:-----------------------------------------:|:-------------:|
| ["dilation"]{style="font-family: 'Courier New', monospace;"} | [Dilates the nuclei by an amount defined by the user. The size of the dilatation in pixels may be specified with the `discDize` argument.]{style="font-family: 'Courier New', monospace;"} |  |
| ["discModel"]{style="font-family: 'Courier New', monospace;"} | [Uses all the markers to predict the presence of dilated 'discs' around the nuclei. The model therefore learns which markers are typically present in the cell cytoplasm and generates a mask based on this.]{style="font-family: 'Courier New', monospace;"} |  |
| ["marker"]{style="font-family: 'Courier New', monospace;"} | [The user may specify one or multiple dedicated cytoplasm markers to predict the cytoplasm. This can be done using `cellBody = "marker name"/"index"`]{style="font-family: 'Courier New', monospace;"} |  |
| ["None"]{style="font-family: 'Courier New', monospace;"} | [The nuclei mask is returned directly.]{style="font-family: 'Courier New', monospace;"} |  |

<!-- ## Cell segmentation with BIDCell -->

<!-- ## Visual comparison of segmentations -->

<!-- plotPixels can plot multiple images \<-- use this to visualise multiple images at once after you have BIDCell ready. -->

## Summarise cell features

We can use the `measureObjects` from `cytomapper` to calculate the average intensity of each channel within each cell, as well as other morphological features. By default, `measureObjects` will return a `SingleCellExperiment` object, where the channel intensities are stored in the `counts` assay and the spatial location of each cell is stored in `colData` in the `m.cx` and `m.cy` columns.

However, you can also specify `measureObjects` to return a `SpatialExperiment` object by specifying `return_as = "spe"`. In a `SpatialExperiment` object, the spatial location of each cell is stored in the `spatialCoords` slot as `m.cx` and `m.cy`, which simplifies plotting. In this demonstration, we will return a `SpatialExperiment` object.

```{r 01-measureObjects, time_it = TRUE}
# summarise the expression of each marker in each cell
cells <- cytomapper::measureObjects(masks,
                                    images,
                                    img_id = "imageID",
                                    return_as = "spe",
                                    BPPARAM = BPPARAM
                                    )

spatialCoordsNames(cells) <- c("x", "y")

cells
```

So far, we have obtained our raw TIFF images, performed cell segmentation to isolate individual cells, and then stored our data as a `SpatialExperiment` object. We can now move on to quality control, data transformation, and normalisation to address batch effects.

## sessionInfo

```{r}
sessionInfo()
```