From 09ad29d0b450b04b00ff14df2f80c1da0937e9f2 Mon Sep 17 00:00:00 2001 From: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com> Date: Wed, 28 Feb 2024 13:49:07 -0600 Subject: [PATCH] reorganize additional modalities section --- content/03.results.md | 115 ++++++++++++++++++++++++------------------ 1 file changed, 65 insertions(+), 50 deletions(-) diff --git a/content/03.results.md b/content/03.results.md index 1921d7d..164b1bd 100644 --- a/content/03.results.md +++ b/content/03.results.md @@ -48,13 +48,12 @@ When building `scpca-nf`, we sought a fast and memory-efficient tool for gene ex We expected many users of the Portal to have their own single-cell or single-nuclei data processed with Cell Ranger[@url:https://www.10xgenomics.com/support/software/cell-ranger/latest], due to its popularity. Thus, selecting a tool with comparable results to Cell Ranger was also desirable. In comparing `alevin-fry` [@doi:10.1038/s41592-022-01408-3] to Cell Ranger, we found `alevin-fry` had a lower run time and memory usage (Supplemental Figure 1A), while retaining comparable mean gene expression for all genes (Supplemental Figure 1B), total UMIs per cell (Supplemental Figure 1C), or total genes detected per cell (Supplemental Figure 1D). - (All analyses comparing gene expression quantification tools are available in a public analysis repository[@url:https://github.com/AlexsLemonade/alsf-scpca].) Based on these results, we elected to use `salmon alevin` and `alevin-fry` [@doi:10.1038/s41592-022-01408-3] in `scpca-nf` to quantify gene expression data. `scpca-nf` takes FASTQ files as input (Figure 2A). Reads are aligned using the selective alignment option of `salmon alevin` to an index with transcripts corresponding to spliced cDNA and intronic regions, denoted by `alevin-fry` as a `splici` index. -The output from `alevin-fry` includes a gene-by-cell count matrix for all barcodes identified, even those that may not contain true cells. +The output from `alevin-fry` includes a gene by cell count matrix for all barcodes identified, even those that may not contain true cells. This unfiltered counts matrix is stored in a `SingleCellExperiment` object[@doi:10.1038/s41592-019-0654-x] and output from the workflow to a `.rds` file with the suffix `_unfiltered.rds`. `scpca-nf` performs filtering of empty droplets, removal of low-quality cells, normalization, dimensionality reduction, and cell type annotation (Figure 2A). @@ -75,60 +74,76 @@ We include plots showing the `miQC` model and which cells are kept and removed a A UMAP plot with cells colored by the total number of genes detected and a faceted UMAP plot where cells are colored by the expression of a top highly variable gene are also available (Figure 2F-G). -## Making samples with additional modalities available on the Portal +## Processing samples with additional modalities Each sample on the Portal will have summarized gene expression data from either single-cell or single-nuclei RNA-seq. For some samples, submitters included data from additional sequencing modalities, including corresponding ADT or CITE-seq data [@doi:10.1038/nmeth.4380], multiplexing using cell hashing [@doi:10.1186/s13059-018-1603-1], spatial transcriptomics, or bulk RNA-seq. -To make all received data available, we included additional modules in `scpca-nf` to uniformly process samples with these additional sequencing modalities. -For a full summary of the libraries and samples available with additional modalities, see Supplemental Table 1. - -For all libraries with associated ADT or CITE-seq data, we provide both the summarized RNA and ADT gene expression data for download on the portal. -To process these libraries, both FASTQ from single-cell or single-nuclei RNA-seq and CITE-seq were provided as input into `scpca-nf` and quantified using `salmon alevin` and `alevin-fry` (Supplemental Figure 2A). -Along with the FASTQ files, we required a tsv file from each submitter that contained the labels for each ADT used and the associated barcode. -This file is also input to `scpca-nf` and used to build the index used to quantify ADT expression and create the cell by ADT counts matrix. -Unlike with RNA counts, we do not perform any filtering of cells due to low-quality ADT expression. -However, we do include the results from running `DropletUtils::cleanTagCounts()` on the ADT counts matrix in both the filtered and processed objects output by `scpca-nf`. -Similar to RNA counts, we normalize ADT data and provide the normalized counts matrix in the processed object. -Although we do normalize the ADT counts, we do not provide any dimensionality reduction of ADT data; only the RNA counts data is used as input for dimensionality reduction. -For all three `SingleCellExperiment` objects (unfiltered, filtered, and processed), the summarized ADT expression data can be found in the `altExp` slot. -With `AnnData` objects, the ADT data is provided as a separate `_adt.hdf5` file for each of the three objects. - -If a library contains associated ADT data, an additional section will be included in the provided QC report. -This section includes a summary of statistics, such as how many cells express each ADT. -We also include a collection of additional plots specific to ADT data (Supplemental Figure 2B). -As mentioned above, we include the results from `DropletUtils::cleanTagCounts()`, but do not filter any ADTs or cells from the object. -Instead, we include a column in the `colData` of the processed `SingleCellExperiment` object (`.obs` in the `AnnData` object) that indicates if a cell is recommended to be removed due to low expression of ADTs. -In the QC report, we summarize filtering low-quality cells based on both RNA and ADT counts. -The first quadrant shows which cells would be kept if filtering on both RNA and ADT. -Each of the other facets highlights which cells would be removed if filtering is only done using RNA counts, ADT counts, or both. -The top 4 ADTs with the most variable expression are also identified and visualized using both UMAPs and density plots. -The UMAPs shown were calculated using the RNA data but cells are colored by ADT expression, while the ridge plots show the normalized ADT expression across all cells. - -Similar to ADT data, if any libraries contain samples that have been multiplexed and have an associated cell hashing library, both the RNA and HTO FASTQ are provided as input to the workflow and quantified with `salmon alevin` and `alevin-fry` (Supplemental Figure 2C). -`scpca-nf` also requires a tsv file with one row per sample included in the library, which tells the workflow which HTO was used for which sample when multiplexing the library. -The HTO counts data will be included as an `altExp` in each of the `SingleCellExperiment` objects. -Although we quantify the HTO data and include the cell by HTO counts matrix in all objects, we do not demultiplex the samples so that there is one sample per library. -Instead, we apply multiple demultiplexing methods, including demultiplexing with `DropletUtils::hashedDrops()`, demultiplexing with `Seurat::HTODemux()`, and genetic demultiplexing when possible. -The genetic demultiplexing used in `scpca-nf` uses the method described in Weber et al [@doi:[10.1093/gigascience/giab062](https://doi.org/10.1093/gigascience/giab062)], which takes bulk RNA-seq data from the same sample as a reference for the expected genotypes found in each sample. - -Variants among the samples within each pool are identified from both the mapped bulk RNA-seq data and pooled single-cell or single-nuclei RNA-seq. -After variants from both bulk and single-cell or single-nuclei RNA-seq are genotyped, `vireo` is used to match genotypes and identify the most likely sample of origin [@doi:10.1186/s13059-019-1865-2]. -The results from these three demultiplexing methods are included in the filtered and processed objects on the Portal. -If a sample does not have associated bulk RNA-seq data, then the objects will not contain genetic demultiplexing results. - -If a library has associated HTO data, an additional section is added to the QC report included on the Portal and output by `scpca-nf`. -This section includes a summary of library statistics, such as how many cells express each HTO. -We do not include any additional plots, but we do show a table summarizing how many cells belong to each sample included in the multiplexed library using each of the demultiplexing methods mentioned. +`scpca-nf` includes additional modules to ensure uniform processing of samples with additional sequencing modalities. + +To process ADT libraries, the ADT FASTQ were provided as input into `scpca-nf` and quantified using `salmon alevin` and `alevin-fry` (Supplemental Figure 2A). +Along with the FASTQ files, `scpca-nf` takes a tab-separated values (tsv) file containing one row for each ADT used. +This file contains the name used for the ADT and associated barcode and is required for building an ADT-specific index for quantifying ADT expression with `alevin-fry`. +The output from `alevin-fry` is the unfiltered ADT by cell counts matrix. +The ADT by cell counts matrix is read into R alongside the gene by cell counts matrix and saved as an alternative experiment (`altExp`) within the main `SingleCellExperiment` containing the unfiltered RNA counts. +This `SingleCellExperiment` object containing both RNA and ADT counts is output from the workflow to a `.rds` file with the suffix `_unfiltered.rds`. + +`scpca-nf` does not filter any cells based on ADT expression. +`DropletUtils::emptyDropsCellRanger()` is only applied to the unfiltered RNA counts matrix to remove empty droplets. +Any cells removed after filtering empty droplets are also removed from the ADT counts matrix. +`scpca-nf` also does not remove cells with low-quality ADT expression, but the workflow does calculate QC statistics for ADT counts using `DropletUtils::cleanTagCounts()`. +These ADT QC statistics are stored alongside the ADT by cell counts matrix in the filtered `SingleCellExperiment` object. +The `SingleCellExperiment` object containing both the filtered RNA and ADT counts matrix, along with associated ADT QC statistics, is saved to an `.rds` file with the suffix `_filtered.rds`. + +Similar to processing RNA gene expression data, the ADT by cell counts matrix is normalized. +Normalized counts are not calculated for any cells that would be removed because of low-quality ADT expression, indicated by `DropletUtils::cleanTagCounts()`. +Although `scpca-nf` normalizes ADT counts, the workflow does not perform any dimensionality reduction of ADT data; only the RNA counts data is used as input for dimensionality reduction. +The normalized ADT data is saved as an `altExp` within the processed `SingleCellExperiment` containing the normalized RNA data and is output to a `.rds` file with the suffix `_processed.rds`. +All `.rds` files containing `SingleCellExperiment` objects and associated `altExp` objects, are converted to `AnnData` objects and exported as separate RNA (`_rna.hdf5`) and ADT (`_adt.hdf5`) AnnData objects. + +If a library contains associated ADT data, the QC report output by `scpca-nf` will include an additional section. +This section includes a summary of ADT-related statistics, such as how many cells express each ADT. +The report also includes a collection of additional plots specific to ADT data (Supplemental Figure 2B-D). +As mentioned above, `scpca-nf` uses `DropletUtils::cleanTagCounts()` to calculate QC statistics for each cell using ADT expression but does not filter any cells from the object. +A summary of removing low-quality cells based on both RNA and ADT counts is included in the QC report (Supplemental Figure 2B). +The first quadrant indicates which cells would be kept if the object was filtered on both RNA and ADT. +The other facets highlight which cells would be removed if filtering was done using only RNA counts, only ADT counts, or both. +The top 4 ADTs with the most variable expression are also identified and visualized using UMAPs (Supplemental Figure 2C) and density plots (Supplemental Figure 2D). +The UMAPs shown were calculated using the RNA data, and cells are colored by ADT expression, while the density plots show the normalized ADT expression across all cells. + +To process multiplexed libraries, the HTO FASTQ is input to `scpca-nf` and quantified using `salmon alevin` and `alevin-fry` (Supplemental Figure 2C). +Along with the FASTQ files, `scpca-nf` requires two tsv files for processing multiplexed data. +The first is similar to the barcode file required when quantifying ADT expression and contains the HTO name and associated barcode. +This file is needed to build an HTO-specific index for quantifying HTO expression with `alevin-fry`. +The second tsv file contains one row for each sample included in the multiplexed library and tells the workflow which HTO was used for which sample when multiplexing the library. +The output from `alevin-fry` is the HTO by cell counts matrix. +The HTO by cell counts matrix is read into R alongside the gene by cell counts matrix. +The unfiltered HTO by cell matrix is saved as an alternative experiment (`altExp`) within the main `SingleCellExperiment` containing the unfiltered RNA counts. +This `SingleCellExperiment` object containing both RNA and HTO counts is output from the workflow to a `.rds` file with the suffix `_unfiltered.rds`. + +As with ADT data, `scpca-nf` does not filter any cells based on HTO expression. +`DropletUtils::emptyDropsCellRanger()` is only applied to the unfiltered RNA counts matrix to remove empty droplets, and any cells removed after filtering empty droplets are also removed from the HTO counts matrix and saved to an `.rds` file with the `_filtered.rds` suffix. +`scpca-nf` does not perform any additional filtering or processing of the HTO by cell counts matrix, so the same filtered matrix is saved to the processed `.rds` file with the `_processed.rds` suffix. + +Although `scpca-nf` quantifies the HTO data and includes an HTO by cell counts matrix in all objects, `scpca-nf` does not demultiplex the samples into one sample per library. +Instead, `scpca-nf` applies multiple demultiplexing methods, including demultiplexing with `DropletUtils::hashedDrops()`, demultiplexing with `Seurat::HTODemux()`, and genetic demultiplexing, if possible. +The genetic demultiplexing used in `scpca-nf` uses the method described in Weber et al [@doi:[10.1093/gigascience/giab062](https://doi.org/10.1093/gigascience/giab062)], which takes bulk RNA-seq data and single-cell RNA-seq data from the same sample. +The bulk RNA-seq serves as a reference for the expected genotypes found in each sample. +If a sample lacks associated bulk RNA-seq data, then no genetic demultiplexing is performed. +The results from these three demultiplexing methods are saved in the filtered and processed `SingleCellExperiment` objects. + +If a library has associated HTO data, an additional section is included in the QC report output by `scpca-nf`. +This section summarizes HTO-specific library statistics, such as how many cells express each HTO. +No additional plots are produced, but a table summarizing the results from all three demultiplexing methods is included. For some samples, multiple libraries were collected, with the additional libraries being used for bulk RNA-seq and/or spatial transcriptomics. -Both of these additional sequencing methods are supported by `scpca-nf`, and the output is available for download on the Portal. -To process bulk RNA-seq data, reads are first trimmed using `fastp` and then aligned using `salmon` (Supplemental Figure 3A). -The bulk module of `scpca-nf` outputs a single tsv file with the sample by gene counts matrix for all samples in a given ScPCA project. +Both of these additional sequencing methods are supported by `scpca-nf`. +`scpca-nf` takes FASTQ from bulk RNA-seq as input, trims reads using `fastp`, and then aligns reads with `salmon` (Supplemental Figure 3A). +The output is single tsv file with the sample by gene counts matrix for all samples in a given ScPCA project. This sample by gene matrix is included only with project downloads on the Portal. -As there is not yet support for spatial transcriptomics with `alevin-fry`, `scpca-nf` uses Space Ranger to quantify all spatial transcriptomics data [@url:https://www.10xgenomics.com/support/software/space-ranger/latest] (Supplemental Figure 3B). -The required input includes the RNA FASTQ files and slide image. -The output includes the spot-by-gene matrix along with a summary report produced by Space Ranger. -If a sample on the Portal has associated spatial transcriptomics data, users will be able to download the summarized gene expression data from single-cell or single-nuclei RNA-seq separately from the spatial output. + +To quantify spatial transcriptomics data, `scpca-nf` takes the RNA FASTQ and slide image as input (Supplemental Figure 3B). +As there is not yet support for spatial transcriptomics with `alevin-fry`, `scpca-nf` uses Space Ranger to quantify all spatial transcriptomics data [@url:https://www.10xgenomics.com/support/software/space-ranger/latest]. +The output includes the spot by gene matrix along with a summary report produced by Space Ranger. ## Downloading projects from the ScPCA Portal