Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classification Error if Bioconductor Atlas Used as Reference #13

Open
DarioS opened this issue Mar 9, 2023 · 0 comments
Open

Classification Error if Bioconductor Atlas Used as Reference #13

DarioS opened this issue Mar 9, 2023 · 0 comments

Comments

@DarioS
Copy link
Member

DarioS commented Mar 9, 2023

I am recreating an analysis done by researchers from Regeneron Pharmaceuticals published in Immunostimulatory Cancer-Associated Fibroblast Subpopulations Can Predict Immunotherapy Response in Head and Neck Cancer, Clinical Cancer Research but using scClassify instead of SingleR and on in-house data. They used Human Primary Cell Atlas and BLUEPRINT Consortium atlas, as provided by the Biocondcutor package celldex, developed by the same people who developed Bioconductor package SingleR (i.e. Aaron Lun et al.). These atlases are comprised of either purified cell types or cell lines. Human Primary Cell Atlas is developed using 745 microarrays from 105 separate studies that are freely available for download from N.C.B.I. Gene Expression Omnibus.

A large and diverse set of human primary cell gene expression data was collected, with a particular emphasis on datasets that divided immune cells into sub-populations based upon surface markers.

library(celldex)
HPCAset <- HumanPrimaryCellAtlasData() # Downloads and caches the atlas.
> table(colData(HPCAset)$label.main)
           Astrocyte               B_cell                   BM           BM & Prog. 
                   2                   26                    7                    1 
        Chondrocytes                  CMP                   DC Embryonic_stem_cells 
                   8                    2                   88                   17 
   Endothelial_cells     Epithelial_cells         Erythroblast          Fibroblasts 
                  64                   16                    8                   10
                                   ...                 ... 

The other atlas has less samples but is all from one project.

259 RNA-seq samples of pure stroma and immune cells as generated and supplied by BLUEPRINT and ENCODE.

BLUEPRINTset <- BlueprintEncodeData()
> table(colData(BLUEPRINTset)$label.main)
       Adipocytes        Astrocytes           B-cells      CD4+ T-cells      CD8+ T-cells 
                9                 2                 8                14                 5 
     Chondrocytes                DC Endothelial cells       Eosinophils  Epithelial cells 
                2                 1                26                 1                18 
     Erythrocytes       Fibroblasts               HSC     Keratinocytes       Macrophages 
                7                20                38                 2                25
                                   ...                 ... 

Human Primary Cell Atlas is used as the example data set in the SingleR vignette and classification works. However, scClassify fails with an error message if either one of these cell type atlases are used.

library(Seurat)
headNeckCells <- readRDS("/dski/nobackup/biostat/datasets/singlecell/headneck/analysis/seurat_supervised_k10_18samples_0.01.rds")
RNAdata <- headNeckCells@assays$RNA@data

> HPCAclassify <- scClassify(assays(HPCAset)[["logcounts"]], colData(HPCAset)[["label.main"]], RNAdata)
Error in scClassify(assays(HPCAset)[["logcounts"]], colData(HPCAset)[["label.main"]],: 
  There is cell type with only one cell, please check cellTypes_train
> HPCAclassify <- scClassify(assays(BLUEPRINTset)[["logcounts"]], colData(BLUEPRINTset)[["label.main"]], RNAdata)
Error in scClassify(assays(BLUEPRINTset)[["logcounts"]], colData(BLUEPRINTset)[["label.main"]],: 
  There is cell type with only one cell, please check cellTypes_train

Could this error message be changed into a warning and the classification allowed to proceed? Or, is it important to prevent that? There are some cell types in the atlas with two samples belonging to them. For example, having one sample belonging to Eosinophils triggers the error, but removing singleton classes and even having Keratinocytes class with only two samples belonging to proceeds with classification. Then, another error / message is reached.

classCounts <- table(colData(HPCAset)[["label.main"]])
keepSamples <- colData(HPCAset)[["label.main"]] %in% names(classCounts)[classCounts > 1]
> table(keepSamples) # Two samples, each belonging to one class, will be dropped from the atlas.
keepSamples
FALSE  TRUE
    2   711
HPCAset <- HPCAset[, keepSamples] # All classes now have at least two samples.
> HPCAclassify <- scClassify(assays(HPCAset)[["logcounts"]], colData(HPCAset)[["label.main"]], RNAdata)
   There are only 0 selected genes in reference data expressed in query data.
> nrow(HPCAset)
[1] 19363
> length(intersect(rownames(HPCAset), rownames(RNAdata))) # Must be an abundance issue instead of gene name matching.
[1] 17589

It seems that the selected genes probably have moderate rather than high abundance and are not detectable by the poor limit-of-detection of single-cell sequencing. May scClassify output the list of selected genes in the error message to allow manual inspection? Also, could scClassify automatically select only useable genes and avoid the unuseable ones during feature selection to prevent ever selecting zero expressed genes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant