Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Cell Barcodes Trigger Error if Multiple Batches of Data in Combined Data Set #16

Open
DarioS opened this issue Jan 19, 2024 · 0 comments

Comments

@DarioS
Copy link
Member

DarioS commented Jan 19, 2024

There are only a finite number of 10x Genomics cell barcodes; 737280. When data is collected over many months in different sequencing batches, some cell barcodes will recur because, each time, all of the 737280 barcodes are used to label cells. Basically,

Each mapped read in a 10x Genomics Single Cell 3’ v2 Gene Expression Library can be annotated by four labels: (1) A sample barcode, (2) cell-barcode index, (3) Unique Molecular Identifier (UMI) (4) gene ID.
A 16 bp cell-barcode index is randomly selected out of a set containing 737280 possible combinations. In scRNA-seq data, a cell is identified by a unique cell barcode.

set.seed(2024)
January <- sample(737280, 5000) # Patient X
June <- sample(737280, 5000) # Patient Y
> intersect(January, June) # Some simulated barcodes appear twice.
 [1] 378471 685968 517615 550588 543582 255006  83184  47276 697772 584851  17973 577898
[13] 533573    123 384119 518639 591930 295070 238711 401381 171660 184026 210186 708855
[25] 599121 311327 220013 458140 515898 180450 640358 174120 301284 631054

For a real data set, a barcode appears between one and four times.

> allHuman
class: SingleCellExperiment 
dim: 21711 111671

> range(table(colData(allHuman)$Barcode))
  1 4

> table(table(colData(allHuman)$Barcode))
     1      2      3      4 
108977   1275     12     27

scClassify doesn't allow empty column names in the test matrix, which is what read10xCounts produces by default.

allHuman <- logNormCounts(allHuman)
SydneyLogCounts <- assay(allHuman, "logcounts")
> colnames(SydneyLogCounts)
NULL
> scClassify(referenceLogCounts, colData(aReference)[, "cellType"], SydneyLogCounts)
  Error in predict_scClassifySingle: colnames of exprsMat_test is NULL or not unique

Because it is multi-batch data, setting column names on the matrix to be Barcodes also fails.

> colnames(SydneyLogCounts) <- colData(allHuman)$Barcode
> head(colnames(SydneyLogCounts))
[1] "AAACCTGAGAAACCTA-1" "AAACCTGAGGACAGAA-1" "AAACCTGCAGACAAAT-1"
[4] "AAACCTGGTACCGAGA-1" "AAACCTGGTCGGGTCT-1" "AAACCTGGTCGTTGTA-1"
> scClassify(referenceLogCounts, colData(aReference)[, "cellType"], SydneyLogCounts)
  Error in predict_scClassifySingle: colnames of exprsMat_test is NULL or not unique

The obscure solution is to paste the patient ID to cell barcode to ensure uniqueness.

> colnames(SydneyLogCounts) <- paste(colData(allHuman)$Sample, colData(allHuman)$Barcode, sep = '_')
> head(colnames(SydneyLogCounts))
[1] "DANFRO_CUL1HNP3_AAACCTGAGAAACCTA-1" "DANFRO_CUL1HNP3_AAACCTGAGGACAGAA-1"
[3] "DANFRO_CUL1HNP3_AAACCTGCAGACAAAT-1" "DANFRO_CUL1HNP3_AAACCTGGTACCGAGA-1"
[5] "DANFRO_CUL1HNP3_AAACCTGGTCGGGTCT-1" "DANFRO_CUL1HNP3_AAACCTGGTCGTTGTA-1"
> predicts <- scClassify(referenceLogCounts, colData(aReference)[, "cellType"], SydneyLogCounts) # Success

This could be much smoother for end-users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant