-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it appropriate to use RNAseq data to annotate flow cytometry data via SingleR? #246
Comments
Hi there, This is an interesting question. Aside from the facts that 1) you are using an RNA reference for protein data, and we know these don't always correlate perfectly, and 2) your flow data likely has only a handful of markers compared to the thousands in a sequencing dataset, I think 3) it's also possible that flow data might break a primary assumption made in the SingleR algorithm. Namely, that a cell with higher expression value for a 'markerA' than a 'markerB' with also have higher signal for 'markerA' relative to 'markerB'. We can assume this to be true in (properly normalized) scRNAseq or bulk RNAseq data in that we expect a more highly expressed gene to have more sequencing read counts than a lowly expressed gene, within a given cell or tissue sample. But flow cytometer tuning prioritizes signal separation within each marker individually while caring little (except in the case of heavy compensation issues) for relative leveling between markers. Thus, you might end up with very different value ranges for your different markers, and thus the assumption that higher expression means higher measurement relative to a marker with lower expression may break down. (Said another way, the same expression value might translate to high expression of markerA but only medium expression for markerB.) If so, the spearman correlation metric at the heart of SingleR's scoring may fail to score test<->ref matches accurately. Of course, this is just theoretical. I've never actually looked at how values scale between markers in any of the flow data I analyzed in the past, and am just making some hypothetical extensions from how I remember compensating and adjusting voltages before running my samples. So I am curious about how well you think SingleR performed for your flow data after you run it! |
I too would be curious. In addition to the concerns raised by Dan, there is also the issue of the number of genes involved. Flow cytometry uses fewer features, even when highly multiplexed (10-20 nowadays, maybe?) and each cell type can probably expect to be positive for one or two markers, with the rest being background noise. This doesn't give a lot for the Spearman correlation to work with, especially as it's not allowed to consider the magnitude of the signal in the positive markers; a single strongly upregulated marker won't translate to a big effect in SingleR's scoring. |
There might be something in the If not, I would suggest just doing something very simple to begin with, e.g., nearest neighbor classification. Use BiocNeighbors to build an index with the average reference profile for each cell type, and then just search for the nearest neighbor for each cell in the test dataset. Some tricks may need to be applied, e.g., to use correlation-based distances and to account for differences in the number of reference profiles per cell type. Modifying SingleR to do this is theoretically straightforward but practically difficult as there are many places in SingleR's optimized C++ code where integer ranks are expected, under the assumption that Spearman's correlation is the way to go. I wouldn't undertake this modification without some expectation that it would work. |
Hi, @LTLA,
Thanks so much for this great package.
I performed clustering of my flow cytometry data and have the object as
sce
.Would you mind giving me some insights on the appropriateness of using RNAseq as refs to annotate clusters of flow cytometry data?
Briefly, I compensated, and bi-exponential transform my flow data in FlowJo, then export the data as channel values so that I do not need to transform the data in R for clustering. Once I have the clusters of my data as a
sce
object, I applySingleR
:Thank you again for your help.
The text was updated successfully, but these errors were encountered: