Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Underestimation of doublets with data set that contains a lot of proliferative cells #49

Closed
lenaschneehas opened this issue Aug 26, 2020 · 3 comments

Comments

@lenaschneehas
Copy link

Dear Solo team,

your tool is great and easy to use, thank you!
Still, I have a question: I am running several 10x samples in order to identify doublets and I observed that if my data set consists mainly of cells that are is G2M or S Phase (cell cycle classification according to Seurat), the number of detected doublets is extremely underestimated. Eg.: In a data set with 10^4 Cells where more than 50% of the cells are classified of being in G2M/S Phase, less than 100 cells are found to be doublets. With my other data, where less than 30 % is in G2M/S phase, the number of detected doublets are similar to the expected ones.

Do you have any idea how we could manage to find the doublets in data sets where cells are proliferating?

Best,

Lena

@njbernstein
Copy link
Contributor

njbernstein commented Aug 26, 2020

Hi @lenaschneehas Great to know you are using Solo.

Continuous cell states is an interesting issue, which we have not fully figured out a solution to when trying to identify doublets computationally.

One option is you can force solo to call the expected number of doublets using the -e EXPECTED_NUMBER_OF_DOUBLETS parameter, which can be based on the expected number of doublets based on the number of cells you loaded. This option is just a bandaid that just forces solo to call the number of doublets expected. It does this by ranking cells based on their doublet score from the solo classifier and taking the top n cells as doublets where n is the EXPECTED_NUMBER_OF_DOUBLETS provided to solo.
This option has not been well tested. We know empirically the adjustment to the doublet probabilities solo performs typically maximizes the trade-off between precision and recall, so changing the threshold with the -e parameter would typically decrease performance. The caveat to all this is that all our tests have been done on cell states which are more discrete then cell cycle. See paper.

Another consideration is that if solo can't identify the doublets in your data then they perhaps will not impact your biological inferences as much as the doublets which can be transcriptionally identified as a doublet. Personally, this is the route I would choose.

CCing @davek44 to see if you have any additional advice.

@lenaschneehas
Copy link
Author

@njbernstein thank you for your fast response. Mhm..Setting the expected number of doublets is sth I really don't want to do. In the meant time I run the solo analysis for a data set where I have the known doublets information due to cell hashing and this increased the number of doublets found up to 80 % compared to the number of expected (in this data set only 50 % or more are classified as G1). I'd be happy if you give me an update if you are investigating this any further! Best, Lena

@njbernstein
Copy link
Contributor

We don't have anything planned unfortunately bu I'll be sure to circle back if anything pertinent comes up. I'm gonna close this issue.

Please let me know if you have any more issues or have new info regarding this topic!

Best
Nick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants