Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newbie questions: warning message, model parameters, and outputs #73

Closed
lauren-fish opened this issue Mar 7, 2022 · 5 comments
Closed

Comments

@lauren-fish
Copy link

Hi, I've used solo a few times now and am really appreciating how user-friendly it is. It seems to work really well on my dataset.

  1. Every time I run solo, I get the following warning:
    "UserWarning: Make sure the registered X field in anndata contains unnormalized count data."
    I want to confirm that this is a normal error that shows with scvi-tools and I'm not screwing something up- before running Solo, I've been removing ambient RNA and empty droplets from my dataset using CellBender, then doing some subsetting in Seurat to remove droplets with aggressively low or high counts, but that's all.
    I found the warning in this vignette on the scvi-tools website: https://docs.scvi-tools.org/en/0.13.0/user_guide/notebooks/scarches_scvi_tools.html so my gut says it's ok but I figured I'd check since I'm new at all this.

  2. This relates to the model parameters:
    On the README, the example parameters are:
    {
    "n_hidden": 384,
    "n_latent": 64,
    "n_layers": 1,
    "cl_hidden": 128,
    "cl_layers": 1,
    "dropout_rate": 0.2,
    "learning_rate": 0.001,
    "valid_pct": 0.10
    }
    But the model.json file included with Solo has:
    {
    "n_hidden": 128,
    "n_latent": 16,
    "cl_hidden": 64,
    "cl_layers": 1,
    "dropout_rate": 0.1,
    "learning_rate": 0.001,
    "valid_pct": 0.10
    }
    Which of these should I use for regular snRNAseq data? is one of these examples intended for use with demultiplexing/hashsolo?

  3. I wanted to make sure that I should go by the is_doublet.csv binary predictions, and not worry about the preds.npy files etc.
    My data are from muscle nuclei and when I create a FeaturePlot for a particular muscle marker, many of the non-muscle cells that express this marker are categorized as doublets in is_doublet.csv, so the results seem reasonable.
    I got some example code from a colleague for adding solo calls to seurat metadata and they use rcpp to import the preds.npy output to R, which doesn't work well for me. It's something to do with my newer versions of either rcpp or solo- he gets consistent number strings for the binary "T" and 0 for "F" upon importing the preds.npy file to R, and I have more than 2 different number strings when I do this. When I open my preds.npy output in python it IS binary, and matches what I see in is_doublet.csv. I'm happy to cut out the rcpp middleman and just use is_doublet.csv, but wanted to make sure that this is correct.
    I just started using solo in Dec2021/Jan2022 so I suspect the other person wrote the code for data generated with an older version of solo, given what you mentioned in Difference between is_doublet and preds #62.
    However, it's been several months since that issue was resolved so I wanted to make sure that I'm using the correct output files.

Thank you so much for your time!!

@lauren-fish lauren-fish changed the title Warning message, model parameters, and another general question Newbie questions: warning message, model parameters, and outputs Mar 7, 2022
@njbernstein
Copy link
Contributor

Hi there sorry for the very late response.

  1. You should be fine. I'd double check that they are counts but I get that warning when I run it on counts
  2. Either one of those will work. I will update them to be the same, but default to the one in the model.json file.
  3. Yes use the is_doublet.csv output.

@lauren-fish
Copy link
Author

No worries! Thanks for your answers.

@lauren-fish
Copy link
Author

Actually, I have another question- what kind of subsetting do you suggest before using solo?
I think I'm ok with removing the bottom 1% of cells by nFeature or setting a threshold of >150 features, but is it ok/recommended to remove the top 1% of cells (by nFeature)?

The clustering-aware tools seem to work best when low UMI cells are removed, but all of the upper outliers are left in place- and this makes sense.

However, it seems like the way solo was built to handle the data, this might not matter too much?

@lauren-fish lauren-fish reopened this Apr 17, 2022
@davek44
Copy link
Collaborator

davek44 commented Apr 24, 2022

If there is a threshold above which you are certain that a cell is a doublet, you should go ahead and pre-remove those before running Solo. Otherwise, the training phase will incorrectly treat those as singlets.

@lauren-fish
Copy link
Author

Thanks Dave, that makes a lot of sense, given how Solo works compared to other methods.

@davek44 davek44 closed this as completed Apr 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants