-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solo encountering Nan values with 10x data #71
Comments
Hi, I also encounter the same error. To me, the UMI filter also did not resolve for all samples this issue. I work on a HPC - depending on the GPU node, this error shows up and sometimes not. Thanks for further support on this. |
Hi all, @symbiologist @Elhl93 I was on leave for a couple weeks so sorry about the slow response. Removing high and low count cells and genes is sometimes necessary to have the model converge. I will often use the following filters to prevent this issue:
I will add this suggestion to the README. |
Thanks for the tip, and appreciate the response! I'll echo @Elhl93 's sentiments - I did some more testing and found that filtering on UMI and genes didn't always determine whether solo would throw this error. I found that changing the n_hidden (changed to 320 instead of 384) or dropout_rate (to to 0.1 from 0.2) in the model parameters would resolve the issue (on the same unfiltered dataset). Sometimes changing the seed value would resolve it too, though it was not as reliable as reducing n_hidden. I've noticed reducing n_hidden would typically lead to more doublets called. My knowledge of VAEs is limited - which parameters would be most appropriate to change (or not change) in this case? Should we be trying to go with the max n_hidden value and find a dropout_rate and seed value that allows solo to run, or is it more robust to simply reduce n_hidden? Thank you! |
Interesting. I'm always hesitant to increase Deep learning models are tricky to have converge for all datasets and I've found the defaults to work in majority of cases when I remove observations (cells) or features (genes) which are orders of magnitude different. If you can share your data I'm happy to take a look to see if I can understand why its happening. |
Hi, when changing the learning rate it runs stable. I have 15 samples, previously I had 6 errors, now none. However, it takes more time. Just for you information, I also tested an older version of solo (0.3), but it runs stable with the default settings. However, as the learning rate should be flexible, I would suggest to make them parameters in the function. If I see it correctly you can't change the learning rate for solo.train in the json file. I forked the repo and exchanged the hard-coded learning rate into a parameter with different default, happy to make a pull request if this is of interest. |
Always happy to take a pull request! Its a good idea. |
I made a pull request. I tested the branch on my machine, it worked fine. |
I'm getting this error while running solo on one (but not all) of my samples:
ValueError: Expected parameter loc (Tensor of shape (128, 64)) of distribution Normal(loc: torch.Size([128, 64]), scale: torch.Size([128, 64])) to satisfy the constraint Real(), but found invalid values: tensor([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], device='cuda:0', grad_fn=<AddmmBackward0>)
Using this as my model json file, if helpful:
{ "n_hidden": 384, "n_latent": 64, "n_layers": 1, "cl_hidden": 128, "cl_layers": 1, "dropout_rate": 0.2, "learning_rate": 0.001, "valid_pct": 0.10 }
Interestingly, if I filter out cells with < 1000 UMI, this error goes away. Weirdly, I have several other samples where this does not appear to be a problem at all (regardless of filtering).
Any thoughts on how to resolve this without applying the UMI cutoff? Thanks!
The text was updated successfully, but these errors were encountered: