-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAMClustR fails if run on large feature table. #19
Comments
@hechth - this is an issue i never really tried to tackle, but would love to have a solution for. I _was hitting issues for some time with .Machine$integer.max, as a square matrix of (Machine$integer.max^0.5 = 46341, feature matrix with > this value would be problematic) - i was assuming this was the issue rather than ff specifically. I would certainly be open to any fix you might suggest. While in the past this wasn't terribly limiting for me, with instrument developments toward increases sensitivity, selectivity, dynamic range, and resolution, i can imagine this is going to become quite limiting. Thanks for helping to tackle this! |
Thanks for the quick response! The command As far as I can see, the matrix is used to store the correlations between the features, which will be a symmetric matrix - you already just compute the upper triangle I think and use a block-wise procedure for efficiency - I will have to have a more detailed look but I think allocating the large matrix can eventually be circumvented since I don't think the We will need some time to implement some tests to ensure that the program still behaves the same but we can come up with some fix afterwards. We can post it as a PR to this main repo to make those developments accessible for everyone and we can discuss implementation details etc. in the PR to find a solution that works for everyone :) |
I believe the issue occurs when coming out of ff and into a distance matrix format necessary for heirarchical clustering. If i recall, the distance matrix taken in forced a square matrix. My memory, however, is fallible, and i could be mis-remembering this. I think that i explored sparse form distance matrices and didn't find a solution. At one point, @meowcat forked ramclustR to try implementing a more memory efficient approach. If i recall, https://github.com/meowcat/fastliclust was used instead of the native fastclust algorithm. I can't remember where this ended though. |
If supplying a feature table with more than 55k entries, RAMClustR fails due to this issue in the ff package ref. I doubt this issue in ff will be fixed.
Since this allocated matrix is symmetric (I assume at least), and only the upper triangle is computed anyway, I think this computation could maybe be optimized in order to never have to store the actual full matrix in memory.
@cbroeckl if you are currently busy and don't have the time to address this issue I'd be happy to support and we will come up with an implementation to solve this.
RAMClustR/R/ramclustR.R
Line 667 in 351243d
The text was updated successfully, but these errors were encountered: