Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAMClustR fails if run on large feature table. #19

Closed
hechth opened this issue Jan 31, 2022 · 4 comments · Fixed by #20
Closed

RAMClustR fails if run on large feature table. #19

hechth opened this issue Jan 31, 2022 · 4 comments · Fixed by #20

Comments

@hechth
Copy link
Collaborator

hechth commented Jan 31, 2022

If supplying a feature table with more than 55k entries, RAMClustR fails due to this issue in the ff package ref. I doubt this issue in ff will be fixed.

Since this allocated matrix is symmetric (I assume at least), and only the upper triangle is computed anyway, I think this computation could maybe be optimized in order to never have to store the actual full matrix in memory.

@cbroeckl if you are currently busy and don't have the time to address this issue I'd be happy to support and we will come up with an implementation to solve this.

ffmat<-ff::ff(vmode="double", dim=c(n, n), initdata = 0) ##reset to 1 if necessary

@cbroeckl
Copy link
Owner

@hechth - this is an issue i never really tried to tackle, but would love to have a solution for. I _was hitting issues for some time with .Machine$integer.max, as a square matrix of (Machine$integer.max^0.5 = 46341, feature matrix with > this value would be problematic) - i was assuming this was the issue rather than ff specifically. I would certainly be open to any fix you might suggest. While in the past this wasn't terribly limiting for me, with instrument developments toward increases sensitivity, selectivity, dynamic range, and resolution, i can imagine this is going to become quite limiting.

Thanks for helping to tackle this!

@hechth
Copy link
Collaborator Author

hechth commented Jan 31, 2022

Thanks for the quick response! The command ffmat<-ff::ff(vmode="double", dim=c(n, n), initdata = 0) ##reset to 1 if necessary listed above is called with n being the number of features. The ff function is limited to .Machine$integer.max entries, which, as you say, becomes problematic with more than 46k features (seems like my estimation skills above aren't that great, so 46k instead of 55k).

As far as I can see, the matrix is used to store the correlations between the features, which will be a symmetric matrix - you already just compute the upper triangle I think and use a block-wise procedure for efficiency - I will have to have a more detailed look but I think allocating the large matrix can eventually be circumvented since I don't think the ff package will be fixed.

We will need some time to implement some tests to ensure that the program still behaves the same but we can come up with some fix afterwards. We can post it as a PR to this main repo to make those developments accessible for everyone and we can discuss implementation details etc. in the PR to find a solution that works for everyone :)

@cbroeckl
Copy link
Owner

I believe the issue occurs when coming out of ff and into a distance matrix format necessary for heirarchical clustering. If i recall, the distance matrix taken in forced a square matrix. My memory, however, is fallible, and i could be mis-remembering this. I think that i explored sparse form distance matrices and didn't find a solution. At one point, @meowcat forked ramclustR to try implementing a more memory efficient approach. If i recall, https://github.com/meowcat/fastliclust was used instead of the native fastclust algorithm. I can't remember where this ended though.

@hechth
Copy link
Collaborator Author

hechth commented Jan 31, 2022

@cbroeckl I used a debugger in vscode and traced the failure in my case back to the ff function - could also be that there will be more problems coming up afterwards. Thank you very much for the hints, maybe @meowcat has some more hints?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants