k-means out of memory error on large data sets #179

jjlynch2 · 2019-10-30T03:13:52Z

I'm looking to switch to Julia for my k-means clustering needs. However, I'm regularly using k-means on three-dimensional data sets with 500,000 data points on average. Typically I use k-means to identify 10% or roughly 50,000 clusters. I am unable to run this as it receives an out of memory error on a machine with 64 gb of ram. Is there a way around this, or should I just develop my own k-means implementation in Julia for high performance?

wildart · 2019-11-12T17:59:56Z

You can try to convert your data to Float32 to reduce memory footprint.

davidbp · 2022-10-28T15:23:38Z

are there any plans to provide a minibatch version such as https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html ?

davidbp · 2023-04-12T09:47:17Z

@jjlynch2 the memory problem you mention happens because the implementation stores a 500,000x50,000 distance matrix when using pairwise. I am interested in making a PR to avoid this. For each datapoint of the 500,000 I think we only need its closest centroid at every iteration, there is no need to keep the distance from every datapoint to all centroids. Doing this would reduce from storing 50,000 x 500,000 to storing 50,000 x 1 numbers during learning.

Ideally it would be very useful to have the option to define a backend implementation when fitting the K-means so that users could opt to different implementations (maybe you care a lot about memory but not that much speed, maybe you want to maximize speed even if at a higher memory cost etc).

codetalker7 · 2024-08-12T13:50:11Z

Hi! I'm facing a similar issue; I'm trying to cluster a matrix of size 128×13694127 into about 65k clusters. And even on a server with a good amount of memory, it's still giving me OOM issues.

On a side note: are there are plans to implement faster k-means algorithm? Or any kind of support for parallelism or GPUs? Python's faiss library is able to do this really efficiently. I also came across the ParallelKMeans.jl package, but seems like it's not actively maintained.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k-means out of memory error on large data sets #179

k-means out of memory error on large data sets #179

jjlynch2 commented Oct 30, 2019

wildart commented Nov 12, 2019

davidbp commented Oct 28, 2022

davidbp commented Apr 12, 2023 •

edited

Loading

codetalker7 commented Aug 12, 2024

k-means out of memory error on large data sets #179

k-means out of memory error on large data sets #179

Comments

jjlynch2 commented Oct 30, 2019

wildart commented Nov 12, 2019

davidbp commented Oct 28, 2022

davidbp commented Apr 12, 2023 • edited Loading

codetalker7 commented Aug 12, 2024

davidbp commented Apr 12, 2023 •

edited

Loading