Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k-means out of memory error on large data sets #179

Open
jjlynch2 opened this issue Oct 30, 2019 · 4 comments
Open

k-means out of memory error on large data sets #179

jjlynch2 opened this issue Oct 30, 2019 · 4 comments

Comments

@jjlynch2
Copy link

I'm looking to switch to Julia for my k-means clustering needs. However, I'm regularly using k-means on three-dimensional data sets with 500,000 data points on average. Typically I use k-means to identify 10% or roughly 50,000 clusters. I am unable to run this as it receives an out of memory error on a machine with 64 gb of ram. Is there a way around this, or should I just develop my own k-means implementation in Julia for high performance?

@wildart
Copy link
Contributor

wildart commented Nov 12, 2019

You can try to convert your data to Float32 to reduce memory footprint.

@davidbp
Copy link

davidbp commented Oct 28, 2022

are there any plans to provide a minibatch version such as https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html ?

@davidbp
Copy link

davidbp commented Apr 12, 2023

@jjlynch2 the memory problem you mention happens because the implementation stores a 500,000x50,000 distance matrix when using pairwise. I am interested in making a PR to avoid this. For each datapoint of the 500,000 I think we only need its closest centroid at every iteration, there is no need to keep the distance from every datapoint to all centroids. Doing this would reduce from storing 50,000 x 500,000 to storing 50,000 x 1 numbers during learning.

Ideally it would be very useful to have the option to define a backend implementation when fitting the K-means so that users could opt to different implementations (maybe you care a lot about memory but not that much speed, maybe you want to maximize speed even if at a higher memory cost etc).

@codetalker7
Copy link

Hi! I'm facing a similar issue; I'm trying to cluster a matrix of size 128×13694127 into about 65k clusters. And even on a server with a good amount of memory, it's still giving me OOM issues.

On a side note: are there are plans to implement faster k-means algorithm? Or any kind of support for parallelism or GPUs? Python's faiss library is able to do this really efficiently. I also came across the ParallelKMeans.jl package, but seems like it's not actively maintained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants