Python k-means algorithm

Question

I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.

I did a similar implementation for images. You can use 2d arrays instead of RGB values. It's very naive but works for me github.com/keremgocen/pattern-recog-notes. — Kerem
– Kerem, Commented May 18, 2015 at 1:35

tom10 · Accepted Answer · 2020-10-11 20:55:59Z

57

Update: (Eleven years after this original answer, it's probably time for an update.)

First off, are you sure you want k-means? This page gives an excellent graphical summary of some different clustering algorithms. I'd suggest that beyond the graphic, look especially at the parameters that each method requires and decide whether you can provide the required parameter (eg, k-means requires the number of clusters, but maybe you don't know that before you start clustering).

Here are some resources:

Old answer:

Scipy's clustering implementations work well, and they include a k-means implementation.

There's also scipy-cluster, which does agglomerative clustering; ths has the advantage that you don't need to decide on the number of clusters ahead of time.

edited Oct 11, 2020 at 20:55

answered Oct 9, 2009 at 22:10

tom10

69.5k11 gold badges133 silver badges143 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

shallow_water Over a year ago

Why is scipy preferred over sklean for k-means? Having used both recently, I found I liked sklearn's implementation more

Vebjorn Ljosa · Accepted Answer · 2010-02-09 03:31:12Z

29

SciPy's kmeans2() has some numerical problems: others have reported error messages such as "Matrix is not positive definite - Cholesky decomposition cannot be computed" in version 0.6.0, and I just encountered the same in version 0.7.1.

For now, I would recommend using PyCluster instead. Example usage:

>>> import numpy
>>> import Pycluster
>>> points = numpy.vstack([numpy.random.multivariate_normal(mean, 
                                                            0.03 * numpy.diag([1,1]),
                                                            20) 
                           for mean in [(1, 1), (2, 4), (3, 2)]])
>>> labels, error, nfound = Pycluster.kcluster(points, 3)
>>> labels  # Cluster number for each point
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
>>> error   # The within-cluster sum of distances for the solution
1.7721661785401261
>>> nfound  # Number of times this solution was found
1

edited Feb 9, 2010 at 3:31

answered Feb 8, 2010 at 20:03

Vebjorn Ljosa

18.1k13 gold badges76 silver badges90 bronze badges

4 Comments

Sid Over a year ago

It also seems that the scipy cluster kmeans function does not accept a distance method and always uses Euclidean. Another reason to use PyCluster?

monkut Over a year ago

just hit the error mentioned... I see in your example the cluster groupings, but can you get the cluster "center"?

Vebjorn Ljosa Over a year ago

@monkup, numpy.vstack([points[labels == i].mean(0) for i in range(labels.max() + 1)]) to get the centers of the clusters.

forefinger Over a year ago

You can get rid of the error in kmeans2 by using the keyword argument minit='points'

Nathan · Accepted Answer · 2010-04-09 05:21:50Z

For continuous data, k-means is very easy.

You need a list of your means, and for each data point, find the mean its closest to and average the new data point to it. your means will represent the recent salient clusters of points in the input data.

I do the averaging continuously, so there is no need to have the old data to obtain the new average. Given the old average k,the next data point x, and a constant n which is the number of past data points to keep the average of, the new average is

k*(1-(1/n)) + n*(1/n)

Here is the full code in Python

from __future__ import division
from random import random

# init means and data to random values
# use real data in your code
means = [random() for i in range(10)]
data = [random() for i in range(1000)]

param = 0.01 # bigger numbers make the means change faster
# must be between 0 and 1

for x in data:
    closest_k = 0;
    smallest_error = 9999; # this should really be positive infinity
    for k in enumerate(means):
        error = abs(x-k[1])
        if error < smallest_error:
            smallest_error = error
            closest_k = k[0]
        means[closest_k] = means[closest_k]*(1-param) + x*(param)

you could just print the means when all the data has passed through, but its much more fun to watch it change in real time. I used this on frequency envelopes of 20ms bits of sound and after talking to it for a minute or two, it had consistent categories for the short 'a' vowel, the long 'o' vowel, and the 's' consonant. wierd!

this is a great online learning kmeans algorithm! But there is bug at last row of the code. should remove one tab on this row: means[closest_k] = means[closest_k]*(1-param) + x*(param)

Community · Accepted Answer · 2017-05-23 10:31:34Z

6

(Years later) this kmeans.py under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means is straightforward and reasonably fast; it uses any of the 20-odd metrics in scipy.spatial.distance.

edited May 23, 2017 at 10:31

CommunityBot

11 silver badge

answered Jul 4, 2011 at 14:43

denis

22k12 gold badges68 silver badges92 bronze badges

Comments

Jacob · Accepted Answer · 2009-10-09 19:26:39Z

5

From wikipedia, you could use scipy, K-means clustering an vector quantization

Or, you could use a Python wrapper for OpenCV, ctypes-opencv.

Or you could OpenCV's new Python interface, and their kmeans implementation.

edited Oct 9, 2009 at 19:26

answered Oct 9, 2009 at 19:21

Jacob

34.7k15 gold badges116 silver badges168 bronze badges

Comments

thedatastrategist · Accepted Answer · 2017-02-12 12:45:48Z

SciKit Learn's KMeans() is the simplest way to apply k-means clustering in Python. Fitting clusters is simple as: kmeans = KMeans(n_clusters=2, random_state=0).fit(X).

This code snippet shows how to store centroid coordinates and predict clusters for an array of coordinates.

>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [4, 2], [4, 4], [4, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> kmeans.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)
>>> kmeans.cluster_centers_
array([[ 1.,  2.],
       [ 4.,  2.]])

(courtesy of SciKit Learn's documentation, linked above)

George Silva · Accepted Answer · 2009-10-09 19:35:19Z

0

You can also use GDAL, which has many many functions to work with spatial data.

answered Oct 9, 2009 at 19:35

George Silva

3,48310 gold badges41 silver badges64 bronze badges

Comments

Guest · Accepted Answer · 2014-09-14 20:52:51Z

0

Python's Pycluster and pyplot can be used for k-means clustering and for visualization of 2D data. A recent blog post Stock Price/Volume Analysis Using Python and PyCluster gives an example of clustering using PyCluster on stock data.

edited Sep 14, 2014 at 20:52

answered Sep 14, 2014 at 20:47

Guest

111 bronze badge

Collectives™ on Stack Overflow

Python k-means algorithm

8 Answers 8

1 Comment

4 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

1 Comment

4 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related