Clustering data with given cluster centers in Python

Question

I have a 1-dimension numerical dataset (but my question also applies for a n-dimension numerical dataset) which I want to cluster, and I already know the values of the cluster centers. So I only want to map each data point to its associed cluster center (the one which is the closest of the datapoint).

I could write an ad hoc function, but I would really prefer using a Python scientific library optimised to work on pandas.Series or numpy.arrays, as Scipy, because my dataset is very big (hundreds of millions of data points).

How can I do this?

Thank you!

Can you provide some example data sets with example cluster points? — Ffisegydd
– Ffisegydd, Commented Aug 14, 2014 at 9:54
Sorry I forgot to say that the dataset was numerical. So just take any 1-d array of floats. And suppose that there are five cluster centers which are also floats. — sweeeeeet
– sweeeeeet, Commented Aug 14, 2014 at 10:06

loopbackbee · Accepted Answer · 2014-08-14 12:36:12Z

3

You are looking for the scipy vq function.

The first argument is the data to cluster, and the second is the clusters coordinates. The first element of the return value is the index of each cluster (the label), which is what you want:

>>> vq( array([0,5,5]), array([1,2,3]) )
(array([0, 2, 2]), array([ 1.,  2.,  2.]))

edited Aug 14, 2014 at 12:36

answered Aug 14, 2014 at 10:14

loopbackbee

23.6k11 gold badges69 silver badges102 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Clustering data with given cluster centers in Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related