4

I have a 1-dimension numerical dataset (but my question also applies for a n-dimension numerical dataset) which I want to cluster, and I already know the values of the cluster centers. So I only want to map each data point to its associed cluster center (the one which is the closest of the datapoint).

I could write an ad hoc function, but I would really prefer using a Python scientific library optimised to work on pandas.Series or numpy.arrays, as Scipy, because my dataset is very big (hundreds of millions of data points).

How can I do this?

Thank you!

2
  • 1
    Can you provide some example data sets with example cluster points? Commented Aug 14, 2014 at 9:54
  • Sorry I forgot to say that the dataset was numerical. So just take any 1-d array of floats. And suppose that there are five cluster centers which are also floats. Commented Aug 14, 2014 at 10:06

1 Answer 1

3

You are looking for the scipy vq function.

The first argument is the data to cluster, and the second is the clusters coordinates. The first element of the return value is the index of each cluster (the label), which is what you want:

>>> vq( array([0,5,5]), array([1,2,3]) )
(array([0, 2, 2]), array([ 1.,  2.,  2.]))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.