Using scipy's kmeans2 function in python

Question

I found this example for using kmeans2 algorithm in python. I can't get the following part

# make some z vlues
z = numpy.sin(xy[:,1]-0.2*xy[:,1])

# whiten them
z = whiten(z)

# let scipy do its magic (k==3 groups)
res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3)

The points are zip(xy[:,0],xy[:,1]), so what is the third value z doing here?

Also what is whitening?

Any explanation is appreciated. Thanks.

askewchan · Accepted Answer · 2013-11-29 03:04:34Z

9

First:

# make some z vlues
z = numpy.sin(xy[:,1]-0.2*xy[:,1])

The weirdest thing about this is that it's equivalent to:

z = numpy.sin(0.8*xy[:, 1])

So I don't know why it's written that way. maybe there's a typo?

Next,

# whiten them
z = whiten(z)

whitening is simply normalizing the variance of the population. See here for a demo:

>>> z = np.sin(.8*xy[:, 1])      # the original z
>>> zw = vq.whiten(z)            # save it under a different name
>>> zn = z / z.std()             # make another 'normalized' array
>>> map(np.std, [z, zw, zn])     # standard deviations of the three arrays
[0.42645, 1.0, 1.0]
>>> np.allclose(zw, zn)          # whitened is the same as normalized
True

It's not obvious to me why it is whitened. Anyway, moving along:

# let scipy do its magic (k==3 groups)
res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3)

Let's break that into two parts:

data = np.array(zip(xy[:, 0], xy[:, 1], z))

which is a weird (and slow) way of writing

data = np.column_stack([xy, z])

In any case, you started with two arrays and merge them into one:

>>> xy.shape
(30, 2)
>>> z.shape
(30,)
>>> data.shape
(30, 3)

Then it's data that is passed to the kmeans algorithm:

res, idx = vq.kmeans2(data, 3)

So now you can see that it's 30 points in 3d space that are passed to the algorithm, and the confusing part is how the set of points were created.

answered Nov 29, 2013 at 3:04

askewchan

46.7k18 gold badges125 silver badges135 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

kamalbanga Over a year ago

That's what my doubt is, what is the use of the third value z when all I have to do is apply k-means to 2D points. However, thanks for whitening is normalizing variance. I get that part now. Maybe z, the third coordinate is used to divide the point set into well defined clusters so that we see some meaningful clusters when plotted.

askewchan Over a year ago

@kamalbanga, I think the point of the example you linked to is to apply kmeans to 3d points.

Collectives™ on Stack Overflow

Using scipy's kmeans2 function in python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related