5

I found this example for using kmeans2 algorithm in python. I can't get the following part

# make some z vlues
z = numpy.sin(xy[:,1]-0.2*xy[:,1])

# whiten them
z = whiten(z)

# let scipy do its magic (k==3 groups)
res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3)

The points are zip(xy[:,0],xy[:,1]), so what is the third value z doing here?

Also what is whitening?

Any explanation is appreciated. Thanks.

1 Answer 1

9

First:

# make some z vlues
z = numpy.sin(xy[:,1]-0.2*xy[:,1])

The weirdest thing about this is that it's equivalent to:

z = numpy.sin(0.8*xy[:, 1])

So I don't know why it's written that way. maybe there's a typo?

Next,

# whiten them
z = whiten(z)

whitening is simply normalizing the variance of the population. See here for a demo:

>>> z = np.sin(.8*xy[:, 1])      # the original z
>>> zw = vq.whiten(z)            # save it under a different name
>>> zn = z / z.std()             # make another 'normalized' array
>>> map(np.std, [z, zw, zn])     # standard deviations of the three arrays
[0.42645, 1.0, 1.0]
>>> np.allclose(zw, zn)          # whitened is the same as normalized
True

It's not obvious to me why it is whitened. Anyway, moving along:

# let scipy do its magic (k==3 groups)
res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3)

Let's break that into two parts:

data = np.array(zip(xy[:, 0], xy[:, 1], z))

which is a weird (and slow) way of writing

data = np.column_stack([xy, z])

In any case, you started with two arrays and merge them into one:

>>> xy.shape
(30, 2)
>>> z.shape
(30,)
>>> data.shape
(30, 3)

Then it's data that is passed to the kmeans algorithm:

res, idx = vq.kmeans2(data, 3)

So now you can see that it's 30 points in 3d space that are passed to the algorithm, and the confusing part is how the set of points were created.

Sign up to request clarification or add additional context in comments.

2 Comments

That's what my doubt is, what is the use of the third value z when all I have to do is apply k-means to 2D points. However, thanks for whitening is normalizing variance. I get that part now. Maybe z, the third coordinate is used to divide the point set into well defined clusters so that we see some meaningful clusters when plotted.
@kamalbanga, I think the point of the example you linked to is to apply kmeans to 3d points.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.