Clustering using k-means in python

Question

I have a document d1 consisting of lines of form user_id tag_id. There is another document d2 consisting of tag_id tag_name I need to generate clusters of users with similar tagging behaviour. I want to try this with k-means algorithm in python. I am completely new to this and cant figure out how to start on this. Can anyone give any pointers?

Do I need to first create different documents for each user using d1 with his tag vocabulary? And then apply k-means algorithm on these documents? There are like 1 million users in d1. I am not sure I am thinking in right direction, creating 1 million files ?

Has QUIT--Anony-Mousse · Accepted Answer · 2012-03-07 07:03:19Z

4

Since the data you have is binary and sparse (in particular, not all users have tagged all documents, right)? So I'm not at all convinced that k-means is the proper way to do this.

Anyway, if you want to give k-means a try, have a look at the variants such as k-medians (which won't allow "half-tagging") and convex/spherical k-means (which supposedly works better with distance functions such as cosine distance, which seems a lot more appropriate here).

answered Mar 7, 2012 at 7:03

Has QUIT--Anony-Mousse

77.8k14 gold badges146 silver badges198 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 10:26:52Z

2

As mentioned by @Jacob Eggers, you have to denormalize the data to form the matrix which is a sparse one indeed. Use SciPy package in python for k means. See

Scipy Kmeans

for examples and execution. Also check Kmeans in python (Stackoverflow) for more information in python kmeans clustering.

edited May 23, 2017 at 10:26

CommunityBot

11 silver badge

answered Mar 7, 2012 at 4:44

sravan_kumar

1,1391 gold badge13 silver badges25 bronze badges

Comments

Jacob Eggers · Accepted Answer · 2012-03-07 04:05:37Z

First you need to denormalize the data so that you have one file like this:

userid tag1 tag2 tag3 tag4 ....
0001   1    0    1    0    ....
0002   0    1    1    0    ....
0003   0    0    1    1    ....

Then you need to loop through the k-means algorithm. Here is matlab code from the ml-class:

% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
    % Cluster assignment step: Assign each data point to the
    % closest centroid. idx(i) corresponds to cˆ(i), the index 
    % of the centroid assigned to example i
    idx = findClosestCentroids(X, centroids);

    % Move centroid step: Compute means based on centroid
    % assignments
    centroids = computeMeans(X, idx, K);
end

Community · Accepted Answer · 2017-05-23 10:26:52Z

0

For sparse k-means, see the examples under scikit-learn clustering.
About how many ids are there, how many per user on average, how many clusters are you looking for ? Even rough numbers, e.g. 100k ids, av 10 per user, 100 clusters, may lead to someone who's done clustering in that range (or else to back-of-the-envelope "impossible").

MinHash may be better suited for your problem than k-means; see chapter 3, Finding Similar Items, of Ullman, Mining Massive Datasets;
also SO questions/tagged/similarity+algorithm+python.

edited May 23, 2017 at 10:26

CommunityBot

11 silver badge

answered Mar 7, 2012 at 16:51

denis

22k12 gold badges68 silver badges92 bronze badges

Collectives™ on Stack Overflow

Clustering using k-means in python

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related