3

I have a document d1 consisting of lines of form user_id tag_id. There is another document d2 consisting of tag_id tag_name I need to generate clusters of users with similar tagging behaviour. I want to try this with k-means algorithm in python. I am completely new to this and cant figure out how to start on this. Can anyone give any pointers?

Do I need to first create different documents for each user using d1 with his tag vocabulary? And then apply k-means algorithm on these documents? There are like 1 million users in d1. I am not sure I am thinking in right direction, creating 1 million files ?

4 Answers 4

4

Since the data you have is binary and sparse (in particular, not all users have tagged all documents, right)? So I'm not at all convinced that k-means is the proper way to do this.

Anyway, if you want to give k-means a try, have a look at the variants such as k-medians (which won't allow "half-tagging") and convex/spherical k-means (which supposedly works better with distance functions such as cosine distance, which seems a lot more appropriate here).

Sign up to request clarification or add additional context in comments.

Comments

2

As mentioned by @Jacob Eggers, you have to denormalize the data to form the matrix which is a sparse one indeed. Use SciPy package in python for k means. See

Scipy Kmeans

for examples and execution. Also check Kmeans in python (Stackoverflow) for more information in python kmeans clustering.

Comments

0

First you need to denormalize the data so that you have one file like this:

userid tag1 tag2 tag3 tag4 ....
0001   1    0    1    0    ....
0002   0    1    1    0    ....
0003   0    0    1    1    ....

Then you need to loop through the k-means algorithm. Here is matlab code from the ml-class:

% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
    % Cluster assignment step: Assign each data point to the
    % closest centroid. idx(i) corresponds to cˆ(i), the index 
    % of the centroid assigned to example i
    idx = findClosestCentroids(X, centroids);

    % Move centroid step: Compute means based on centroid
    % assignments
    centroids = computeMeans(X, idx, K);
end

Comments

0

For sparse k-means, see the examples under scikit-learn clustering.
About how many ids are there, how many per user on average, how many clusters are you looking for ? Even rough numbers, e.g. 100k ids, av 10 per user, 100 clusters, may lead to someone who's done clustering in that range (or else to back-of-the-envelope "impossible").

MinHash may be better suited for your problem than k-means; see chapter 3, Finding Similar Items, of Ullman, Mining Massive Datasets;
also SO questions/tagged/similarity+algorithm+python.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.