Data clustering algorithm

Question

What is the most popular text clustering algorithm which deals with large dimensions and huge dataset and is fast? I am getting confused after reading so many papers and so many approaches..now just want to know which one is used most, to have a good starting point for writing a clustering application for documents.

rano · Accepted Answer · 2010-12-03 08:47:04Z

2

To deal with the curse of dimensionality you can try to determine the blind sources (ie topics) that generated your dataset. You could use Principal Component Analysis or Factor Analysis to reduce the dimensionality of your feature set and to compute useful indexes.

PCA is what is used in Latent Semantic Indexing, since SVD can be demonstrated to be PCA : )

Remember that you can lose interpretation when you obtain the principal components of your dataset or its factors, so you maybe wanna go the Non-Negative Matrix Factorization route. (And here is the punch! K-Means is a particular NNMF!) In NNMF the dataset can be explained just by its additive, non-negative components.

answered Dec 3, 2010 at 8:47

rano

5,6975 gold badges43 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sujith Surendranathan · Accepted Answer · 2010-12-02 02:43:57Z

1

There is no one size fits all approach. Hierarchical clustering is an option always. If you want to have distinct groups formed out of the data, you can go with K-means clustering (it is also supposedly computationally less intensive).

answered Dec 2, 2010 at 2:43

Sujith Surendranathan

2,56917 silver badges21 bronze badges

1 Comment

user352951 Over a year ago

but how to deal with the curse of dimensionality?

Suratna · Accepted Answer · 2011-02-21 17:15:51Z

The two most popular document clustering approaches, are hierarchical clustering and k-means. k-means is faster as it is linear in the number of documents, as opposed to hierarchical, which is quadratic, but is generally believed to give better results. Each document in the dataset is usually represented as an n-dimensional vector (n is the number of words), with the magnitude of the dimension corresponding to each word equal to its term frequency-inverse document frequency score. The tf-idf score reduces the importance of high-frequency words in similarity calculation. The cosine similarity is often used as a similarity measure.

A paper comparing experimental results between hierarchical and bisecting k-means, a cousin algorithm to k-means, can be found here.

The simplest approaches to dimensionality reduction in document clustering are: a) throw out all rare and highly frequent words (say occuring in less than 1% and more than 60% of documents: this is somewhat arbitrary, you need to try different ranges for each dataset to see impact on results), b) stopping: throw out all words in a stop list of common english words: lists can be found online, and c) stemming, or removing suffixes to leave only word roots. The most common stemmer is a stemmer designed by Martin Porter. Implementations in many languages can be found here. Usually, this will reduce the number of unique words in a dataset to a few hundred or low thousands, and further dimensionality reduction may not be required. Otherwise, techniques like PCA could be used.

mariana soffer · Accepted Answer · 2010-12-02 04:32:37Z

-1

I will stick with kmedoids, since you can compute the distance from any point to anypoint at the beggining of the algorithm, You only need to do this one time, and it saves you time, specially if there are many dimensions. This algorithm works by choosing as a center of a cluster the point that is nearer to it, not a centroid calculated in base of the averages of the points belonging to that cluster. Therefore you have all possible distance calculations already done for you in this algorithm.

answered Dec 2, 2010 at 4:32

mariana soffer

1,85312 silver badges17 bronze badges

Comments

don · Accepted Answer · 2010-12-08 05:44:29Z

In the case where you aren't looking for semantic text clustering (I can't tell if this is a requirement or not from your original question), try using Levenshtein distance and building a similarity matrix with it. From this, you can use k-medoids to cluster and subsequently validate your clustering through use of silhouette coefficients. Unfortunately, Levensthein can be quite slow, but there are ways to speed it up through uses of thresholds and other methods.

Another way to deal with the curse of dimensionality would be to find 'contrasting sets,', conjunctions of attribute-value pairs that are more prominent in one group than in the rest. You can then use those contrasting sets as dimensions either in lieu of the original attributes or with a restricted number of attributes.

Collectives™ on Stack Overflow

Data clustering algorithm

5 Answers 5

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related