4

What is the most popular text clustering algorithm which deals with large dimensions and huge dataset and is fast? I am getting confused after reading so many papers and so many approaches..now just want to know which one is used most, to have a good starting point for writing a clustering application for documents.

5 Answers 5

2

To deal with the curse of dimensionality you can try to determine the blind sources (ie topics) that generated your dataset. You could use Principal Component Analysis or Factor Analysis to reduce the dimensionality of your feature set and to compute useful indexes.

PCA is what is used in Latent Semantic Indexing, since SVD can be demonstrated to be PCA : )

Remember that you can lose interpretation when you obtain the principal components of your dataset or its factors, so you maybe wanna go the Non-Negative Matrix Factorization route. (And here is the punch! K-Means is a particular NNMF!) In NNMF the dataset can be explained just by its additive, non-negative components.

Sign up to request clarification or add additional context in comments.

Comments

1

There is no one size fits all approach. Hierarchical clustering is an option always. If you want to have distinct groups formed out of the data, you can go with K-means clustering (it is also supposedly computationally less intensive).

1 Comment

but how to deal with the curse of dimensionality?
1

The two most popular document clustering approaches, are hierarchical clustering and k-means. k-means is faster as it is linear in the number of documents, as opposed to hierarchical, which is quadratic, but is generally believed to give better results. Each document in the dataset is usually represented as an n-dimensional vector (n is the number of words), with the magnitude of the dimension corresponding to each word equal to its term frequency-inverse document frequency score. The tf-idf score reduces the importance of high-frequency words in similarity calculation. The cosine similarity is often used as a similarity measure.

A paper comparing experimental results between hierarchical and bisecting k-means, a cousin algorithm to k-means, can be found here.

The simplest approaches to dimensionality reduction in document clustering are: a) throw out all rare and highly frequent words (say occuring in less than 1% and more than 60% of documents: this is somewhat arbitrary, you need to try different ranges for each dataset to see impact on results), b) stopping: throw out all words in a stop list of common english words: lists can be found online, and c) stemming, or removing suffixes to leave only word roots. The most common stemmer is a stemmer designed by Martin Porter. Implementations in many languages can be found here. Usually, this will reduce the number of unique words in a dataset to a few hundred or low thousands, and further dimensionality reduction may not be required. Otherwise, techniques like PCA could be used.

Comments

-1

I will stick with kmedoids, since you can compute the distance from any point to anypoint at the beggining of the algorithm, You only need to do this one time, and it saves you time, specially if there are many dimensions. This algorithm works by choosing as a center of a cluster the point that is nearer to it, not a centroid calculated in base of the averages of the points belonging to that cluster. Therefore you have all possible distance calculations already done for you in this algorithm.

Comments

-1

In the case where you aren't looking for semantic text clustering (I can't tell if this is a requirement or not from your original question), try using Levenshtein distance and building a similarity matrix with it. From this, you can use k-medoids to cluster and subsequently validate your clustering through use of silhouette coefficients. Unfortunately, Levensthein can be quite slow, but there are ways to speed it up through uses of thresholds and other methods.

Another way to deal with the curse of dimensionality would be to find 'contrasting sets,', conjunctions of attribute-value pairs that are more prominent in one group than in the rest. You can then use those contrasting sets as dimensions either in lieu of the original attributes or with a restricted number of attributes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.