How to select a proper clustering algorithm

Question

I am about to do clustering with feature vectors of 1000 dimension. that is, feature vectors are like the below. a = {255, 2334, 436, ... , 5284}; b = {235, 434, 63, ... , 844}; ... I have also the metric to measure the distance between 2 feature vectors. but i cant figure out which clustering algorithm does clustering with this feature vectors the best because i cant visualize the distribution of these vectors due to high dimension. Anyone knows the method which can visualize these distribution, or in condition of not knowing the distribution of data, how to select the best clustering algorithm? Thanks in advance.

What kind of data do you have? Labeled, unlabeled? Do you know anything about the number of classes? — purew
– purew, Commented Nov 26, 2013 at 17:15
To do this, i collected the experimental data, so i know the number of classes and the labels on experimental data. And then i applied the various clustering algorithms on the data and evaluated the performance so that i could get the best method. but this experimental data is not enough and not general, so the method which was selected on experimental step could be failed on real situation with general big data, so, i wanna know how to get the best algorithm which can fit with general data, thanks for your caring — user2668204
– user2668204, Commented Nov 26, 2013 at 17:18
If you have labeled data, why not just compare a few different clustering algorithms and compare which one is the most correct one? — purew
– purew, Commented Nov 26, 2013 at 17:21
furthermore, i thought the experimental data has obviously separated clusters each other, so clustering performance would be 100%, but the best method couldn't arrive, so i am not sure if misclustered vectors cant be surely clustered correctly or there is any algorithm to do it perfectly. how can i solve my problem? — user2668204
– user2668204, Commented Nov 26, 2013 at 17:38
Ok so you really want to classify new values as belonging to one of these classes (clusters)? — purew
– purew, Commented Nov 26, 2013 at 18:03

purew · Accepted Answer · 2013-11-26 18:10:21Z

1

You should split your labeled data into training- and test-sets. Using the training set you train a classifier which performance you can measure against your labeled test-set.

As classifier, a first try could be an SVC.

For better reliability, you should redo this procedure for different training- and test-sets. This is known as cross-validation.

answered Nov 26, 2013 at 18:10

purew

5,1654 gold badges24 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mtrw Over a year ago

+1 for the link to the scikit-learn flowchart. I didn't even know I was looking for that.

user2668204 Over a year ago

I know about it. but what if there is any algorithm to outperform the best algorithm which i got in cross-validation experiment? Its possible because i can't test all algorithms in cross-validation. so i wanted visualizing the distribution or get how to find the proper algorithm from both of labeled data and distance metric to be sure that gotten result is reasonable.

Collectives™ on Stack Overflow

How to select a proper clustering algorithm

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related