I am about to do clustering with feature vectors of 1000 dimension. that is, feature vectors are like the below. a = {255, 2334, 436, ... , 5284}; b = {235, 434, 63, ... , 844}; ... I have also the metric to measure the distance between 2 feature vectors. but i cant figure out which clustering algorithm does clustering with this feature vectors the best because i cant visualize the distribution of these vectors due to high dimension. Anyone knows the method which can visualize these distribution, or in condition of not knowing the distribution of data, how to select the best clustering algorithm? Thanks in advance.
-
What kind of data do you have? Labeled, unlabeled? Do you know anything about the number of classes?purew– purew2013-11-26 17:15:54 +00:00Commented Nov 26, 2013 at 17:15
-
To do this, i collected the experimental data, so i know the number of classes and the labels on experimental data. And then i applied the various clustering algorithms on the data and evaluated the performance so that i could get the best method. but this experimental data is not enough and not general, so the method which was selected on experimental step could be failed on real situation with general big data, so, i wanna know how to get the best algorithm which can fit with general data, thanks for your caringuser2668204– user26682042013-11-26 17:18:44 +00:00Commented Nov 26, 2013 at 17:18
-
If you have labeled data, why not just compare a few different clustering algorithms and compare which one is the most correct one?purew– purew2013-11-26 17:21:31 +00:00Commented Nov 26, 2013 at 17:21
-
furthermore, i thought the experimental data has obviously separated clusters each other, so clustering performance would be 100%, but the best method couldn't arrive, so i am not sure if misclustered vectors cant be surely clustered correctly or there is any algorithm to do it perfectly. how can i solve my problem?user2668204– user26682042013-11-26 17:38:02 +00:00Commented Nov 26, 2013 at 17:38
-
Ok so you really want to classify new values as belonging to one of these classes (clusters)?purew– purew2013-11-26 18:03:32 +00:00Commented Nov 26, 2013 at 18:03
|
Show 2 more comments
1 Answer
You should split your labeled data into training- and test-sets. Using the training set you train a classifier which performance you can measure against your labeled test-set.
As classifier, a first try could be an SVC.
For better reliability, you should redo this procedure for different training- and test-sets. This is known as cross-validation.
2 Comments
mtrw
+1 for the link to the scikit-learn flowchart. I didn't even know I was looking for that.
user2668204
I know about it. but what if there is any algorithm to outperform the best algorithm which i got in cross-validation experiment? Its possible because i can't test all algorithms in cross-validation. so i wanted visualizing the distribution or get how to find the proper algorithm from both of labeled data and distance metric to be sure that gotten result is reasonable.