Clustering binary data

Question

I want perform cluster analysis for the following data (sample):

    ID     CODE1     CODE2     CODE3     CODE4      CODE5      CODE6
   ------------------------------------------------------------------
   00001     0         1         1         0          0          0
   00002     1         0         0         0          1          1
   00003     0         1         0         1          1          1
   00004     1         1         1         0          1          0
    ...

Where 1 indicates the presence of that code for a person, and 0 the absence.. Is k-means or hierarchical clustering most appropriate for clustering the codes for this kind of data (for about a million distinct ids), and with which distance measure? If neither of these methods are appropriate, what do you think is most appropriate?

Thank you

Has QUIT--Anony-Mousse · Accepted Answer · 2013-07-27 17:24:19Z

1

No, k-means does not make a lot of sense for binary data.

Because k-means computes means. But what is the mean vector for binary data?

Your cluster "centers" will be not part of your data space, and nothing like your input data. That doesn't seem like a proper "center" to me, when it's totally different from your objects.

Most likely, your cluster "centers" will end up being more similar to each other than to the actual cluster members, because they are somewhere in the center, and all your data is in corners.

Seriously, investigate similarity functions for your data type. Then choose a clustering algorithm that works with this distance function. Hierarchical clustering is quite general, but really slow. But you don't have to use a 40 year old algorithm, you may want to look into more modern stuff.

answered Jul 27, 2013 at 17:24

Has QUIT--Anony-Mousse

77.8k14 gold badges146 silver badges198 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Clustering binary data

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related