3

I want perform cluster analysis for the following data (sample):

    ID     CODE1     CODE2     CODE3     CODE4      CODE5      CODE6
   ------------------------------------------------------------------
   00001     0         1         1         0          0          0
   00002     1         0         0         0          1          1
   00003     0         1         0         1          1          1
   00004     1         1         1         0          1          0
    ...

Where 1 indicates the presence of that code for a person, and 0 the absence.. Is k-means or hierarchical clustering most appropriate for clustering the codes for this kind of data (for about a million distinct ids), and with which distance measure? If neither of these methods are appropriate, what do you think is most appropriate?

Thank you

1 Answer 1

1

No, k-means does not make a lot of sense for binary data.

Because k-means computes means. But what is the mean vector for binary data?

Your cluster "centers" will be not part of your data space, and nothing like your input data. That doesn't seem like a proper "center" to me, when it's totally different from your objects.

Most likely, your cluster "centers" will end up being more similar to each other than to the actual cluster members, because they are somewhere in the center, and all your data is in corners.

Seriously, investigate similarity functions for your data type. Then choose a clustering algorithm that works with this distance function. Hierarchical clustering is quite general, but really slow. But you don't have to use a 40 year old algorithm, you may want to look into more modern stuff.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.