2

I am performing a hierarchical clustering analysis in python. My variables are binary so I was wondering how to calculate the binary euclidean distance. According to the literature, it is possible to use this distance metric with this clustering technique.

Choi, S. S., Cha, S. H., & Tappert, C. C. (2010). A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1), 43-48.

I was using scipy.spatial.distance.pdist(X, metric='euclidean') but this function uses the euclidean distance for non-binary data.

Is there any python library to calculate distance matrices based on the binary euclidean distance metric?

2
  • This may help: pypi.org/project/bitarray converting binary to bitarray Commented Aug 16, 2018 at 7:12
  • you can calculate euclidean distance with two bitarray like this scipy.spatial.distance.euclidean([1, 0, 0], [0, 1, 0]) Commented Aug 16, 2018 at 7:14

2 Answers 2

1

The paper you referenced has a formula which is simply a faster way to computer the standard euclidean distance for binary data. In that case the scipy method will work fine. Is there a different distance you would like used, or is your data somehow formatted so that pdist() doesn't work on it natively?

Sign up to request clarification or add additional context in comments.

3 Comments

I wanted to confirm whether this function is valid to use with binary data or not. Indeed, for me the formula referenced by the paper is no so clear to see that is a faster way to compute the standard formula.
The validity depends on what kind of data it is (in terms of domain knowledge, not just whether it's binary or not) and what you're doing with it. The euclidean distance induces the same topology as most other useful metrics, so in some sense the worst thing that can happen is that you get the right answer plus a distortion. That's fine in some domains and not in others. As to the speed, all the paper is doing in that section is noting that for binary vectors v and w, |v-w| is the same as (v XOR w). If your data is stored bitwise, this can be really fast.
Note that speed comment doesn't apply to, e.g., a list of floats which happen to only be 0 or 1. In Python, that carries the extra overhead of everything being an object. In most languages (Python included), that at least has the extra bits needed to represent the floats. To help you better, we really need an example of what you mean by "binary data" to be able to suggest which methods to use.
0

Solution 1 - numpy

from numpy import linalg, array

M1 = [[1, 1], [0, 1]]
M2 = [[0, 1], [1, 1]]

print(linalg.norm(array(M1) - array(M2)))

Solution 2 - custom

M1 = [[1, 1], [0, 1]]
M2 = [[0, 1], [1, 1]]

def binary_dist(m1, m2):
    sum = 0
    for i in range(len(m1)):
        for j in range(len(m1[i])):
            if m1[i][j] != m2[i][j]:
                sum += 1
    return sum ** .5


print(binary_dist(M1, M2))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.