Computation of Distance Matrices for Binary Data in python

Question

I am performing a hierarchical clustering analysis in python. My variables are binary so I was wondering how to calculate the binary euclidean distance. According to the literature, it is possible to use this distance metric with this clustering technique.

Choi, S. S., Cha, S. H., & Tappert, C. C. (2010). A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1), 43-48.

I was using scipy.spatial.distance.pdist(X, metric='euclidean') but this function uses the euclidean distance for non-binary data.

Is there any python library to calculate distance matrices based on the binary euclidean distance metric?

This may help: pypi.org/project/bitarray converting binary to bitarray — EunChong Lee
– EunChong Lee, Commented Aug 16, 2018 at 7:12
you can calculate euclidean distance with two bitarray like this scipy.spatial.distance.euclidean([1, 0, 0], [0, 1, 0]) — EunChong Lee
– EunChong Lee, Commented Aug 16, 2018 at 7:14

Hans Musgrave · Accepted Answer · 2018-08-16 07:04:49Z

1

The paper you referenced has a formula which is simply a faster way to computer the standard euclidean distance for binary data. In that case the scipy method will work fine. Is there a different distance you would like used, or is your data somehow formatted so that pdist() doesn't work on it natively?

answered Aug 16, 2018 at 7:04

Hans Musgrave

7,2012 gold badges21 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jorge Rodriguez Over a year ago

I wanted to confirm whether this function is valid to use with binary data or not. Indeed, for me the formula referenced by the paper is no so clear to see that is a faster way to compute the standard formula.

Hans Musgrave Over a year ago

The validity depends on what kind of data it is (in terms of domain knowledge, not just whether it's binary or not) and what you're doing with it. The euclidean distance induces the same topology as most other useful metrics, so in some sense the worst thing that can happen is that you get the right answer plus a distortion. That's fine in some domains and not in others. As to the speed, all the paper is doing in that section is noting that for binary vectors v and w, |v-w| is the same as (v XOR w). If your data is stored bitwise, this can be really fast.

Hans Musgrave Over a year ago

Note that speed comment doesn't apply to, e.g., a list of floats which happen to only be 0 or 1. In Python, that carries the extra overhead of everything being an object. In most languages (Python included), that at least has the extra bits needed to represent the floats. To help you better, we really need an example of what you mean by "binary data" to be able to suggest which methods to use.

Omar Cusma Fait · Accepted Answer · 2018-08-16 06:59:04Z

0

Solution 1 - numpy

from numpy import linalg, array

M1 = [[1, 1], [0, 1]]
M2 = [[0, 1], [1, 1]]

print(linalg.norm(array(M1) - array(M2)))

Solution 2 - custom

M1 = [[1, 1], [0, 1]]
M2 = [[0, 1], [1, 1]]

def binary_dist(m1, m2):
    sum = 0
    for i in range(len(m1)):
        for j in range(len(m1[i])):
            if m1[i][j] != m2[i][j]:
                sum += 1
    return sum ** .5


print(binary_dist(M1, M2))

answered Aug 16, 2018 at 6:59

Omar Cusma Fait

3522 silver badges11 bronze badges

Collectives™ on Stack Overflow

Computation of Distance Matrices for Binary Data in python

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related