calculating distance between two numpy arrays

Question

I was interested in calculating various spatial distances between two numpy arrays (x and y).

http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.cdist.html

import numpy as np
from scipy.spatial.distance import cdist

x = np.array([[[1,2,3,4,5],
               [5,6,7,8,5],
               [5,6,7,8,5]],
              [[11,22,23,24,5],
               [25,26,27,28,5],
               [5,6,7,8,5]]])
i,j,k = x.shape

xx = x.reshape(i,j*k).T

y = np.array([[[31,32,33,34,5],
               [35,36,37,38,5],
               [5,6,7,8,5]],
              [[41,42,43,44,5],
               [45,46,47,48,5],
               [5,6,7,8,5]]])

yy = y.reshape(i,j*k).T

results =  cdist(xx,yy,'euclidean')
print results

However, above results produces too many unwanted results. How can I limit it for my required results only.

I want to calculate distance between [1,11] and [31,41]; [2,22] and [32,42],...and so on.

Are you only interested in the Euclidean distance, or do you also want the option of computing the other distances provided by cdist? If just the Euclidean distance, that's a one-liner: np.sqrt(((xx - yy)**2).sum(axis=1)). — Warren Weckesser
– Warren Weckesser, Commented Dec 28, 2014 at 17:10
I think your question points out a gap in the API. pdist and cdist compute distances for all combinations of the input points. That is, they apply the distance calculation to the outer product of the input collections. There isn't a corresponding function that applies the distance calculation to the inner product of the input arguments (i.e. the pairwise calculation that you want). For any given distance, you can "roll your own", but that defeats the purpose of a having a module such as scipy.spatial.distance. — Warren Weckesser
– Warren Weckesser, Commented Dec 28, 2014 at 17:28
@WarrenWeckesser - Alternatively, the individual functions in scipy.spatial.distance could be given an axis argument or something similar. It would avoid the hack of having to use apply_along_axis. It looks like it would only require a few tweaks to scipy.spatial.distance._validate_vector. — Joe Kington
– Joe Kington, Commented Dec 28, 2014 at 17:34
@JoeKington: That's one of the alternatives I was just thinking about. How about submitting a pull request at github.com/scipy/scipy? :) — Warren Weckesser
– Warren Weckesser, Commented Dec 28, 2014 at 17:38
@WarrenWeckesser - As always, it's a touch more complicated than I thought it was. I'll keep working at it though. It would definitely be useful! — Joe Kington
– Joe Kington, Commented Dec 28, 2014 at 18:24

Joe Kington · Accepted Answer · 2014-12-28 17:42:47Z

10

If you just want the distances between each pair of points, then you don't need to calculate a full distance matrix.

Instead, calculate it directly:

import numpy as np

x = np.array([[[1,2,3,4,5],
               [5,6,7,8,5],
               [5,6,7,8,5]],
              [[11,22,23,24,5],
               [25,26,27,28,5],
               [5,6,7,8,5]]])

y = np.array([[[31,32,33,34,5],
               [35,36,37,38,5],
               [5,6,7,8,5]],
              [[41,42,43,44,5],
               [45,46,47,48,5],
               [5,6,7,8,5]]])

xx = x.reshape(2, -1)
yy = y.reshape(2, -1)
dist = np.hypot(*(xx - yy))

print dist

To explain a bit more about what's going on, first we reshape the arrays such that they have a 2xN shape (-1 is a placeholder that tells numpy to calculate the correct size along that axis automatically):

In [2]: x.reshape(2, -1)
Out[2]: 
array([[ 1,  2,  3,  4,  5,  5,  6,  7,  8,  5,  5,  6,  7,  8,  5],
       [11, 22, 23, 24,  5, 25, 26, 27, 28,  5,  5,  6,  7,  8,  5]])

Therefore, when we subtract xx and yy, we'll get a 2xN array:

In [3]: xx - yy
Out[3]: 
array([[-30, -30, -30, -30,   0, -30, -30, -30, -30,   0,   0,   0,   0,
          0,   0],
       [-30, -20, -20, -20,   0, -20, -20, -20, -20,   0,   0,   0,   0,
          0,   0]])

We can then unpack this in to dx and dy components:

In [4]: dx, dy = xx - yy

In [5]: dx
Out[5]: 
array([-30, -30, -30, -30,   0, -30, -30, -30, -30,   0,   0,   0,   0,
         0,   0])

In [6]: dy
Out[6]: 
array([-30, -20, -20, -20,   0, -20, -20, -20, -20,   0,   0,   0,   0,
         0,   0])

And calculate the distance (np.hypot is equivalent to np.sqrt(dx**2 + dy**2)):

In [7]: np.hypot(dx, dy)
Out[7]: 
array([ 42.42640687,  36.05551275,  36.05551275,  36.05551275,
         0.        ,  36.05551275,  36.05551275,  36.05551275,
        36.05551275,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ])

Or we can have the unpacking done automatically and do it all in one step:

In [8]: np.hypot(*(xx - yy))
Out[8]: 
array([ 42.42640687,  36.05551275,  36.05551275,  36.05551275,
         0.        ,  36.05551275,  36.05551275,  36.05551275,
        36.05551275,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ])

If you want to calculate other types of distances, just change np.hypot to the function you'd like to use. For example, for Manhattan/city-block distances:

In [9]: dist = np.sum(np.abs(xx - yy), axis=0)

In [10]: dist
Out[10]: array([60, 50, 50, 50,  0, 50, 50, 50, 50,  0,  0,  0,  0,  0,  0])

edited Dec 28, 2014 at 17:42

answered Dec 28, 2014 at 17:08

Joe Kington

287k73 gold badges621 silver badges474 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Borys Over a year ago

Actually I wan to calculate many other distances as mentioned in cdist

Borys Over a year ago

Thanks for your nice explaination. I also want to calculate other distances: mahalanobis, correlation. So is not it possible to reduce the matix obtained from cdist into required ones.

Joe Kington Over a year ago

@simen - Sure, just call np.diag(dist_matrix). However, that will make many unnecessary calculations.

Borys Over a year ago

yes, for performance using cdist is too slower for my need. if I had known the equations for other distances as well, your method is super fast.

Borys Over a year ago

accepted for your contribution, and look forward to seeing the required functionality in scipy.spatial.distance

Collectives™ on Stack Overflow

calculating distance between two numpy arrays

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related