7

I was interested in calculating various spatial distances between two numpy arrays (x and y).

http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.cdist.html

import numpy as np
from scipy.spatial.distance import cdist

x = np.array([[[1,2,3,4,5],
               [5,6,7,8,5],
               [5,6,7,8,5]],
              [[11,22,23,24,5],
               [25,26,27,28,5],
               [5,6,7,8,5]]])
i,j,k = x.shape

xx = x.reshape(i,j*k).T

y = np.array([[[31,32,33,34,5],
               [35,36,37,38,5],
               [5,6,7,8,5]],
              [[41,42,43,44,5],
               [45,46,47,48,5],
               [5,6,7,8,5]]])

yy = y.reshape(i,j*k).T

results =  cdist(xx,yy,'euclidean')
print results

However, above results produces too many unwanted results. How can I limit it for my required results only.

I want to calculate distance between [1,11] and [31,41]; [2,22] and [32,42],...and so on.

10
  • Are you only interested in the Euclidean distance, or do you also want the option of computing the other distances provided by cdist? If just the Euclidean distance, that's a one-liner: np.sqrt(((xx - yy)**2).sum(axis=1)). Commented Dec 28, 2014 at 17:10
  • 2
    I think your question points out a gap in the API. pdist and cdist compute distances for all combinations of the input points. That is, they apply the distance calculation to the outer product of the input collections. There isn't a corresponding function that applies the distance calculation to the inner product of the input arguments (i.e. the pairwise calculation that you want). For any given distance, you can "roll your own", but that defeats the purpose of a having a module such as scipy.spatial.distance. Commented Dec 28, 2014 at 17:28
  • 1
    @WarrenWeckesser - Alternatively, the individual functions in scipy.spatial.distance could be given an axis argument or something similar. It would avoid the hack of having to use apply_along_axis. It looks like it would only require a few tweaks to scipy.spatial.distance._validate_vector. Commented Dec 28, 2014 at 17:34
  • 1
    @JoeKington: That's one of the alternatives I was just thinking about. How about submitting a pull request at github.com/scipy/scipy? :) Commented Dec 28, 2014 at 17:38
  • 1
    @WarrenWeckesser - As always, it's a touch more complicated than I thought it was. I'll keep working at it though. It would definitely be useful! Commented Dec 28, 2014 at 18:24

1 Answer 1

10

If you just want the distances between each pair of points, then you don't need to calculate a full distance matrix.

Instead, calculate it directly:

import numpy as np

x = np.array([[[1,2,3,4,5],
               [5,6,7,8,5],
               [5,6,7,8,5]],
              [[11,22,23,24,5],
               [25,26,27,28,5],
               [5,6,7,8,5]]])

y = np.array([[[31,32,33,34,5],
               [35,36,37,38,5],
               [5,6,7,8,5]],
              [[41,42,43,44,5],
               [45,46,47,48,5],
               [5,6,7,8,5]]])

xx = x.reshape(2, -1)
yy = y.reshape(2, -1)
dist = np.hypot(*(xx - yy))

print dist

To explain a bit more about what's going on, first we reshape the arrays such that they have a 2xN shape (-1 is a placeholder that tells numpy to calculate the correct size along that axis automatically):

In [2]: x.reshape(2, -1)
Out[2]: 
array([[ 1,  2,  3,  4,  5,  5,  6,  7,  8,  5,  5,  6,  7,  8,  5],
       [11, 22, 23, 24,  5, 25, 26, 27, 28,  5,  5,  6,  7,  8,  5]])

Therefore, when we subtract xx and yy, we'll get a 2xN array:

In [3]: xx - yy
Out[3]: 
array([[-30, -30, -30, -30,   0, -30, -30, -30, -30,   0,   0,   0,   0,
          0,   0],
       [-30, -20, -20, -20,   0, -20, -20, -20, -20,   0,   0,   0,   0,
          0,   0]])

We can then unpack this in to dx and dy components:

In [4]: dx, dy = xx - yy

In [5]: dx
Out[5]: 
array([-30, -30, -30, -30,   0, -30, -30, -30, -30,   0,   0,   0,   0,
         0,   0])

In [6]: dy
Out[6]: 
array([-30, -20, -20, -20,   0, -20, -20, -20, -20,   0,   0,   0,   0,
         0,   0])

And calculate the distance (np.hypot is equivalent to np.sqrt(dx**2 + dy**2)):

In [7]: np.hypot(dx, dy)
Out[7]: 
array([ 42.42640687,  36.05551275,  36.05551275,  36.05551275,
         0.        ,  36.05551275,  36.05551275,  36.05551275,
        36.05551275,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ])

Or we can have the unpacking done automatically and do it all in one step:

In [8]: np.hypot(*(xx - yy))
Out[8]: 
array([ 42.42640687,  36.05551275,  36.05551275,  36.05551275,
         0.        ,  36.05551275,  36.05551275,  36.05551275,
        36.05551275,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ])

If you want to calculate other types of distances, just change np.hypot to the function you'd like to use. For example, for Manhattan/city-block distances:

In [9]: dist = np.sum(np.abs(xx - yy), axis=0)

In [10]: dist
Out[10]: array([60, 50, 50, 50,  0, 50, 50, 50, 50,  0,  0,  0,  0,  0,  0])
Sign up to request clarification or add additional context in comments.

5 Comments

Actually I wan to calculate many other distances as mentioned in cdist
Thanks for your nice explaination. I also want to calculate other distances: mahalanobis, correlation. So is not it possible to reduce the matix obtained from cdist into required ones.
@simen - Sure, just call np.diag(dist_matrix). However, that will make many unnecessary calculations.
yes, for performance using cdist is too slower for my need. if I had known the equations for other distances as well, your method is super fast.
accepted for your contribution, and look forward to seeing the required functionality in scipy.spatial.distance

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.