Vectorizing Outer Loop of euclidean distance using numpy on multi-dimensional data

Question

I have a 2D matrix of values. Each row is a data point.

data = np.array(
   [[2, 2, 3],
    [4, 2, 4],
    [1, 1, 4]])

Now if my test point is a single 1D numpy array like:

test = np.array([2,3,3])

I can do something simple like np.sqrt(np.sum((test-data)**2,axis=1)) to calculate the distance of the test point relative to all three data points.

However, if test is itself a 2D array of points to be tested, the above doesn't work and I been using something like:

test = np.array([[2,3,3],[4,1,2]])    
for i in range(len(test)):
    print np.sqrt(np.sum((test[i]-data)**2,axis=1))

>>> [ 1.          2.44948974  2.44948974]
    [ 2.44948974  2.23606798  3.60555128]

In order to calculate each point in my Test set against all the points in the Data set. It seems like there should be a way to vectorize this whole operation so that I get a (2,3) matrix of corresponding distances back without the outer FOR loop

(Note: While this particular example is about Euclidean Distance, I find myself with similar type operations where I would like to perform an operation on all elements of one matrix with the individual elements of another matrix, so I'm hoping there's a generalized way to set up problems of this nature using Numpy.)

This seems to work, but I'm concerned about memory usage on larger data sets as it seems to require duplicating each test point N times where N is the number of data points to begin with. Thus if there are a 1000 data points, I need to build a 2000 point matrix to test two values. print np.reshape(np.sqrt(np.sum((np.reshape(np.repeat(test, len(data), axis=0), (len(test) * len(data), Xdims)) - ml.repmat(data, 2, 1)) ** 2, axis=1)), (2, len(data))).T — Phil Glau
– Phil Glau, Commented Mar 30, 2016 at 5:32
Just use scipy's cdist : from scipy.spatial.distance import cdist ; out = cdist(test,data). It's super efficient. — Divakar
– Divakar, Commented Mar 30, 2016 at 6:59

B. M. · Accepted Answer · 2016-03-30 17:05:08Z

2

use broadcasting to do that :

from numpy.linalg import norm
norm(data-test[:,None],axis=2)

for

[ 1.          2.44948974  2.44948974]
[ 2.44948974  2.23606798  3.60555128]

Some explanations. It is easier to understand with different shapes, four and two points for exemple:

ens1 = np.array(
   [[2, 2, 3],
    [4, 2, 4],
    [1, 1, 4],
    [2, 4, 5]])


ens2 = np.array([[2,3,3],
                 [4,1,2]])  


In [16]: ens1.shape
Out[16]: (4, 3)

In [17]: ens2.shape
Out[17]: (2, 3)

Then :

In [21]: ens2[:,None].shape 
Out[21]: (2, 1, 3)

add a new dimension. now we can make the 2X4= 8 subtractions :

In [22]: (ens1-ens2[:,None]).shape
Out[22]: (2, 4, 3)

and take the norm along last axis, for 8 distances :

In [23]: norm(ens1-ens2[:,None],axis=2)
Out[23]: 
array([[ 1.        ,  2.44948974,  2.44948974,  2.23606798],
       [ 2.44948974,  2.23606798,  3.60555128,  4.69041576]])

edited Mar 30, 2016 at 17:05

answered Mar 30, 2016 at 16:39

B. M.

18.7k2 gold badges40 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

roadrunner66 · Accepted Answer · 2016-03-30 05:28:06Z

1

What about np.meshgrid?

import numpy as np

data = np.array(
   [[2, 2, 3],
    [4, 2, 4],
    [1, 1, 4]])


test = np.array([[2,3,3],
                 [4,1,2]])   


d = np.arange(0,3)
t = np.arange(0,2)
d, t = np.meshgrid(d, t)

# print test[t]
# print data[d]
print np.sqrt(np.sum((test[t]-data[d])**2,axis=2))

output:

[[ 1.          2.44948974  2.44948974]
 [ 2.44948974  2.23606798  3.60555128]]

answered Mar 30, 2016 at 5:28

roadrunner66

7,9914 gold badges34 silver badges39 bronze badges

1 Comment

roadrunner66 Over a year ago

After seeing Divakar's post, I'd go with scipy cdist.

Christian · Accepted Answer · 2016-03-30 05:13:16Z

-2

You could use a list comprehension:

result = np.array([np.sqrt(np.sum((t - data)**2, axis=1)) for t in test])

answered Mar 30, 2016 at 5:13

Christian

7393 silver badges8 bronze badges

1 Comment

Phil Glau Over a year ago

My understanding is that a comprehension is just a fancy FOR loop. My hope is to exploit the speed of numpy and avoid a loop in Python.

Collectives™ on Stack Overflow

Vectorizing Outer Loop of euclidean distance using numpy on multi-dimensional data

3 Answers 3

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related