0

I have a 2D matrix of values. Each row is a data point.

data = np.array(
   [[2, 2, 3],
    [4, 2, 4],
    [1, 1, 4]])

Now if my test point is a single 1D numpy array like:

test = np.array([2,3,3])

I can do something simple like np.sqrt(np.sum((test-data)**2,axis=1)) to calculate the distance of the test point relative to all three data points.

However, if test is itself a 2D array of points to be tested, the above doesn't work and I been using something like:

test = np.array([[2,3,3],[4,1,2]])    
for i in range(len(test)):
    print np.sqrt(np.sum((test[i]-data)**2,axis=1))

>>> [ 1.          2.44948974  2.44948974]
    [ 2.44948974  2.23606798  3.60555128]

In order to calculate each point in my Test set against all the points in the Data set. It seems like there should be a way to vectorize this whole operation so that I get a (2,3) matrix of corresponding distances back without the outer FOR loop

(Note: While this particular example is about Euclidean Distance, I find myself with similar type operations where I would like to perform an operation on all elements of one matrix with the individual elements of another matrix, so I'm hoping there's a generalized way to set up problems of this nature using Numpy.)

2
  • This seems to work, but I'm concerned about memory usage on larger data sets as it seems to require duplicating each test point N times where N is the number of data points to begin with. Thus if there are a 1000 data points, I need to build a 2000 point matrix to test two values. print np.reshape(np.sqrt(np.sum((np.reshape(np.repeat(test, len(data), axis=0), (len(test) * len(data), Xdims)) - ml.repmat(data, 2, 1)) ** 2, axis=1)), (2, len(data))).T Commented Mar 30, 2016 at 5:32
  • 2
    Just use scipy's cdist : from scipy.spatial.distance import cdist ; out = cdist(test,data). It's super efficient. Commented Mar 30, 2016 at 6:59

3 Answers 3

2

use broadcasting to do that :

from numpy.linalg import norm
norm(data-test[:,None],axis=2)

for

[ 1.          2.44948974  2.44948974]
[ 2.44948974  2.23606798  3.60555128]

Some explanations. It is easier to understand with different shapes, four and two points for exemple:

ens1 = np.array(
   [[2, 2, 3],
    [4, 2, 4],
    [1, 1, 4],
    [2, 4, 5]])


ens2 = np.array([[2,3,3],
                 [4,1,2]])  


In [16]: ens1.shape
Out[16]: (4, 3)

In [17]: ens2.shape
Out[17]: (2, 3)   

Then :

In [21]: ens2[:,None].shape 
Out[21]: (2, 1, 3) 

add a new dimension. now we can make the 2X4= 8 subtractions :

In [22]: (ens1-ens2[:,None]).shape
Out[22]: (2, 4, 3)       

and take the norm along last axis, for 8 distances :

In [23]: norm(ens1-ens2[:,None],axis=2)
Out[23]: 
array([[ 1.        ,  2.44948974,  2.44948974,  2.23606798],
       [ 2.44948974,  2.23606798,  3.60555128,  4.69041576]])     
Sign up to request clarification or add additional context in comments.

Comments

1

What about np.meshgrid?

import numpy as np

data = np.array(
   [[2, 2, 3],
    [4, 2, 4],
    [1, 1, 4]])


test = np.array([[2,3,3],
                 [4,1,2]])   


d = np.arange(0,3)
t = np.arange(0,2)
d, t = np.meshgrid(d, t)

# print test[t]
# print data[d]
print np.sqrt(np.sum((test[t]-data[d])**2,axis=2))  

output:

[[ 1.          2.44948974  2.44948974]
 [ 2.44948974  2.23606798  3.60555128]]

1 Comment

After seeing Divakar's post, I'd go with scipy cdist.
-2

You could use a list comprehension:

result = np.array([np.sqrt(np.sum((t - data)**2, axis=1)) for t in test])

1 Comment

My understanding is that a comprehension is just a fancy FOR loop. My hope is to exploit the speed of numpy and avoid a loop in Python.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.