12

I am new to Python so this question might look trivia. However, I did not find a similar case to mine. I have a matrix of coordinates for 20 nodes. I want to compute the euclidean distance between all pairs of nodes from this set and store them in a pairwise matrix. For example, If I have 20 nodes, I want the end result to be a matrix of (20,20) with values of euclidean distance between each pairs of nodes. I tried to used a for loop to go through each element of the coordinate set and compute euclidean distance as follows:

ncoord=numpy.matrix('3225   318;2387    989;1228    2335;57      1569;2288  8138;3514   2350;7936   314;9888    4683;6901   1834;7515   8231;709   3701;1321    8881;2290   2350;5687   5034;760    9868;2378   7521;9025   5385;4819   5943;2917   9418;3928   9770')
n=20 
c=numpy.zeros((n,n))
for i in range(0,n):
    for j in range(i+1,n):
        c[i][j]=math.sqrt((ncoord[i][0]-ncoord[j][0])**2+(ncoord[i][1]-ncoord[j][1])**2)

How ever, I am getting an error of "input must be a square array ". I wonder if anybody knows what is happening here. Thanks

6
  • Please edit your question to include the definition of ncoord. Thanks for improving the question's reference value and making it more answerable! Commented Feb 24, 2015 at 2:50
  • What is your n? for j in range(i+1,n-1) will do j=i+1, i+2, ..., n-2. My guess is you want both those ranges to go up to n, not n-1. Commented Feb 24, 2015 at 2:52
  • @MarkG yes I have 20 nodes (n=20) and I want both indices go up to n. I tried n instead of n-1 but I got the same error. I can easily code this in MATLAB but I have to use Python. Indexing in Python is different so I might be wrong. Commented Feb 24, 2015 at 3:03
  • Then both your for loops should go up to n: for i in range(0,n):, and for j in range(i+1,n): If this isn't your bug you then will need to show more code. Commented Feb 24, 2015 at 3:06
  • @MarkG yes this is not my bug. my code is what I mentioned in the main question. I have nothing more Commented Feb 24, 2015 at 3:09

4 Answers 4

29

There are much, much faster alternatives to using nested for loops for this. I'll show you two different approaches - the first will be a more general method that will introduce you to broadcasting and vectorization, and the second uses a more convenient scipy library function.


  1. The general way, using broadcasting & vectorization

One of the first things I'd suggest doing is switching to using np.array rather than np.matrix. Arrays are preferred for a number of reasons, most importantly because they can have >2 dimensions, and they make element-wise multiplication much less awkward.

import numpy as np

ncoord = np.array(ncoord)

With an array, we can eliminate the nested for loops by inserting a new singleton dimension and broadcasting the subtraction over it:

# indexing with None (or np.newaxis) inserts a new dimension of size 1
print(ncoord[:, :, None].shape)
# (20, 2, 1)

# by making the 'inner' dimensions equal to 1, i.e. (20, 2, 1) - (1, 2, 20),
# the subtraction is 'broadcast' over every pair of rows in ncoord
xydiff = ncoord[:, :, None] - ncoord[:, :, None].T

print(xydiff.shape)
# (20, 2, 20)

This is equivalent to looping over every pair of rows using nested for loops, but much, much faster!

xydiff2 = np.zeros((20, 2, 20), dtype=xydiff.dtype)
for ii in range(20):
    for jj in range(20):
        for kk in range(2):
            xydiff[ii, kk, jj] = ncoords[ii, kk] - ncoords[jj, kk]

# check that these give the same result
print(np.all(xydiff == xydiff2))
# True

The rest we can also do using vectorized operations:

# we square the differences and sum over the 'middle' axis, equivalent to
# computing (x_i - x_j) ** 2 + (y_i - y_j) ** 2
ssdiff = (xydiff * xydiff).sum(1)

# finally we take the square root
D = np.sqrt(ssdiff)

The whole thing could be done in one line like this:

D = np.sqrt(((ncoord[:, :, None] - ncoord[:, :, None].T) ** 2).sum(1))

  1. The lazy way, using pdist

It turns out that there's already a fast and convenient function for computing all pairwise distances: scipy.spatial.distance.pdist.

from scipy.spatial.distance import pdist, squareform

d = pdist(ncoord)

# pdist just returns the upper triangle of the pairwise distance matrix. to get
# the whole (20, 20) array we can use squareform:

print(d.shape)
# (190,)

D2 = squareform(d)
print(D2.shape)
# (20, 20)

# check that the two methods are equivalent
print np.all(D == D2)
# True
Sign up to request clarification or add additional context in comments.

3 Comments

This broadcasting is magical to me. How can I get some intuition about it?
Thanks for the amazing method but it is still much slower than cross product whereas the complexity looks the same.
I also had some memory issues when computing on a large matrix (1000 * 20000) with the method 1 which i did not have with the method 2 (scipy).
5
for i in range(0, n):
    for j in range(i+1, n):
        c[i, j] = math.sqrt((ncoord[i, 0] - ncoord[j, 0])**2 
        + (ncoord[i, 1] - ncoord[j, 1])**2)

Note: ncoord[i, j] is not the same as ncoord[i][j] for a Numpy matrix. This appears to be the source of confusion. If ncoord is a Numpy array then they will give the same result.

For a Numpy matrix, ncoord[i] returns the ith row of ncoord, which itself is a Numpy matrix object with shape 1 x 2 in your case. Therefore, ncoord[i][j] actually means: take the ith row of ncoord and take the jth row of that 1 x 2 matrix. This is where your indexing problems comes about when j > 0.

Regarding your comments on assigning to c[i][j] "working", it shouldn't. At least on my build of Numpy 1.9.1 it shouldn't work if your indices i and j iterates up to n.

As an aside, remember to add the transpose of the matrix c to itself.

It is recommended to use Numpy arrays instead of matrix. See this post.

If your coordinates are stored as a Numpy array, then pairwise distance can be computed as:

from scipy.spatial.distance import pdist

pairwise_distances = pdist(ncoord, metric="euclidean", p=2)

or simply

pairwise_distances = pdist(ncoord)

since the default metric is "euclidean", and default "p" is 2.

In a comment below I mistakenly mentioned that the result of pdist is a n x n matrix. To get a n x n matrix, you will need to do the following:

from scipy.spatial.distance import pdist, squareform

pairwise_distances = squareform(pdist(ncoord))

or

from scipy.spatial.distance import cdist

pairwise_distances = cdist(ncoord, ncoord)

8 Comments

I actually did that but did not put it here. The last line of my code is: c[j][i]=c[i][j]
Thank you it is working now. However I am misunderstood now. I thought when we want to call an element of a matrix in Python, we need to call it as a[][], but you are using a[,]. Why did you use the second format to read data from ncoord but saved the distance in c matrix by calling c elements like c[][]?
Thank you so much for the complete information. I will try the other method you mentioned to see if I can get the same matrix size as the result ( I assume that pairwise_distances is a n*n matrix)
Yes pairwise_distances will be a n x n matrix if n is the number of points you have.
Thanks. But I still don't understand the difference between ncoord[i,j] and ncoord[i][j]
|
1

What I figure you wanted to do: You said you wanted a 20 by 20 matrix... but the one you coded is triangular.

Thus I coded a complete 20x20 matrix instead.

distances = []
for i in range(len(ncoord)):
    given_i = []
    for j in range(len(ncoord)):
        d_val = math.sqrt((ncoord[i, 0]-ncoord[j,0])**2+(ncoord[i,1]-ncoord[j,1])**2)
        given_i.append(d_val)

    distances.append(given_i)

    # distances[i][j] = distance from i to j

SciPy way:

from scipy.spatial.distance import cdist
# Isn't scipy nice - can also use pdist... works in the same way but different recall method.
distances = cdist(ncoord, ncoord, 'euclidean')

2 Comments

Thank you for your comment. I will try your method as well.
Anytime you have to do a double loop through an array in numpy, you're losing the speed advantage afforded by NumPy in the firstplace. You want to broadcast whenever possible. However, for some operations, I think including this one, you can't broadcast because the values at each step depend on their neighbors. In these cases, SciPy solutions are usually optimized at the c-level (see cython), so they can still be much faster. I'd expect the cdist function to be much faster than the double loop.
0

Using your own custom sqrt sum sqaures is not always safe, they can overflow or underflow. Speed wise they are same

np.hypot(
    np.subtract.outer(x, x),
    np.subtract.outer(y, y)
)

Underflow

i, j = 1e-200, 1e-200
np.sqrt(i**2+j**2)
# 0.0

Overflow

i, j = 1e+200, 1e+200
np.sqrt(i**2+j**2)
# inf

No Underflow

i, j = 1e-200, 1e-200
np.hypot(i, j)
# 1.414213562373095e-200

No Overflow

i, j = 1e+200, 1e+200
np.hypot(i, j)
# 1.414213562373095e+200

Refer

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.