2

i have numpy array in python which contains lots (10k+) of 3D vertex points (vectors with coordinates [x,y,z]). I need to calculate distance between all possible pairs of these points.

it's easy to do using scipy:

import scipy
D = spdist.cdist(verts, verts)

but i can't use this because of project policy on introducing new dependencies.

So i came up with this naive code:

def vert_dist(self, A, B):
    return ((B[0]-A[0])**2+(B[1]-A[1])**2+(B[2]-A[2])**2)**(1.0/2)

# Pairwise distance between verts
#Use SciPy, otherwise use fallback
try:
    import scipy.spatial.distance as spdist
    D = spdist.cdist(verts, verts)
except ImportError:
    #FIXME: This is VERY SLOW:
    D = np.empty((len(verts), len(verts)), dtype=np.float64)
    for i,v in enumerate(verts):
        #self.app.setStatus(_("Calculating distance %d of %d (SciPy not installed => using SLOW AF fallback method)"%(i,len(verts))), True)
        for j in range(i,len(verts)):
            D[j][i] = D[i][j] = self.vert_dist(v,verts[j])

vert_dist() calculates 3D distance between two vertices and rest of code just iterates over vertices in 1D array and for each one it calculates distance to every other in the same array and produces 2D array of distances.

But that's extremely slow (1000 times) when compared to scipy's native C code. I wonder if i can speed it up using pure numpy. at least to some extent.

Some more info: https://github.com/scipy/scipy/issues/9172

BTW i've tried PyPy JIT compiler and it was even slower (10 times) than pure python.

UPDATE: I was able to speed things up a little like this:

    def vert_dist_matrix(self, verts):
            #FIXME: This is VERY SLOW:
            D = np.empty((len(verts), len(verts)), dtype=np.float64)
            for i,v in enumerate(verts):
                    D[i] = D[:,i] = np.sqrt(np.sum(np.square(verts-verts[i]), axis=1))
            return D

This eliminates the inner loop by computing whole row at once, which makes stuff quite faster, but still noticeably slower than scipy. So i still look at @Divakar's solution

4
  • Are you looking for something like this? stackoverflow.com/a/51352819/4909087 Commented Aug 26, 2018 at 21:51
  • this answer is also quite interesting as the performance is discussed: stackoverflow.com/a/37903795/8069403 Commented Aug 26, 2018 at 21:57
  • @coldspeed i already saw this post, but it was very unclear to me how can i modify it to do the ((B[0]-A[0])**2+(B[1]-A[1])**2+(B[2]-A[2])**2)**(1.0/2) calculation rather than just subtraction. Commented Aug 26, 2018 at 21:58
  • @xdze2 i've tried this, but it only replaces the vert_dist() function, which is fast enough. When i replace ((B[0]-A[0])**2+(B[1]-A[1])**2+(B[2]-A[2])**2)**(1.0/2) with np.sqrt(np.sum(np.square(B-A))) it's bit more readable, but it still does not eliminate the 2D for loops which are the actual slow part of this code. Commented Aug 26, 2018 at 22:06

1 Answer 1

8

There's eucl_dist package (disclaimer: I am its author) that basically contains two methods to solve the problem of computing squared euclidean distances that are more efficient than SciPy's cdist, especially for large arrays ( with decent to large number of columns).

We will use some of the codes from its source code to adapt to our problem here to give us two approaches.

Approach #1

Following the wiki contents, we could leverage matrix-multiplication and some NumPy specific implementations for our first approach, like so -

def pdist_squareformed_numpy(a):
    a_sumrows = np.einsum('ij,ij->i',a,a)
    dist = a_sumrows[:,None] + a_sumrows -2*np.dot(a,a.T)
    np.fill_diagonal(dist,0)
    return dist

Approach #2

One more method would be to create "extended" versions of input arrays, again discussed in detail in that github source code link to have our second approach, which is better for lesser columns, as is the case here, like so -

def ext_arrs(A,B, precision="float64"):
    nA,dim = A.shape
    A_ext = np.ones((nA,dim*3),dtype=precision)
    A_ext[:,dim:2*dim] = A
    A_ext[:,2*dim:] = A**2

    nB = B.shape[0]
    B_ext = np.ones((dim*3,nB),dtype=precision)
    B_ext[:dim] = (B**2).T
    B_ext[dim:2*dim] = -2.0*B.T
    return A_ext, B_ext

def pdist_squareformed_numpy_v2(a):
    A_ext, B_ext = ext_arrs(a,a)
    dist = A_ext.dot(B_ext)
    np.fill_diagonal(dist,0)
    return dist

Note that these give us squared eucludean distances. So, for the actual distances, we want to use np.sqrt() if that's the final output needed.

Sample runs -

In [380]: np.random.seed(0)
     ...: a = np.random.rand(5,3)

In [381]: from scipy.spatial.distance import cdist

In [382]: cdist(a,a)
Out[382]: 
array([[0.  , 0.29, 0.42, 0.2 , 0.57],
       [0.29, 0.  , 0.58, 0.42, 0.76],
       [0.42, 0.58, 0.  , 0.45, 0.9 ],
       [0.2 , 0.42, 0.45, 0.  , 0.51],
       [0.57, 0.76, 0.9 , 0.51, 0.  ]])

In [383]: np.sqrt(pdist_squareformed_numpy(a))
Out[383]: 
array([[0.  , 0.29, 0.42, 0.2 , 0.57],
       [0.29, 0.  , 0.58, 0.42, 0.76],
       [0.42, 0.58, 0.  , 0.45, 0.9 ],
       [0.2 , 0.42, 0.45, 0.  , 0.51],
       [0.57, 0.76, 0.9 , 0.51, 0.  ]])

In [384]: np.sqrt(pdist_squareformed_numpy_v2(a))
Out[384]: 
array([[0.  , 0.29, 0.42, 0.2 , 0.57],
       [0.29, 0.  , 0.58, 0.42, 0.76],
       [0.42, 0.58, 0.  , 0.45, 0.9 ],
       [0.2 , 0.42, 0.45, 0.  , 0.51],
       [0.57, 0.76, 0.9 , 0.51, 0.  ]])

Timings on 10k points -

In [385]: a = np.random.rand(10000,3)

In [386]: %timeit cdist(a,a)
1 loop, best of 3: 309 ms per loop

# Approach #1
In [388]: %timeit pdist_squareformed_numpy(a) # squared eucl distances
1 loop, best of 3: 668 ms per loop

In [389]: %timeit np.sqrt(pdist_squareformed_numpy(a)) # actual eucl distances
1 loop, best of 3: 812 ms per loop

# Approach #2
In [390]: %timeit pdist_squareformed_numpy_v2(a) # squared eucl distances
1 loop, best of 3: 237 ms per loop

In [391]: %timeit np.sqrt(pdist_squareformed_numpy_v2(a)) # actual eucl distances
1 loop, best of 3: 395 ms per loop

The second approach seems close to cdist one on performance!

Sign up to request clarification or add additional context in comments.

13 Comments

Thanks! Good edit. I deleted my previous comments and will return to delete this one later.
@Divakar This is really cool and blazing fast. Thanks! but in pdist_squareformed_numpy_v2() i had to replace return dist with return np.abs(np.nan_to_num(dist)). It works better now, but it gives inaccurate results in some cases. This code is made so that vertices closer than 1e-5 can be told as identical. But i have to lower the precision to 1e-1 to get same results as with scipy. Probably there's some loss of acuracy.
@Harvie.CZ Why are you using np.nan_to_num ? Would you have NaNs in your input array?
@Divakar i've added it before np.abs(), just found that with np.abs() it's not needed anymore. Probably np.sqrt() was returning complex numbers for negative values, which the rest of code was not able to interpret (eg. when comparing using < ).
@Divakar both are working! I'll do some more benchmarking and probably go with first one, since it's shorter and easier to understand.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.