1

I am looking for the nearest nonzero cell in a numpy 3d array based on the i,j,k coordinates stored in a pandas dataframe. My solution below works, but it is slower than I would like. I know my optimization skills are lacking, so I am hoping someone can help me find a faster option.

It takes 2 seconds to find the nearest non-zero for a 100 x 100 x 100 binary array, and I have hundreds of files, so any speed enhancements would be much appreciated!

a=np.random.randint(0,2,size=(100,100,100))
# df with i,j,k of interest
df=pd.DataFrame(np.random.randint(100,size=(100,3)).tolist(),
                columns=['i','j','k'])

def find_nearest(a,df):
    import numpy as np
    import pandas as pd
    import time
    t0=time.time()
    nzi = np.nonzero(a)
    for i,r in df.iterrows():
        dist = ((r['k'] - nzi[0])**2 + \
              (r['i'] - nzi[1])**2 + \
              (r['j'] - nzi[2])**2)
        nidx = dist.argmin()
        df.loc[i,['nk','ni','nj']]=(nzi[0][nidx],
                                    nzi[1][nidx],
                                    nzi[2][nidx])
    print(time.time()-t0)
    return(df)

1 Answer 1

1

The problem that you are trying to solve looks like a nearest-neighbor search.

The worst-case complexity of the current code is O(n m) with n the number of point to search and m the number of neighbour candidates. With n = 100 and m = 100**3 = 1,000,000, this means about hundreds of million iterations. To solve this efficiently, one can use a better algorithm.

The common way to solve this kind of problem consists in putting all elements in a BSP-Tree data structure (such as Quadtree or Octree. Such a data structure helps you to locate the nearest elements near a location in a O(log(m)) time. As a result, the overall complexity of this method is O(n log(m))! SciPy already implement KD-trees.

Vectorization generally also help to speed up the computation.

def find_nearest_fast(a,df):
    from scipy.spatial import KDTree
    import numpy as np
    import pandas as pd
    import time
    t0=time.time()
    candidates = np.array(np.nonzero(a)).transpose().copy()
    tree = KDTree(candidates, leafsize=1024, compact_nodes=False)
    searched = np.array([df['k'], df['i'], df['j']]).transpose()
    distances, indices = tree.query(searched)
    nearestPoints = candidates[indices,:]
    df[['nk', 'ni', 'nj']] = nearestPoints
    print(time.time()-t0)
    return df

This implementation is 16 times faster on my machine. Note the results differ a bit since there are multiple nearest points for a given input point (with the same distance).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.