1

I need to loop over a numpy array and then do the following search. The following is taking almost 60(s) for an array (npArray1 and npArray2 in the example below) with around 300K values.

In other words, I am looking for the index of the first occurence in npArray2 for every value of npArray1.

for id in np.nditer(npArray1):                    
       newId=(np.where(npArray2==id))[0][0] 

Is there anyway I can make the above faster using numpy? I need to run the script above on much bigger arrays (50M). Please note that my two numpy arrays in the lines above, npArray1 and npArray2 are not necessarily the same size, but they are both 1d.

Thanks a lot for your help,

2
  • 2
    Could you add some short and complete code example? Define two arrays, etc.. It would help people to understand better your problem Commented Mar 11, 2016 at 11:14
  • Curious if any of the solutions posted here work for you? Commented Mar 16, 2016 at 7:43

3 Answers 3

1

The function np.unique will do much of the work for you:

npArray2 = np.random.randint(100,None,(1000,)) #1000-long vector of ints between 1 and 100, so lots of repeats
vals,idxs = np.unique(searchMe, return_index=True) #each unique value AND the index of its first appearance
for val in npArray1:
  newId = idxs[vals==val][0]

vals is an array containing the unique values in npArray2, while idxs gives the index of the first appearance of each value in npArray2. Searching in vals should be much faster than in npArray1 because it's smaller.

You can speed up the search further by taking advantage of the fact that vals is sorted:

import bisect  #we can use binary search since vals is sorted
for val in npArray1:
    newId = idxs[bisect.bisect_left(vals, val)]
Sign up to request clarification or add additional context in comments.

Comments

1

Assuming the input arrays contain unique values, you can use np.searchsorted with its optional sorter option for a vectorized solution, like so -

arr2_sortidx = npArray2.argsort()
idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
out1 = arr2_sortidx[idx]

Sample run to verify output -

In [154]: npArray1
Out[154]: array([77, 19,  0, 69])

In [155]: npArray2
Out[155]: array([ 8, 33, 12, 19, 77, 30, 81, 69, 20,  0])

In [156]: out = np.empty(npArray1.size,dtype=int)
     ...: for i,id in np.ndenumerate(npArray1):
     ...:     out[i] = (np.where(npArray2==id))[0][0]
     ...:     

In [157]: arr2_sortidx = npArray2.argsort()
     ...: idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
     ...: out1 = arr2_sortidx[idx]
     ...: 

In [158]: out
Out[158]: array([4, 3, 9, 7])

In [159]: out1
Out[159]: array([4, 3, 9, 7])

Runtime test -

In [175]: def original_app(npArray1,npArray2):
     ...:     out = np.empty(npArray1.size,dtype=int)
     ...:     for i,id in np.ndenumerate(npArray1):
     ...:         out[i] = (np.where(npArray2==id))[0][0] 
     ...:     return out
     ...: 
     ...: def searchsorted_app(npArray1,npArray2):
     ...:   arr2_sortidx = npArray2.argsort()
     ...:   idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
     ...:   return arr2_sortidx[idx]
     ...: 

In [176]: # Setup inputs
     ...: M,N = 50000,40000 # npArray2 and npArray1 sizes respectively
     ...: maxn = 200000
     ...: npArray2 = np.unique(np.random.randint(0,maxn,(M)))
     ...: npArray2 = npArray2[np.random.permutation(npArray2.size)]
     ...: npArray1 = npArray2[np.random.permutation(npArray2.size)[:N]]
     ...: 

In [177]: out1 = original_app(npArray1,npArray2)

In [178]: out2 = searchsorted_app(npArray1,npArray2)

In [179]: np.allclose(out1,out2)
Out[179]: True

In [180]: %timeit original_app(npArray1,npArray2)
1 loops, best of 3: 3.14 s per loop

In [181]: %timeit searchsorted_app(npArray1,npArray2)
100 loops, best of 3: 17.4 ms per loop

2 Comments

Really great answer. Though sorting a large array might be quite time and CPU consuming. In every other aspect you solution is just elegant and simple.
@RomanKh Thanks! I need to get into numba at some point too! Lovely blog on numba you got there!
0

In the task you specified you have to iterate over the array one way or another. So you can just think of a considerable performance improvement without changing your algorithm too much. This is where numba might be of a great help:

import numpy as np
from numba import jit

@jit
def numba_iter(npa1, npa2):
    for id in np.nditer(npa1):                    
        newId=(np.where(npa2==id))[0][0]

This simple approach might make your program much faster. Look at some examples and benchmarks here.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.