Looping and Searching in Numpy Array

Question

I need to loop over a numpy array and then do the following search. The following is taking almost 60(s) for an array (npArray1 and npArray2 in the example below) with around 300K values.

In other words, I am looking for the index of the first occurence in npArray2 for every value of npArray1.

for id in np.nditer(npArray1):                    
       newId=(np.where(npArray2==id))[0][0]

Is there anyway I can make the above faster using numpy? I need to run the script above on much bigger arrays (50M). Please note that my two numpy arrays in the lines above, npArray1 and npArray2 are not necessarily the same size, but they are both 1d.

Thanks a lot for your help,

Could you add some short and complete code example? Define two arrays, etc.. It would help people to understand better your problem — Juan Leni
– Juan Leni, Commented Mar 11, 2016 at 11:14

mtrw · Accepted Answer · 2016-03-11 12:02:15Z

1

The function np.unique will do much of the work for you:

npArray2 = np.random.randint(100,None,(1000,)) #1000-long vector of ints between 1 and 100, so lots of repeats
vals,idxs = np.unique(searchMe, return_index=True) #each unique value AND the index of its first appearance
for val in npArray1:
  newId = idxs[vals==val][0]

vals is an array containing the unique values in npArray2, while idxs gives the index of the first appearance of each value in npArray2. Searching in vals should be much faster than in npArray1 because it's smaller.

You can speed up the search further by taking advantage of the fact that vals is sorted:

import bisect  #we can use binary search since vals is sorted
for val in npArray1:
    newId = idxs[bisect.bisect_left(vals, val)]

edited Mar 11, 2016 at 12:02

answered Mar 11, 2016 at 11:38

mtrw

35.3k7 gold badges66 silver badges73 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Divakar · Accepted Answer · 2016-03-11 12:17:55Z

1

Assuming the input arrays contain unique values, you can use np.searchsorted with its optional sorter option for a vectorized solution, like so -

arr2_sortidx = npArray2.argsort()
idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
out1 = arr2_sortidx[idx]

Sample run to verify output -

In [154]: npArray1
Out[154]: array([77, 19,  0, 69])

In [155]: npArray2
Out[155]: array([ 8, 33, 12, 19, 77, 30, 81, 69, 20,  0])

In [156]: out = np.empty(npArray1.size,dtype=int)
     ...: for i,id in np.ndenumerate(npArray1):
     ...:     out[i] = (np.where(npArray2==id))[0][0]
     ...:     

In [157]: arr2_sortidx = npArray2.argsort()
     ...: idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
     ...: out1 = arr2_sortidx[idx]
     ...: 

In [158]: out
Out[158]: array([4, 3, 9, 7])

In [159]: out1
Out[159]: array([4, 3, 9, 7])

Runtime test -

In [175]: def original_app(npArray1,npArray2):
     ...:     out = np.empty(npArray1.size,dtype=int)
     ...:     for i,id in np.ndenumerate(npArray1):
     ...:         out[i] = (np.where(npArray2==id))[0][0] 
     ...:     return out
     ...: 
     ...: def searchsorted_app(npArray1,npArray2):
     ...:   arr2_sortidx = npArray2.argsort()
     ...:   idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
     ...:   return arr2_sortidx[idx]
     ...: 

In [176]: # Setup inputs
     ...: M,N = 50000,40000 # npArray2 and npArray1 sizes respectively
     ...: maxn = 200000
     ...: npArray2 = np.unique(np.random.randint(0,maxn,(M)))
     ...: npArray2 = npArray2[np.random.permutation(npArray2.size)]
     ...: npArray1 = npArray2[np.random.permutation(npArray2.size)[:N]]
     ...: 

In [177]: out1 = original_app(npArray1,npArray2)

In [178]: out2 = searchsorted_app(npArray1,npArray2)

In [179]: np.allclose(out1,out2)
Out[179]: True

In [180]: %timeit original_app(npArray1,npArray2)
1 loops, best of 3: 3.14 s per loop

In [181]: %timeit searchsorted_app(npArray1,npArray2)
100 loops, best of 3: 17.4 ms per loop

edited Mar 11, 2016 at 12:17

answered Mar 11, 2016 at 11:45

Divakar

222k19 gold badges273 silver badges374 bronze badges

2 Comments

Roman Kh Over a year ago

Really great answer. Though sorting a large array might be quite time and CPU consuming. In every other aspect you solution is just elegant and simple.

Divakar Over a year ago

@RomanKh Thanks! I need to get into numba at some point too! Lovely blog on numba you got there!

Roman Kh · Accepted Answer · 2016-03-11 11:48:47Z

0

In the task you specified you have to iterate over the array one way or another. So you can just think of a considerable performance improvement without changing your algorithm too much. This is where numba might be of a great help:

import numpy as np
from numba import jit

@jit
def numba_iter(npa1, npa2):
    for id in np.nditer(npa1):                    
        newId=(np.where(npa2==id))[0][0]

This simple approach might make your program much faster. Look at some examples and benchmarks here.

answered Mar 11, 2016 at 11:48

Roman Kh

2,7652 gold badges20 silver badges16 bronze badges

Collectives™ on Stack Overflow

Looping and Searching in Numpy Array

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related