Numpy Array: Efficiently find matching indices

Question

I have two lists, one of which is massive (millions of elements), the other several thousand. I want to do the following

bigArray=[0,1,0,2,3,2,,.....]

smallArray=[0,1,2,3,4]

for i in len(smallArray):
  pts=np.where(bigArray==smallArray[i])
  #Do stuff with pts...

The above works, but is slow. Is there any way to do this more efficiently without resorting to writing something in C?

I really doubt you would get much speedup when ported to C since most likely the comparison operation and the where operation are already implemented in C. — Michael Wild
– Michael Wild, Commented Apr 25, 2012 at 17:43

Geremia · Accepted Answer · 2025-01-26 23:06:38Z

11

Numpy provides the function numpy.searchsorted.

Example:

>>> import numpy as np
>>> sorted = np.argsort(big_list)
>>> r = np.searchsorted(big_list, small_list, side='right',sorter=sorted)
>>> l  = np.searchsorted(big_list, small_list, side='left',sorter=sorted)
>>> for b, e in zip(l, r):
...     inds = sorted[b:e]

edited Jan 26 at 23:06

Geremia

5,8441 gold badge42 silver badges46 bronze badges

answered Mar 2, 2016 at 12:17

Raphael Campos

1111 silver badge3 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Fabricio Over a year ago

I didn't test the bisection version, but in my quick experiment this is at least faster than the defaultdic lookup answer. In my setup by a bit more than a factor of 2.

segasai · Accepted Answer · 2012-04-25 20:20:19Z

8

In your case you may benefit from presorting your big array. Here is the example demonstrating how you can reduce the time from ~ 45 seconds to 2 seconds (on my laptop)(for one particular set of lengths of the arrays 5e6 vs 1e3). Obviously the solution won't be optimal if the array sizes will be wastly different. E.g. with the default solution the complexity is O(bigN*smallN), but for my suggested solution it is O((bigN+smallN)*log(bigN))

import numpy as np, numpy.random as nprand, time, bisect

bigN = 5e6
smallN = 1000
maxn = 1e7
nprand.seed(1)  
bigArr = nprand.randint(0, maxn, size=bigN)
smallArr = nprand.randint(0, maxn, size=smallN)

# brute force 
t1 = time.time()
for i in range(len(smallArr)):
    inds = np.where(bigArr == smallArr[i])[0]
t2 = time.time()
print "Brute", t2-t1

# not brute force (like nested loop with index scan)
t1 = time.time()
sortedind = np.argsort(bigArr)
sortedbigArr = bigArr[sortedind]
for i in range(len(smallArr)):
    i1 = bisect.bisect_left(sortedbigArr, smallArr[i])
    i2 = bisect.bisect_right(sortedbigArr, smallArr[i])
    inds = sortedind[i1:i2]
t2=time.time()
print "Non-brute", t2-t1

Output:

Brute 42.5278530121

Non-brute 1.57193303108

edited Apr 25, 2012 at 20:20

answered Apr 25, 2012 at 20:02

segasai

8,5581 gold badge31 silver badges40 bronze badges

1 Comment

Michael Over a year ago

I'm not entirely sure, but there is probably room for optimization by using np.searchsorted instead of the loop with the bisection.

Akavall · Accepted Answer · 2012-04-26 01:06:15Z

3

So far I don't see any need for numpy; you can make use of defaultdict, provided that you memory is sufficient, it should be if number of observation is not too many millions.

big_list = [0,1,0,2,3,2,5,6,7,5,6,4,5,3,4,3,5,6,5]
small_list = [0,1,2,3,4]

from collections import defaultdict

dicto = defaultdict(list) #dictionary stores all the relevant coordinates
                          #so you don't have to search for them later

for ind, ele in enumerate(big_list):
    dicto[ele].append(ind)

Result:

>>> for ele in small_list:
...     print dicto[ele]
... 
[0, 2]
[1]
[3, 5]
[4, 13, 15]
[11, 14]

This should give you some speed.

answered Apr 26, 2012 at 1:06

Akavall

86.8k58 gold badges214 silver badges261 bronze badges

Collectives™ on Stack Overflow

Numpy Array: Efficiently find matching indices

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related