Numpy.array indexing question

Question

I am trying to create a 'mask' of a numpy.array by specifying certain criteria. Python even has nice syntax for something like this:

>> A = numpy.array([1,2,3,4,5])
>> A > 3
array([False, False, False, True, True])

But if I have a list of criteria instead of a range:

>> A = numpy.array([1,2,3,4,5])
>> crit = [1,3,5]

I can't do this:

>> A in crit

I have to do something based on list comprehensions, like this:

>> [a in crit for a in A]
array([True, False, True, False, True])

Which is correct.

Now, the problem is that I am working with large arrays and the above code is very slow. Is there a more natural way of doing this operation that might speed it up?

EDIT: I was able to get a small speedup by making crit into a set.

EDIT2: For those who are interested:

Jouni's approach: 1000 loops, best of 3: 102 µs per loop

numpy.in1d: 1000 loops, best of 3: 1.33 ms per loop

EDIT3: Just tested again with B = randint(10,size=100)

Jouni's approach: 1000 loops, best of 3: 2.96 ms per loop

numpy.in1d: 1000 loops, best of 3: 1.34 ms per loop

Conclusion: Use numpy.in1d() unless B is very small.

Justin Peel · Accepted Answer · 2010-10-21 22:08:54Z

6

I think that the numpy function in1d is what you are looking for:

>>> A = numpy.array([1,2,3,4,5])
>>> B = [1,3,5]
>>> numpy.in1d(A,crit)
array([ True, False,  True, False,  True], dtype=bool)

as stated in its docstring, "in1d(a, b) is roughly equivalent to np.array([item in b for item in a])"

Admittedly, I haven't done any speed tests, but it sounds like what you are looking for.

Another faster way

Here's another way to do it which is faster. Sort the B array first(containing the elements you are looking to find in A), turn it into a numpy array, and then do:

B[B.searchsorted(A)] == A

though if you have elements in A that are larger than the largest in B, you will need to do:

inds = B.searchsorted(A)
inds[inds == len(B)] = 0
mask = B[inds] == A

It may not be faster for small arrays (especially for B being small), but before long it will definitely be faster. Why? Because this is a O(N log M) algorithm, where N is the number of elements in A and M is the number of elements in M, putting together a bunch of individual masks is O(N * M). I tested it with N = 10000 and M = 14 and it was already faster. Anyway, just thought that you might like to know, especially if you are truly planning on using this on very large arrays.

edited Oct 21, 2010 at 22:08

answered Oct 21, 2010 at 19:05

Justin Peel

47.1k6 gold badges62 silver badges81 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Paul Over a year ago

looks like a recent addition to numpy (wasn't in version 1.3)

aduric Over a year ago

You are right. I only tested on B having a length of 3. If B is also large, numpy.in1d() definitely scales a lot better.

Justin Peel Over a year ago

@aduric and my second method is even faster than in1d.

Jouni K. Seppänen · Accepted Answer · 2010-10-21 18:07:43Z

3

Combine several comparisons with "or":

A = randint(10,size=10000)
mask = (A == 1) | (A == 3) | (A == 5)

Or if you have a list B and want to create the mask dynamically:

B = [1, 3, 5]
mask = zeros((10000,),dtype=bool)
for t in B: mask = mask | (A == t)

answered Oct 21, 2010 at 18:07

Jouni K. Seppänen

44.4k5 gold badges74 silver badges101 bronze badges

2 Comments

dtlussier Over a year ago

just wondering why or how to anticipate when numpy will naturally do this ufunc enabled element-wise logical operation? When doing logical operations numpy sometimes throws back an exception: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

aduric Over a year ago

this is certainly the fastest approach, albeit, not the cleanest one.

dr jimbob · Accepted Answer · 2010-10-21 18:12:42Z

0

Create a mask and use the compress function of the numpy array. It should be much faster. If you have a complex criteria, remember to construct it based on math of the arrays.

a = numpy.array([3,1,2,4,5])
mask = a > 3
b = a.compress(mask)

or

a = numpy.random.random_integers(1,5,100000)
c=a.compress((a<=4)*(a>=2)) ## numbers between n<=4 and n>=2
d=a.compress(~((a<=4)*(a>=2))) ## numbers either n>4 or n<2

Ok, if you want a mask that has all a in [1,3,5] you can do something like

a = numpy.random.random_integers(1,5,100000)
mask=(a==1)+(a==3)+(a==5)

or

a = numpy.random.random_integers(1,5,100000)
mask = numpy.zeros(len(a), dtype=bool)
for num in [1,3,5]:
    mask += (a==num)

edited Oct 21, 2010 at 18:12

answered Oct 21, 2010 at 17:07

dr jimbob

17.8k7 gold badges63 silver badges84 bronze badges

2 Comments

aduric Over a year ago

I don't think that this is what I'm looking for. I don't want to get the actual contents of the array back, I just want to get a boolean mask that has the same length as the original array.

dr jimbob Over a year ago

Ok, edited it now that I know what you want. I guess Jouni's solution that he came up with while I was editing mine was equivalent, as True= True + True, True = True + False, False = False + False, exactly the same as or using |.

Collectives™ on Stack Overflow

Numpy.array indexing question

3 Answers 3

3 Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related