7

I am trying to create a 'mask' of a numpy.array by specifying certain criteria. Python even has nice syntax for something like this:

>> A = numpy.array([1,2,3,4,5])
>> A > 3
array([False, False, False, True, True])

But if I have a list of criteria instead of a range:

>> A = numpy.array([1,2,3,4,5])
>> crit = [1,3,5]

I can't do this:

>> A in crit

I have to do something based on list comprehensions, like this:

>> [a in crit for a in A]
array([True, False, True, False, True])

Which is correct.

Now, the problem is that I am working with large arrays and the above code is very slow. Is there a more natural way of doing this operation that might speed it up?

EDIT: I was able to get a small speedup by making crit into a set.

EDIT2: For those who are interested:

Jouni's approach: 1000 loops, best of 3: 102 µs per loop

numpy.in1d: 1000 loops, best of 3: 1.33 ms per loop

EDIT3: Just tested again with B = randint(10,size=100)

Jouni's approach: 1000 loops, best of 3: 2.96 ms per loop

numpy.in1d: 1000 loops, best of 3: 1.34 ms per loop

Conclusion: Use numpy.in1d() unless B is very small.

3 Answers 3

6

I think that the numpy function in1d is what you are looking for:

>>> A = numpy.array([1,2,3,4,5])
>>> B = [1,3,5]
>>> numpy.in1d(A,crit)
array([ True, False,  True, False,  True], dtype=bool)

as stated in its docstring, "in1d(a, b) is roughly equivalent to np.array([item in b for item in a])"

Admittedly, I haven't done any speed tests, but it sounds like what you are looking for.

Another faster way

Here's another way to do it which is faster. Sort the B array first(containing the elements you are looking to find in A), turn it into a numpy array, and then do:

B[B.searchsorted(A)] == A

though if you have elements in A that are larger than the largest in B, you will need to do:

inds = B.searchsorted(A)
inds[inds == len(B)] = 0
mask = B[inds] == A

It may not be faster for small arrays (especially for B being small), but before long it will definitely be faster. Why? Because this is a O(N log M) algorithm, where N is the number of elements in A and M is the number of elements in M, putting together a bunch of individual masks is O(N * M). I tested it with N = 10000 and M = 14 and it was already faster. Anyway, just thought that you might like to know, especially if you are truly planning on using this on very large arrays.

Sign up to request clarification or add additional context in comments.

3 Comments

looks like a recent addition to numpy (wasn't in version 1.3)
You are right. I only tested on B having a length of 3. If B is also large, numpy.in1d() definitely scales a lot better.
@aduric and my second method is even faster than in1d.
3

Combine several comparisons with "or":

A = randint(10,size=10000)
mask = (A == 1) | (A == 3) | (A == 5)

Or if you have a list B and want to create the mask dynamically:

B = [1, 3, 5]
mask = zeros((10000,),dtype=bool)
for t in B: mask = mask | (A == t)

2 Comments

just wondering why or how to anticipate when numpy will naturally do this ufunc enabled element-wise logical operation? When doing logical operations numpy sometimes throws back an exception: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
this is certainly the fastest approach, albeit, not the cleanest one.
0

Create a mask and use the compress function of the numpy array. It should be much faster. If you have a complex criteria, remember to construct it based on math of the arrays.

a = numpy.array([3,1,2,4,5])
mask = a > 3
b = a.compress(mask)

or

a = numpy.random.random_integers(1,5,100000)
c=a.compress((a<=4)*(a>=2)) ## numbers between n<=4 and n>=2
d=a.compress(~((a<=4)*(a>=2))) ## numbers either n>4 or n<2

Ok, if you want a mask that has all a in [1,3,5] you can do something like

a = numpy.random.random_integers(1,5,100000)
mask=(a==1)+(a==3)+(a==5)

or

a = numpy.random.random_integers(1,5,100000)
mask = numpy.zeros(len(a), dtype=bool)
for num in [1,3,5]:
    mask += (a==num)

2 Comments

I don't think that this is what I'm looking for. I don't want to get the actual contents of the array back, I just want to get a boolean mask that has the same length as the original array.
Ok, edited it now that I know what you want. I guess Jouni's solution that he came up with while I was editing mine was equivalent, as True= True + True, True = True + False, False = False + False, exactly the same as or using |.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.