Pythonic and efficient way to do an elementwise "in" using numpy

Question

I'm looking for a way to efficiently get an array of booleans, where given two arrays with equal size a and b, each element is true if the corresponding element of a appears in the corresponding element of b.

For example, the following program:

a = numpy.array([1, 2, 3, 4])
b = numpy.array([[1, 2, 13], [2, 8, 9], [5, 6], [7]])
print(numpy.magic_function(a, b))

Should print

[True, True, False, False]

Keep in mind this function should be the equivalent of

[x in y for x, y in zip(a, b)]

Only numpy-optimized for cases when a and b are big, and each element of b is reasonably small.

Aside: if you care about efficiency, you shouldn't be using an array like your b: instead of a fast numpy 2D array of int dtype you only have a 1D array of object dtype. — DSM
– DSM, Commented Jul 24, 2015 at 19:37
To be fair, the brunt of the work (checking if an integer is in a numpy array) is handled by your list comprehension. That may be the most efficient (and definitely the most Pythonic) way to do it. — Alyssa Haroldsen
– Alyssa Haroldsen, Commented Jul 24, 2015 at 19:43
@Kupiakos Good point, I forgot to add that the usecase I'm looking for this is for big a and b, and each element of b having few elements. — Martín Fixman
– Martín Fixman, Commented Jul 24, 2015 at 19:45
Based on docs.scipy.org/doc/numpy/reference/arrays.nditer.html, I think you want to use nditer or maybe in1d docs.scipy.org/doc/numpy/reference/generated/…. — Bennett Brown
– Bennett Brown, Commented Jul 24, 2015 at 19:48
Since the elements of b differ in length, there's little point in it being an array. Either turn it into a real 2d array, or treat it as a list. — hpaulj
– hpaulj, Commented Jul 24, 2015 at 23:17

Bennett Brown · Accepted Answer · 2015-07-26 05:57:18Z

4

To take advantage of NumPy's broadcasting rules you should make array b squared first, which can be achieved using itertools.izip_longest:

from itertools import izip_longest

c = np.array(list(izip_longest(*b))).astype(float)

resulting in:

array([[  1.,   2.,   5.,   7.],
       [  2.,   8.,   6.,  nan],
       [ 13.,   9.,  nan,  nan]])

Then, by doing np.isclose(c, a) you get a 2D array of Booleans showing the difference between each c[:, i] and a[i], according to the broadcasting rules, giving:

array([[ True,  True, False, False],
       [False, False, False, False],
       [False, False, False, False]], dtype=bool)

Which can be used to obtain your answer:

np.any(np.isclose(c, a), axis=0)
#array([ True,  True, False, False], dtype=bool)

edited Jul 26, 2015 at 5:57

Bennett Brown

5,3911 gold badge29 silver badges35 bronze badges

answered Jul 25, 2015 at 23:15

Saullo G. P. Castro

59.4k28 gold badges191 silver badges244 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Bas Swinckels · Accepted Answer · 2015-07-24 20:18:06Z

Is there an upper limit to the length of the small lists in b? If so, maybe you could make b a matrix of say 1000x5, and use nan to fill the gaps for the sub-arrays that are too short. You can then use numpy.any to get the answer you want, something like this:

In [42]: a = np.array([1, 2, 3, 4])
    ...: b = np.array([[1, 2, 13], [2, 8, 9], [5, 6], [7]])

In [43]: bb = np.full((len(b), max(len(i) for i in b)), np.nan)

In [44]: for irow, row in enumerate(b):
    ...:     bb[irow, :len(row)] = row

In [45]: bb
Out[45]: 
array([[  1.,   2.,  13.],
       [  2.,   8.,   9.],
       [  5.,   6.,  nan],
       [  7.,  nan,  nan]])

In [46]: a[:,np.newaxis] == bb
Out[46]: 
array([[ True, False, False],
       [ True, False, False],
       [False, False, False],
       [False, False, False]], dtype=bool)

In [47]: np.any(a[:,np.newaxis] == bb, axis=1)
Out[47]: array([ True,  True, False, False], dtype=bool)

No idea if this is faster for your data.

Bennett Brown · Accepted Answer · 2015-07-26 05:20:40Z

Summary

The approach from Sauldo Castro runs most quickly among those posted so far. The generator expression in the original post is second fastest.

Code to generate test data:

import numpy
import random

alength = 100
a = numpy.array([random.randint(1, 6) for i in range(alength)])
b = []
for i in range(alength):
    length = random.randint(1, 5)
    element = []
    for i in range(length):
        element.append(random.randint(1, 6))
    b.append(element)
b = numpy.array(b)
print a, b

The options:

from itertools import izip_longest
def magic_function1(a, b): # From OP Martin Fixman
    return [x in y for x, y in zip(a, b)]  

def magic_function2(a, b): # What I thought might be better.
    bools = []
    for x, y in zip(a,b):
        found = False
        for j in y:
            if x == j:
                found=True
                break
        bools.append(found)

def magic_function3(a, b): # What I tried first
    bools = []
    for i in range(len(a)):
        found = False
        for j in range(len(b[i])):
            if a[i] == b[i][j]:
                found=True
                break
        bools.append(found)

def magic_function4(a, b): # From Bas Swinkels
    bb = numpy.full((len(b), max(len(i) for i in b)), numpy.nan)
    for irow, row in enumerate(b):
        bb[irow, :len(row)] = row
    a[:,numpy.newaxis] == bb
    return numpy.any(a[:,numpy.newaxis] == bb, axis=1)

def magic_function5(a, b): # From Sauldo Castro, revised version
    c = numpy.array(list(izip_longest(*b))).astype(float)
    return numpy.isclose(c, a), axis=0)

Time n_executions

n_executions = 100
clock = timeit.Timer(stmt="magic_function1(a, b)", setup="from __main__ import magic_function1, a, b")
print clock.timeit(n_executions), "seconds"
# Repeat with each candidate function

The results:

0.158078225475 seconds for magic_function1
0.181080926835 seconds for magic_function2
0.259621047822 seconds for magic_function3
0.287054750224 seconds for magic_function4
0.0839162196207 seconds for magic_function5

Collectives™ on Stack Overflow

Pythonic and efficient way to do an elementwise "in" using numpy

3 Answers 3

Comments

Comments

Summary

Code to generate test data:

The options:

Time n_executions

The results:

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Summary

Code to generate test data:

The options:

Time n_executions

The results:

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related