13

I'm looking for a way to efficiently get an array of booleans, where given two arrays with equal size a and b, each element is true if the corresponding element of a appears in the corresponding element of b.

For example, the following program:

a = numpy.array([1, 2, 3, 4])
b = numpy.array([[1, 2, 13], [2, 8, 9], [5, 6], [7]])
print(numpy.magic_function(a, b))

Should print

[True, True, False, False]

Keep in mind this function should be the equivalent of

[x in y for x, y in zip(a, b)]

Only numpy-optimized for cases when a and b are big, and each element of b is reasonably small.

6
  • 3
    Aside: if you care about efficiency, you shouldn't be using an array like your b: instead of a fast numpy 2D array of int dtype you only have a 1D array of object dtype. Commented Jul 24, 2015 at 19:37
  • To be fair, the brunt of the work (checking if an integer is in a numpy array) is handled by your list comprehension. That may be the most efficient (and definitely the most Pythonic) way to do it. Commented Jul 24, 2015 at 19:43
  • @Kupiakos Good point, I forgot to add that the usecase I'm looking for this is for big a and b, and each element of b having few elements. Commented Jul 24, 2015 at 19:45
  • Based on docs.scipy.org/doc/numpy/reference/arrays.nditer.html, I think you want to use nditer or maybe in1d docs.scipy.org/doc/numpy/reference/generated/…. Commented Jul 24, 2015 at 19:48
  • 1
    Since the elements of b differ in length, there's little point in it being an array. Either turn it into a real 2d array, or treat it as a list. Commented Jul 24, 2015 at 23:17

3 Answers 3

4

To take advantage of NumPy's broadcasting rules you should make array b squared first, which can be achieved using itertools.izip_longest:

from itertools import izip_longest

c = np.array(list(izip_longest(*b))).astype(float)

resulting in:

array([[  1.,   2.,   5.,   7.],
       [  2.,   8.,   6.,  nan],
       [ 13.,   9.,  nan,  nan]])

Then, by doing np.isclose(c, a) you get a 2D array of Booleans showing the difference between each c[:, i] and a[i], according to the broadcasting rules, giving:

array([[ True,  True, False, False],
       [False, False, False, False],
       [False, False, False, False]], dtype=bool)

Which can be used to obtain your answer:

np.any(np.isclose(c, a), axis=0)
#array([ True,  True, False, False], dtype=bool)
Sign up to request clarification or add additional context in comments.

Comments

3

Is there an upper limit to the length of the small lists in b? If so, maybe you could make b a matrix of say 1000x5, and use nan to fill the gaps for the sub-arrays that are too short. You can then use numpy.any to get the answer you want, something like this:

In [42]: a = np.array([1, 2, 3, 4])
    ...: b = np.array([[1, 2, 13], [2, 8, 9], [5, 6], [7]])

In [43]: bb = np.full((len(b), max(len(i) for i in b)), np.nan)

In [44]: for irow, row in enumerate(b):
    ...:     bb[irow, :len(row)] = row

In [45]: bb
Out[45]: 
array([[  1.,   2.,  13.],
       [  2.,   8.,   9.],
       [  5.,   6.,  nan],
       [  7.,  nan,  nan]])

In [46]: a[:,np.newaxis] == bb
Out[46]: 
array([[ True, False, False],
       [ True, False, False],
       [False, False, False],
       [False, False, False]], dtype=bool)

In [47]: np.any(a[:,np.newaxis] == bb, axis=1)
Out[47]: array([ True,  True, False, False], dtype=bool)

No idea if this is faster for your data.

Comments

1

Summary

The approach from Sauldo Castro runs most quickly among those posted so far. The generator expression in the original post is second fastest.

Code to generate test data:

import numpy
import random

alength = 100
a = numpy.array([random.randint(1, 6) for i in range(alength)])
b = []
for i in range(alength):
    length = random.randint(1, 5)
    element = []
    for i in range(length):
        element.append(random.randint(1, 6))
    b.append(element)
b = numpy.array(b)
print a, b

The options:

from itertools import izip_longest
def magic_function1(a, b): # From OP Martin Fixman
    return [x in y for x, y in zip(a, b)]  

def magic_function2(a, b): # What I thought might be better.
    bools = []
    for x, y in zip(a,b):
        found = False
        for j in y:
            if x == j:
                found=True
                break
        bools.append(found)

def magic_function3(a, b): # What I tried first
    bools = []
    for i in range(len(a)):
        found = False
        for j in range(len(b[i])):
            if a[i] == b[i][j]:
                found=True
                break
        bools.append(found)

def magic_function4(a, b): # From Bas Swinkels
    bb = numpy.full((len(b), max(len(i) for i in b)), numpy.nan)
    for irow, row in enumerate(b):
        bb[irow, :len(row)] = row
    a[:,numpy.newaxis] == bb
    return numpy.any(a[:,numpy.newaxis] == bb, axis=1)

def magic_function5(a, b): # From Sauldo Castro, revised version
    c = numpy.array(list(izip_longest(*b))).astype(float)
    return numpy.isclose(c, a), axis=0)  

Time n_executions

n_executions = 100
clock = timeit.Timer(stmt="magic_function1(a, b)", setup="from __main__ import magic_function1, a, b")
print clock.timeit(n_executions), "seconds"
# Repeat with each candidate function

The results:

  • 0.158078225475 seconds for magic_function1
  • 0.181080926835 seconds for magic_function2
  • 0.259621047822 seconds for magic_function3
  • 0.287054750224 seconds for magic_function4
  • 0.0839162196207 seconds for magic_function5

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.