How to check if all elements of a numpy array are in another numpy array

Question

I have two 2D numpy arrays, for example:

A = numpy.array([[1, 2, 4, 8], [16, 32, 32, 8], [64, 32, 16, 8]])

and

B = numpy.array([[1, 2], [32, 32]])

I want to have all lines from A where I can find all elements from any of the lines of B. Where there are 2 of the same element in a row of B, lines from A must contain at least 2 as well. In case of my example, I want to achieve this:

A_filtered = [[1, 2, 4, 8], [16, 32, 32, 8]]

I have control over the values representation so I chose numbers where the binary representation has only one place with 1 (example: 0b00000001 and 0b00000010, etc...) This way I can easily check if all type of values are in the row by using np.logical_or.reduce() function, but I cannot check that the number of the same elements are bigger or equal in a row of A. I was really hoping that I could avoid simple for loop and deep copies of the arrays as the performance is a very important aspect for me.

How can I do that in numpy in an efficient way?

Update:

A solution from here may work, but I think the performance is a big concern for me, the A can be really big (>300000 rows) and B can be moderate (>30):

[set(row).issuperset(hand) for row in A.tolist() for hand in B.tolist()]

Update 2:

The set() solution is not working since the set() drops all duplicated values.

I would check out this question: check if numpy array is subset of another array — Easton Bornemeier
– Easton Bornemeier, Commented Jul 9, 2018 at 18:38
Is the 'at least 2' important? If lines in A and B need an equal count of each token in B, I think I know an elegant solution. — Eelco Hoogendoorn
– Eelco Hoogendoorn, Commented Jul 9, 2018 at 18:50
2 of the same element of any row. 'at least 2' is really important. In A, a line of [32, 32, 32,8] shall be accepted — doodoroma
– doodoroma, Commented Jul 9, 2018 at 18:53
I dont think sets fully match your description, since the distinction between single and multiple occurences gets lost. — Eelco Hoogendoorn
– Eelco Hoogendoorn, Commented Jul 9, 2018 at 19:54

max9111 · Accepted Answer · 2018-07-09 23:32:35Z

1

I hope I got your question right. At least it works with the problem you described in your question. If the order of the output should stay the same as the input, change the inplace-sort.

The code looks quite ugly, but should perform well and shouldn't be to hard to understand.

Code

import time
import numba as nb
import numpy as np

@nb.njit(fastmath=True,parallel=True)
def filter(A,B):
  iFilter=np.zeros(A.shape[0],dtype=nb.bool_)

  for i in nb.prange(A.shape[0]):
    break_loop=False

    for j in range(B.shape[0]):
      ind_to_B=0
      for k in range(A.shape[1]):
        if A[i,k]==B[j,ind_to_B]:
          ind_to_B+=1

        if ind_to_B==B.shape[1]:
          iFilter[i]=True
          break_loop=True
          break

      if break_loop==True:
        break

  return A[iFilter,:]

Measuring performance

####First call has some compilation overhead####
A=np.random.randint(low=0, high=60, size=300_000*4).reshape(300_000,4)
B=np.random.randint(low=0, high=60, size=30*2).reshape(30,2)

t1=time.time()
#At first sort the arrays
A.sort()
B.sort()
A_filtered=filter(A,B)
print(time.time()-t1)

####Let's measure the second call too####
A=np.random.randint(low=0, high=60, size=300_000*4).reshape(300_000,4)
B=np.random.randint(low=0, high=60, size=30*2).reshape(30,2)

t1=time.time()
#At first sort the arrays
A.sort()
B.sort()
A_filtered=filter(A,B)
print(time.time()-t1)

Results

46ms after the first run on a dual-core Notebook (sorting included)
32ms (sorting excluded)

answered Jul 9, 2018 at 23:32

max9111

6,5171 gold badge19 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

doodoroma Over a year ago

I tried and it indeed boosted up my code, thanks! I am new to numba and I just read about it a little. Could I use nopython mode in this case or numpy requires object mode?

max9111 Over a year ago

@doodoroma (at)njit is a shortcut for (at)jit(nopython=True)

Eelco Hoogendoorn · Accepted Answer · 2018-07-09 20:11:29Z

1

I think this should work:

First, encode the data as follows (this assumes a limited number of 'tokens', as your binary scheme also seems to imply):

Make A shape [n_rows, n_tokens], dtype int8, where each element counts the number of tokens. Encode B in the same way, with shape [n_hands, n_tokens]

This allows for a single vectorized expression of your output; matches = (A[None, :, :] >= B[:, None, :]).all(axis=-1). (exactly how to map this matches array to your desired output format is left as an excerise to the reader since the question leaves it undefined for multiple matches).

But we are talking > 10Mbyte of memory per token here. Even with 32 tokens that should not unthinkable; but in a situation like this it tends to be better to not vectorize the loop over n_tokens or n_hands, or both; for loops are fine for small n, or if there is sufficient work to be done in the body, such that the looping overhead is insignificant.

As long as n_tokens and n_hands remain moderate, I think this will be the fastest solution, if staying within the realm of pure python and numpy.

answered Jul 9, 2018 at 20:11

Eelco Hoogendoorn

10.8k1 gold badge46 silver badges43 bronze badges

4 Comments

doodoroma Over a year ago

What do you mean by where each element counts the number of tokens?

Eelco Hoogendoorn Over a year ago

Your example A array has 7 unique tokens. Assigning each token a column, and assigning each element the count of that token in the row, gives A = numpy.array([[1, 1, 1, 1, 0, 0, 0], [0, 0, 0, 1, 1, 2, 0], [0, 0, 0, 1, 1, 1, 1]])

doodoroma Over a year ago

Yes, this is the kind of solution I was looking for. I can only convert A offline, B must be converted every time, and I just counted, I have 71 tokens, which seems to be a lot.

doodoroma Over a year ago

Despite the fact that I was thinking of a solution like this answer, I chose @max9111 's idea because of the memory constraint. Having 71 tokens would require too big matrix to handle (300000*71). Anyways, it was really useful!

Collectives™ on Stack Overflow

How to check if all elements of a numpy array are in another numpy array

2 Answers 2

2 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related