5

I have two 2D numpy arrays, for example:

A = numpy.array([[1, 2, 4, 8], [16, 32, 32, 8], [64, 32, 16, 8]])

and

B = numpy.array([[1, 2], [32, 32]])

I want to have all lines from A where I can find all elements from any of the lines of B. Where there are 2 of the same element in a row of B, lines from A must contain at least 2 as well. In case of my example, I want to achieve this:

A_filtered = [[1, 2, 4, 8], [16, 32, 32, 8]]

I have control over the values representation so I chose numbers where the binary representation has only one place with 1 (example: 0b00000001 and 0b00000010, etc...) This way I can easily check if all type of values are in the row by using np.logical_or.reduce() function, but I cannot check that the number of the same elements are bigger or equal in a row of A. I was really hoping that I could avoid simple for loop and deep copies of the arrays as the performance is a very important aspect for me.

How can I do that in numpy in an efficient way?


Update:

A solution from here may work, but I think the performance is a big concern for me, the A can be really big (>300000 rows) and B can be moderate (>30):

[set(row).issuperset(hand) for row in A.tolist() for hand in B.tolist()]

Update 2:

The set() solution is not working since the set() drops all duplicated values.

9
  • 2 of the same element in B or any row of B? Commented Jul 9, 2018 at 18:38
  • I would check out this question: check if numpy array is subset of another array Commented Jul 9, 2018 at 18:38
  • Is the 'at least 2' important? If lines in A and B need an equal count of each token in B, I think I know an elegant solution. Commented Jul 9, 2018 at 18:50
  • 2 of the same element of any row. 'at least 2' is really important. In A, a line of [32, 32, 32,8] shall be accepted Commented Jul 9, 2018 at 18:53
  • 1
    I dont think sets fully match your description, since the distinction between single and multiple occurences gets lost. Commented Jul 9, 2018 at 19:54

2 Answers 2

1

I hope I got your question right. At least it works with the problem you described in your question. If the order of the output should stay the same as the input, change the inplace-sort.

The code looks quite ugly, but should perform well and shouldn't be to hard to understand.

Code

import time
import numba as nb
import numpy as np

@nb.njit(fastmath=True,parallel=True)
def filter(A,B):
  iFilter=np.zeros(A.shape[0],dtype=nb.bool_)

  for i in nb.prange(A.shape[0]):
    break_loop=False

    for j in range(B.shape[0]):
      ind_to_B=0
      for k in range(A.shape[1]):
        if A[i,k]==B[j,ind_to_B]:
          ind_to_B+=1

        if ind_to_B==B.shape[1]:
          iFilter[i]=True
          break_loop=True
          break

      if break_loop==True:
        break

  return A[iFilter,:]

Measuring performance

####First call has some compilation overhead####
A=np.random.randint(low=0, high=60, size=300_000*4).reshape(300_000,4)
B=np.random.randint(low=0, high=60, size=30*2).reshape(30,2)

t1=time.time()
#At first sort the arrays
A.sort()
B.sort()
A_filtered=filter(A,B)
print(time.time()-t1)

####Let's measure the second call too####
A=np.random.randint(low=0, high=60, size=300_000*4).reshape(300_000,4)
B=np.random.randint(low=0, high=60, size=30*2).reshape(30,2)

t1=time.time()
#At first sort the arrays
A.sort()
B.sort()
A_filtered=filter(A,B)
print(time.time()-t1)

Results

46ms after the first run on a dual-core Notebook (sorting included)
32ms (sorting excluded)
Sign up to request clarification or add additional context in comments.

2 Comments

I tried and it indeed boosted up my code, thanks! I am new to numba and I just read about it a little. Could I use nopython mode in this case or numpy requires object mode?
@doodoroma (at)njit is a shortcut for (at)jit(nopython=True)
1

I think this should work:

First, encode the data as follows (this assumes a limited number of 'tokens', as your binary scheme also seems to imply):

Make A shape [n_rows, n_tokens], dtype int8, where each element counts the number of tokens. Encode B in the same way, with shape [n_hands, n_tokens]

This allows for a single vectorized expression of your output; matches = (A[None, :, :] >= B[:, None, :]).all(axis=-1). (exactly how to map this matches array to your desired output format is left as an excerise to the reader since the question leaves it undefined for multiple matches).

But we are talking > 10Mbyte of memory per token here. Even with 32 tokens that should not unthinkable; but in a situation like this it tends to be better to not vectorize the loop over n_tokens or n_hands, or both; for loops are fine for small n, or if there is sufficient work to be done in the body, such that the looping overhead is insignificant.

As long as n_tokens and n_hands remain moderate, I think this will be the fastest solution, if staying within the realm of pure python and numpy.

4 Comments

What do you mean by where each element counts the number of tokens?
Your example A array has 7 unique tokens. Assigning each token a column, and assigning each element the count of that token in the row, gives A = numpy.array([[1, 1, 1, 1, 0, 0, 0], [0, 0, 0, 1, 1, 2, 0], [0, 0, 0, 1, 1, 1, 1]])
Yes, this is the kind of solution I was looking for. I can only convert A offline, B must be converted every time, and I just counted, I have 71 tokens, which seems to be a lot.
Despite the fact that I was thinking of a solution like this answer, I chose @max9111 's idea because of the memory constraint. Having 71 tokens would require too big matrix to handle (300000*71). Anyways, it was really useful!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.