3

The actual problem is in some machine learning application, and the data gets a little complex. So here's an MWE that captures the essence of the problem:

I have two arrays made as follows:

L = np.arange(12).reshape(4,3)
M = np.arange(12).reshape(6,2)

Now, I want to find the rows R in L, such that there exists some row in M that is made up of all the elements in R except the last one.

From the above example code, L and M look like this:

array([[ 0,  1,  2],  # L
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

array([[ 0,  1],  # M
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

I would like from these, the marked rows in L, as a numpy array:

array([[ 0,  1,  2],
       [ 6,  7,  8]])

If I were representing L and M as python lists, I would have done this:

L = [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]]
M = [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11]]
answer = [R for R in L if R[:-1] in M]

Now, I know that I can use a similar list comprehension in numpy and cast the result into an array, numpy being as awesome as it is, probably has a more elegant way to do this that I don't know about.

I tried looking into np.where (to get the required indices, which I can then subscipt L with), but that doesn't seem to do what I need.

I'd appreciate any help

4
  • If a row of M contains the first two elements of a row of L, it can't contain the last element, because it's out of room. Will this also be true in your actual application? Commented Mar 22, 2014 at 21:42
  • @user2357112: absolutely true. This is why I am testing for "some row in M that is made up of all the elements in R except the last one". The last element of a row in L is an added dimension based on some extra computation Commented Mar 22, 2014 at 21:45
  • When I see problems like this, I wish NumPy's set operations weren't all 1D-only. I think you might be able to do something by putting together several sort and in1d calls. The idea is vague right now, though. Commented Mar 22, 2014 at 21:51
  • @user2357112: that's beyond my numpy-fu. I'd appreciate an example Commented Mar 22, 2014 at 21:53

4 Answers 4

4

Ok, I think I got it. The trick is to add another dimension to M, and then you can use broadcasting:

M.shape += (1,)
E = np.all(L[:,:-1].T == M, 1)

and you get a 6x4 boolean matrix E that gives you the results of comparing all rows of L with all rows of M.

From here it is easy to finish:

result = L[np.any(E,0)]

This way the solution is streamlined and you don't need any lambda functions or "implicit loops" (e.g. np.apply_along_axis()).

Yes, numpy vectorization is beautiful (but you have to think quite abstract sometimes)...

Sign up to request clarification or add additional context in comments.

3 Comments

Either I don't get this, or it doesn't work. Could you please implement the easy finish to show me how I could get the desired values?
Ok I made it explicit - I usually try not to give an answer that people can just copy and paste.
Awesome answer ! I updated mine with a detailed explanation of the array broadcast magic. Thanks for increasing my numpy-fu ! :)
3

Quite similar to Bitwise's answer :

def fn(a):
    return lambda b: np.all(a==b, axis=1)
matches = np.apply_along_axis(fn(M), 1, L[:,:2])
result = L[np.any(matches, axis=1)]

What happens under the hood is something like this (I'll use Bitwise's example, which is easier to demonstrate) :

>>> M
array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])
>>> M.shape+=(1,)
>>> M
array([[[ 0],
        [ 1]],

       [[ 2],
        [ 3]],

       [[ 4],
        [ 5]],

       [[ 6],
        [ 7]],

       [[ 8],
        [ 9]],

       [[10],
        [11]]])

Here we have added another dimension to the M array, which is now (6,2,1).

>>> L2 = L[:,:-1].T

Then we get rid of the last column of 2, and transpose the array, so that the dimension is (2,4)

And here is the magic, M and L2 are now broadcastable to arrays of dimension (6,2,4).

As numpy's doc states :

A set of arrays is called “broadcastable” to the same shape if the above rules produce a valid result, i.e., one of the following is true:

The arrays all have exactly the same shape.
The arrays all have the same number of dimensions and the length of each dimensions is either a common length or 1.
The arrays that have too few dimensions can have their shapes prepended with a dimension of length 1 to satisfy property 2.

Example

If a.shape is (5,1), b.shape is (1,6), c.shape is (6,) and d.shape is () so that d is a scalar, then a, b, c, and d are all broadcastable to dimension (5,6); and

a acts like a (5,6) array where a[:,0] is broadcast to the other columns,
b acts like a (5,6) array where b[0,:] is broadcast to the other rows,
c acts like a (1,6) array and therefore like a (5,6) array where c[:] is broadcast to every row, and finally,
d acts like a (5,6) array where the single value is repeated.

M[:,:,0] will be repeated 4 times to fill the 3 dim, and L2 will be prepended a new dimension and be repeated 6 times to fill it.

>>> B = np.broadcast_arrays(L2,M)
>>> B
[array([[[ 0,  3,  6,  9],
        [ 1,  4,  7, 10]],

       [[ 0,  3,  6,  9],
        [ 1,  4,  7, 10]],

       [[ 0,  3,  6,  9],
        [ 1,  4,  7, 10]],

       [[ 0,  3,  6,  9],
        [ 1,  4,  7, 10]],

       [[ 0,  3,  6,  9],
        [ 1,  4,  7, 10]],

       [[ 0,  3,  6,  9],
        [ 1,  4,  7, 10]]]),


array([[[ 0,  0,  0,  0],
        [ 1,  1,  1,  1]],

       [[ 2,  2,  2,  2],
        [ 3,  3,  3,  3]],

       [[ 4,  4,  4,  4],
        [ 5,  5,  5,  5]],

       [[ 6,  6,  6,  6],
        [ 7,  7,  7,  7]],

       [[ 8,  8,  8,  8],
        [ 9,  9,  9,  9]],

       [[10, 10, 10, 10],
        [11, 11, 11, 11]]])]

We can now compare them element-wise :

>>> np.equal(*B)
array([[[ True, False, False, False],
        [ True, False, False, False]],

       [[False, False, False, False],
        [False, False, False, False]],

       [[False, False, False, False],
        [False, False, False, False]],

       [[False, False,  True, False],
        [False, False,  True, False]],

       [[False, False, False, False],
        [False, False, False, False]],

       [[False, False, False, False],
        [False, False, False, False]]], dtype=bool)

Row to row (axis = 1):

>>> np.all(np.equal(*B), axis=1)
array([[ True, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False,  True, False],
       [False, False, False, False],
       [False, False, False, False]], dtype=bool)

Aggregate on L's :

>>> C = np.any(np.all(np.equal(*B), axis=1), axis=0)
>>> C
array([ True, False,  True, False], dtype=bool)

And this gives you the boolean mask to apply to L.

>>> L[C]
array([[0, 1, 2],
       [6, 7, 8]])

apply_along_axis will leverage the same feature, but reducing L's dimension instead of increasing M's (thus adding implicit loops).

1 Comment

Thank you for such a detailed post. I'm going to stay with behzad.nouri's answer, just for its simplicity. But I really like yours for how the amount of detail and teaching me a few more things. I'm bookmarking this for future reference
0
>>> import hashlib
>>> fn = lambda xs: hashlib.sha1(xs).hexdigest()
>>> m = np.apply_along_axis(fn, 1, M)
>>> l = np.apply_along_axis(fn, 1, L[:,:-1])
>>> L[np.in1d(l, m)]
array([[0, 1, 2],
       [6, 7, 8]])

2 Comments

Could you add an explanation please? I don't fully understand the solution
@inspectorG4dget since in1d only works on 1 dimensional arrays, i am hashing the rows, and then apply in1d.
0
>>> print np.array([row for row in L if row[:-1] in M])
[[0 1 2]
 [6 7 8]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.