Mask numpy array based on index

Question

How do I mask an array based on the actual index values?

That is, if I have a 10 x 10 x 30 matrix and I want to mask the array when the first and second index equal each other.

For example, [1, 1 , :] should be masked because 1 and 1 equal each other but [1, 2, :] should not because they do not.

I'm only asking this with the third dimension because it resembles my current problem and may complicate things. But my main question is, how to mask arrays based on the value of the indices?

Off topic, your avatar matches your question topic nicely :P — askewchan
– askewchan, Commented Sep 17, 2013 at 22:13
Yep, just trying to draw Z's all over my code. Thanks for helping me with that. — aleph4
– aleph4, Commented Sep 17, 2013 at 23:56

askewchan · Accepted Answer · 2013-09-25 17:00:19Z

7

In general, to access the value of the indices, you can use np.meshgrid:

i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij')
m.mask = (i == j)

The advantage of this method is that it works for arbitrary boolean functions on i, j, and k. It is a bit slower than the use of the identity special case.

In [56]: %%timeit
   ....: i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij')
   ....: i == j
10000 loops, best of 3: 96.8 µs per loop

As @Jaime points out, meshgrid supports a sparse option, which doesn't do so much duplication, but requires a bit more care in some cases because they don't broadcast. It will save memory and speed things up a little. For example,

In [77]: x = np.arange(5)

In [78]: np.meshgrid(x, x)
Out[78]: 
[array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]]),
 array([[0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4]])]

In [79]: np.meshgrid(x, x, sparse=True)
Out[79]: 
[array([[0, 1, 2, 3, 4]]),
 array([[0],
       [1],
       [2],
       [3],
       [4]])]

So, you can use the sparse version as he says, but you must force the broadcasting as such:

i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij', sparse=True)
m.mask = np.repeat(i==j, k.size, axis=2)

And the speedup:

In [84]: %%timeit
   ....: i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij', sparse=True)
   ....: np.repeat(i==j, k.size, axis=2)
10000 loops, best of 3: 73.9 µs per loop

edited Sep 25, 2013 at 17:00

answered Sep 17, 2013 at 22:28

askewchan

46.7k18 gold badges125 silver badges135 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Daniel Over a year ago

Why not use np.arange instead of range, should save time.

Jaime Over a year ago

Nice use of map... But you are creating 3 full masks of (10, 10, 30)... If you set sparse=True in the call to np.meshgrid, you reduce the memory footprint substantially, but you need to include all dimensions in the creation of the mask, e.g. (i == j) & (k >= 0) runs twice as fast as your solution.

askewchan Over a year ago

Thanks @Jaime, I was trying to figure out how to do it with the sparser versions but couldn't!

Jaime Over a year ago

It's a pity that boolean mask indexing does not support broadcasting. There may be some good reason for it, but it would make some operations ridiculously easy.

Bi Rico Over a year ago

sparse=True seems like a lot of trouble for 8us, at least in this case :).

|

askewchan · Accepted Answer · 2013-09-17 22:41:09Z

0

In your special case of wanting to mask the diagonals, you can use the np.identity() function which returns ones along the diagonal. Since you have the third dimension, we have to add that third dimension to the the identity matrix:

m.mask = np.identity(10)[...,None]*np.ones((1,1,30))

There might be a better way of constructing that array, but it is basically stacking 30 of the np.identity(10) array. For example, this is equivalent:

np.dstack((np.identity(10),)*30)

but slower:

In [30]: timeit np.identity(10)[...,None]*np.ones((1,1,30))
10000 loops, best of 3: 40.7 µs per loop

In [31]: timeit np.dstack((np.identity(10),)*30)
1000 loops, best of 3: 219 µs per loop

And @Ophion's suggestions

In [33]: timeit np.tile(np.identity(10)[...,None], 30)
10000 loops, best of 3: 63.2 µs per loop

In [71]: timeit np.repeat(np.identity(10)[...,None], 30)
10000 loops, best of 3: 45.3 µs per loop

edited Sep 17, 2013 at 22:41

answered Sep 17, 2013 at 22:12

askewchan

46.7k18 gold badges125 silver badges135 bronze badges

4 Comments

Daniel Over a year ago

np.tile(np.identity(10)[...,None], 30)

Daniel Over a year ago

Well thats interesting. tile is basically a wrapper on repeat. So if you run np.repeat(np.identity(10)[...,None], 30, axis=-1) you skip some extra if statements, but np.identity(10)[...,None]*np.ones((1,1,30)) is universally faster for any array size in 1 dimension. Good to know.

askewchan Over a year ago

@Ophion Cool, I'd never considered that along dimensions of length 1, tile and repeat are equivalent.

Daniel Over a year ago

You do have to specify an axis as np.repeat is odd, instead of the default axis=-1 as with most function it is axis=None. If you do not explicitly pass the axis keyword you get a 1D array out and will have to reshape it manually.

Collectives™ on Stack Overflow

Mask numpy array based on index

2 Answers 2

6 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related