Selecting columns in numpy based on a Boolean vector

Question

I have two NumPy arrays a, b with dimensions m by n. I have a Boolean vector b of length n and I want to produce a new array c, which selects the n columns from a, b, so that if b[i] is true, I take the column from b otherwise from a.

How do I do this in the most efficient way possible? I've looked at select, where and choose.

Could you provide some sample (dummy) data and your expected output? I know you have explained it quite clearly but it does help others (me actually) better understand your question in a visual sense — Anzel
– Anzel, Commented Jan 3, 2015 at 23:27
please forget what I said, others are just so good at understanding and have already come with solutions :) — Anzel
– Anzel, Commented Jan 3, 2015 at 23:28

Joe Kington · Accepted Answer · 2015-01-03 23:34:24Z

5

First off, let's set up some example code:

import numpy as np

m, n = 5, 3
a = np.zeros((m, n))
b = np.ones((m, n))

boolvec = np.random.randint(0, 2, m).astype(bool)

Just to show what this data might look like:

In [2]: a
Out[2]: 
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [3]: b
Out[3]: 
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [4]: boolvec
Out[4]: array([ True,  True, False, False, False], dtype=bool)

In this case, it's most efficient to use np.where for this. However, we need boolvec to be of a shape that can broadcast to the same shape as a and b. Therefore, we can make it a column vector by slicing with np.newaxis or None (they're the same):

In [5]: boolvec[:,None]
Out[5]: 
array([[ True],
       [ True],
       [False],
       [False],
       [False]], dtype=bool)

And then we can make the final result using np.where:

In [6]: c = np.where(boolvec[:, None], a, b)

In [7]: c
Out[7]: 
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

edited Jan 3, 2015 at 23:34

answered Jan 3, 2015 at 23:27

Joe Kington

287k73 gold badges621 silver badges474 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Neil G Over a year ago

Thanks, where is faster than choose. I did not need to use np.newaxis as broadcasting seemed to work?

Joe Kington Over a year ago

@NeilG - It depends on whether boolvec is m-length or n-length. np.where assumes slightly different things than "normal" indexing would. You can index with boolvec if it's m-length, but np.where expects it to broadcast, which applies along the last axis. Therefore, if boolvec is m-length, you'll need to slice with np.newaxis, and if it's n-length, you won't (or, rather, you can, but you'd do boolvec[None, :]).

Joe Kington Over a year ago

Also, I just realized I misread your original question. You were asking about an n-length vector, in which case np.where works as-is.

Neil G Over a year ago

Right, thanks. In general, I prefer writing np.newaxis for readability. :)

Joe Kington Over a year ago

Absolutely! It's quite arguably better! (I was just lazy)

Alex Riley · Accepted Answer · 2015-01-03 23:33:26Z

4

You could use np.choose for this.

For example a and b arrays:

>>> a = np.arange(12).reshape(3,4)
>>> b = np.arange(12).reshape(3,4) + 100
>>> a_and_b = np.array([a, b])

To use np.choose, we want a 3D array with both arrays; a_and_b looks like this:

array([[[  0,   1,   2,   3],
        [  4,   5,   6,   7],
        [  8,   9,  10,  11]],

       [[100, 101, 102, 103],
        [104, 105, 106, 107],
        [108, 109, 110, 111]]])

Now let the Boolean array be bl = np.array([0, 1, 1, 0]). Then:

>>> np.choose(bl, a_and_b)
array([[  0, 101, 102,   3],
       [  4, 105, 106,   7],
       [  8, 109, 110,  11]])

edited Jan 3, 2015 at 23:33

answered Jan 3, 2015 at 23:27

Alex Riley

178k46 gold badges274 silver badges247 bronze badges

Comments

hpaulj · Accepted Answer · 2015-01-04 18:29:00Z

4

Timings for (5000,3000) arrays are:

In [107]: timeit np.where(boolvec[:,None],b,a)
1 loops, best of 3: 993 ms per loop

In [108]: timeit np.choose(boolvec[:,None],[a,b])
1 loops, best of 3: 929 ms per loop

In [109]: timeit c=a[:];c[boolvec,:]=b[boolvec,:]
1 loops, best of 3: 786 ms per loop

where and choose are essentially the same; boolean indexing slightly faster. select uses choose, so I didn't time it.

My timings for column sampling are similar, except this time the indexing is slower:

In [119]: timeit np.where(cols,b,a)
1 loops, best of 3: 878 ms per loop

In [120]: timeit np.choose(cols,[a,b])
1 loops, best of 3: 915 ms per loop

In [121]: timeit c=a[:];c[:,cols]=b[:,cols]
1 loops, best of 3: 1.25 s per loop

Correction, for the indexing I should be using a.copy().

In [32]: timeit c=a.copy();c[boolvec,:]=b[boolvec,:]
1 loops, best of 3: 783 ms per loop
In [33]: timeit c=a.copy();c[:,cols]=b[:,cols]
1 loops, best of 3: 1.44 s per loop

I get the same timings for Python2.7 and 3, numpy 1.8.2 and 1.9.0 dev

edited Jan 4, 2015 at 18:29

answered Jan 4, 2015 at 2:36

hpaulj

233k14 gold badges260 silver badges392 bronze badges

1 Comment

Neil G Over a year ago

On my computer choose was twice as slow. Not sure why.

Collectives™ on Stack Overflow

Selecting columns in numpy based on a Boolean vector

3 Answers 3

5 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related