4

I have two NumPy arrays a, b with dimensions m by n. I have a Boolean vector b of length n and I want to produce a new array c, which selects the n columns from a, b, so that if b[i] is true, I take the column from b otherwise from a.

How do I do this in the most efficient way possible? I've looked at select, where and choose.

2
  • 1
    Could you provide some sample (dummy) data and your expected output? I know you have explained it quite clearly but it does help others (me actually) better understand your question in a visual sense Commented Jan 3, 2015 at 23:27
  • please forget what I said, others are just so good at understanding and have already come with solutions :) Commented Jan 3, 2015 at 23:28

3 Answers 3

5

First off, let's set up some example code:

import numpy as np

m, n = 5, 3
a = np.zeros((m, n))
b = np.ones((m, n))

boolvec = np.random.randint(0, 2, m).astype(bool)

Just to show what this data might look like:

In [2]: a
Out[2]: 
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [3]: b
Out[3]: 
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [4]: boolvec
Out[4]: array([ True,  True, False, False, False], dtype=bool)

In this case, it's most efficient to use np.where for this. However, we need boolvec to be of a shape that can broadcast to the same shape as a and b. Therefore, we can make it a column vector by slicing with np.newaxis or None (they're the same):

In [5]: boolvec[:,None]
Out[5]: 
array([[ True],
       [ True],
       [False],
       [False],
       [False]], dtype=bool)

And then we can make the final result using np.where:

In [6]: c = np.where(boolvec[:, None], a, b)

In [7]: c
Out[7]: 
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks, where is faster than choose. I did not need to use np.newaxis as broadcasting seemed to work?
@NeilG - It depends on whether boolvec is m-length or n-length. np.where assumes slightly different things than "normal" indexing would. You can index with boolvec if it's m-length, but np.where expects it to broadcast, which applies along the last axis. Therefore, if boolvec is m-length, you'll need to slice with np.newaxis, and if it's n-length, you won't (or, rather, you can, but you'd do boolvec[None, :]).
Also, I just realized I misread your original question. You were asking about an n-length vector, in which case np.where works as-is.
Right, thanks. In general, I prefer writing np.newaxis for readability. :)
Absolutely! It's quite arguably better! (I was just lazy)
4

You could use np.choose for this.

For example a and b arrays:

>>> a = np.arange(12).reshape(3,4)
>>> b = np.arange(12).reshape(3,4) + 100
>>> a_and_b = np.array([a, b])

To use np.choose, we want a 3D array with both arrays; a_and_b looks like this:

array([[[  0,   1,   2,   3],
        [  4,   5,   6,   7],
        [  8,   9,  10,  11]],

       [[100, 101, 102, 103],
        [104, 105, 106, 107],
        [108, 109, 110, 111]]])

Now let the Boolean array be bl = np.array([0, 1, 1, 0]). Then:

>>> np.choose(bl, a_and_b)
array([[  0, 101, 102,   3],
       [  4, 105, 106,   7],
       [  8, 109, 110,  11]])

Comments

4

Timings for (5000,3000) arrays are:

In [107]: timeit np.where(boolvec[:,None],b,a)
1 loops, best of 3: 993 ms per loop

In [108]: timeit np.choose(boolvec[:,None],[a,b])
1 loops, best of 3: 929 ms per loop

In [109]: timeit c=a[:];c[boolvec,:]=b[boolvec,:]
1 loops, best of 3: 786 ms per loop

where and choose are essentially the same; boolean indexing slightly faster. select uses choose, so I didn't time it.


My timings for column sampling are similar, except this time the indexing is slower:

In [119]: timeit np.where(cols,b,a)
1 loops, best of 3: 878 ms per loop

In [120]: timeit np.choose(cols,[a,b])
1 loops, best of 3: 915 ms per loop

In [121]: timeit c=a[:];c[:,cols]=b[:,cols]
1 loops, best of 3: 1.25 s per loop

Correction, for the indexing I should be using a.copy().

In [32]: timeit c=a.copy();c[boolvec,:]=b[boolvec,:]
1 loops, best of 3: 783 ms per loop
In [33]: timeit c=a.copy();c[:,cols]=b[:,cols]
1 loops, best of 3: 1.44 s per loop

I get the same timings for Python2.7 and 3, numpy 1.8.2 and 1.9.0 dev

1 Comment

On my computer choose was twice as slow. Not sure why.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.