How to exclude rows/columns from numpy.ndarray data

Question

Assume we have a numpy.ndarray data, let say with the shape (100,200), and you also have a list of indices which you want to exclude from the data. How would you do that? Something like this:

a = numpy.random.rand(100,200)
indices = numpy.random.randint(100,size=20)
b = a[-indices,:] # imaginary code, what to replace here?

Thanks.

Brad Solomon · Accepted Answer · 2018-01-23 04:05:15Z

18

You can use b = numpy.delete(a, indices, axis=0)

Source: NumPy docs.

edited Jan 23, 2018 at 4:05

Brad Solomon

41.2k39 gold badges167 silver badges260 bronze badges

answered May 16, 2015 at 8:24

Bang

1,13211 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

hpaulj Over a year ago

For a numeric list of indices, np.delete uses the mask solution that you earlier rejected as taking up too much memory.

Thomas Arildsen Over a year ago

@hpaulj the documentation for delete says: "out : ndarray A copy of arr with the elements specified by obj removed." Do you mean that it uses a numpy.ma masked array? It does not sound like it to me.

hpaulj Over a year ago

No, not masked array; mask as in boolean index.

Community · Accepted Answer · 2017-05-23 10:33:58Z

You could try:

a = numpy.random.rand(100,200)
indices = numpy.random.randint(100,size=20)
b = a[np.setdiff1d(np.arange(100),indices),:]

This avoids creating the mask array of same size as your data in https://stackoverflow.com/a/21022753/865169. Note that this example creates a 2D array b instead of the flattened array in the latter answer.

A crude investigation of runtime vs memory cost of this approach vs https://stackoverflow.com/a/30273446/865169 seems to suggest that delete is faster while indexing with setdiff1d is much easier on memory consumption:

In [75]: %timeit b = np.delete(a, indices, axis=0)
The slowest run took 7.47 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 24.7 µs per loop

In [76]: %timeit c = a[np.setdiff1d(np.arange(100),indices),:]
10000 loops, best of 3: 48.4 µs per loop

In [77]: %memit b = np.delete(a, indices, axis=0)
peak memory: 52.27 MiB, increment: 0.85 MiB

In [78]: %memit c = a[np.setdiff1d(np.arange(100),indices),:]
peak memory: 52.39 MiB, increment: 0.12 MiB

Andrey Shokhin · Accepted Answer · 2014-01-09 14:17:22Z

3

It's ugly but works:

b = np.array([a[i] for i in range(m.shape[0]) if i not in indices])

answered Jan 9, 2014 at 14:17

Andrey Shokhin

12.4k1 gold badge19 silver badges15 bronze badges

Comments

MB-F · Accepted Answer · 2014-01-09 14:19:56Z

1

You could try something like this:

a = numpy.random.rand(100,200)
indices = numpy.random.randint(100,size=20)
mask = numpy.ones(a.shape, dtype=bool)
mask[indices,:] = False
b = a[mask]

answered Jan 9, 2014 at 14:19

MB-F

23.8k5 gold badges71 silver badges127 bronze badges

2 Comments

adrin Over a year ago

This solution needs an array of the exact same size as my original data, which in my case is gigantic. The time and space complexity of this solution is O(n^2), which is not really practical for my data.

hpaulj Over a year ago

This is essentially method the np.delete uses. Look where it constructs keep = ones(N, dtype=bool); keep[obj,] = False.

Collectives™ on Stack Overflow

How to exclude rows/columns from numpy.ndarray data

4 Answers 4

3 Comments

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related