How can one efficiently remove a range of rows from a large numpy array?

Question

Given a large 2d numpy array, I would like to remove a range of rows, say rows 10000:10010 efficiently. I have to do this multiple times with different ranges, so I would like to also make it parallelizable.

Using something like numpy.delete() is not efficient, since it needs to copy the array, taking too much time and memory. Ideally I would want to do something like create a view, but I am not sure how I could do this in this case. A masked array is also not an option since the downstream operations are not supported on masked arrays.

Any ideas?

What are the downstream operations? You could ttry to fake the deletion by keeping track of the to-be-deleted rows... — Jaime
– Jaime, Commented Nov 1, 2013 at 7:06

Warren Weckesser · Accepted Answer · 2013-11-01 02:35:17Z

3

Because of the strided data structure that defines a numpy array, what you want will not be possible without using a masked array. Your best option might be to use a masked array (or perhaps your own boolean array) to mask the deleted the rows, and then do a single real delete operation of all the rows to be deleted before passing it downstream.

answered Nov 1, 2013 at 2:35

Warren Weckesser

116k20 gold badges207 silver badges224 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Bitwise Over a year ago

Thanks, I suspected there is no way around this (but let's see if someone comes up with a creative solution). I don't understand though why you proposed to mask and then delete - how is that better than just deleting?

Warren Weckesser Over a year ago

That's partly a guess about just how your code that figures out what to delete will have to work. As you pointed out, repeatedly deleting ranges of rows will be inefficient (in terms of both memory and time). Also, I interpreted "I have to do this multiple times with different ranges" as the part that might be parallelizable. To do that in parallel, you'll want to keep the underlying array unchanged, and just flip the appropriate "deleted" bits. Then, when you've figured out all the rows that are to be deleted, you can do the real "delete" operation in final non-parallel step.

Bi Rico · Accepted Answer · 2013-11-01 02:51:33Z

2

There isn't really a good way to speed up the delete operation, as you've already alluded to, this kind of deleting requires the data to be copied in memory. The one thing you can do, as suggested by @WarrenWeckesser, is combine multiple delete operations and apply them all at once. Here's an example:

ranges = [(10, 20), (25, 30), (50, 100)]
mask = np.ones(len(array), dtype=bool)

# Update the mask with all the rows you want to delete
for start, end in ranges:
    mask[start:stop] = False

# Apply all the changes at once
new_array = array[mask]

It doesn't really make sense to parallelize this because you're just copying stuff in memory so this will be memory bound anyways, adding more cpus will not help.

answered Nov 1, 2013 at 2:51

Bi Rico

25.9k3 gold badges57 silver badges75 bronze badges

Comments

Paul Roub · Accepted Answer · 2016-07-18 00:14:54Z

0

I don't know how fast this is, relative to the above, but say you have a list L of row indices of the rows you wish to keep from array A (by "rows" I mean the first index, for higher dimensional arrays). All other rows will be deleted. We'll let A hold the result.

A = A[np.ix_(L)]

edited Jul 18, 2016 at 0:14

Paul Roub

36.5k27 gold badges88 silver badges95 bronze badges

answered Jul 17, 2016 at 23:36

Rafael_Espericueta

5451 gold badge6 silver badges15 bronze badges

Collectives™ on Stack Overflow

How can one efficiently remove a range of rows from a large numpy array?

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related