8

Given a large 2d numpy array, I would like to remove a range of rows, say rows 10000:10010 efficiently. I have to do this multiple times with different ranges, so I would like to also make it parallelizable.

Using something like numpy.delete() is not efficient, since it needs to copy the array, taking too much time and memory. Ideally I would want to do something like create a view, but I am not sure how I could do this in this case. A masked array is also not an option since the downstream operations are not supported on masked arrays.

Any ideas?

1
  • 1
    What are the downstream operations? You could ttry to fake the deletion by keeping track of the to-be-deleted rows... Commented Nov 1, 2013 at 7:06

3 Answers 3

3

Because of the strided data structure that defines a numpy array, what you want will not be possible without using a masked array. Your best option might be to use a masked array (or perhaps your own boolean array) to mask the deleted the rows, and then do a single real delete operation of all the rows to be deleted before passing it downstream.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, I suspected there is no way around this (but let's see if someone comes up with a creative solution). I don't understand though why you proposed to mask and then delete - how is that better than just deleting?
That's partly a guess about just how your code that figures out what to delete will have to work. As you pointed out, repeatedly deleting ranges of rows will be inefficient (in terms of both memory and time). Also, I interpreted "I have to do this multiple times with different ranges" as the part that might be parallelizable. To do that in parallel, you'll want to keep the underlying array unchanged, and just flip the appropriate "deleted" bits. Then, when you've figured out all the rows that are to be deleted, you can do the real "delete" operation in final non-parallel step.
2

There isn't really a good way to speed up the delete operation, as you've already alluded to, this kind of deleting requires the data to be copied in memory. The one thing you can do, as suggested by @WarrenWeckesser, is combine multiple delete operations and apply them all at once. Here's an example:

ranges = [(10, 20), (25, 30), (50, 100)]
mask = np.ones(len(array), dtype=bool)

# Update the mask with all the rows you want to delete
for start, end in ranges:
    mask[start:stop] = False

# Apply all the changes at once
new_array = array[mask]

It doesn't really make sense to parallelize this because you're just copying stuff in memory so this will be memory bound anyways, adding more cpus will not help.

Comments

0

I don't know how fast this is, relative to the above, but say you have a list L of row indices of the rows you wish to keep from array A (by "rows" I mean the first index, for higher dimensional arrays). All other rows will be deleted. We'll let A hold the result.

A = A[np.ix_(L)]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.