7

I know that in order to add an element to a set it must be hashable, and numpy arrays seemingly are not. This is causing me some problems because I have the following bit of code:

fill_set = set()
for i in list_of_np_1D:
    vecs = i + np_2D
    for j in range(N):
        tup = tuple(vecs[j,:])
        fill_set.add(tup)

# list_of_np_1D is a list of 1D numpy arrays
# np_2D is a 2D numpy array
# np_2D could also be converted to a list of 1D arrays if it helped.

I need to get this running faster and nearly 50% of the run-time is spent converting slices of the 2D numpy array to tuples so they can be added to the set.

so I've been trying to find out the following

  • Is there any way to make numpy arrays, or something that functions like numpy arrays (has vector addition) hashable so they can be added to sets?
  • If not, is there a way I can speed up the process of making the tuple conversion?

Thanks for any help!

5
  • 1
    Not only are NumPy arrays not hashable, they're not even really equatable. a == b doesn't produce a boolean representing whether a equals b if either of a or b is an array, and set has no idea what to do with an array of elementwise comparison results or how to call np.array_equal. Commented Feb 16, 2016 at 20:07
  • 3
    Do you really need to convert your arrays to Python sets? Numpy natively supports various set operations on arrays (see numpy.lib.arraysetops). Commented Feb 16, 2016 at 20:16
  • 1
    @ali_m I wasn't aware of that thanks, I'll go check it out now. Ultimately I have a two large collections of 1D arrays of integers, I need to be able to add more arrays to those collections and do something equivalent to the .difference_update operation that sets have. Commented Feb 16, 2016 at 20:27
  • You can use tuple(vecs[j,:].tolist()) to reduce the convert time. You can even convert the array to a bytes object by vecs[j, :].tobytes() if you only want to save the array in a set. Commented Feb 17, 2016 at 3:11
  • @HYRY thanks man I'll go try those out now. Commented Feb 17, 2016 at 10:32

1 Answer 1

3

Create some data first:

import numpy as np
np.random.seed(1)
list_of_np_1D = np.random.randint(0, 5, size=(500, 6))
np_2D = np.random.randint(0, 5, size=(20, 6))

run your code:

%%time
fill_set = set()
for i in list_of_np_1D:
    vecs = i + np_2D
    for v in vecs:
        tup = tuple(v)
        fill_set.add(tup)
res1 = np.array(list(fill_set))

output:

CPU times: user 161 ms, sys: 2 ms, total: 163 ms
Wall time: 167 ms

Here is a speedup version, it use broadcast, .view() method to convert dtype to string, after calling set() convert the string back to array:

%%time
r = list_of_np_1D[:, None, :] + np_2D[None, :, :]
stype = "S%d" % (r.itemsize * np_2D.shape[1])
fill_set2 = set(r.ravel().view(stype).tolist())
res2 = np.zeros(len(fill_set2), dtype=stype)
res2[:] = list(fill_set2)
res2 = res2.view(r.dtype).reshape(-1, np_2D.shape[1])

output:

CPU times: user 13 ms, sys: 1 ms, total: 14 ms
Wall time: 14.6 ms

To check the result:

np.all(res1[np.lexsort(res1.T), :] == res2[np.lexsort(res2.T), :])

You can also use lexsort() to remove duplicated data:

%%time
r = list_of_np_1D[:, None, :] + np_2D[None, :, :]
r = r.reshape(-1, r.shape[-1])

r = r[np.lexsort(r.T)]
idx = np.where(np.all(np.diff(r, axis=0) == 0, axis=1))[0] + 1
res3 = np.delete(r, idx, axis=0)

output:

CPU times: user 13 ms, sys: 3 ms, total: 16 ms
Wall time: 16.1 ms

To check the result:

np.all(res1[np.lexsort(res1.T), :] == res3)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.