vectorize numpy unique for subarrays

Question

I have a numpy array data of shape (N, 20, 20) with N being some very large number. I want to get the number of unique values in each of the 20x20 sub-arrays. with a loop that would be:

values = []
for i in data:
    values.append(len(np.unique(i)))

How could I vectorize this loop? speed is a concern.

If I try np.unique(data) I get the unique values for the whole data array not the individual 20x20 blocks, so that's not what I need.

Have you considered writing a Fortran function for this and wrapping it with f2py? Within the fortran subroutine you can parallelize quite easily with OpenMP. I often resort to this approach when I need to speed up a computationally intensive loop. — deepak
– deepak, Commented Sep 3, 2015 at 16:26
Another approach might be to use numba numba.pydata.org . It has a vectorize decorator that I believe might be applicable to this case. I am not an expert on numba so you might like to take a look at it. — deepak
– deepak, Commented Sep 3, 2015 at 16:28
Thanks deepak. I don't know Fortran, I would rather try to use cython is I have to use another language. I may explore how to use numba. — martinako
– martinako, Commented Sep 3, 2015 at 17:05

hpaulj · Accepted Answer · 2015-09-04 17:33:00Z

3

First, you can work with data.reshape(N,-1), since you are interested in sorting the last 2 dimensions.

An easy way to get the number of unique values for each row is to dump each row into a set and let it do the sorting:

[len(set(i)) for i in data.reshape(data.shape[0],-1)]

But this is an iteration, through probably a fast one.

A problem with 'vectorizing' is that the set or list of unique values in each row will differ in length. 'rows with differing length' is a red flag when it comes to 'vectorizing'. You no longer have the 'rectangular' data layout that makes most vectorizing possible.

You could sort each row:

np.sort(data.reshape(N,-1))

array([[1, 2, 2, 3, 3, 5, 5, 5, 6, 6],
       [1, 1, 1, 2, 2, 2, 3, 3, 5, 7],
       [0, 0, 2, 3, 4, 4, 4, 5, 5, 9],
       [2, 2, 3, 3, 4, 4, 5, 7, 8, 9],
       [0, 2, 2, 2, 2, 5, 5, 5, 7, 9]])

But how do you identify the unique values in each row without iterating? Counting the number of nonzero differences might just do the trick:

In [530]: data=np.random.randint(10,size=(5,10))

In [531]: [len(set(i)) for i in data.reshape(data.shape[0],-1)]
Out[531]: [7, 6, 6, 8, 6]

In [532]: sdata=np.sort(data,axis=1)
In [533]: (np.diff(sdata)>0).sum(axis=1)+1            
Out[533]: array([7, 6, 6, 8, 6])

I was going to add a warning about floats, but if np.unique is working for your data, my approach should work just as well.

[(np.bincount(i)>0).sum() for i in data]

This is an iterative solution that is clearly faster than my len(set(i)) version, and is competitive with the diff...sort.

In [585]: data.shape Out[585]: (10000, 400)

In [586]: timeit [(np.bincount(i)>0).sum() for i in data]
1 loops, best of 3: 248 ms per loop

In [587]: %%timeit                                       
sdata=np.sort(data,axis=1)
(np.diff(sdata)>0).sum(axis=1)+1
   .....: 
1 loops, best of 3: 280 ms per loop

I just found a faster way to use bincount, np.count_nonzero

In [715]: timeit np.array([np.count_nonzero(np.bincount(i)) for i in data])
10 loops, best of 3: 59.6 ms per loop

I was surprised at the speed improvement. But then I recalled that count_nonzero is used in other functions (e.g. np.nonzero) to allocate space for their return results. So it makes sense that this function would be coded for maximum speed. (It doesn't help in the diff...sort case because it does not take an axis parameter).

edited Sep 4, 2015 at 17:33

answered Sep 3, 2015 at 16:34

hpaulj

233k14 gold badges260 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

martinako Over a year ago

Thanks. It does work! Although I was hoping to get it to run faster. On my data is taking about 7s and I'd like to get it done in less than a 1s. If there isn't another answer with a faster method I'll accept yours.

hpaulj Over a year ago

So 'vectorize' is really code for 'fastest'? :) np.sort takes about 3/4 of the time; the diff part only 1/4. sort by row takes just as long as sorting the whole flattened array.

hpaulj Over a year ago

I found a bincount version that's slightly faster - but it iterates by row.

martinako Over a year ago

I tried the bincount version, and when I adapt it to my script it's actually slower (~11s) that the sort/diff (~7s), I need to place the results in specific parts of a numpy array, I not sure if that take much time. In any case I think the sort/diff version is probably as fast as it can get with pure python. So I accept the answer. I'm going to try cython as I need this to work in less than a second.

hpaulj Over a year ago

I found that count_nonzero can speed up the bincount solution quite a bit.

|

Collectives™ on Stack Overflow

vectorize numpy unique for subarrays

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related