Efficient way to create an array that is a sequence of variable length ranges in numpy

Question

Suppose I have an array

import numpy as np
x=np.array([5,7,2])

I want to create an array that contains a sequence of ranges stacked together with the length of each range given by x:

y=np.hstack([np.arange(1,n+1) for n in x])

Is there some way to do this without the speed penalty of a list comprehension or looping. (x could be a very large array)

The result should be

y == np.array([1,2,3,4,5,1,2,3,4,5,6,7,1,2])

do you mean an array with a different number of rows/cols for each col/row? — Saullo G. P. Castro
– Saullo G. P. Castro, Commented Aug 6, 2013 at 15:05
Can you show an example of the result you are looking for? The current code for y doesnt work because elements of the numpy array will be different lengths. — Daniel
– Daniel, Commented Aug 6, 2013 at 15:48
The result should be an array. Sorry for the errors in the original code! Should be fixed now — dlm
– dlm, Commented Aug 6, 2013 at 15:52
What is the a maximum length of x and maximum of x? Also is x unique? — Daniel
– Daniel, Commented Aug 6, 2013 at 16:11
If you are planning on iterating over this, maybe turn it into a generator so you don't have to make it all at once and store it in memory. It may not be much faster, but it won't happen all at once. — gggg
– gggg, Commented Aug 6, 2013 at 16:32

seberg · Accepted Answer · 2013-08-06 19:05:37Z

5

You could use accumulation:

def my_sequences(x):
    x = x[x != 0] # you can skip this if you do not have 0s in x.

    # Create result array, filled with ones:
    y = np.cumsum(x, dtype=np.intp)
    a = np.ones(y[-1], dtype=np.intp)

    # Set all beginnings to - previous length:
    a[y[:-1]] -= x[:-1]

    # and just add it all up (btw. np.add.accumulate is equivalent):
    return np.cumsum(a, out=a) # here, in-place should be safe.

(One word of caution: If you result array would be larger then the possible size np.iinfo(np.intp).max this might with some bad luck return wrong results instead of erroring out cleanly...)

And because everyone always wants timings (compared to Ophion's) method:

In [11]: x = np.random.randint(0, 20, 1000000)

In [12]: %timeit ua,uind=np.unique(x,return_inverse=True);a=[np.arange(1,k+1) for k in ua];np.concatenate(np.take(a,uind))
1 loops, best of 3: 753 ms per loop

In [13]: %timeit my_sequences(x)
1 loops, best of 3: 191 ms per loop

of course the my_sequences function will not ill-perform when the values of x get large.

edited Aug 6, 2013 at 19:05

answered Aug 6, 2013 at 18:18

seberg

9,0352 gold badges34 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Daniel Over a year ago

Really demonstrates how slow concatenate is. When I was timing my data it was the majority of the final codes time. +1!

dlm Over a year ago

Yes, this is exactly the sort of numpy voodoo I was looking for. :)

Daniel · Accepted Answer · 2013-08-06 17:24:40Z

First idea; prevent multiple calls to np.arange and concatenate should be much faster then hstack:

import numpy as np
x=np.array([5,7,2])

>>>a=np.arange(1,x.max()+1)
>>> np.hstack([a[:k] for k in x])
array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 1, 2])

>>> np.concatenate([a[:k] for k in x])
array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 1, 2])

If there are many nonunique values this seems more efficient:

>>>ua,uind=np.unique(x,return_inverse=True)
>>>a=[np.arange(1,k+1) for k in ua]
>>>np.concatenate(np.take(a,uind))

array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 1, 2])

Some timings for your case:

x=np.random.randint(0,20,1000000)

Original code

#Using hstack
%timeit np.hstack([np.arange(1,n+1) for n in x])
1 loops, best of 3: 7.46 s per loop

#Using concatenate
%timeit np.concatenate([np.arange(1,n+1) for n in x])
1 loops, best of 3: 5.27 s per loop

First code:

#Using hstack
%timeit a=np.arange(1,x.max()+1);np.hstack([a[:k] for k in x])
1 loops, best of 3: 3.03 s per loop

#Using concatenate
%timeit a=np.arange(1,x.max()+1);np.concatenate([a[:k] for k in x])
10 loops, best of 3: 998 ms per loop

Second code:

%timeit ua,uind=np.unique(x,return_inverse=True);a=[np.arange(1,k+1) for k in ua];np.concatenate(np.take(a,uind))
10 loops, best of 3: 522 ms per loop

Looks like we gain a 14x speedup with the final code.

Small sanity check:

ua,uind=np.unique(x,return_inverse=True)
a=[np.arange(1,k+1) for k in ua]
out=np.concatenate(np.take(a,uind))

>>>out.shape
(9498409,)

>>>np.sum(x)
9498409

Both answers look good. I think your second answer works for my use case of x with large length, but small maximum value.

Collectives™ on Stack Overflow

Efficient way to create an array that is a sequence of variable length ranges in numpy

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related