For a massive performance we could incorporate strides here. The trick is to concatenate the original array with the sliced array ending at the second last element and then taking sliding windows of lengths same as the length of the original array.
Hence, the implementation would be -
def strided_method(ar):
a = np.concatenate(( ar, ar[:-1] ))
L = len(ar)
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, (L,L), (n,n), writeable=False)
The output would be read-only and a view of the concatenated array and as such would have a constant time almost irrespective of the array size. This means a hugely efficient solution. If you need writable output with its own memory space, make a copy there, as shown in the timings later on.
Sample run -
In [51]: foo = np.array([1,2,3,4])
In [52]: strided_method(foo)
Out[52]:
array([[1, 2, 3, 4],
[2, 3, 4, 1],
[3, 4, 1, 2],
[4, 1, 2, 3]])
Runtime test -
In [53]: foo = np.random.randint(0,9,(1000))
# @cᴏʟᴅsᴘᴇᴇᴅ's loopy soln
In [54]: %timeit np.array([np.roll(foo, -x) for x in np.arange(foo.shape[0])])
100 loops, best of 3: 12.7 ms per loop
In [55]: %timeit strided_method(foo)
100000 loops, best of 3: 7.46 µs per loop
In [56]: %timeit strided_method(foo).copy()
1000 loops, best of 3: 454 µs per loop