2
>>> import numpy as np
>>> a = np.arange(5)
>>> b = desired_function(a, 4)
array([[0, 3, 4, 1],
...    [1, 2, 1, 3],
...    [2, 4, 2, 4],
...    [3, 1, 3, 0],
...    [4, 0, 0, 2]])

What I've tried so far

def repeat_and_shuffle(a, ncols):
    nrows, = a.shape
    m = np.tile(a.reshape(nrows, 1), (1, ncols))
    return m

Somehow I have to shuffle m[:,1:ncols] efficiently by column.

4 Answers 4

3

Here is one way to create such an array:

>>> a = np.arange(5)
>>> perms = np.argsort(np.random.rand(a.shape[0], 3), axis=0) # 3 columns
>>> np.hstack((a[:,np.newaxis], a[perms]))
array([[0, 3, 1, 4],
       [1, 2, 3, 0],
       [2, 1, 4, 1],
       [3, 4, 0, 3],
       [4, 0, 2, 2]])

This creates an array of random values of the required shape and then sorts the indices in each column by their corresponding value. This array of indices is then used to index a.

(The idea of using np.argsort to create an array of columns of permuted indices came from @jme's answer here.)

Sign up to request clarification or add additional context in comments.

2 Comments

I was about to suggest something similar when you undeleted! I did some tests and yours is the fastest by far.
Thanks for supplying the timings! I wasn't sure how performant argsort would be for arrays with many rows; it's good to see that it fares well against other methods.
2

Build the new array using random permutations of the original.

>>> a = np.arange(5)
>>> n = 4
>>> z = np.array([a]+[np.random.permutation(a) for _ in xrange(n-1)])
>>> z.T
array([[0, 0, 4, 3],
       [1, 1, 3, 0],
       [2, 3, 2, 4],
       [3, 2, 0, 2],
       [4, 4, 1, 1]])
>>> 

Duplicate columns are possible because of the randomness.

3 Comments

Don't want to shuffle the first column.
Yours is actually faster than Ashwini's for all inputs I tried. See my tests.
@senderle .. I was using np.arange(100) with ten rows and np.arange(1000) with 100 rows.
2

This is a version of Ashwini Chaudhary's solution:

>>> a = numpy.array(['a', 'b', 'c', 'd', 'e'])
>>> a = numpy.tile(a[:,None], 5)
>>> a[:,1:] = numpy.apply_along_axis(numpy.random.permutation, 0, a[:,1:])
>>> a
    array([['a', 'c', 'a', 'd', 'c'],
       ['b', 'd', 'b', 'e', 'a'],
       ['c', 'e', 'd', 'a', 'e'],
       ['d', 'a', 'e', 'b', 'd'],
       ['e', 'b', 'c', 'c', 'b']], 
      dtype='|S1')

I think it's well-conceived and pedagogically useful (and I hope he undeletes it). But somewhat surprisingly, it's consistently the slowest one in the tests I've performed. Definitions:

>>> def column_perms_along(a, cols):
...     a = numpy.tile(a[:,None], cols)
...     a[:,1:] = numpy.apply_along_axis(numpy.random.permutation, 0, a[:,1:])
...     return a
... 
>>> def column_perms_argsort(a, cols):
...     perms = np.argsort(np.random.rand(a.shape[0], cols - 1), axis=0)
...     return np.hstack((a[:,None], a[perms]))
... 
>>> def column_perms_lc(a, cols):
...     z = np.array([a] + [np.random.permutation(a) for _ in xrange(cols - 1)])
...     return z.T
... 

For small arrays and few columns:

>>> %timeit column_perms_along(a, 5)
1000 loops, best of 3: 272 µs per loop
>>> %timeit column_perms_argsort(a, 5)
10000 loops, best of 3: 23.7 µs per loop
>>> %timeit column_perms_lc(a, 5)
1000 loops, best of 3: 165 µs per loop

For small arrays and many columns:

>>> %timeit column_perms_along(a, 500)
100 loops, best of 3: 29.8 ms per loop
>>> %timeit column_perms_argsort(a, 500)
10000 loops, best of 3: 185 µs per loop
>>> %timeit column_perms_lc(a, 500)
100 loops, best of 3: 11.7 ms per loop

For big arrays and few columns:

>>> A = numpy.arange(1000)
>>> %timeit column_perms_along(A, 5)
1000 loops, best of 3: 2.97 ms per loop
>>> %timeit column_perms_argsort(A, 5)
1000 loops, best of 3: 447 µs per loop
>>> %timeit column_perms_lc(A, 5)
100 loops, best of 3: 2.27 ms per loop

And for big arrays and many columns:

>>> %timeit column_perms_along(A, 500)
1 loops, best of 3: 281 ms per loop
>>> %timeit column_perms_argsort(A, 500)
10 loops, best of 3: 71.5 ms per loop
>>> %timeit column_perms_lc(A, 500)
1 loops, best of 3: 269 ms per loop

The moral of the story: always test! I imagine that for extremely large arrays, the disadvantage of an n log n solution like sorting might become apparent here. But numpy's implementation of sorting is extremely well-tuned in my experience. I bet you could go up several orders of magnitude before noticing an effect.

1 Comment

@AshwiniChaudhary, I didn't mean to induce you to delete -- only to correct the logic! I think all you'll have to do is change it to apply_along_axis...
0

Assuming you are ultimately intending to loop over multiple 1D input arrays, you might be able to cache your permutation indices and then just take rather than permute at the point of use. This can work even if the length of the 1D arrays varies: you just need to discard the permutation indices that are too large.

Rough (partially tested) code for implementation:

def permute_multi(X, k, _cache={}):
    """For 1D input `X` of len `n`, it generates an `(k,n)` array
    giving `k` permutations of `X`."""
    n = len(X)
    cached_inds = _cache.get('inds',np.array([[]]))

    # make sure that cached_inds has shape >= (k,n)
    if cached_inds.shape[1] < n:
        _cache['inds'] = cached_inds = np.empty(shape=(k,n),dtype=int)
        for i in xrange(k):
            cached_inds[i,:] = np.random.permutation(n)
    elif cached_inds.shape[0] < k:
        pass # TODO: need to generate more rows

    inds = cached_inds[:k,:] # dispose of excess rows

    if n < cached_inds.shape[1]:
        # dispose of high indices
        inds = inds.compress(inds.ravel()<n).reshape((k,n))

    return X[inds]

Depending on your usage you might want to provide some way of clearing the cache, or at least some heuristic that can spot when the cached n and k have grown much larger than most of the common inputs. Note that the above function gives (k,n) not (n,k), this is because numpy defaults to rows being contiguous and we want the n-dimension to be contiguous - you could force Fortran-style if you wish, or just transpose the output (which flips a flag inside the array rather than really moving data).

In terms of whether this caching concept is statistically valid, I believe that in most cases it is probably fine, since it is roughly equivalent to resetting the seed at the start of the function to a fixed constant...but if you are doing anything particularly fancy with the returned array you might need to think carefully before using this approach.

A quick benchmark says that (once warmed up) for n=1000 and k=1000 this takes about 2.2 ms, compared to 150 ms for the full k-loop over np.random.permutation. Which is about 70 times faster...but that's in the simplest case where we don't call compress. For n=999 and k=1000, having warmed up with n=1000, it takes an extra few ms, giving 8ms total time, which is still about 19 times faster than the k-loop.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.