This is a version of Ashwini Chaudhary's solution:
>>> a = numpy.array(['a', 'b', 'c', 'd', 'e'])
>>> a = numpy.tile(a[:,None], 5)
>>> a[:,1:] = numpy.apply_along_axis(numpy.random.permutation, 0, a[:,1:])
>>> a
array([['a', 'c', 'a', 'd', 'c'],
['b', 'd', 'b', 'e', 'a'],
['c', 'e', 'd', 'a', 'e'],
['d', 'a', 'e', 'b', 'd'],
['e', 'b', 'c', 'c', 'b']],
dtype='|S1')
I think it's well-conceived and pedagogically useful (and I hope he undeletes it). But somewhat surprisingly, it's consistently the slowest one in the tests I've performed. Definitions:
>>> def column_perms_along(a, cols):
... a = numpy.tile(a[:,None], cols)
... a[:,1:] = numpy.apply_along_axis(numpy.random.permutation, 0, a[:,1:])
... return a
...
>>> def column_perms_argsort(a, cols):
... perms = np.argsort(np.random.rand(a.shape[0], cols - 1), axis=0)
... return np.hstack((a[:,None], a[perms]))
...
>>> def column_perms_lc(a, cols):
... z = np.array([a] + [np.random.permutation(a) for _ in xrange(cols - 1)])
... return z.T
...
For small arrays and few columns:
>>> %timeit column_perms_along(a, 5)
1000 loops, best of 3: 272 µs per loop
>>> %timeit column_perms_argsort(a, 5)
10000 loops, best of 3: 23.7 µs per loop
>>> %timeit column_perms_lc(a, 5)
1000 loops, best of 3: 165 µs per loop
For small arrays and many columns:
>>> %timeit column_perms_along(a, 500)
100 loops, best of 3: 29.8 ms per loop
>>> %timeit column_perms_argsort(a, 500)
10000 loops, best of 3: 185 µs per loop
>>> %timeit column_perms_lc(a, 500)
100 loops, best of 3: 11.7 ms per loop
For big arrays and few columns:
>>> A = numpy.arange(1000)
>>> %timeit column_perms_along(A, 5)
1000 loops, best of 3: 2.97 ms per loop
>>> %timeit column_perms_argsort(A, 5)
1000 loops, best of 3: 447 µs per loop
>>> %timeit column_perms_lc(A, 5)
100 loops, best of 3: 2.27 ms per loop
And for big arrays and many columns:
>>> %timeit column_perms_along(A, 500)
1 loops, best of 3: 281 ms per loop
>>> %timeit column_perms_argsort(A, 500)
10 loops, best of 3: 71.5 ms per loop
>>> %timeit column_perms_lc(A, 500)
1 loops, best of 3: 269 ms per loop
The moral of the story: always test! I imagine that for extremely large arrays, the disadvantage of an n log n solution like sorting might become apparent here. But numpy's implementation of sorting is extremely well-tuned in my experience. I bet you could go up several orders of magnitude before noticing an effect.