randomly choose N elements from each row of 2d array in python

Question

I have a 2d array, say, a = [ [1, 2, 3, 4], [5, 6,7, 8], [9, 10, 11, 12], ...[21, 22, 23, 24] ], and I would like to pick N elements from each row randomly, based on a probability distribution p which can be different for each row.

So basically, I'd like to do something like [ np.random.choice(a[i], N, p=p_arr[i]) for i in range(a.shape[0]) ] without using a loop, where p_arr is a 2d array the same shape as a. p_arr stores the probability distribution for each row in a.

The reason I want to avoid using a for loop is because running a line profiler shows that the loop is slowing down my code by a lot (I have large arrays to work with).

Is there a more python-ic way of doing this?

I checked out these links (here and here) but they don't answer my question.

Thank you!

An example of what I'd like to do without the loop:

a = np.ones([500, 500])

>>> p_arr = np.identity(a.shape[0])

>>> for i in range(a.shape[0]):

... a[i] = a[i]*np.arange(a.shape[0])

...

>>> [print(np.random.choice(a[i], p =p_arr[i])) for i in range(a.shape[0])]

Is there a reason for using a[i] and i in range(a.shape[0]), instead of just doing np.random.choice(x) for x in a) ? — match
– match, Commented Apr 26, 2020 at 17:03
@yatu yes, p_arr is a 2d array. I've added an example. thanks! — strugglinggrad
– strugglinggrad, Commented Apr 26, 2020 at 17:12
@match, you're right, I can do it the way you suggested, but I still won't be avoiding the for loop, which is my main concern. — strugglinggrad
– strugglinggrad, Commented Apr 26, 2020 at 17:14
It's not really a loop - you're just looking at each item in a once. I'm not sure you can get any more efficient than that without looking at running in parallel? — match
– match, Commented Apr 26, 2020 at 17:58

Alain T. · Accepted Answer · 2020-04-27 17:59:06Z

Perhaps using a list comprehension instead of a loop would address the issue:

import numpy as np

shape = (10,10)
N     = 4
distributions = np.random.rand(*shape)
distributions = distributions/(np.sum(distributions,axis=1)[:,None])
values        = np.arange(shape[0]*shape[1]).reshape(shape)

sample        = np.array([np.random.choice(v,N,p=r) for v,r in zip(values,distributions)])

output:

print(np.round(distributions,2))
[[0.03 0.22 0.1  0.09 0.2  0.1  0.11 0.05 0.08 0.01]
 [0.04 0.12 0.13 0.03 0.16 0.22 0.16 0.05 0.   0.09]
 [0.15 0.04 0.08 0.07 0.17 0.13 0.01 0.15 0.1  0.1 ]
 [0.06 0.13 0.16 0.03 0.17 0.09 0.08 0.11 0.05 0.12]
 [0.07 0.08 0.09 0.08 0.13 0.18 0.12 0.13 0.07 0.07]
 [0.1  0.04 0.11 0.06 0.04 0.16 0.18 0.15 0.01 0.15]
 [0.06 0.09 0.17 0.08 0.14 0.15 0.09 0.01 0.06 0.15]
 [0.03 0.1  0.11 0.07 0.14 0.14 0.15 0.1  0.04 0.11]
 [0.05 0.1  0.18 0.1  0.03 0.18 0.12 0.05 0.05 0.13]
 [0.13 0.1  0.08 0.11 0.06 0.14 0.11 0.   0.14 0.14]]

print(sample)
[[ 6  4  8  5]
 [16 19 15 10]
 [25 20 24 23]
 [37 34 30 31]
 [41 44 46 45]
 [59 55 53 57]
 [64 63 65 61]
 [79 75 76 77]
 [85 81 83 88]
 [99 96 93 90]]

If you want non-repeating samples on each line, there is another kind of optimization that you could try. By flattening the values and the distributions, you can create a non-repeating shuffle of indexes of the whole matrix according to the respective distributions of each line. With the flattened distributions, each group of values that belong to the same line will have (as a group) an equivalent distribution. This means that, if you reassemble the shuffled indexes on their original lines but keeping their stable shuffled ordered, you can then take a slice of the shuffle matrix to obtain your sample:

flatDist    = distributions.reshape((distributions.size,))
flatDist    = flatDist/np.sum(flatDist)
randomIdx   = np.random.choice(np.arange(values.size),flatDist.size,replace=False,p=flatDist)
shuffleIdx  = np.array([randomIdx//shape[1],randomIdx%shape[1]])
shuffleIdx  = shuffleIdx[:,np.argsort(shuffleIdx[0,:],kind="stable")]
sample      = values[tuple(shuffleIdx)].reshape(shape)[:,:N]

output:

print(sample)
[[ 3  7  2  5]
 [13 12 14 16]
 [27 23 25 29]
 [37 31 33 36]
 [47 45 48 49]
 [59 50 52 54]
 [62 61 60 66]
 [72 78 70 77]
 [87 82 83 86]
 [92 98 95 93]]

manshu · Accepted Answer · 2020-04-26 18:21:17Z

0

This can be used instead of using the loop.

a = np.ones([500, 500])
p_arr = np.identity(a.shape[0])
a2 = a.flatten()
a3 = a2*np.full(shape=a.shape, fill_value=np.arange(a.shape[0])).flatten()
p_arr3 = p_arr.flatten()/a.shape[1]
print(np.random.choice(a3, a.shape[1], p =p_arr3))

I had to use np.array.flatten() multiple times to convert the 2D array into 1D array. Then we can avoid using the loop by performing our operations on the 1D array.

edited Apr 26, 2020 at 18:21

answered Apr 26, 2020 at 18:13

manshu

1236 bronze badges

1 Comment

match Over a year ago

flatten() is still performing a loop internally to return the flattened data - this is just moving the problem, and adding an extra step of creating a flat object which may be less performant/memory-efficient.

Collectives™ on Stack Overflow

randomly choose N elements from each row of 2d array in python

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related