3

I have a 2d array, say, a = [ [1, 2, 3, 4], [5, 6,7, 8], [9, 10, 11, 12], ...[21, 22, 23, 24] ], and I would like to pick N elements from each row randomly, based on a probability distribution p which can be different for each row.

So basically, I'd like to do something like [ np.random.choice(a[i], N, p=p_arr[i]) for i in range(a.shape[0]) ] without using a loop, where p_arr is a 2d array the same shape as a. p_arr stores the probability distribution for each row in a.

The reason I want to avoid using a for loop is because running a line profiler shows that the loop is slowing down my code by a lot (I have large arrays to work with).

Is there a more python-ic way of doing this?

I checked out these links (here and here) but they don't answer my question.

Thank you!

An example of what I'd like to do without the loop:

a = np.ones([500, 500])

>>> p_arr = np.identity(a.shape[0])

>>> for i in range(a.shape[0]):

... a[i] = a[i]*np.arange(a.shape[0])

...

>>> [print(np.random.choice(a[i], p =p_arr[i])) for i in range(a.shape[0])]

8
  • 1
    So p_arr is 2D? Can you share a minimal example? Commented Apr 26, 2020 at 16:52
  • Is there a reason for using a[i] and i in range(a.shape[0]), instead of just doing np.random.choice(x) for x in a) ? Commented Apr 26, 2020 at 17:03
  • @yatu yes, p_arr is a 2d array. I've added an example. thanks! Commented Apr 26, 2020 at 17:12
  • @match, you're right, I can do it the way you suggested, but I still won't be avoiding the for loop, which is my main concern. Commented Apr 26, 2020 at 17:14
  • It's not really a loop - you're just looking at each item in a once. I'm not sure you can get any more efficient than that without looking at running in parallel? Commented Apr 26, 2020 at 17:58

2 Answers 2

1

Perhaps using a list comprehension instead of a loop would address the issue:

import numpy as np

shape = (10,10)
N     = 4
distributions = np.random.rand(*shape)
distributions = distributions/(np.sum(distributions,axis=1)[:,None])
values        = np.arange(shape[0]*shape[1]).reshape(shape)

sample        = np.array([np.random.choice(v,N,p=r) for v,r in zip(values,distributions)])

output:

print(np.round(distributions,2))
[[0.03 0.22 0.1  0.09 0.2  0.1  0.11 0.05 0.08 0.01]
 [0.04 0.12 0.13 0.03 0.16 0.22 0.16 0.05 0.   0.09]
 [0.15 0.04 0.08 0.07 0.17 0.13 0.01 0.15 0.1  0.1 ]
 [0.06 0.13 0.16 0.03 0.17 0.09 0.08 0.11 0.05 0.12]
 [0.07 0.08 0.09 0.08 0.13 0.18 0.12 0.13 0.07 0.07]
 [0.1  0.04 0.11 0.06 0.04 0.16 0.18 0.15 0.01 0.15]
 [0.06 0.09 0.17 0.08 0.14 0.15 0.09 0.01 0.06 0.15]
 [0.03 0.1  0.11 0.07 0.14 0.14 0.15 0.1  0.04 0.11]
 [0.05 0.1  0.18 0.1  0.03 0.18 0.12 0.05 0.05 0.13]
 [0.13 0.1  0.08 0.11 0.06 0.14 0.11 0.   0.14 0.14]]

print(sample)
[[ 6  4  8  5]
 [16 19 15 10]
 [25 20 24 23]
 [37 34 30 31]
 [41 44 46 45]
 [59 55 53 57]
 [64 63 65 61]
 [79 75 76 77]
 [85 81 83 88]
 [99 96 93 90]]

If you want non-repeating samples on each line, there is another kind of optimization that you could try. By flattening the values and the distributions, you can create a non-repeating shuffle of indexes of the whole matrix according to the respective distributions of each line. With the flattened distributions, each group of values that belong to the same line will have (as a group) an equivalent distribution. This means that, if you reassemble the shuffled indexes on their original lines but keeping their stable shuffled ordered, you can then take a slice of the shuffle matrix to obtain your sample:

flatDist    = distributions.reshape((distributions.size,))
flatDist    = flatDist/np.sum(flatDist)
randomIdx   = np.random.choice(np.arange(values.size),flatDist.size,replace=False,p=flatDist)
shuffleIdx  = np.array([randomIdx//shape[1],randomIdx%shape[1]])
shuffleIdx  = shuffleIdx[:,np.argsort(shuffleIdx[0,:],kind="stable")]
sample      = values[tuple(shuffleIdx)].reshape(shape)[:,:N]

output:

print(sample)
[[ 3  7  2  5]
 [13 12 14 16]
 [27 23 25 29]
 [37 31 33 36]
 [47 45 48 49]
 [59 50 52 54]
 [62 61 60 66]
 [72 78 70 77]
 [87 82 83 86]
 [92 98 95 93]]
Sign up to request clarification or add additional context in comments.

Comments

0

This can be used instead of using the loop.

a = np.ones([500, 500])
p_arr = np.identity(a.shape[0])
a2 = a.flatten()
a3 = a2*np.full(shape=a.shape, fill_value=np.arange(a.shape[0])).flatten()
p_arr3 = p_arr.flatten()/a.shape[1]
print(np.random.choice(a3, a.shape[1], p =p_arr3))

I had to use np.array.flatten() multiple times to convert the 2D array into 1D array. Then we can avoid using the loop by performing our operations on the 1D array.

1 Comment

flatten() is still performing a loop internally to return the flattened data - this is just moving the problem, and adding an extra step of creating a flat object which may be less performant/memory-efficient.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.