I am wondering how to sample rows efficiently from high dimensional numpy arrays.
At the moment, this is what I do:
n=11000000
d=28
X = np.random.randn(n, d)
idx =np.random.choice(range(n), 10000000, replace=False)
time_l=[]
for i in range(15):
t_0=time.clock()
_X=X[idx, :]
t_1=time.clock()
time_l.append(t_1-t_0)
print 'avg= ', (sum(time_l))/15
print 'sd= ', np.std(time_l)
But the performance of the X[idx, :] varies substantially. For example, when n=11 million, no_samples= 10million and d=50 it takes roughly 32 seconds on average with a standard deviation of 25.
So at times it's done in 4 seconds but there's also times when it takes more than 50s? How can this be? (same thing for the method np.take())
Also, I'm getting a memory error if I try X.T[:,idx] instead which suprises me too.
Thanks for your thoughts!
**Update: I upgraded from numpy 1.10 to 1.12 and it behaves way better now. Avg=6 sd=2. If any of you knows a more stable/faster way to subsample rows I'm glad to hear it anyway!
no_samples?no_samples? I am assuming you were using the same value across those tests.