Sampling rows from numpy array efficiently

Question

I am wondering how to sample rows efficiently from high dimensional numpy arrays.

At the moment, this is what I do:

n=11000000
d=28
X = np.random.randn(n, d)  
idx =np.random.choice(range(n), 10000000, replace=False)

time_l=[]
for i in range(15):
    t_0=time.clock()
    _X=X[idx, :]
    t_1=time.clock()
    time_l.append(t_1-t_0)
print 'avg= ', (sum(time_l))/15
print 'sd= ', np.std(time_l)

But the performance of the X[idx, :] varies substantially. For example, when n=11 million, no_samples= 10million and d=50 it takes roughly 32 seconds on average with a standard deviation of 25.

So at times it's done in 4 seconds but there's also times when it takes more than 50s? How can this be? (same thing for the method np.take())

Also, I'm getting a memory error if I try X.T[:,idx] instead which suprises me too.

Thanks for your thoughts!

**Update: I upgraded from numpy 1.10 to 1.12 and it behaves way better now. Avg=6 sd=2. If any of you knows a more stable/faster way to subsample rows I'm glad to hear it anyway!

Sorry! Number of rows I sample. For my example I sampled all of them so no_samples=n=11 million — Jonasson
– Jonasson, Commented Mar 7, 2017 at 14:01
"no_samples" must be less than "n" if "replace=False". Otherwise you are just shuffling your original array, not sampling it. — deanshanahan
– deanshanahan, Commented Mar 7, 2017 at 14:02
What value were you using for no_samples? I am assuming you were using the same value across those tests. — Divakar
– Divakar, Commented Mar 7, 2017 at 14:03
Yes I went over this with a simple for loop so no_samples=const. This is part of an algorithm that asymptotically looks at all datapoints. That's why I used no_samples=n but for any value <n you get the same high variance in the computation time! — Jonasson
– Jonasson, Commented Mar 7, 2017 at 14:05

deanshanahan · Accepted Answer · 2017-03-14 16:17:23Z

1

This only part answers your question

X is an n-by-m array of random values. The value idx is an array of n rows to be sampled, where the values in the array range from 0 to n-1. If you try to use:

X = X.T[idx, :]

And the value of n is not equal to m then you may be trying to access values a row higher than what is contained in the transposed array. The only way that this code is guaranteed to work is when X is a square-matrix, i.e. n=m.

This code will transpose X on the same line if that is what you are trying to achieve:

X = X.T[:, idx]

With regards the timing of the code. Long Python code can have a wildly varying execution time as the computers processor can be doing other tasks. I have never seen or heard of a 4s-50s spread for the same task. Are you sure that the array you used for the 4s time was the same as the 50s time?

--

In response to your update: The two second standard deviation could definitely be attributable to the computer's processor doing other tasks while the program is executed. It would be almost impossible to get the same time for each execution of the code using an OS like Windows or Linux. You would nearly have to write your own OS to get exact an exact timing for each execution. ( Which I doubt you will want to do! :D )

edited Mar 14, 2017 at 16:17

answered Mar 7, 2017 at 14:15

deanshanahan

3205 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jonasson Over a year ago

Yes you're right about the indices regarding the transpose. Yes everything stays the same and I'm not runnig anything else in parallel. I posted the entire code above. It's suprising to me too, that why I asked the question!

Collectives™ on Stack Overflow

Sampling rows from numpy array efficiently

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related