0

I am wondering how to sample rows efficiently from high dimensional numpy arrays.

At the moment, this is what I do:

n=11000000
d=28
X = np.random.randn(n, d)  
idx =np.random.choice(range(n), 10000000, replace=False)

time_l=[]
for i in range(15):
    t_0=time.clock()
    _X=X[idx, :]
    t_1=time.clock()
    time_l.append(t_1-t_0)
print 'avg= ', (sum(time_l))/15
print 'sd= ', np.std(time_l)

But the performance of the X[idx, :] varies substantially. For example, when n=11 million, no_samples= 10million and d=50 it takes roughly 32 seconds on average with a standard deviation of 25.

So at times it's done in 4 seconds but there's also times when it takes more than 50s? How can this be? (same thing for the method np.take())

Also, I'm getting a memory error if I try X.T[:,idx] instead which suprises me too.

Thanks for your thoughts!

**Update: I upgraded from numpy 1.10 to 1.12 and it behaves way better now. Avg=6 sd=2. If any of you knows a more stable/faster way to subsample rows I'm glad to hear it anyway!

9
  • 1
    What is no_samples? Commented Mar 7, 2017 at 13:53
  • Sorry! Number of rows I sample. For my example I sampled all of them so no_samples=n=11 million Commented Mar 7, 2017 at 14:01
  • "no_samples" must be less than "n" if "replace=False". Otherwise you are just shuffling your original array, not sampling it. Commented Mar 7, 2017 at 14:02
  • What value were you using for no_samples? I am assuming you were using the same value across those tests. Commented Mar 7, 2017 at 14:03
  • Yes I went over this with a simple for loop so no_samples=const. This is part of an algorithm that asymptotically looks at all datapoints. That's why I used no_samples=n but for any value <n you get the same high variance in the computation time! Commented Mar 7, 2017 at 14:05

1 Answer 1

1

This only part answers your question

X is an n-by-m array of random values. The value idx is an array of n rows to be sampled, where the values in the array range from 0 to n-1. If you try to use:

X = X.T[idx, :]

And the value of n is not equal to m then you may be trying to access values a row higher than what is contained in the transposed array. The only way that this code is guaranteed to work is when X is a square-matrix, i.e. n=m.

This code will transpose X on the same line if that is what you are trying to achieve:

X = X.T[:, idx]

With regards the timing of the code. Long Python code can have a wildly varying execution time as the computers processor can be doing other tasks. I have never seen or heard of a 4s-50s spread for the same task. Are you sure that the array you used for the 4s time was the same as the 50s time?

--

In response to your update: The two second standard deviation could definitely be attributable to the computer's processor doing other tasks while the program is executed. It would be almost impossible to get the same time for each execution of the code using an OS like Windows or Linux. You would nearly have to write your own OS to get exact an exact timing for each execution. ( Which I doubt you will want to do! :D )

Sign up to request clarification or add additional context in comments.

1 Comment

Yes you're right about the indices regarding the transpose. Yes everything stays the same and I'm not runnig anything else in parallel. I posted the entire code above. It's suprising to me too, that why I asked the question!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.