3

I have a Numpy array that looks like

>>> a
array([[ 3. ,  2. , -1. ],
       [-1. ,  0.1,  3. ],
       [-1. ,  2. ,  3.5]])

I would like to select a value from each row at random, but I would like to exclude the -1 values from the random sampling.

What I do currently is:

x=[]
for i in range(a.shape[0]):
    idx=numpy.where(a[i,:]>0)[0]
    idxr=random.sample(idx,1)[0]
    xi=a[i,idxr]
    x.append(xi)

and get

>>> x
[3.0, 3.0, 2.0]

This is becoming a bit slow for large arrays and I would like to know if there is a way to conditionally select random values from the original a matrix without dealing with each row individually.

4
  • I don't have any experience with NumPy but I would have guessed that generating a random number would take longer than accessing the value from the array. The same is true of appending to a list. Have you profiled your program to make sure you're optimizing the right thing? Commented Jun 30, 2010 at 16:24
  • I've profiled the program and the idx and idxr lines are the slowest, with an almost equal amount of time spent on each. Commented Jun 30, 2010 at 17:11
  • Do you always expect to have the same number of excluded values in each row? If so, you can vectorize the whole thing and do it in two lines of code with no python loops... Commented Jun 30, 2010 at 22:18
  • @Joe Kington: not necessarily. For all intents and purposes, the rows belong to independent samples. Commented Jul 1, 2010 at 2:11

1 Answer 1

3

I really don't think that you will find anything in Numpy that does exactly what you are asking as packaged so I've decided to offer what optimizations I could think up.

There are several things that could make this slow here. First off, numpy.where() is rather slow because it has to check every value in the sliced array (the slice is generated for each row as well) and then generate an array of values. The best thing that you could do if you plan on doing this process over and over again on the same matrix would be to sort each row. Then you would just use a binary search to find where the positive values start and just use a random number to select a value from them. Of course, you could also just store the indices where the positive values start after finding them once with binary searches.

If you don't plan on doing this process many times over, then I would recommend using Cython to speed up the numpy.where line. Cython would allow you to not need to slice the rows out and speed up the process overall.

My last suggestion is to use random.choice rather than random.sample unless you really do plan on choosing sample sizes that are larger than 1.

Sign up to request clarification or add additional context in comments.

1 Comment

I'll be doing this process on similar but newly generated arrays many times over, so I'll look into Cython. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.