1

Main goal: Suppose I have a multi-dimensional array. I also have a 0-1 index set corresponding to each column for each row. For example, If my array is [[3,6,7,8], [1,32,45,7]], I will have an index set as [[1,0,1,1], [0,0,1,1]]. I would like to take a copy of each row of my array n times. Then, I'd like to increase each element whose corresponding index is equal to 1 randomly.

import time
import random
import numpy as np

def foo(arr, upper_bound, index_set, first_set_size, sec_set_size, limit):
    iter =0

    my_array = np.zeros((first_set_size*sec_set_size, limit)) #each row is copied |sec_set_size| times
    it =0
    for i in range(first_set_size):
        for j in range(sec_set_size):
            my_array[it] = arr[i] #copy the elements from the corresponding row
            for k in range(limit):
                if index_set[i][k]==1: #update the elements whose indices are one
                    temp = arr[i][k]   #get the current value
                    my_array[it][k]  =temp + random.randint(1,upper_bound-temp) #I use fastrand.pcg32bounded here. Update the value. 
            it +=1
    return my_array


upper_bound = 50
limit = 1000
first_set_size= 100
sec_set_size = 50
arr = np.random.randint(25, size=(first_set_size, limit)) #create an array containing integer numbers
index_set= np.array([[random.randint(0,1) for j in range(limit)] for i in range(first_set_size)]) #each elements has an index which is either 1 or 0

start_time = time.time() #measure the time taken by the function
result = foo(arr, upper_bound,index_set, first_set_size, sec_set_size, limit)
print("time taken: %s " % (time.time() - start_time))

Once I increase the limit and set sizes, the code takes several minutes. Is there any way that I can perform this operation faster / efficiently? I've spent quite a bit of time on this, but could not improve the speed of my implementation.

EDIT: Suppose my initial array is:

[[11 23 24 17  0]
 [ 1 23 12 19  5]
 [20 15  1 17 17]
 [ 3  8  7  0 24]]

Also, my index set is given as;

[[1 0 0 0 1]
 [1 0 1 0 0]
 [1 1 1 1 0]
 [0 1 0 1 1]]

If sec_set_size=5, I would like to take the copy of each row and increase the values of each element if their indices are one.

The final result should be like this;

[[39. 23. 24. 17. 44.]
 [50. 23. 24. 17. 27.]
 [42. 23. 24. 17. 24.]
 [45. 23. 24. 17. 11.]
 [49. 23. 24. 17. 43.]
 [23. 23. 44. 19.  5.]
 [10. 23. 37. 19.  5.]
 [14. 23. 29. 19.  5.]
 [12. 23. 22. 19.  5.]
 [ 5. 23. 15. 19.  5.]
 [36. 45. 26. 37. 17.]
 [24. 40. 35. 38. 17.]
 [34. 20. 24. 31. 17.]
 [27. 16.  9. 20. 17.]
 [37. 37.  6. 37. 17.]
 [ 3. 50.  7. 46. 47.]
 [ 3. 13.  7. 37. 44.]
 [ 3. 23.  7. 32. 29.]
 [ 3. 10.  7. 22. 41.]
 [ 3. 22.  7. 32. 41.]]
4
  • Don't use iter as a variable name, it's a built-in function. Commented Sep 10, 2021 at 18:29
  • Your example isn't clear. Given your example array please show your expected output. Commented Sep 10, 2021 at 18:30
  • @blorgon I updated my post accordingly. Commented Sep 10, 2021 at 18:44
  • The output certainly helps, but the reference implementation is what's important Commented Sep 10, 2021 at 18:49

1 Answer 1

2

Numpy is all about vectorization. If you're using python loops, you're probably doing it wrong.

First off, all the random number generators are vectorized:

index_set = np.random.randint(2, size=(first_set_size, limit), dtype=bool)

You did it correctly on the line above.

Next, to copy rows multiple times, you can use np.repeat:

my_array = np.repeat(arr, sec_set_size, axis=0)

Notice that you don't need first_set_size at all. It's redundant with arr.shape[0]. You can do the same with your boolean mask to make the shapes match:

index_set = np.repeat(index_set, sec_set_size, axis=0)

Now you can update the option of my_array masked by index_set with an appropriate number of randomly generated elements:

my_array[index_set] += np.random.randint(1, upper_bound - my_array[index_set])

Your entire program reduces to about four (very fast) lines, plus some initialization:

def foo(arr, upper_bound, index_set, sec_set_size, limit):
    my_array = np.repeat(arr, sec_set_size, axis=0)
    index_set = np.repeat(index_set, sec_set_size, axis=0)
    my_array[index_set] += np.random.randint(1, upper_bound - my_array[index_set])
    return my_array

upper_bound = 50
limit = 1000
first_set_size= 100
sec_set_size = 50
arr = np.random.randint(25, size=(first_set_size, limit)) #create an array containing integer numbers
index_set = np.random.randint(2, size=(first_set_size, limit), dtype=bool)

start_time = time.time() #measure the time taken by the function
result = foo(arr, upper_bound, index_set, sec_set_size, limit)
print(f"time taken: {time.time() - start_time}")

You may want to experiment with using indices instead of a boolean mask. It will make indexing more efficient since the number of non-zero elements don't need to be recomputed twice, but on the other hand the setup is a bit more expensive:

def foo(arr, upper_bound, index_set, sec_set_size, limit):
    my_array = np.repeat(arr, sec_set_size, axis=0)
    r, c = np.where(index_set)
    r = (sec_set_size * r[:, None] + np.arange(sec_set_size)).ravel()
    c = np.repeat(c, sec_set_size)
    my_array[r, c] += np.random.randint(1, upper_bound - my_array[r, c])
    return my_array
Sign up to request clarification or add additional context in comments.

17 Comments

@ball_jan. Glad it worked out. How much faster is it now? I'm on mobile, so no way to test anything.
When limit= 1000, first_set_size= 200, sec_set_size- 50, your method takes 1.07 while mine takes 21.09876 :)
@ball_jan. Thanks, could you also let me know how the updated version compares? I'm hoping it's marginally faster, but no guarantee.
I'm getting an invalid syntax error, but could not figure it out. I'll keep you updated.
What does np.where(index_set) exactly do?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.