1

I have a nested loop that has to loop through a huge amount of data.

Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.

Explaining what the code is supposed to do:

  • Each row represents a coordinates in X-Y plane.
  • r_test is containing the diameters of different circles of investigations in our 2D plane (X-Y plane).
  • For each 10 points/rows, for every single diameter in r_test, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 to H. Then we calculate H/(N**5) and store it in c_10 with the index corresponding to that of the diameter of investigation.
  • For this first 10 points finally when the loop went through all those diameters in r_test, we read the slope of the fitted line and save it to S_wind[ii]. So the first 9 data points will have no value calculated for them thus giving them np.inf to be distinguished later.
  • Then the window moves one point down the rows and repeat this process till S_wind is completed.

What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?

Many thanks in advance!

import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']


####====creating upper and lower bound for the diameter of the investigation circles    
x_range =max(df['X']) - min(df['X']) 
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)  
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800  
r_test = np.arange(r1, r2, 5) 

S_wind = np.zeros(len(df['X'])) + np.inf

for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
        c_10 = np.zeros(len(r_test)) +np.inf
        H = 0
        C = 0
        N = 10 ##### maybe I should also remove this
        for ind in range(len(r_test)):
            for i in range (ii-10,ii):
                for j in range(ii-10,ii):
                    dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
                    if dd > 0:
                        H += 1
            c_10[ind] = (H/(N**2))

        S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]   
5
  • It would help a lot if you explained what your code was supposed to be doing to all those datapoints. I'm perplexed by a whole lot of stuff and I haven't even gotten to the nested loop you're asking about. Why are you taking the fifth root of 1/10 and using a multiple of it as a lower bound on a range? Better variable names might be a start to making the code more comprehensible. Commented Apr 11, 2021 at 20:26
  • @Blckknght Thanks for your comment. My apologies, I will try to clear out the code now. Commented Apr 11, 2021 at 20:28
  • @Blckknght I tried to explain what the code is supposed to do. Does it help you understanding the purpose of the code? Please let me know if I need to clarify more. Many thanks! Commented Apr 11, 2021 at 20:47
  • So I'm trying to test out a way of using numpy broadcasting to eliminate the three inner loops, but I've found that r_test is an empty array for my random data. Is that something that should be possible, or is the computation incorrect somehow? Commented Apr 11, 2021 at 22:09
  • @Blckknght Thanks for the comment. No actually it should not be empty. I changed the code in a way to avoid any empty r_test. This code is similar to my actual code except the data are generated randomly here, and the window length is only 10, and in my case it is 200. Commented Apr 11, 2021 at 22:20

1 Answer 1

1

You can use numpy broadcasting to eliminate all of the inner loops. I'm not sure if there's an easy way to get rid of the outermost loop, but the others are not too hard to avoid.

The inner loops are comparing ten 2D points against each other in pairs. That's just dying for using a 10x10x2 numpy array:

# replacing the `for ind` loop and its contents:
points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
differences = np.subtract(points[None, :, :],  points[:, None, :]) # broadcast to 10x10x2
squared_distances = (differences * differences).sum(axis=2)
within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None]  # compare squares
c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)

S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...

I'm not very pandas savvy, so there's probably a better way to get the X and Y values into a single 2-dimensional numpy array. You generated the random data in the format that I'd find most useful, then converted into something less immediately useful for numeric operations!

Note that this code matches the output of your loop code. I'm not sure that's actually doing what you want it to do, as there are several slightly strange things in your current code. For example, you may not want the cumsum in my code, which corresponds to only re-initializing H to zero in the outermost loop. If you don't want the matches for smaller values of r_test to be counted again for the larger values, you can skip that sum (or equivalently, move the H = 0 line to in between the for ind and the for i loops in your original code).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.