I have a nested loop that has to loop through a huge amount of data.
Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.
Explaining what the code is supposed to do:
- Each row represents a coordinates in X-Y plane.
r_testis containing the diameters of different circles of investigations in our 2D plane (X-Y plane).- For each 10 points/rows, for every single diameter in
r_test, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 toH. Then we calculateH/(N**5)and store it inc_10with the index corresponding to that of the diameter of investigation. - For this first 10 points finally when the loop went through all those diameters in
r_test, we read the slope of the fitted line and save it toS_wind[ii]. So the first 9 data points will have no value calculated for them thus giving themnp.infto be distinguished later. - Then the window moves one point down the rows and repeat this process till
S_windis completed.
What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?
Many thanks in advance!
import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']
####====creating upper and lower bound for the diameter of the investigation circles
x_range =max(df['X']) - min(df['X'])
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800
r_test = np.arange(r1, r2, 5)
S_wind = np.zeros(len(df['X'])) + np.inf
for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
c_10 = np.zeros(len(r_test)) +np.inf
H = 0
C = 0
N = 10 ##### maybe I should also remove this
for ind in range(len(r_test)):
for i in range (ii-10,ii):
for j in range(ii-10,ii):
dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
if dd > 0:
H += 1
c_10[ind] = (H/(N**2))
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]
r_testis an empty array for my random data. Is that something that should be possible, or is the computation incorrect somehow?r_test. This code is similar to my actual code except the data are generated randomly here, and the window length is only 10, and in my case it is 200.