How to optimize a double for loop with condition on pandas dataframe?

Question

I have these two dataframes :

df = pd.DataFrame({'Points':[0,1,2,3],'Axis1':[1,2,2,3], 'Axis2':[4,2,3,0],'ClusterId':[1,2,2,3]})
df
   Points  Axis1  Axis2  ClusterId
0       0      1      4          1
1       1      2      2          2
2       2      2      3          2
3       3      3      0          3

Neighbour = pd.DataFrame()
Neighbour['Points'] = df['Points']
Neighbour['Closest'] = np.nan
Neighbour['Distance'] = np.nan

Neighbour
   Points  Closest  Distance
0       0      NaN       NaN
1       1      NaN       NaN
2       2      NaN       NaN
3       3      NaN       NaN

I would like that the Closest column contains the closest point which is NOT in the same cluster (ClusterId in df), based on the following distance function, applied to Axis1 and Axis2 :

def distance(x1,y1,x2,y2):
    dist = sqrt((x1-x2)**2 + (y1-y2)**2)
    return dist

And I would like that the Distance column contains the distance between the point and its closest point.

The following script works but I think it is really not the best way to do in Python :

for i in range(len(Neighbour['Points'])): 
    bestD = -1 #best distance
    #bestP for best point
    for ii in range(len(Neighbour['Points'])): 
        if df.loc[i,"ClusterId"] != df.loc[ii,"ClusterId"]: #if not share the same cluster
            dist = distance(df.iloc[i,1],df.iloc[i,2],df.iloc[ii,1],df.iloc[ii,2])
            if dist < bestD or bestD == -1:
                bestD = dist
                bestP = Neighbour.iloc[ii,0]
    Neighbour.loc[i,'Closest'] = bestP
    Neighbour.loc[i,'Distance'] = bestD

Neighbour
   Points  Closest  Distance
0       0      2.0  1.414214
1       1      0.0  2.236068
2       2      0.0  1.414214
3       3      1.0  2.236068

Is there a more effective way to fill the Closest and Distance columns (especially, without the for loops)? It might be an appropriate occasion to use map and reduce but I don't really see how.

The real dataframe has currently around 1 000 rows, but it could have around 50 000 latter — Ewdlam
– Ewdlam, Commented Jan 23, 2020 at 11:37
Have you tried with docs.scipy.org/doc/scipy-0.14.0/reference/generated/… instead of your distance function? Also, you could use df.iterrows() for that iteration. That would clean your code quite a bit. — alec_djinn
– alec_djinn, Commented Jan 23, 2020 at 11:58
@alec_djinn no need to use iterrows because scipy does provide a function computing pairwise distance (euclidean, or another one) between array. See cdist or pdist — politinsa
– politinsa, Commented Jan 23, 2020 at 12:05
@politinsa Indeed, apply() would be perfect for this case. — alec_djinn
– alec_djinn, Commented Jan 23, 2020 at 12:06

politinsa · Accepted Answer · 2020-01-23 12:27:05Z

2

To compute the distance you can use scipy.spatial.distance.cdist on the underlying ndarray of your DataFrame. This might be faster that your double loop.

>>> import numpy as np
>>> from scipy.spatial.distance import cdist

>>> distance_matrix = cdist(df.values[:, 1:3], df.values[:, 1:3], 'euclidean')
>>> distance_matrix
array([[0.        , 2.23606798, 1.41421356, 4.47213595],
       [2.23606798, 0.        , 1.        , 2.23606798],
       [1.41421356, 1.        , 0.        , 3.16227766],
       [4.47213595, 2.23606798, 3.16227766, 0.        ]])
>>> np.fill_diagonal(distance_matrix, np.inf) # set diagonal to inf so minimum isn't distance(x, x) = 0
>>> distance_matrix
array([[       inf, 2.23606798, 1.41421356, 4.47213595],
       [2.23606798,        inf, 1.        , 2.23606798],
       [1.41421356, 1.        ,        inf, 3.16227766],
       [4.47213595, 2.23606798, 3.16227766,        inf]])

To speed things up you could also check the pdist function instead of the cdist, it's taking way less memory for when you'll have 50_000 rows.
There is also KDTree aiming to find the nearest neighbours of a point.

Then you can use np.argmin to get the closest distance, and check if the closest point is in the cluster, like this (I didn't try though):

for i in range(len(Neighbour['Points'])):
    same_cluster = True
    while same_cluster:
        index_min = np.argmin(distance_matrix[i])
        same_cluster = (df.loc[i,"ClusterId"] == df.loc[index_min,"ClusterId"])
        if same_cluster:
            distance_matrix[i][index_min] = np.inf
    Neighbour.loc[i,'Closest'] = index_min
    Neighbour.loc[i,'Distance'] = distance_matrix[i][index_min]

edited Jan 23, 2020 at 12:27

answered Jan 23, 2020 at 11:54

politinsa

3,7701 gold badge14 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ewdlam Over a year ago

Good job ! I tested it and the result is clear : with 1000 rows, mine needs 71 seconds whereas yours needs only 0.6 second... For those who are interested, I used timeit package with start = timeit.default_timer(), stop = timeit.default_timer() and then stop - start

Ewdlam · Accepted Answer · 2020-01-23 13:47:11Z

To complete @politinsa answer, the following script allows to compare the performance of both methods :

from sklearn.datasets import make_moons
from sklearn.utils import check_random_state
import numpy as np
import timeit
import pandas as pd
from math import sqrt
from scipy.spatial.distance import cdist

def distance(x1,y1,x2,y2):
    dist = sqrt((x1-x2)**2 + (y1-y2)**2)
    return dist 

X,y = make_moons(n_samples=1000, noise=0.1)
W = list(range(1000))
rs = check_random_state(0)
Z = rs.randint(0, 10, size=(1000,))
df = pd.DataFrame(dict(Points=W, Axis1=X[:,0], Axis2=X[:,1],ClusterId=Z))
Neighbour = pd.DataFrame()
Neighbour['Points'] = df['Points']
Neighbour['Closest'] = np.nan
Neighbour['Distance'] = np.nan

start = timeit.default_timer()

for i in range(len(Neighbour['Points'])): 
    bestD = -1 #best distance
    for ii in range(len(Neighbour['Points'])): 
        if df.loc[i,"ClusterId"] != df.loc[ii,"ClusterId"]: #if not share the same cluster
            dist = distance(df.iloc[i,1],df.iloc[i,2],df.iloc[ii,1],df.iloc[ii,2])
            if dist < bestD or bestD == -1:
                bestD = dist
                bestP = Neighbour.iloc[ii,0]
    Neighbour.loc[i,'Closest'] = int(bestP)
    Neighbour.loc[i,'Distance'] = bestD

stop = timeit.default_timer()
print('Time initial script: ', stop - start)

start = timeit.default_timer()

distance_matrix = cdist(df.values[:, 1:3], df.values[:, 1:3])
np.fill_diagonal(distance_matrix, np.inf) # set diagonal to inf so minimum isn't distance(x, x) = 0
for i in range(len(Neighbour['Points'])):
    same_cluster = True
    while same_cluster:
        index_min = np.argmin(distance_matrix[i])
        same_cluster = (df.loc[i,"ClusterId"] == df.loc[index_min,"ClusterId"])
        if same_cluster:
            distance_matrix[i][index_min] = np.inf
    Neighbour.loc[i,'Closest'] = index_min
    Neighbour.loc[i,'Distance'] = distance_matrix[i][index_min]
stop = timeit.default_timer()
print('Time @politinsa\'s script: ', stop - start)

Out (in seconds):

Time initial script:  70.62462342600003
Time @politinsa's script:  0.6489833670000235

baranbartu · Accepted Answer · 2020-01-23 11:28:18Z

0

You can create cartesian product first and and apply new column as distance accordingly using below distance function

def distance(row):
    x1 = row.Axis1_x
    y1 = row.Axis2_x
    x2 = row.Axis1_y
    y2 = row.Axis2_y
    dist = math.sqrt((x1-x2)**2 + (y1-y2)**2)
    return dist


df = pd.DataFrame({'Points':[0,1,2,3],'Axis1':[1,2,2,3], 'Axis2':[4,2,3,0],'ClusterId':[1,2,2,3]})
df['join_key'] = '12345'
df = df.merge(df, how='outer', on='join_key')
df['distance'] = df.apply(distance, axis=1)
df = df.drop(columns=['join_key'])

So you will see a cartesian df like below

from now on, you ll see each point to each point distance. I assume that the hardest part is this. Please let me know if it helps.

answered Jan 23, 2020 at 11:28

baranbartu

631 gold badge1 silver badge6 bronze badges

3 Comments

Ewdlam Over a year ago

Thanks @baranbartu. Interesting approach, I'm testing what we can do from it

Ewdlam Over a year ago

I'm afraid that this approach is not very effective with a large number of rows :/ These dataframes are just simple examples. But the real ones could have tens of thousands rows

baranbartu Over a year ago

You are definitely right but complexity is same O(n2), you just use pandas features instead of multiple for loops. Also, as far as i know, df runs distributed according to the CPU power. Basically, In this case, if you want to know closest distance, first you have to calculate all distances unfortunately.

Collectives™ on Stack Overflow

How to optimize a double for loop with condition on pandas dataframe?

3 Answers 3

1 Comment

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related