I have these two dataframes :
df = pd.DataFrame({'Points':[0,1,2,3],'Axis1':[1,2,2,3], 'Axis2':[4,2,3,0],'ClusterId':[1,2,2,3]})
df
Points Axis1 Axis2 ClusterId
0 0 1 4 1
1 1 2 2 2
2 2 2 3 2
3 3 3 0 3
Neighbour = pd.DataFrame()
Neighbour['Points'] = df['Points']
Neighbour['Closest'] = np.nan
Neighbour['Distance'] = np.nan
Neighbour
Points Closest Distance
0 0 NaN NaN
1 1 NaN NaN
2 2 NaN NaN
3 3 NaN NaN
I would like that the Closest column contains the closest point which is NOT in the same cluster (ClusterId in df), based on the following distance function, applied to Axis1 and Axis2 :
def distance(x1,y1,x2,y2):
dist = sqrt((x1-x2)**2 + (y1-y2)**2)
return dist
And I would like that the Distance column contains the distance between the point and its closest point.
The following script works but I think it is really not the best way to do in Python :
for i in range(len(Neighbour['Points'])):
bestD = -1 #best distance
#bestP for best point
for ii in range(len(Neighbour['Points'])):
if df.loc[i,"ClusterId"] != df.loc[ii,"ClusterId"]: #if not share the same cluster
dist = distance(df.iloc[i,1],df.iloc[i,2],df.iloc[ii,1],df.iloc[ii,2])
if dist < bestD or bestD == -1:
bestD = dist
bestP = Neighbour.iloc[ii,0]
Neighbour.loc[i,'Closest'] = bestP
Neighbour.loc[i,'Distance'] = bestD
Neighbour
Points Closest Distance
0 0 2.0 1.414214
1 1 0.0 2.236068
2 2 0.0 1.414214
3 3 1.0 2.236068
Is there a more effective way to fill the Closest and Distance columns (especially, without the for loops)? It might be an appropriate occasion to use map and reduce but I don't really see how.

df.iterrows()for that iteration. That would clean your code quite a bit.cdistorpdistapply()would be perfect for this case.