Pandas: vectorization with function on two dataframes

Question

I'm having trouble with implementing vectorization in pandas. Let me preface this by saying I am a total newbie to vectorization so it's extremely likely that I'm getting some syntax wrong.

Let's say I've got two pandas dataframes.

Dataframe one describes the x,y coordinates of some circles with radius R, with unique IDs.

>>> data1 = {'ID': [1, 2], 'x': [1, 10], 'y': [1, 10], 'R': [4, 5]}
>>> df_1=pd.DataFrame(data=data1)
>>>
>>> df_1
   ID  x   y   R
   1   1   1   4
   2   10  10  5

Dataframe two describes the x,y coordinates of some points, also with unique IDs.

>>> data2 = {'ID': [3, 4, 5], 'x': [1, 3, 9], 'y': [2, 5, 9]}
>>> df_2=pd.DataFrame(data=data2)
>>>
>>> df_2
   ID  x  y
   3   1  2
   4   3  5
   5   9  9

Now, imagine plotting the circles and the points on a 2D plane. Some of the points will reside inside the circles. See the image below.

All I want to do is create a new column in df_2 called "host_circle" that indicates the ID of the circle that each point resides in. If the particle does not reside in a circle, the value should be "None".

My desired output would be

>>> df_2
   ID  x  y   host_circle
   3   1  2   1 
   4   3  5   None 
   5   9  9   2

First, define a function that checks if a given particle (x2,y2) resides inside a given circle (x1,y1,R1,ID_1). If it does, return the ID of the circle; else, return None.

>>> def func(x1,y1,R1,ID_1,x2,y2):
...     dist = np.sqrt( (x1-x2)**2 + (y1-y2)**2 )
...     if dist < R:
...         return ID_1
...     else:
...        return None

Next, the actual vectorization. I'm sorta lost here. I think it should be something like

df_2['host']=func(df_1['x'],df_1['y'],df_1['R'],df_1['ID'],df_2['x'],df_2['y'])

but that just throws errors. Can someone help me?

One final note: My actual data I'm working with is VERY large; tens of millions of rows. Speed is crucial, hence why I'm trying to make vectorization work.

Can you post your desired output? What you have now is not vectorized — user3483203
– user3483203, Commented Aug 24, 2018 at 3:54
@user3483203 The approach I used, it is the last one that will be assigned as host. I could alter this by reversing the array when I assign. If we wanted to assign the closest? I'd have to sort along an axis, track the argsort positions, and unwind them. — piRSquared
– piRSquared, Commented Aug 24, 2018 at 4:31

piRSquared · Accepted Answer · 2018-08-24 05:13:48Z

5

Numba v1

You might have to install numba with

pip install numba

Then use numbas jit compiler via the njit function decorator

from numba import njit

@njit
def distances(point, points):
  return ((points - point) ** 2).sum(1) ** .5

@njit
def find_my_circle(point, circles):
  points = circles[:, :2]
  radii = circles[:, 2]
  dist = distances(point, points)
  mask = dist < radii
  i = mask.argmax()
  return i if mask[i] else -1

@njit
def find_my_circles(points, circles):
  n = len(points)
  out = np.zeros(n, np.int64)
  for i in range(n):
    out[i] = find_my_circle(points[i], circles)
  return out

ids = np.append(df_1.ID.values, np.nan)

i = find_my_circles(points, df_1[['x', 'y', 'R']].values)
df_2['host_circle'] = ids[i]

df_2

   ID  x  y  host_circle
0   3  1  2          1.0
1   4  3  5          NaN
2   5  9  9          2.0

This iterates row by row... meaning one point at a time it tries to find the host circle. Now, that part is still vectorized. And the loop should be very fast. The massive benefit is that you don't occupy tons of memory.

Numba v2

This one is more loopy but short circuits when it finds a host

from numba import njit

@njit
def distance(a, b):
  return ((a - b) ** 2).sum() ** .5

@njit
def find_my_circles(points, circles):
  n = len(points)
  m = len(circles)

  out = -np.ones(n, np.int64)

  centers = circles[:, :2]
  radii = circles[:, 2]

  for i in range(n):
    for j in range(m):
      if distance(points[i], centers[j]) < radii[j]:
        out[i] = j
        break

  return out

ids = np.append(df_1.ID.values, np.nan)

i = find_my_circles(points, df_1[['x', 'y', 'R']].values)
df_2['host_circle'] = ids[i]

df_2

Vectorized

But still problematic

c = ['x', 'y']
centers = df_1[c].values
points = df_2[c].values
radii = df_1['R'].values

i, j = np.where(((points[:, None] - centers) ** 2).sum(2) ** .5 < radii)

df_2.loc[df_2.index[i], 'host_circle'] = df_1['ID'].iloc[j].values

df_2

   ID  x  y  host_circle
0   3  1  2          1.0
1   4  3  5          NaN
2   5  9  9          2.0

Explanation

Distance from any point from the center of a circle is

((x1 - x0) ** 2 + (y1 - y0) ** 2) ** .5

I can use broadcasting if I extend one of my arrays into a third dimension

points[:, None] - centers

array([[[ 0,  1],
        [-9, -8]],

       [[ 2,  4],
        [-7, -5]],

       [[ 8,  8],
        [-1, -1]]])

That is all six combinations of vector differences. Now to calculate the distances.

((points[:, None] - centers) ** 2).sum(2) ** .5

array([[ 1.        , 12.04159458],
       [ 4.47213595,  8.60232527],
       [11.3137085 ,  1.41421356]])

Thats all 6 combinations of distances and I can compare against the radii to see which are within the circles

((points[:, None] - centers) ** 2).sum(2) ** .5 < radii

array([[ True, False],
       [False, False],
       [False,  True]])

Ok, I want to find where the True values are. That is a perfect use case for np.where. It will give me two arrays, the first will be the row positions, the second the column positions of where these True values are. Turns out, the row positions are the points and column positions are the circles.

i, j = np.where(((points[:, None] - centers) ** 2).sum(2) ** .5 < radii)

Now I just have to slice df_2 with i somehow and assign to it values I get from df_1 using j somehow... But I showed that above.

edited Aug 24, 2018 at 5:13

answered Aug 24, 2018 at 4:14

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

ALollz Over a year ago

Seems appropriate you'd answer the circle question :D

piRSquared Over a year ago

I've got this area covered (-:

Programmer Over a year ago

This works for this small dataset. However, the actual dataset I'm working with is extremely large; df_1 is 200,000 rows and df_2 is 75 million rows. I run into memory errors on "np.where"

user3483203 Over a year ago

@Programmer it's probably the broadcasting more than np.where

piRSquared Over a year ago

I'm not surprised. However, you said vectorized and that is what this is. Turns out, you want something that works/won't break you machine but also finishes sometime this year. That I can give you but it will not be vectorized. Give me a few minutes. I'm supposed to be going to sleep, but...

|

Madh · Accepted Answer · 2018-08-24 06:23:47Z

0

Try this. I have modified your function a bit for calculation and I am getting as list assuming there are many circle satisfying one point. You can modify it if that's not the case. Also it will be zero member list in case particle do not reside in any of the circle

def func(df, x2,y2):
    val = df.apply(lambda row: np.sqrt((row['x']-x2)**2 + (row['y']-y2)**2) < row['R'], axis=1)
    return list(val.index[val==True])

df_2['host'] = df_2.apply(lambda row: func(df_1, row['x'],row['y']), axis=1)

answered Aug 24, 2018 at 6:23

Madh

215 bronze badges

Collectives™ on Stack Overflow

Pandas: vectorization with function on two dataframes

2 Answers 2

Numba v1

Numba v2

Vectorized

Explanation

10 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Numba v1

Numba v2

Vectorized

Explanation

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related