0

I have written a udf in pyspark like below:

df1 = df.where(point_inside_polygon(latitide,longitude,polygonArr))

df1 and df are spark dataframes

The function is given below:

def point_inside_polygon(x,y,poly):


latt = float(x)
long = float(y)
if ((math.isnan(latt)) or (math.isnan(long))):
    point = sh.geometry.Point(latt, long)
    polygonArr = poly
    polygon=MultiPoint(polygonArr).convex_hull
    if polygon.contains(point):
        return True
    else:
        return False
else:
    return False

But when I tried checking the data type of latitude and longitude, its a class of column. The data type is Column

Is there a way to iterate through each tuple and use their values, instead of taking the data type column. I don't want to use a for loop because I have a huge recordset and it defeats the purpose of using SPARK.

Is there a way to accomplish to pass the column values as float, or converting them inside the function?

1 Answer 1

1

Wrap it using udf:

from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf

point_inside_polygon_ = udf(point_inside_polygon, BooleanType())
df1 = df.where(point_inside_polygon(latitide,longitude,polygonArr))
Sign up to request clarification or add additional context in comments.

1 Comment

I haven't done this before, so just a small doubt. the second line, should it have the new function or the old function? df1 = df.where(point_inside_polygon(args)) or df1=df.where(point_inside_polygon_(args))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.