1

I have a pyspark df which has many columns but a subset looks like this:

datetime eventid sessionid lat lon filtertype
someval someval someval someval someval someval
someval someval someval someval someval someval

I want to map a function some_func() which only makes use of the columns 'lat', 'lon' and 'event_id' to return a Boolean value which would be added to the df as a separate column named 'verified'. Basically I need to retrieve the columns of interest inside the function separately and do my operations on them. I know I can use UDFs or df.withColumn() but they are used to map to single column. For that I need to concatenate columns of interest as one column which would make the code a bit messy.

Is there a way to retrieve the column values inside the function separately and map that function to the entire dataframe? (similar to what we can do with Pandas df using map-lambda & df.apply())?

1

1 Answer 1

2

you can create a udf which can take up multiple column as parameters

ex:

from pyspark.sql.functions as f
from pyspark.sql.types import BooleanType

def your_function(p1, p2, p3):
    # your logic goes here
    # return a bool

udf_func = f.udf(your_function, BooleanType())


df = spark.read.....

df2 = df.withColumn("verified", udf_func(f.col("lat"), f.col("lon"), f.col("event_id")))

df2.show(truncate=False)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.