Mapping a function to multiple columns of pyspark dataframe

Question

I have a pyspark df which has many columns but a subset looks like this:

datetime	eventid	sessionid	lat	lon	filtertype
someval	someval	someval	someval	someval	someval
someval	someval	someval	someval	someval	someval

I want to map a function some_func() which only makes use of the columns 'lat', 'lon' and 'event_id' to return a Boolean value which would be added to the df as a separate column named 'verified'. Basically I need to retrieve the columns of interest inside the function separately and do my operations on them. I know I can use UDFs or df.withColumn() but they are used to map to single column. For that I need to concatenate columns of interest as one column which would make the code a bit messy.

Is there a way to retrieve the column values inside the function separately and map that function to the entire dataframe? (similar to what we can do with Pandas df using map-lambda & df.apply())?

Does this answer your question? Pyspark: Pass multiple columns in UDF — werner
– werner, Commented Sep 6, 2021 at 8:12

hprakash · Accepted Answer · 2021-09-06 08:08:17Z

2

you can create a udf which can take up multiple column as parameters

ex:

from pyspark.sql.functions as f
from pyspark.sql.types import BooleanType

def your_function(p1, p2, p3):
    # your logic goes here
    # return a bool

udf_func = f.udf(your_function, BooleanType())


df = spark.read.....

df2 = df.withColumn("verified", udf_func(f.col("lat"), f.col("lon"), f.col("event_id")))

df2.show(truncate=False)

answered Sep 6, 2021 at 8:08

hprakash

4722 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Mapping a function to multiple columns of pyspark dataframe

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related