Call a function for each row of a dataframe in pyspark[non pandas]

Question

There is a function in pyspark:

def sum(a,b):
    c=a+b
    return c

It has to be run on each record of a very very large dataframe using spark sql:

x = sum(df.select["NUM1"].first()["NUM1"], df.select["NUM2"].first()["NUM2"])

But this would run it only for the first record of the df and not for all rows. I understand it could be done using a lambda, but I am not able to code it in the desired way.

In reality; c would be a dataframe and the function would be doing a lot of spark.sql stuff and return it. I would have to call that function for each row. I guess, I will try to pick it up using this sum(a,b) as an analogy.

+----------+----------+-----------+
|     NUM1 |     NUM2 |    XYZ    |
+----------+----------+-----------+
|      10  |     20   |      HELLO|                                    
|      90  |     60   |      WORLD|
|      50  |     45   |      SPARK|
+----------+----------+-----------+


+----------+----------+-----------+------+
|     NUM1 |     NUM2 |    XYZ    | VALUE|
+----------+----------+-----------+------+
|      10  |     20   |      HELLO|30    |                                     
|      90  |     60   |      WORLD|150   |
|      50  |     45   |      SPARK|95    |
+----------+----------+-----------+------+

Python: 3.7.4
Spark: 2.2

Vincent · Accepted Answer · 2019-11-13 15:44:01Z

3

You can use .withColumn function:

from pyspark.sql.functions import col
from pyspark.sql.types import LongType
df.show()
+----+----+-----+
|NUM1|NUM2|  XYZ|
+----+----+-----+
|  10|  20|HELLO|
|  90|  60|WORLD|
|  50|  45|SPARK|
+----+----+-----+

def mysum(a,b):
  return a + b

spark.udf.register("mysumudf", mysum, LongType())

df2 = df.withColumn("VALUE", mysum(col("NUM1"),col("NUM2"))

df2.show()
+----+----+-----+-----+
|NUM1|NUM2|  XYZ|VALUE|
+----+----+-----+-----+
|  10|  20|HELLO|   30|
|  90|  60|WORLD|  150|
|  50|  45|SPARK|   95|
+----+----+-----+-----+

edited Nov 13, 2019 at 15:44

answered Nov 12, 2019 at 18:01

Vincent

5911 gold badge5 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

earl Over a year ago

Solution in python pls. $ is not there in python and this is creating trouble for me

earl Over a year ago

I will try sum(df['NUM1'],df['NUM2']) . In reality, i have to call this function in select statement itself. That should not be a problem. It would be select sum(df['NUM1'],df['NUM2']) from abc. That should work i guess

earl Over a year ago

In select statement it is throwing error. df['NUM1'] is printed as Column<b'NUM1'> which is wrong. I want this to be the actual value of this column. And the same needs to be applied for all the records of the df on which select is running

Vincent Over a year ago

yeah, that's exactly what the withColumn function is for, it applies a function to every row, on selected column. I corrected the use of $

Naidu_DPN · Accepted Answer · 2020-08-14 13:22:38Z

We can do it in below ways and while registering udf 3rd argument that is return type is not mandatory.

from pyspark.sql import functions as F
df1 = spark.createDataFrame([(10,20,'HELLO'),(90,60,'WORLD'),(50,45,'SPARK')],['NUM1','NUM2','XYZ'])
df1.show()
df2=df1.withColumn('VALUE',F.expr('NUM1 + NUM2'))
df2.show(3,False)
+----+----+-----+-----+
|NUM1|NUM2|XYZ  |VALUE|
+----+----+-----+-----+
|10  |20  |HELLO|30   |
|90  |60  |WORLD|150  |
|50  |45  |SPARK|95   |
+----+----+-----+-----+


(or)

def sum(c1,c2):
    return c1+c2
spark.udf.register('sum_udf1',sum)
df2=df1.withColumn('VALUE',F.expr("sum_udf1(NUM1,NUM2)"))
df2.show(3,False)
+----+----+-----+-----+
|NUM1|NUM2|XYZ  |VALUE|
+----+----+-----+-----+
|10  |20  |HELLO|30   |
|90  |60  |WORLD|150  |
|50  |45  |SPARK|95   |
+----+----+-----+-----+
(or)

sum_udf2=F.udf(lambda x,y: x+y)
df2=df1.withColumn('VALUE',sum_udf2('NUM1','NUM2'))
df2.show(3,False)
+----+----+-----+-----+
|NUM1|NUM2|XYZ  |VALUE|
+----+----+-----+-----+
|10  |20  |HELLO|30   |
|90  |60  |WORLD|150  |
|50  |45  |SPARK|95   |
+----+----+-----+-----+

n1tk · Accepted Answer · 2021-11-10 21:52:57Z

0

Use the below simple approach:

1. Import pyspark.sql functions

from pyspark.sql import functions as F

2. Use F.expr() function

df.withColumn("VALUE",F.expr("NUM1+NUM2")<br>

edited Nov 10, 2021 at 21:52

n1tk

2,5502 gold badges25 silver badges36 bronze badges

answered Aug 14, 2020 at 10:06

Manjunatha H C

2472 silver badges3 bronze badges

Collectives™ on Stack Overflow

Call a function for each row of a dataframe in pyspark[non pandas]

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related