4

There is a function in pyspark:

def sum(a,b):
    c=a+b
    return c

It has to be run on each record of a very very large dataframe using spark sql:

x = sum(df.select["NUM1"].first()["NUM1"], df.select["NUM2"].first()["NUM2"])

But this would run it only for the first record of the df and not for all rows. I understand it could be done using a lambda, but I am not able to code it in the desired way.

In reality; c would be a dataframe and the function would be doing a lot of spark.sql stuff and return it. I would have to call that function for each row. I guess, I will try to pick it up using this sum(a,b) as an analogy.

+----------+----------+-----------+
|     NUM1 |     NUM2 |    XYZ    |
+----------+----------+-----------+
|      10  |     20   |      HELLO|                                    
|      90  |     60   |      WORLD|
|      50  |     45   |      SPARK|
+----------+----------+-----------+


+----------+----------+-----------+------+
|     NUM1 |     NUM2 |    XYZ    | VALUE|
+----------+----------+-----------+------+
|      10  |     20   |      HELLO|30    |                                     
|      90  |     60   |      WORLD|150   |
|      50  |     45   |      SPARK|95    |
+----------+----------+-----------+------+

Python: 3.7.4
Spark: 2.2

3 Answers 3

3

You can use .withColumn function:

from pyspark.sql.functions import col
from pyspark.sql.types import LongType
df.show()
+----+----+-----+
|NUM1|NUM2|  XYZ|
+----+----+-----+
|  10|  20|HELLO|
|  90|  60|WORLD|
|  50|  45|SPARK|
+----+----+-----+

def mysum(a,b):
  return a + b

spark.udf.register("mysumudf", mysum, LongType())

df2 = df.withColumn("VALUE", mysum(col("NUM1"),col("NUM2"))

df2.show()
+----+----+-----+-----+
|NUM1|NUM2|  XYZ|VALUE|
+----+----+-----+-----+
|  10|  20|HELLO|   30|
|  90|  60|WORLD|  150|
|  50|  45|SPARK|   95|
+----+----+-----+-----+
Sign up to request clarification or add additional context in comments.

4 Comments

Solution in python pls. $ is not there in python and this is creating trouble for me
I will try sum(df['NUM1'],df['NUM2']) . In reality, i have to call this function in select statement itself. That should not be a problem. It would be select sum(df['NUM1'],df['NUM2']) from abc. That should work i guess
In select statement it is throwing error. df['NUM1'] is printed as Column<b'NUM1'> which is wrong. I want this to be the actual value of this column. And the same needs to be applied for all the records of the df on which select is running
yeah, that's exactly what the withColumn function is for, it applies a function to every row, on selected column. I corrected the use of $
2

We can do it in below ways and while registering udf 3rd argument that is return type is not mandatory.

from pyspark.sql import functions as F
df1 = spark.createDataFrame([(10,20,'HELLO'),(90,60,'WORLD'),(50,45,'SPARK')],['NUM1','NUM2','XYZ'])
df1.show()
df2=df1.withColumn('VALUE',F.expr('NUM1 + NUM2'))
df2.show(3,False)
+----+----+-----+-----+
|NUM1|NUM2|XYZ  |VALUE|
+----+----+-----+-----+
|10  |20  |HELLO|30   |
|90  |60  |WORLD|150  |
|50  |45  |SPARK|95   |
+----+----+-----+-----+


(or)

def sum(c1,c2):
    return c1+c2
spark.udf.register('sum_udf1',sum)
df2=df1.withColumn('VALUE',F.expr("sum_udf1(NUM1,NUM2)"))
df2.show(3,False)
+----+----+-----+-----+
|NUM1|NUM2|XYZ  |VALUE|
+----+----+-----+-----+
|10  |20  |HELLO|30   |
|90  |60  |WORLD|150  |
|50  |45  |SPARK|95   |
+----+----+-----+-----+
(or)

sum_udf2=F.udf(lambda x,y: x+y)
df2=df1.withColumn('VALUE',sum_udf2('NUM1','NUM2'))
df2.show(3,False)
+----+----+-----+-----+
|NUM1|NUM2|XYZ  |VALUE|
+----+----+-----+-----+
|10  |20  |HELLO|30   |
|90  |60  |WORLD|150  |
|50  |45  |SPARK|95   |
+----+----+-----+-----+

Comments

0

Use the below simple approach:

1. Import pyspark.sql functions

from pyspark.sql import functions as F

2. Use F.expr() function

df.withColumn("VALUE",F.expr("NUM1+NUM2")<br>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.