There is a function in pyspark:
def sum(a,b):
c=a+b
return c
It has to be run on each record of a very very large dataframe using spark sql:
x = sum(df.select["NUM1"].first()["NUM1"], df.select["NUM2"].first()["NUM2"])
But this would run it only for the first record of the df and not for all rows. I understand it could be done using a lambda, but I am not able to code it in the desired way.
In reality; c would be a dataframe and the function would be doing a lot of spark.sql stuff and return it. I would have to call that function for each row.
I guess, I will try to pick it up using this sum(a,b) as an analogy.
+----------+----------+-----------+
| NUM1 | NUM2 | XYZ |
+----------+----------+-----------+
| 10 | 20 | HELLO|
| 90 | 60 | WORLD|
| 50 | 45 | SPARK|
+----------+----------+-----------+
+----------+----------+-----------+------+
| NUM1 | NUM2 | XYZ | VALUE|
+----------+----------+-----------+------+
| 10 | 20 | HELLO|30 |
| 90 | 60 | WORLD|150 |
| 50 | 45 | SPARK|95 |
+----------+----------+-----------+------+
Python: 3.7.4
Spark: 2.2