I am trying to deploy a simple if-else function specifically using pandas_udf. Here is the code:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd
@pandas_udf("string", PandasUDFType.SCALAR )
def seq_sum1(col1,col2):
if col1 + col2 <= 6:
v = "low"
elif ((col1 + col2 > 6) & (col1 + col2 <=10)) :
v = "medium"
else:
v = "High"
return (v)
# Deploy
df.select("*",seq_sum1('c1','c2').alias('new_col')).show(10)
this results in an error:
PythonException: An exception was thrown from a UDF: 'ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', from <command-1220380192863042>, line 13. Full traceback below:
if I deploy the same code but using @udf instead of @pandas_udf, it produces the results as expected. However, pandas_udf doesn't seem to work.
I know that this kind of functionally can be achieved through other means in spark (case when etc), so the point here is that I want to understand how pandas_udf works when dealing with such logics.
Thanks