Spark exception error using pandas_udf with logical statement

Question

I am trying to deploy a simple if-else function specifically using pandas_udf. Here is the code:

from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd

@pandas_udf("string", PandasUDFType.SCALAR )
def seq_sum1(col1,col2):
  if col1 + col2 <= 6:
    v = "low"
  elif ((col1 + col2 > 6) & (col1 + col2 <=10)) :
    v = "medium"
  else:
    v = "High"
  return (v)

# Deploy 
df.select("*",seq_sum1('c1','c2').alias('new_col')).show(10)

this results in an error:

PythonException: An exception was thrown from a UDF: 'ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', from <command-1220380192863042>, line 13. Full traceback below:

if I deploy the same code but using @udf instead of @pandas_udf, it produces the results as expected. However, pandas_udf doesn't seem to work.

I know that this kind of functionally can be achieved through other means in spark (case when etc), so the point here is that I want to understand how pandas_udf works when dealing with such logics.

Thanks

mck · Accepted Answer · 2021-01-13 06:58:33Z

1

The UDF should take a pandas series and return a pandas series, not taking and returning strings.

import pandas as pd
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T

@F.pandas_udf("string", F.PandasUDFType.SCALAR)
def seq_sum1(col1, col2):
    return pd.Series(
        np.where(
            col1 + col2 <= 6, "low",
            np.where(
                (col1 + col2 > 6) & (col1 + col2 <= 10), "medium",
                    "high"
            )
        )
    )

df.select("*", seq_sum1('c1','c2').alias('new_col')).show()
+---+---+-------+
| c1| c2|new_col|
+---+---+-------+
|  1|  2|    low|
|  3|  4| medium|
|  5|  6|   high|
+---+---+-------+

answered Jan 13, 2021 at 6:58

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Fahadakbar Over a year ago

Thank you! it opened up my eyes. I actually end up using map functions to make it work, I will post my code

Fahadakbar · Accepted Answer · 2021-01-13 19:08:08Z

0

@mck provided the insight, and I end up using the map function to solve it.

@pandas_udf("string", PandasUDFType.SCALAR)
def seq_sum(col1):
  
  # actual function/calculation goes here
  def main(x):
    if x < 6:
      v = "low"
    else:
      v = "high"
    return(v)
  
  # now apply map function, returning a panda series
  result = pd.Series(map(main,col1))
   
  return (result)

df.select("*",seq_sum('column_name').alias('new_col')).show(10)

answered Jan 13, 2021 at 19:08

Fahadakbar

5181 gold badge10 silver badges27 bronze badges

Collectives™ on Stack Overflow

Spark exception error using pandas_udf with logical statement

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related