How can I pass extra parameters to UDFs in Spark SQL?

Question

I want to parse the date columns in a DataFrame, and for each date column, the resolution for the date may change (i.e. 2011/01/10 => 2011 /01 if the resolution is set to "Month").

I wrote the following code:

def convertDataFrame(dataframe: DataFrame, schema : Array[FieldDataType], resolution: Array[DateResolutionType]) : DataFrame =
{
  import org.apache.spark.sql.functions._
  val convertDateFunc = udf{(x:String, resolution: DateResolutionType) => SparkDateTimeConverter.convertDate(x, resolution)}
  val convertDateTimeFunc = udf{(x:String, resolution: DateResolutionType) => SparkDateTimeConverter.convertDateTime(x, resolution)}

  val allColNames = dataframe.columns
  val allCols = allColNames.map(name => dataframe.col(name))

  val mappedCols =
  {
    for(i <- allCols.indices) yield
    {
      schema(i) match
      {
        case FieldDataType.Date => convertDateFunc(allCols(i), resolution(i)))
        case FieldDataType.DateTime => convertDateTimeFunc(allCols(i), resolution(i))
        case _ => allCols(i)
      }
    }
  }

  dataframe.select(mappedCols:_*)

}}

However it doesn't work. It seems that I can only pass Columns to UDFs. And I wonder if it will be very slow if I convert the DataFrame to RDD and apply the function on each row.

Does anyone know the correct solution? Thank you!

Mysterion · Accepted Answer · 2017-09-08 10:49:35Z

58

Just use a little bit of currying:

def convertDateFunc(resolution: DateResolutionType) = udf((x:String) => 
  SparkDateTimeConverter.convertDate(x, resolution))

and use it as follows:

case FieldDataType.Date => convertDateFunc(resolution(i))(allCols(i))

On a side note you should take a look at sql.functions.trunc and sql.functions.date_format. These should at least part of the job without using UDFs at all.

Note:

In Spark 2.2 or later you can use typedLit function:

import org.apache.spark.sql.functions.typedLit

which support a wider range of literals like Seq or Map.

edited Sep 8, 2017 at 10:49

Mysterion

9,3303 gold badges33 silver badges54 bronze badges

answered Feb 22, 2016 at 11:12

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

DarkZero Over a year ago

Thank you for your answer and the intuition of currying!

Ameba Spugnosa Over a year ago

I wrote a tutorial on how to use currying to create Spark UDF that accepts extra parameters at invocation time. gist.github.com/andrearota/5910b5c5ac65845f23856b2415474c38

TroubleShooter Over a year ago

Bravo, quite an insight into Spark.

Aleksejs R Over a year ago

Is it possible to register currying UDF with spark.udf.register in order to make it sql available?

kalyan Over a year ago

someone put this in the documentation..!

Michael Armbrust · Accepted Answer · 2016-02-24 18:31:56Z

23

You can create a literal Column to pass to a udf using the lit(...) function defined in org.apache.spark.sql.functions

For example:

val takeRight = udf((s: String, i: Int) => s.takeRight(i))
df.select(takeRight($"stringCol", lit(1)))

answered Feb 24, 2016 at 18:31

Michael Armbrust

1,56511 silver badges12 bronze badges

1 Comment

DarkZero Over a year ago

Thank you, I initially used lit as well, but it turns out that its performance is not as good as the other answer...

Collectives™ on Stack Overflow

How can I pass extra parameters to UDFs in Spark SQL?

2 Answers 2

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related