1

I am trying to check a column of a scala dataframe against a regular expression using a udf with an additional argument representing the actual regular expression. However, putting the regular expression into a lit() statement does not seem to be allowed throwing the following error

java.lang.RuntimeException: Unsupported literal type class scala.util.matching.Regex

using the example code below. I'd expect an additional column "DMY" with Boolean entries.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import scala.util.matching._


def dateDMY_regex(): Regex = """^[0-3]?[0-9][-/.][0-1]?[0-9][-/.](19|20)?\d{2}$""".r

def date_match(value: String, dateEx: Regex): Boolean = {
  return dateEx.unapplySeq(value).isDefined
}

val spark = SparkSession.builder().getOrCreate()

var df = spark.createDataFrame(Seq(
  (0, "31/10/2018"),
  (1, "01/11/2018"),
  (2, "02/11/2018"))).toDF("Id", "col_1")

// to test the function
// print(date_match("31/10/2018", dateDMY_regex()))

val date_match_udf = udf(date_match _)   //, lit("c")
df = df.withColumn( "DMY", date_match_udf( $"col_1", lit(dateDMY_regex()) ) )

df.show()
1

2 Answers 2

1

You can supply your non-Column parameter (i.e. value of dateEx) via currying in a UDF, as shown below:

import org.apache.spark.sql.functions._
import scala.util.matching._

val df = spark.createDataFrame(Seq(
  (0, "31/10/2018"),
  (1, "999/11/2018"),
  (2, "02/11/2018"))
).toDF("Id", "col_1")

val dateEx = """^[0-3]?[0-9][-/.][0-1]?[0-9][-/.](19|20)?\d{2}$""".r

def date_match_udf(dateEx: Regex) = udf(
  (value: String) => dateEx.unapplySeq(value).isDefined
)

df.withColumn("DMY", date_match_udf(dateEx)($"col_1")).
  show
// +---+-----------+-----+
// | Id|      col_1|  DMY|
// +---+-----------+-----+
// |  0| 31/10/2018| true|
// |  1|999/11/2018|false|
// |  2| 02/11/2018| true|
// +---+-----------+-----+

However, for what you need, rather than UDF I would recommend using Spark's built-in functions which generally perform better. Below is one approach using regexp_extract:

val dateExStr = """^([0-3]?[0-9][-/.][0-1]?[0-9][-/.](19|20)?\d{2})$"""

df.withColumn("DMY", $"col_1" === regexp_extract($"col_1", dateExStr, 1)).
  show
Sign up to request clarification or add additional context in comments.

Comments

0

I still have not achieved to pass through a regex. My work around is now to pass through a string and do the compilation within the function. As follows:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import scala.util.matching._


def dateDMY_regex(): String = """^[0-3]?[0-9][-/.][0-1]?[0-9][-/.](19|20)?\d{2}$"""

def date_match(value: String, dateEx: String): Boolean = {
  return dateEx.r.unapplySeq(value).isDefined
}

val spark = SparkSession.builder().getOrCreate()

var df = spark.createDataFrame(Seq(
  (0, "31/10/2018"),
  (1, "01/11/2018"),
  (2, "02/11/2018"))).toDF("Id", "col_1")

// to test the function
// print(date_match("31/10/2018", dateDMY_regex()))

val date_match_udf = udf(date_match _)   //, lit("c")
df = df.withColumn( "DMY", date_match_udf( $"col_1", lit(dateDMY_regex()) ) )

df.show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.