I just want to solve the following problem: i want to filter out all tuples of a data frame in which the strings contained in one column are not contained in a blacklist which is given as a (potentially empty) array of strings.
For example: if the blacklist contains "fourty two" and "twenty three", all rows are filtered out from the dataframe in which the respective column contains either "fourty two" or "twenty three".
The following code will successfully execute, if the blacklist is not empty (for example Array("fourty two")) and fail else (Array.empty[String]):
//HELPERs
val containsStringUDF = udf(containsString(_: mutable.WrappedArray[String], _: String))
def containsString(array: mutable.WrappedArray[String], value: String) = {array.contains(value)}
def arrayCol[T](arr: Array[T]) = {array(arr map lit: _*)}
df.filter(!containsStringUDF(arrayCol[String](blacklist),$"theStringColumn"))
The error message is:
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(array(), theStringColumn)' due to data type mismatch: argument 1 requires array<string> type, however, 'array()' is of array<null> type
It seems, that empty arrays appear typeless to spark. Is there a nice way to deal with this?