3

I just want to solve the following problem: i want to filter out all tuples of a data frame in which the strings contained in one column are not contained in a blacklist which is given as a (potentially empty) array of strings.

For example: if the blacklist contains "fourty two" and "twenty three", all rows are filtered out from the dataframe in which the respective column contains either "fourty two" or "twenty three".

The following code will successfully execute, if the blacklist is not empty (for example Array("fourty two")) and fail else (Array.empty[String]):

//HELPERs
val containsStringUDF = udf(containsString(_: mutable.WrappedArray[String], _: String))
def containsString(array: mutable.WrappedArray[String], value: String) = {array.contains(value)}

def arrayCol[T](arr: Array[T]) = {array(arr map lit: _*)}

df.filter(!containsStringUDF(arrayCol[String](blacklist),$"theStringColumn"))

The error message is:

org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(array(), theStringColumn)' due to data type mismatch: argument 1 requires array<string> type, however, 'array()' is of array<null> type

It seems, that empty arrays appear typeless to spark. Is there a nice way to deal with this?

4
  • so blacklist is an array ? Actually I4m not very sure what you are trying to do Commented Dec 2, 2016 at 10:40
  • Yes, eliasah. It contains strings which i do not want in column "theStringColumn" to appear. Commented Dec 2, 2016 at 11:00
  • can you give an example please ? (input, blacklist and expected output) Commented Dec 2, 2016 at 11:01
  • Sure, I added an example. Commented Dec 2, 2016 at 12:09

1 Answer 1

5

You are overthinking a problem. What you really need here is isin:

val blacklist = Seq("foo", "bar")

$"theStringColumn".isin(blacklist: _*)

Moreover don't depend on the local type for ArrayType being a WrappedArray. Just use Seq.

Finally to answer your question you can either:

array().cast("array<string>")

or:

import org.apache.spark.sql.types.{ArrayType, StringType}

array().cast(ArrayType(StringType))
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! The kind of solution I was looking for was casting. Somehow i missed that. Also thank you for the tip regarding the use of Seq. I will try that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.