Creating a typed array column from an empty array

Question

I just want to solve the following problem: i want to filter out all tuples of a data frame in which the strings contained in one column are not contained in a blacklist which is given as a (potentially empty) array of strings.

For example: if the blacklist contains "fourty two" and "twenty three", all rows are filtered out from the dataframe in which the respective column contains either "fourty two" or "twenty three".

The following code will successfully execute, if the blacklist is not empty (for example Array("fourty two")) and fail else (Array.empty[String]):

//HELPERs
val containsStringUDF = udf(containsString(_: mutable.WrappedArray[String], _: String))
def containsString(array: mutable.WrappedArray[String], value: String) = {array.contains(value)}

def arrayCol[T](arr: Array[T]) = {array(arr map lit: _*)}

df.filter(!containsStringUDF(arrayCol[String](blacklist),$"theStringColumn"))

The error message is:

org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(array(), theStringColumn)' due to data type mismatch: argument 1 requires array<string> type, however, 'array()' is of array<null> type

It seems, that empty arrays appear typeless to spark. Is there a nice way to deal with this?

so blacklist is an array ? Actually I4m not very sure what you are trying to do — eliasah
– eliasah, Commented Dec 2, 2016 at 10:40
Yes, eliasah. It contains strings which i do not want in column "theStringColumn" to appear. — Elmar Macek
– Elmar Macek, Commented Dec 2, 2016 at 11:00
can you give an example please ? (input, blacklist and expected output) — eliasah
– eliasah, Commented Dec 2, 2016 at 11:01

zero323 · Accepted Answer · 2016-12-02 12:18:39Z

5

You are overthinking a problem. What you really need here is isin:

val blacklist = Seq("foo", "bar")

$"theStringColumn".isin(blacklist: _*)

Moreover don't depend on the local type for ArrayType being a WrappedArray. Just use Seq.

Finally to answer your question you can either:

array().cast("array<string>")

or:

import org.apache.spark.sql.types.{ArrayType, StringType}

array().cast(ArrayType(StringType))

answered Dec 2, 2016 at 12:18

zero323

331k108 gold badges981 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Elmar Macek Over a year ago

Thank you! The kind of solution I was looking for was casting. Somehow i missed that. Also thank you for the tip regarding the use of Seq. I will try that.

Collectives™ on Stack Overflow

Creating a typed array column from an empty array

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related