3

Given a dataframe :

    val df = sc.parallelize(Seq(("foo", ArrayBuffer(null,"bar",null)), ("bar", ArrayBuffer("one","two",null)))).toDF("key", "value")
    df.show

    +---+--------------------------+
    |key|                     value|
    +---+--------------------------+
    |foo|ArrayBuffer(null,bar,null)|
    |bar|ArrayBuffer(one, two,null)|
    +---+--------------------------+

I'd like to drop null from column value. After removal the dataframe should look like this :

    +---+--------------------------+
    |key|                     value|
    +---+--------------------------+
    |foo|ArrayBuffer(bar)          |
    |bar|ArrayBuffer(one, two)     |
    +---+--------------------------+

Any suggestion welcome . 10x

3 Answers 3

3

You'll need an UDF here. For example with a flatMap:

val filterOutNull = udf((xs: Seq[String]) => 
  Option(xs).map(_.flatMap(Option(_))))

df.withColumn("value", filterOutNull($"value"))

where external Option with map handles NULL columns:

Option(null: Seq[String]).map(identity)
Option[Seq[String]] = None
Option(Seq("foo", null, "bar")).map(identity)
Option[Seq[String]] = Some(List(foo, null, bar))

and ensures we don't fail with NPE when input is NULL / null by mapping

NULL -> null -> None -> None -> NULL

where null is a Scala null and NULL is a SQL NULL.

The internal flatMap flattens a sequence of Options effectively filtering nulls:

Seq("foo", null, "bar").flatMap(Option(_))
Seq[String] = List(foo, bar)

A more imperative equivalent could be something like this:

val imperativeFilterOutNull = udf((xs: Seq[String]) => 
  if (xs == null) xs
  else for {
    x <- xs
    if x != null
  } yield x)
Sign up to request clarification or add additional context in comments.

Comments

3

Option 1: using UDF:

 val filterNull = udf((arr : Seq[String]) => arr.filter((x: String) => x != null))
 df.withColumn("value", filterNull($"value")).show()

Option 2: no UDF

df.withColumn("value", explode($"value")).filter($"value".isNotNull).groupBy("key").agg(collect_list($"value")).show()

Note that this is less efficient...

Comments

0

Also you can use spark-daria it has: com.github.mrpowers.spark.daria.sql.functions.arrayExNull

from the documentation:

Like array but doesn't include null element

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.