Replace empty array with null in Spark DataFrame

Question

Consider a dataframe like the following:

+---+----+--------+----+
| c1|  c2|      c3|  c4|
+---+----+--------+----+
|  x|  n1|    [m1]|  []|
|  y|  n3|[m2, m3]|[z3]|
|  x|  n2|      []|  []|
+---+----+--------+----+

I want to replace empty array with null.

+---+----+--------+----+
| c1|  c2|      c3|  c4|
+---+----+--------+----+
|  x|  n1|    [m1]|null|
|  y|  n3|[m2, m3]|[z3]|
|  x|  n2|    null|null|
+---+----+--------+----+

What is the efficient way to achieve the above goal?

shuvalov · Accepted Answer · 2019-11-18 03:54:53Z

You could check array length and return null usign when...otherwise function:

val df = Seq(
        ("x", "n1", Seq("m1"), Seq()),
        ("y", "n3", Seq("m2", "m3"), Seq("z3")),
        ("x", "n2", Seq(), Seq())     
    ).toDF("c1", "c2", "c3", "c4")
df.show

df.select($"c1", $"c2", 
    when(size($"c3") > 0, $"c3").otherwise(lit(null)) as "c3",
    when(size($"c4") > 0, $"c4").otherwise(lit(null)) as "c4"
).show

It returns:

df: org.apache.spark.sql.DataFrame = [c1: string, c2: string ... 2 more fields]
+---+---+--------+----+
| c1| c2|      c3|  c4|
+---+---+--------+----+
|  x| n1|    [m1]|  []|
|  y| n3|[m2, m3]|[z3]|
|  x| n2|      []|  []|
+---+---+--------+----+
+---+---+--------+----+
| c1| c2|      c3|  c4|
+---+---+--------+----+
|  x| n1|    [m1]|null|
|  y| n3|[m2, m3]|[z3]|
|  x| n2|    null|null|
+---+---+--------+----+

Collectives™ on Stack Overflow

Replace empty array with null in Spark DataFrame

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related