Convert null values to empty array in Spark DataFrame

Question

I have a Spark data frame where one column is an array of integers. The column is nullable because it is coming from a left outer join. I want to convert all null values to an empty array so I don't have to deal with nulls later.

I thought I could do it like so:

val myCol = df("myCol")
df.withColumn( "myCol", when(myCol.isNull, Array[Int]()).otherwise(myCol) )

However, this results in the following exception:

java.lang.RuntimeException: Unsupported literal type class [I [I@5ed25612
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)
at org.apache.spark.sql.functions$.when(functions.scala:778)

Apparently array types are not supported by the when function. Is there some other easy way to convert the null values?

In case it is relevant, here is the schema for this column:

|-- myCol: array (nullable = true)
|    |-- element: integer (containsNull = false)

Take a look at coalesce sql function docs.oracle.com/database/121/SQLRF/functions033.htm#SQLRF00617 — gasparms
– gasparms, Commented Jan 7, 2016 at 17:20

10465355 · Accepted Answer · 2019-11-15 13:13:36Z

34

You can use an UDF:

import org.apache.spark.sql.functions.udf

val array_ = udf(() => Array.empty[Int])

combined with WHEN or COALESCE:

df.withColumn("myCol", when(myCol.isNull, array_()).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array_())).show

In the recent versions you can use array function:

import org.apache.spark.sql.functions.{array, lit}

df.withColumn("myCol", when(myCol.isNull, array().cast("array<integer>")).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array().cast("array<integer>"))).show

Please note that it will work only if conversion from string to the desired type is allowed.

The same thing can be of course done in PySpark as well. For the legacy solutions you can define udf

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType

def empty_array(t):
    return udf(lambda: [], ArrayType(t()))()

coalesce(myCol, empty_array(IntegerType()))

and in the recent versions just use array:

from pyspark.sql.functions import array

coalesce(myCol, array().cast("array<integer>"))

edited Nov 15, 2019 at 13:13

10465355

4,6412 gold badges24 silver badges46 bronze badges

answered Jan 7, 2016 at 18:01

zero323

331k108 gold badges981 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Daniel Siegmann Over a year ago

Thanks for your help. I had actually tried a UDF before but didn't think to actually call apply on it (i.e. I was doing array_ instead of array_()).

iambdot Over a year ago

@zero323 how would you do this in pyspark?

kelloti Over a year ago

@harppu This answers it for pyspark for me: stackoverflow.com/a/57198009/503826

Jeremy · Accepted Answer · 2018-09-12 16:25:34Z

17

With a slight modification to zero323's approach, I was able to do this without using a udf in Spark 2.3.1.

val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
+---+---------+
| id|  numbers|
+---+---------+
|  a|[1, 2, 3]|
|  b|     null|
|  c|[7, 8, 9]|
+---+---------+

val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
+---+---------+
| id|  numbers|
+---+---------+
|  a|[1, 2, 3]|
|  b|       []|
|  c|[7, 8, 9]|
+---+---------+

answered Sep 12, 2018 at 16:25

Jeremy

1,92015 silver badges21 bronze badges

1 Comment

Josh Herzberg Over a year ago

In PySpark you can do the second approach, just do df2 = df.withColumn("numbers", coalesce(col("numbers"), array()))

harppu · Accepted Answer · 2019-07-25 08:55:29Z

5

An UDF-free alternative to use when the data type you want your array elements in can not be cast from StringType is the following:

import pyspark.sql.types as T
import pyspark.sql.functions as F

df.withColumn(
    "myCol",
    F.coalesce(
        F.col("myCol"),
        F.from_json(F.lit("[]"), T.ArrayType(T.IntegerType()))
    )
)

You can replace IntegerType() with whichever data type, also complex ones.

answered Jul 25, 2019 at 8:55

harppu

4141 gold badge6 silver badges13 bronze badges

Collectives™ on Stack Overflow

Convert null values to empty array in Spark DataFrame

3 Answers 3

3 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related