Get distinct elements from rows of type ArrayType in Spark dataframe column

Question

I have a dataframe with the following schema:

    root
     |-- e: array (nullable = true)
     |    |-- element: string (containsNull = true)

For example, initiate a dataframe:

val df = Seq(Seq("73","73"), null, null, null, Seq("51"), null, null, null, Seq("52", "53", "53", "73", "84"), Seq("73", "72", "51", "73")).toDF("e")

df.show()

+--------------------+
|                   e|
+--------------------+
|            [73, 73]|
|                null|
|                null|
|                null|
|                [51]|
|                null|
|                null|
|                null|
|[52, 53, 53, 73, 84]|
|    [73, 72, 51, 73]|
+--------------------+

I'd like the output to be:

+--------------------+
|                   e|
+--------------------+
|                [73]|
|                null|
|                null|
|                null|
|                [51]|
|                null|
|                null|
|                null|
|    [52, 53, 73, 84]|
|        [73, 72, 51]|
+--------------------+

I am trying the following udf:

def distinct(arr: TraversableOnce[String])=arr.toList.distinct
val distinctUDF=udf(distinct(_:Traversable[String]))

But it only works when the rows aren't null i.e.

df.filter($"e".isNotNull).select(distinctUDF($"e"))

gives me

+----------------+
|          UDF(e)|
+----------------+
|            [73]|
|            [51]|
|[52, 53, 73, 84]|
|    [73, 72, 51]|
+----------------+

but

df.select(distinctUDF($"e"))

fails. How do I make the udf handle null in this case? Alternatively, if there's a simpler way of getting the unique values, I'd like to try that.

stackoverflow.com/questions/37801889/…

pasha701
– pasha701

2018-09-14 06:36:12 +00:00
Commented Sep 14, 2018 at 6:36 — pasha701
– pasha701, Commented Sep 14, 2018 at 6:36

Leo C · Accepted Answer · 2018-09-13 23:38:31Z

3

You can make use of when().otherwise() to apply your UDF only when the column value is not null. In this case, .otherwise(null) can also be skipped, as it defaults to null when not specifying the otherwise clause.

val distinctUDF = udf( (s: Seq[String]) => s.distinct )

df.select(when($"e".isNotNull, distinctUDF($"e")).as("e"))

answered Sep 13, 2018 at 23:38

Leo C

22.5k3 gold badges28 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Oliver W. · Accepted Answer · 2023-12-06 23:18:39Z

0

Two months after you've asked the question, with the release of Spark 2.4.0, the function array_distinct was introduced, which does exactly as intended.

answered Dec 6, 2023 at 23:18

Oliver W.

13.6k3 gold badges41 silver badges52 bronze badges

Collectives™ on Stack Overflow

Get distinct elements from rows of type ArrayType in Spark dataframe column

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related