Spark SQL - Array of arrays to a single array

Question

I have two columns with arrays of strings

| ColA | ColB |
|------|------|
| ["a"]| ["b"]|

I wanted to create a single column containing values from both of the arrays"

| ColAplusB |
|-----------|
|["a", "b"] |

I tried array(ColA, ColB) which left me with:

| ColAplusBnested |
|-----------------|
| [["a"], ["b"]]  |

How could I get the desired result (array of arrays conversed to an array of values from initial arrays)?

Search for term flatten array/collection. I don't know spark, but I believe it should be doable without custom code. — Lukasz Szozda
– Lukasz Szozda, Commented Sep 14, 2017 at 15:38
I believe flattening reduces it to single value per row, which is not exactly what I am looking for. explode does that, but then I am not sure how to pick all values back into a single array. — 3yakuya
– 3yakuya, Commented Sep 14, 2017 at 15:57

matw · Accepted Answer · 2017-09-14 21:05:38Z

Let's suppose your data is like this:

val df = spark.sqlContext.createDataFrame(Seq(
  (Array("a"), Array("b")) 
)).toDF("ColA", "ColB")
df.printSchema()
df.show()

root
 |-- ColA: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ColB: array (nullable = true)
 |    |-- element: string (containsNull = true)

+----+----+
|ColA|ColB|
+----+----+
| [a]| [b]|
+----+----+

The existing set of Spark SQL functions doesn't appear to have a concatenation function for arrays (or sequences). I only see concat functions for strings. But you can create a simple user-defined function (UDF):

import org.apache.spark.sql.functions.udf

val concatSeq = udf { (x: Seq[String], y: Seq[String]) => x ++ y }
val df2 = df.select(concatSeq('ColA, 'ColB).as("ColAplusB"))
df2.printSchema()
df2.show()

root
 |-- ColAplusB: array (nullable = true)
 |    |-- element: string (containsNull = true)

+---------+
|ColAplusB|
+---------+
|   [a, b]|
+---------+

Any extra logic you want to perform (e.g. sorting, removing duplicates) can be done in your UDF:

val df = spark.sqlContext.createDataFrame(Seq(
  (Array("b", "a", "c"), Array("a", "b")) 
)).toDF("ColA", "ColB")

df.show()

+---------+------+
|     ColA|  ColB|
+---------+------+
|[b, a, c]|[a, b]|
+---------+------+

val concatSeq = udf { (x: Seq[String], y: Seq[String]) =>
  (x ++ y).distinct.sorted
}

df.select(concatSeq('ColA, 'ColB).as("ColAplusB")).show()

+---------+
|ColAplusB|
+---------+
|[a, b, c]|
+---------+

Stu · Accepted Answer · 2023-09-25 20:36:25Z

0

As of Spark 2.4.0, the array_union function has allowed for the concatenation of two arrays. Note that this will deduplicate any values that exist in both arrays.

If you want to combine multiple arrays together, with the arrays broken out across rows rather than columns, I use a two step process:

Use explode_outer to unnest the arrays.
Use collect_set to aggregate the values into a single deduplicated array. If you do not wish to deduplicate your results, use collect_list instead.

answered Sep 25, 2023 at 20:36

Stu

415 bronze badges

Collectives™ on Stack Overflow

Spark SQL - Array of arrays to a single array

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related