2

I have two columns with arrays of strings

| ColA | ColB |
|------|------|
| ["a"]| ["b"]|

I wanted to create a single column containing values from both of the arrays"

| ColAplusB |
|-----------|
|["a", "b"] |

I tried array(ColA, ColB) which left me with:

| ColAplusBnested |
|-----------------|
| [["a"], ["b"]]  |

How could I get the desired result (array of arrays conversed to an array of values from initial arrays)?

3
  • Search for term flatten array/collection. I don't know spark, but I believe it should be doable without custom code. Commented Sep 14, 2017 at 15:38
  • I believe flattening reduces it to single value per row, which is not exactly what I am looking for. explode does that, but then I am not sure how to pick all values back into a single array. Commented Sep 14, 2017 at 15:57
  • I mean something like flatten(array(ColA, ColB)) Commented Sep 14, 2017 at 15:58

2 Answers 2

2

Let's suppose your data is like this:

val df = spark.sqlContext.createDataFrame(Seq(
  (Array("a"), Array("b")) 
)).toDF("ColA", "ColB")
df.printSchema()
df.show()

root
 |-- ColA: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ColB: array (nullable = true)
 |    |-- element: string (containsNull = true)

+----+----+
|ColA|ColB|
+----+----+
| [a]| [b]|
+----+----+

The existing set of Spark SQL functions doesn't appear to have a concatenation function for arrays (or sequences). I only see concat functions for strings. But you can create a simple user-defined function (UDF):

import org.apache.spark.sql.functions.udf

val concatSeq = udf { (x: Seq[String], y: Seq[String]) => x ++ y }
val df2 = df.select(concatSeq('ColA, 'ColB).as("ColAplusB"))
df2.printSchema()
df2.show()

root
 |-- ColAplusB: array (nullable = true)
 |    |-- element: string (containsNull = true)

+---------+
|ColAplusB|
+---------+
|   [a, b]|
+---------+

Any extra logic you want to perform (e.g. sorting, removing duplicates) can be done in your UDF:

val df = spark.sqlContext.createDataFrame(Seq(
  (Array("b", "a", "c"), Array("a", "b")) 
)).toDF("ColA", "ColB")

df.show()

+---------+------+
|     ColA|  ColB|
+---------+------+
|[b, a, c]|[a, b]|
+---------+------+

val concatSeq = udf { (x: Seq[String], y: Seq[String]) =>
  (x ++ y).distinct.sorted
}

df.select(concatSeq('ColA, 'ColB).as("ColAplusB")).show()

+---------+
|ColAplusB|
+---------+
|[a, b, c]|
+---------+
Sign up to request clarification or add additional context in comments.

Comments

0

As of Spark 2.4.0, the array_union function has allowed for the concatenation of two arrays. Note that this will deduplicate any values that exist in both arrays.

If you want to combine multiple arrays together, with the arrays broken out across rows rather than columns, I use a two step process:

  1. Use explode_outer to unnest the arrays.
  2. Use collect_set to aggregate the values into a single deduplicated array. If you do not wish to deduplicate your results, use collect_list instead.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.