Let's suppose your data is like this:
val df = spark.sqlContext.createDataFrame(Seq(
(Array("a"), Array("b"))
)).toDF("ColA", "ColB")
df.printSchema()
df.show()
root
|-- ColA: array (nullable = true)
| |-- element: string (containsNull = true)
|-- ColB: array (nullable = true)
| |-- element: string (containsNull = true)
+----+----+
|ColA|ColB|
+----+----+
| [a]| [b]|
+----+----+
The existing set of Spark SQL functions doesn't appear to have a concatenation function for arrays (or sequences). I only see concat functions for strings. But you can create a simple user-defined function (UDF):
import org.apache.spark.sql.functions.udf
val concatSeq = udf { (x: Seq[String], y: Seq[String]) => x ++ y }
val df2 = df.select(concatSeq('ColA, 'ColB).as("ColAplusB"))
df2.printSchema()
df2.show()
root
|-- ColAplusB: array (nullable = true)
| |-- element: string (containsNull = true)
+---------+
|ColAplusB|
+---------+
| [a, b]|
+---------+
Any extra logic you want to perform (e.g. sorting, removing duplicates) can be done in your UDF:
val df = spark.sqlContext.createDataFrame(Seq(
(Array("b", "a", "c"), Array("a", "b"))
)).toDF("ColA", "ColB")
df.show()
+---------+------+
| ColA| ColB|
+---------+------+
|[b, a, c]|[a, b]|
+---------+------+
val concatSeq = udf { (x: Seq[String], y: Seq[String]) =>
(x ++ y).distinct.sorted
}
df.select(concatSeq('ColA, 'ColB).as("ColAplusB")).show()
+---------+
|ColAplusB|
+---------+
|[a, b, c]|
+---------+
flatten array/collection. I don't know spark, but I believe it should be doable without custom code.explodedoes that, but then I am not sure how to pick all values back into a single array.flatten(array(ColA, ColB))