Combine DataFrames with an array column

Question

There are 2 DataFrames, df1 is defined as

 +----+---------+---------+
 |id  |value1   |value2   | 
 +----+---------+---------+
 |1   |["J","W"]|      0.3|
 |2   |         |      0.6|
 |3   |["n"]    |      0.7|
 +----+---------+---------+

df2 is defined as

 +----+---------+
 |id  |value1   |
 +----+---------+
 | 1  | "t"     |
 | 2  | "m"     |
 +----+---------+

is there an easy way to combine the DataFrame as df3

 +----+--------------+---------+
 |id  |value1        |value2   | 
 +----+--------------+---------+
 |1   |["J","W", "t"]|      0.3|
 |2   |["m]          |      0.6|
 |3   |["n"]         |      0.7|
 +----+--------------+---------+

Anahcolus · Accepted Answer · 2018-02-26 05:00:34Z

2

You should first join the two dataframes with column value1 of df2 renamed as

val joineddf = df1.join(df2.withColumnRenamed("value1", "value21"), Seq("id"), "left")

Then you should define a udf function to add the renamed value21 column of df2 as

import org.apache.spark.sql.functions._
def mergeUdf = udf((array: mutable.WrappedArray[String], str: String) => str match{
  case null => array
  case _ => array ++ Array(str)
})

Finally you should call the udf function and drop the unnecessary columns as

joineddf.withColumn("value1", mergeUdf($"value1", $"value21"))
    .drop("value21")

You should get your desired output as

+---+---------+------+
|id |value1   |value2|
+---+---------+------+
|1  |[J, W, t]|0.3   |
|2  |[m]      |0.6   |
|3  |[n]      |0.7   |
+---+---------+------+

answered Feb 26, 2018 at 5:00

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Combine DataFrames with an array column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related