0

There are 2 DataFrames, df1 is defined as

 +----+---------+---------+
 |id  |value1   |value2   | 
 +----+---------+---------+
 |1   |["J","W"]|      0.3|
 |2   |         |      0.6|
 |3   |["n"]    |      0.7|
 +----+---------+---------+

df2 is defined as

 +----+---------+
 |id  |value1   |
 +----+---------+
 | 1  | "t"     |
 | 2  | "m"     |
 +----+---------+

is there an easy way to combine the DataFrame as df3

 +----+--------------+---------+
 |id  |value1        |value2   | 
 +----+--------------+---------+
 |1   |["J","W", "t"]|      0.3|
 |2   |["m]          |      0.6|
 |3   |["n"]         |      0.7|
 +----+--------------+---------+

1 Answer 1

2

You should first join the two dataframes with column value1 of df2 renamed as

val joineddf = df1.join(df2.withColumnRenamed("value1", "value21"), Seq("id"), "left")

Then you should define a udf function to add the renamed value21 column of df2 as

import org.apache.spark.sql.functions._
def mergeUdf = udf((array: mutable.WrappedArray[String], str: String) => str match{
  case null => array
  case _ => array ++ Array(str)
})

Finally you should call the udf function and drop the unnecessary columns as

joineddf.withColumn("value1", mergeUdf($"value1", $"value21"))
    .drop("value21")

You should get your desired output as

+---+---------+------+
|id |value1   |value2|
+---+---------+------+
|1  |[J, W, t]|0.3   |
|2  |[m]      |0.6   |
|3  |[n]      |0.7   |
+---+---------+------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.