I have two dataframes (Scala Spark) A and B. When A("id") == B("a_id") I want to update A("value") to B("value"). Since DataFrames have to be recreated I'm assuming I have to do some joins and withColumn calls but I'm not sure how to do this. In SQL it would be a simple update call on a natural join but for some reason this seems difficult in Spark?
Add a comment
|
1 Answer
Indeed, a left join and a select call would do the trick:
// assuming "spark" is an active SparkSession:
import org.apache.spark.sql.functions._
import spark.implicits._
// some sample data; Notice it's convenient to NAME the dataframes using .as(...)
val A = Seq((1, "a1"), (2, "a2"), (3, "a3")).toDF("id", "value").as("A")
val B = Seq((1, "b1"), (2, "b2")).toDF("a_id", "value").as("B")
// left join + coalesce to "choose" the original value if no match found:
val result = A.join(B, $"A.id" === $"B.a_id", "left")
.select($"id", coalesce($"B.value", $"A.value") as "value")
// result:
// +---+-----+
// | id|value|
// +---+-----+
// | 1| b1|
// | 2| b2|
// | 3| a3|
// +---+-----+
Notice that there's no real "update" here - result is a new DataFrame which you can use (write / count / ...) but the original DataFrames remain unchanged.
5 Comments
Tzach Zohar
First, you can replace these with the
col function, e.g. col("A.id") if they give you trouble; Second - you need that import spark.implicits._ in every scope where you'd want to use $.nobody
like the left join winds up having like a full outer join for some reason or something
Tzach Zohar
This might happen if some records in A have more than one matching record in B (i.e. where A.id == B.a_id) - is that the case? If so, by what logic would you "choose" the right value from B? Try grouping B before joining so that B has unique a_id values.
nobody
even if i do a left outer join on the same data frame (ie. i take A and then join with B and make C (not full outer) then take the original A and join with C on the same id for the same columns) the left outer join does not seem to work properly as i would expect it coming from a sql background. it seems like it does a full outer each time
Tzach Zohar
In the example I supplied above you can see left join working as expected. You'd have to present specific code and specific data, preferably as a separate post - I won't be able to help via comments without more info.