1

I have two dataframes (Scala Spark) A and B. When A("id") == B("a_id") I want to update A("value") to B("value"). Since DataFrames have to be recreated I'm assuming I have to do some joins and withColumn calls but I'm not sure how to do this. In SQL it would be a simple update call on a natural join but for some reason this seems difficult in Spark?

0

1 Answer 1

1

Indeed, a left join and a select call would do the trick:

// assuming "spark" is an active SparkSession: 
import org.apache.spark.sql.functions._
import spark.implicits._

// some sample data; Notice it's convenient to NAME the dataframes using .as(...)
val A = Seq((1, "a1"), (2, "a2"), (3, "a3")).toDF("id", "value").as("A")
val B = Seq((1, "b1"), (2, "b2")).toDF("a_id", "value").as("B")

// left join + coalesce to "choose" the original value if no match found:
val result = A.join(B, $"A.id" === $"B.a_id", "left")
  .select($"id", coalesce($"B.value", $"A.value") as "value")

// result:
// +---+-----+
// | id|value|
// +---+-----+
// |  1|   b1|
// |  2|   b2|
// |  3|   a3|
// +---+-----+

Notice that there's no real "update" here - result is a new DataFrame which you can use (write / count / ...) but the original DataFrames remain unchanged.

Sign up to request clarification or add additional context in comments.

5 Comments

First, you can replace these with the col function, e.g. col("A.id") if they give you trouble; Second - you need that import spark.implicits._ in every scope where you'd want to use $.
like the left join winds up having like a full outer join for some reason or something
This might happen if some records in A have more than one matching record in B (i.e. where A.id == B.a_id) - is that the case? If so, by what logic would you "choose" the right value from B? Try grouping B before joining so that B has unique a_id values.
even if i do a left outer join on the same data frame (ie. i take A and then join with B and make C (not full outer) then take the original A and join with C on the same id for the same columns) the left outer join does not seem to work properly as i would expect it coming from a sql background. it seems like it does a full outer each time
In the example I supplied above you can see left join working as expected. You'd have to present specific code and specific data, preferably as a separate post - I won't be able to help via comments without more info.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.