2

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;

val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))

The solution for this query already exists in pyspark version --provided in the following link PySpark DataFrame - Join on multiple columns dynamically

I would like to code the same code using spark-scala

1 Answer 1

9

In scala you do it in similar way like in python but you need to use map and reduce functions:

val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._

val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")

val columnsdf1 = df1.columns
val columnsdf2 = df2.columns

val joinExprs = columnsdf1
   .zip(columnsdf2)
   .map{case (c1, c2) => df1(c1) === df2(c2)}
   .reduce(_ && _)

val dfJoinRes = df1.join(df2,joinExprs)
Sign up to request clarification or add additional context in comments.

2 Comments

val dfJoinRes = df1.join(df2,df1.columns.toSet.intersect(df2.columns.toSet).toSeq, "left") //This code work as well in my case
yes it works, I wanted to post that answer first but I think it would not be complete because what if columns in df1 and df2 has different names?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.