0

How to calculate the join of two Dataframes using multiple columns as key? For example DF1 , DF2 are the two dataFrame.

This is the way by which we can calculate the join,

JoinDF = DF1.join(DF2, DF1("column1") === DF2("column11") && DF1("column2") === DF2("column22"), "outer") 

But my problem is how to access the multiple columns if they are stored in an arrays like :

DF1KeyArray=Array{column1,column2}
DF2KeyArray=Array{column11,column22}

then It is not possible to calculate the join by this method

JoinDF = DF1.join(DF2, DF1(DF1KeyArray)=== DF2(DF2KeyArray), "outer")

In this case error was :

<console>:128: error: type mismatch;
found   : Array[String]
required: String

Is there any way to access multiple columns as keys stored in an Array for calculation of join?

2
  • 1
    Please format you question ! This is not readable. Add the programming language tag too ! Commented Feb 2, 2016 at 10:14
  • @eliasah Scala is the programming language . Commented Feb 2, 2016 at 10:28

1 Answer 1

10

You can simply create joinExprs programmatically:

val df1KeyArray: Array[String] = ???
val df2KeyArray: Array[String] = ???

val df1: DataFrame = ???
val df2: DataFrame = ???

val joinExprs = df1KeyArray
  .zip(df2KeyArray)
  .map{case (c1, c2) => df1(c1) === df2(c2)}
  .reduce(_ && _)

df1.join(df2, joinExprs, "outer")

See also Including null values in an Apache Spark Join

Sign up to request clarification or add additional context in comments.

2 Comments

nice use of .reduce(_ && _)
Best way to join using multiple columns. @zero323 Please add that if someone wants to return true (rather than NULL) if both inputs are NULL, then === must be replaced with <=> (EqualNullSafe)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.