Dataframe 1:
+---------+---------+
|login_Id1|login_Id2|
+---------+---------+
| 1234567| 1234568|
| 1234567| null|
| null| 1234568|
| 1234567| 1000000|
| 1000000| 1234568|
| 1000000| 1000000|
+---------+---------+
DataFrame 2:
+--------+---------+-----------+
|login_Id|user_name| user_Email|
+--------+---------+-----------+
| 1234567|TestUser1|user1_Email|
| 1234568|TestUser2|user2_Email|
| 1234569|TestUser3|user3_Email|
| 1234570|TestUser4|user4_Email|
+--------+---------+-----------+
Expected Output
+---------+---------+--------+---------+-----------+
|login_Id1|login_Id2|login_Id|user_name| user_Email|
+---------+---------+--------+---------+-----------+
| 1234567| 1234568| 1234567|TestUser1|user1_Email|
| 1234567| null| 1234567|TestUser1|user1_Email|
| null| 1234568| 1234568|TestUser2|user2_Email|
| 1234567| 1000000| 1234567|TestUser1|user1_Email|
| 1000000| 1234568| 1234568|TestUser2|user2_Email|
| 1000000| 1000000| null| null| null|
+---------+---------+--------+---------+-----------+
My requirement is I have to join both the dataframes so as to get the additional information for each login Id from DataFrame 2.Either login_Id1 or login_Id2 will have data(in most of the cases).At times both the columns may also have data.In that case I want to use login_Id1 to perform join.When both of the columns doesn't match I want null as result
I followed this link
Join in spark dataframe (scala) based on not null values
I tried with
DataFrame1.join(broadcast(DataFrame2), DataFrame1("login_Id1") === DataFrame2("login_Id") || DataFrame1("login_Id2") === DataFrame2("login_Id") )
The output that I get is
+---------+---------+--------+---------+-----------+
|login_Id1|login_Id2|login_Id|user_name| user_Email|
+---------+---------+--------+---------+-----------+
| 1234567| 1234568| 1234567|TestUser1|user1_Email|
| 1234567| 1234568| 1234568|TestUser2|user2_Email|
| 1234567| null| 1234567|TestUser1|user1_Email|
| null| 1234568| 1234568|TestUser2|user2_Email|
| 1234567| 1000000| 1234567|TestUser1|user1_Email|
| 1000000| 1234568| 1234568|TestUser2|user2_Email|
| 1000000| 1000000| null| null| null|
+---------+---------+--------+---------+-----------+
I get the expected behavior when either of the columns have value.When both of them have values,a join is performed with both the columns(Row1,Row3).In this case || doesn't short circuit?
Is there a way I can get the Expected dataframe?
As of now,I have a udf function that checks if login_Id1 has value(returns login_Id1) or login_Id2 has value(returns login_Id2), if both of them have values I am returning loginId1,and add the result of the udf function as another column(Filtered_Login_id) to the DataFrame1.
Dataframe1 after adding FilteredId column with udf
+--------+---------+-----------+
|loginId1|loginId2 | FilteredId|
+--------+---------+-----------+
| 1234567|1234568 |1234567 |
| 1234567|null |1234567 |
| null |1234568 |1234568 |
| 1234567|1000000 |1234567 |
| 1000000|1234568 |1000000 |
| 1000000|1000000 |1000000 |
+--------+---------+-----------+
Then I perform join based on FilteredId ===loginId and get the result
DataFrame1.join(broadcast(DataFrame2), DataFrame1("FilteredId") === DataFrame2("login_Id"),"left_outer" )
Is there a better way to achieve this result without udf?just with join(which behaves like short circuit or operator)?
Included the use case pointed out by Leo.My udf approach misses out the use case pointed out by Leo.My exact requirement is if any of the 2 input column values(login_Id1,login_Id2) match with the login_Id of Dataframe2,that loginId data should be fetched.If either of the columns doesn't match it should add null(something like left outer join)