4

Suppose I've two dataframes like following :

First -

A    | B    | C    | D
1a   | 1b   | 1c   | 1d
2a   | null | 2c   | 2d
3a   | null | null | 3d
4a   | 4b   | null | null
5a   | null | null | null
6a   | 6b   | 6c   | null

Second -

P    | B    | C    | D
1p   | 1b   | 1c   | 1d
2p   | 2b   | 2c   | 2d
3p   | 3b   | 3c   | 3d
4p   | 4b   | 4c   | 4d 
5p   | 5b   | 5c   | 5d
6p   | 6b   | 6c   | 6d 

The join operation is performed based on {"B", "C", "D"}. In case of null occurring in any of these columns, it should check for not null values occurring in remaining columns.

So, the result should be like -

P    | B    | C    | D    | A
1p   | 1b   | 1c   | 1d   | 1a
2p   | null | 2c   | 2d   | 2a
3p   | null | null | 3d   | 3a
4p   | 4b   | null | null | 4a // First(C) & First(D) was null so we take only B
6p   | 6b   | 6c   | null | 6a

Can anyone suggest any solution for this query ? Currently I am trying to filter values having null values in single column, two columns, three columns. Then joining them with Second without taking that column. For eg - I first filtered out values having only B as null from First. Then joining it with Second based on "C" and "D". In this way, I will get many dataframes and I will finally union them.

2 Answers 2

2

Here's what you can do

import org.apache.spark.sql.functions._
df1.join(broadcast(df2), df1("B") === df2("B") || df1("C") === df2("C") || df1("D") === df2("D"))
  .drop(df2("B"))
  .drop(df2("C"))
  .drop(df2("D"))
  .show(false)

to be more safe you can broadcast the dataframe which is smaller in size.

Sign up to request clarification or add additional context in comments.

3 Comments

Your solution is correct to some extent. So, I extended your solution to this - '"df1.join(broadcast(df2), (((df2("B")===null) || (df1("B") === df2("B"))) && ((df2("C")===null) || (df1("C") === df2("C"))) && ((df2("D")===null) || (df1("D") === df2("D"))))) .drop(df2("B")) .drop(df2("C")) .drop(df2("D")) .show(false)" It means that it should df1(Col) should either be null or equal to df2(Col) But still I am getting only single row - 1a |1p |1b |1c |1d But not others. Any idea why ?
isn't it the solution for your question ? :) hehehe. Thats because you are using &&. And why are you using df2("B")===null)? null values are in df1, isn't it?
I don't know what you are trying to do but you should try ` df1.join(broadcast(df2), ( ((df2("B")==="null") || (df1("B") === df2("B"))) && ((df2("C")==="null") || (df1("C") === df2("C"))) && ((df2("D")==="null") || (df1("D") === df2("D"))))) .drop(df2("B")) .drop(df2("C")) .drop(df2("D")) .show(false)`. But you should be aware which dataframe has the null values. According to your question df1 has null values not df2
0

I think left join should do the work, try the following code :

val group = udf((p1: String, p2: String, p3: String) => if (p1 != null) p1 else if (p2 != null) p2 else if (p3 != null) p3 else null)
val joined = first.join(second.select("B", "P"), Seq("B"), "left")
                  .withColumnRenamed("P", "P1")
                  .join(second.select("C", "P"), Seq("C"), "left")
                  .withColumnRenamed("P", "P2")
                  .join(second.select("D", "P"), Seq("D"), "left")
                  .withColumnRenamed("P", "P3")
                  .select($"A", $"B", $"C", $"D", group($"P1", $"P2", $"P3") as "P")
                  .where($"P".isNotNull) 

Hope this helps you, otherwise comment your problems

6 Comments

Thanks for your response. I've edited my question. There was some mistake. And I've tried using your solution. But I am getting "null" value instead of "4a" (see in the commented line)
what if you try to join just on B @Ishan ?
I've to first join based on B, C and D.. if any one column is having null then I'll have to join based on another two. Similarly, if two columns having null values at the same time then I'll join based on 3rd one.
@Ishan I've edited the answer try it and give the results
I've displayed the result I'm getting from your solution. See the updated question.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.