2

I have created the below method which takes two Dataframes; lhs & rhs and their respective first and second columns as input. The method should return the result of a left join between these two frames using the two columns provided for each dataframe (ignoring their case sensitivity).

The problem I am facing is that it is doing more of an inner join. It is is returning 3 times the number of the rows that is in the lhs data frame (due to duplicate values in rhs), but as it is a left join the duplication and number of rows in rhs dataframe should not matter.

  def leftJoinCaseInsensitive(lhs: DataFrame, rhs: DataFrame, leftTableColumn: String, rightTableColumn: String, leftTableColumn1: String, rightTableColumn1: String): DataFrame = {
    val joined: DataFrame = lhs.join(rhs, upper(lhs.col(leftTableColumn)) === upper(rhs.col(rightTableColumn)) && upper(lhs.col(leftTableColumn1)) === upper(rhs.col(rightTableColumn1)), "left");
    return joined
  }
1
  • 1
    I would suggest you to create dummy data sets which has 10 - 20 rows and test your code. Preferably put your sample/dummy data here on SO. Commented Nov 1, 2017 at 13:07

2 Answers 2

4

If there are duplicate values in rhs, then it is normal for lhs to get replicated. If a joining values in joining columns from lhs row matches with multiple rhs rows then joined dataframe should have multiple rows from lhs matching the rows from rhs.

for example

lhs dataframe
+--------+--------+--------+
|col1left|col2left|col3left|
+--------+--------+--------+
|a       |1       |leftside|
+--------+--------+--------+

And

rhs dataframe
+---------+---------+---------+
|col1right|col2right|col3right|
+---------+---------+---------+
|a        |1        |rightside|
|a        |1        |rightside|
+---------+---------+---------+

Then it is normal to have left join as

left joined lhs with rhs
+--------+--------+--------+---------+---------+---------+
|col1left|col2left|col3left|col1right|col2right|col3right|
+--------+--------+--------+---------+---------+---------+
|a       |1       |leftside|a        |1        |rightside|
|a       |1       |leftside|a        |1        |rightside|
+--------+--------+--------+---------+---------+---------+

You can have more information here

Sign up to request clarification or add additional context in comments.

Comments

3

but as it is a left join the duplication and number of rows in rhs dataframe should not matter

Not true. Your leftJoinCaseInsensitive method looks good to me. A left join would still produce more rows than the left table's if the right table has duplicated key column(s), as shown below:

val dfR = Seq(
  (1, "a", "x"),
  (1, "a", "y"),
  (2, "b", "z")
).toDF("k1", "k2", "val")

val dfL = Seq(
  (1, "a", "u"),
  (2, "b", "v"),
  (3, "c", "w")
).toDF("k1", "k2", "val")

leftJoinCaseInsensitive(dfL, dfR, "k1", "k1", "k2", "k2")

res1.show
+---+---+---+----+----+----+
| k1| k2|val|  k1|  k2| val|
+---+---+---+----+----+----+
|  1|  a|  u|   1|   a|   y|
|  1|  a|  u|   1|   a|   x|
|  2|  b|  v|   2|   b|   z|
|  3|  c|  w|null|null|null|
+---+---+---+----+----+----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.