Spark specify multiple column conditions for dataframe join

Question

How to give more column conditions when joining two dataframes. For example I want to run the following :

val Lead_all = Leads.join(Utm_Master,  
    Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") ==
    Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left")

I want to join only when these columns match. But above syntax is not valid as cols only takes one string. So how do I get what I want.

Ani Menon · Accepted Answer · 2016-09-21 07:49:06Z

103

There is a Spark column/expression API join for such case:

Leaddetails.join(
    Utm_Master, 
    Leaddetails("LeadSource") <=> Utm_Master("LeadSource")
        && Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source")
        && Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium")
        && Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),
    "left"
)

The <=> operator in the example means "Equality test that is safe for null values".

The main difference with simple Equality test (===) is that the first one is safe to use in case one of the columns may have null values.

edited Sep 21, 2016 at 7:49

Ani Menon

28.4k17 gold badges111 silver badges128 bronze badges

answered Jul 6, 2015 at 13:50

rchukh

2,9472 gold badges24 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

zero323 Over a year ago

Could you explain what's the difference between === and <=>?

rchukh Over a year ago

Updated with more information about difference between those equality tests.

user568109 Over a year ago

Aha, couldn't find this in documentation. How did you know about this ?

rchukh Over a year ago

@user568109 I am using Java API, and there are some cases when Column/Expression API is the only option. Also, Column/Expression API is mostly implemented as a Builder, so it is easier to discover new methods on each version of Spark.

Climbs_lika_Spyder Over a year ago

This gave me duplicated columns so I used the Seq method I added in another answer.

|

dnlbrky · Accepted Answer · 2015-08-08 02:59:11Z

22

As of Spark version 1.5.0 (which is currently unreleased), you can join on multiple DataFrame columns. Refer to SPARK-7990: Add methods to facilitate equi-join on multiple join keys.

Python

Leads.join(
    Utm_Master, 
    ["LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"],
    "left_outer"
)

Scala

The question asked for a Scala answer, but I don't use Scala. Here is my best guess....

Leads.join(
    Utm_Master,
    Seq("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
    "left_outer"
)

answered Aug 8, 2015 at 2:59

dnlbrky

9,8852 gold badges55 silver badges65 bronze badges

1 Comment

soMuchToLearnAndShare Over a year ago

how do we make the join ignore the values case (i.e. make it case insensitive)? i tried below, and did not work. sqlContext.sql("set spark.sql.caseSensitive=false")

Climbs_lika_Spyder · Accepted Answer · 2018-04-13 17:09:09Z

9

The === options give me duplicated columns. So I use Seq instead.

val Lead_all = Leads.join(Utm_Master,
    Seq("Utm_Source","Utm_Medium","Utm_Campaign"),"left")

Of course, this only works when the names of the joining columns are the same.

answered Apr 13, 2018 at 17:09

Climbs_lika_Spyder

6,8324 gold badges45 silver badges62 bronze badges

Comments

3 revs, 2 users 98% · Accepted Answer · 2016-09-21 07:59:51Z

8

One thing you can do is to use raw SQL:

case class Bar(x1: Int, y1: Int, z1: Int, v1: String)
case class Foo(x2: Int, y2: Int, z2: Int, v2: String)

val bar = sqlContext.createDataFrame(sc.parallelize(
    Bar(1, 1, 2, "bar") :: Bar(2, 3, 2, "bar") ::
    Bar(3, 1, 2, "bar") :: Nil))

val foo = sqlContext.createDataFrame(sc.parallelize(
    Foo(1, 1, 2, "foo") :: Foo(2, 1, 2, "foo") ::
    Foo(3, 1, 2, "foo") :: Foo(4, 4, 4, "foo") :: Nil))

foo.registerTempTable("foo")
bar.registerTempTable("bar")

sqlContext.sql(
    "SELECT * FROM foo LEFT JOIN bar ON x1 = x2 AND y1 = y2 AND z1 = z2")

edited Sep 21, 2016 at 7:59

community wiki

3 revs, 2 users 98%
zero323

2 Comments

user568109 Over a year ago

This is the method I use right now. I was hoping I can do it without registering as temp tables. If there is no way to do this with dataframe API I will accept the answer.

zero323 Over a year ago

If so @rchukh's answer is much better.

Ani Menon · Accepted Answer · 2016-09-21 08:06:52Z

7

Scala:

Leaddetails.join(
    Utm_Master, 
    Leaddetails("LeadSource") <=> Utm_Master("LeadSource")
        && Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source")
        && Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium")
        && Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),
    "left"
)

To make it case insensitive,

import org.apache.spark.sql.functions.{lower, upper}

then just use lower(value) in the condition of the join method.

Eg: dataFrame.filter(lower(dataFrame.col("vendor")).equalTo("fortinet"))

answered Sep 21, 2016 at 8:06

Ani Menon

28.4k17 gold badges111 silver badges128 bronze badges

Comments

Patricia F. · Accepted Answer · 2017-05-03 11:57:43Z

6

In Pyspark you can simply specify each condition separately:

val Lead_all = Leads.join(Utm_Master,  
    (Leaddetails.LeadSource == Utm_Master.LeadSource) &
    (Leaddetails.Utm_Source == Utm_Master.Utm_Source) &
    (Leaddetails.Utm_Medium == Utm_Master.Utm_Medium) &
    (Leaddetails.Utm_Campaign == Utm_Master.Utm_Campaign))

Just be sure to use operators and parenthesis correctly.

answered May 3, 2017 at 11:57

Patricia F.

991 silver badge2 bronze badges

Comments

Abdul Mannan - Data Engineer · Accepted Answer · 2019-11-17 15:57:37Z

2

In Pyspark, using parenthesis around each condition is the key to using multiple column names in the join condition.

joined_df = df1.join(df2, 
    (df1['name'] == df2['name']) &
    (df1['phone'] == df2['phone'])
)

answered Nov 17, 2019 at 15:57

Abdul Mannan - Data Engineer

1,11212 silver badges19 bronze badges

Comments

Tagar · Accepted Answer · 2017-09-06 04:28:55Z

0

Spark SQL supports join on tuple of columns when in parentheses, like

... WHERE (list_of_columns1) = (list_of_columns2)

which is a way shorter than specifying equal expressions (=) for each pair of columns combined by a set of "AND"s.

For example:

SELECT a,b,c
FROM    tab1 t1
WHERE 
   NOT EXISTS
   (    SELECT 1
        FROM    t1_except_t2_df e
        WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
   )

instead of

SELECT a,b,c
FROM    tab1 t1
WHERE 
   NOT EXISTS
   (    SELECT 1
        FROM    t1_except_t2_df e
        WHERE t1.a=e.a AND t1.b=e.b AND t1.c=e.c
   )

which is less readable too especially when list of columns is big and you want to deal with NULLs easily.

edited Sep 6, 2017 at 4:28

answered Sep 5, 2017 at 21:49

Tagar

15k7 gold badges102 silver badges116 bronze badges

1 Comment

Shankar Over a year ago

is it really working? is this supported in 1.6 version?

Robert Columbia · Accepted Answer · 2019-08-21 22:19:49Z

0

Try this:

val rccJoin=dfRccDeuda.as("dfdeuda")
.join(dfRccCliente.as("dfcliente")
,col("dfdeuda.etarcid")===col("dfcliente.etarcid") 
&& col("dfdeuda.etarcid")===col("dfcliente.etarcid"),"inner")

edited Aug 21, 2019 at 22:19

Robert Columbia

6,45115 gold badges34 silver badges42 bronze badges

answered Aug 21, 2019 at 20:56

Andy Quiroz

8998 silver badges8 bronze badges

Collectives™ on Stack Overflow

Spark specify multiple column conditions for dataframe join

9 Answers 9

7 Comments

1 Comment

Comments

2 Comments

Comments

Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

7 Comments

1 Comment

Comments

2 Comments

Comments

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related