Spark: Join dataframe column with an array

Question

I have two DataFrames with two columns

df1 with schema (key1:Long, Value)
df2 with schema (key2:Array[Long], Value)

I need to join these DataFrames on the key columns (find matching values between key1 and values in key2). But the problem is that they have not the same type. Is there a way to do this?

one way is like explode the Array[long] and then do the join with the df1 dataframe — Rajat Mishra
– Rajat Mishra, Commented Jan 11, 2017 at 17:49

randal25 · Accepted Answer · 2019-01-24 23:32:16Z

28

The best way to do this (and the one that doesn't require any casting or exploding of dataframes) is to use the array_contains spark sql expression as shown below.

import org.apache.spark.sql.functions.expr
import spark.implicits._

val df1 = Seq((1L,"one.df1"), (2L,"two.df1"),(3L,"three.df1")).toDF("key1","Value")

val df2 = Seq((Array(1L,1L),"one.df2"), (Array(2L,2L),"two.df2"), (Array(3L,3L),"three.df2")).toDF("key2","Value")

val joinedRDD = df1.join(df2, expr("array_contains(key2, key1)")).show

+----+---------+------+---------+
|key1|    Value|  key2|    Value|
+----+---------+------+---------+
|   1|  one.df1|[1, 1]|  one.df2|
|   2|  two.df1|[2, 2]|  two.df2|
|   3|three.df1|[3, 3]|three.df2|
+----+---------+------+---------+

Please note that you cannot use the org.apache.spark.sql.functions.array_contains function directly as it requires the second argument to be a literal as opposed to a column expression.

answered Jan 24, 2019 at 23:32

randal25

1,33013 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Abhishek R Over a year ago

Thanks. The code worked in pyspark. But what is the purpose of import spark.implicits._? I am not able to find this module in pyspark

Logan Over a year ago

import spark.implicits._ is used in SCALA, you don't need it in PySpark

pheeleeppoo · Accepted Answer · 2017-01-12 10:22:19Z

2

You can cast the type of key1 and key2 and then use the contains function, as follow.

val df1 = sc.parallelize(Seq((1L,"one.df1"), 
                             (2L,"two.df1"),      
                             (3L,"three.df1"))).toDF("key1","Value")  

DF1:
+----+---------+
|key1|Value    |
+----+---------+
|1   |one.df1  |
|2   |two.df1  |
|3   |three.df1|
+----+---------+

val df2 = sc.parallelize(Seq((Array(1L,1L),"one.df2"),
                             (Array(2L,2L),"two.df2"),
                             (Array(3L,3L),"three.df2"))).toDF("key2","Value")
DF2:
+------+---------+
|key2  |Value    |
+------+---------+
|[1, 1]|one.df2  |
|[2, 2]|two.df2  |
|[3, 3]|three.df2|
+------+---------+

val joinedRDD = df1.join(df2, col("key2").cast("string").contains(col("key1").cast("string")))

JOIN:
+----+---------+------+---------+
|key1|Value    |key2  |Value    |
+----+---------+------+---------+
|1   |one.df1  |[1, 1]|one.df2  |
|2   |two.df1  |[2, 2]|two.df2  |
|3   |three.df1|[3, 3]|three.df2|
+----+---------+------+---------+

answered Jan 12, 2017 at 10:22

pheeleeppoo

1,5336 gold badges26 silver badges31 bronze badges

1 Comment

Tim Gautier Over a year ago

The string "123" contains the strings "23", "12", "1", etc. Casting to strings is going to join things that shouldn't be joined.

Collectives™ on Stack Overflow

Spark: Join dataframe column with an array

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related