Spark Scala filter DataFrame where value not in another DataFrame

Question

I have two DataFrames: a and b. This is how they look like:

a
-------
v1 string
v2 string

roughly hundreds of millions rows


b
-------
v2 string

roughly tens of millions rows

I would like to keep rows from DataFrame a where v2 is not in b("v2").

I know I could use left join and filter where right side is null or SparkSQL with "not in" construction. I bet there is better approach though.

I've posted an answer, but join+filter should work quite well too! I think most of the work from join+filter is unavoidable in any solution. — Daniel Darabos
– Daniel Darabos, Commented Feb 15, 2016 at 0:01
Yeah, actually SparkSQL worked very fast. Also - it's not duplicate - I needed negative filter. — devopslife
– devopslife, Commented Feb 15, 2016 at 22:31

Javier Alba · Accepted Answer · 2017-06-19 08:47:10Z

3

You can achieve that using the except method of Dataset, wich "Returns a new Dataset containing rows in this Dataset but not in another Dataset"

answered Jun 19, 2017 at 8:47

Javier Alba

4211 gold badge4 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Daniel Darabos · Accepted Answer · 2016-02-14 23:57:33Z

1

Use PairRDDFunctions.subtractByKey:

def subtractByKey[W](other: RDD[(K, W)])(implicit arg0: ClassTag[W]): RDD[(K, V)]

Return an RDD with the pairs from this whose keys are not in other.

(There are variants that offer control over the partitioning. See the docs.)

So you would do a.rdd.map { case (v1, v2) => (v2, v1) }.subtractByKey(b.rdd).toDF.

answered Feb 14, 2016 at 23:57

Daniel Darabos

27.6k10 gold badges108 silver badges122 bronze badges

4 Comments

Daniel Darabos Over a year ago

Perhaps a pure DataFrame-based solution exists as well? I don't use DataFrames much, sorry. But it shouldn't be too painful to jump back to RDDs, use subtractByKey and go back to DataFrames.

Javier Alba Over a year ago

You could use except

Daniel Darabos Over a year ago

Ah, except is the perfect answer! Want to post it as a separate answer?

Javier Alba Over a year ago

sure, I've posted it as an answer

Sarath Subramanian · Accepted Answer · 2019-04-05 18:32:24Z

Consider your dataframe a looks like below.

+----+
|col1|
+----+
|  v1|
|  v2|
+----+

Consider your dataframe b looks like below.

+----+
|col1|
+----+
|  v2|
+----+

APPROACH 1:
-------------------

You can use dataframe's join method and use the type of join as left_anti to find out the values that are in dataframe a but not in dataframe b. The code is given below :

a.as('a).join(b.as('b),$"a.col1" === $"b.col1","left_anti").show()

Please find the result below :

APPROACH 2:
-------------------

You can use sql which is similar to Sql server/Oracle etc to do this. For this, first you have to register your dataframe as temp table (which will reside in spark's memory) and then write the sql on top of that table.

a.registerTempTable("table_a")
b.registerTempTable("table_b")
spark.sql("select * from table_a a where not exists(select 1 from table_b b where a.col1=b.col1)").show()

Please find the result below :

Collectives™ on Stack Overflow

Spark Scala filter DataFrame where value not in another DataFrame

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related