8

I have two DataFrames: a and b. This is how they look like:

a
-------
v1 string
v2 string

roughly hundreds of millions rows


b
-------
v2 string

roughly tens of millions rows

I would like to keep rows from DataFrame a where v2 is not in b("v2").

I know I could use left join and filter where right side is null or SparkSQL with "not in" construction. I bet there is better approach though.

3
  • 1
    I've posted an answer, but join+filter should work quite well too! I think most of the work from join+filter is unavoidable in any solution. Commented Feb 15, 2016 at 0:01
  • Yeah, actually SparkSQL worked very fast. Also - it's not duplicate - I needed negative filter. Commented Feb 15, 2016 at 22:31
  • see stackoverflow.com/questions/29537564/… Commented Oct 14, 2016 at 10:13

3 Answers 3

3

You can achieve that using the except method of Dataset, wich "Returns a new Dataset containing rows in this Dataset but not in another Dataset"

Sign up to request clarification or add additional context in comments.

Comments

1

Use PairRDDFunctions.subtractByKey:

def subtractByKey[W](other: RDD[(K, W)])(implicit arg0: ClassTag[W]): RDD[(K, V)]

Return an RDD with the pairs from this whose keys are not in other.

(There are variants that offer control over the partitioning. See the docs.)

So you would do a.rdd.map { case (v1, v2) => (v2, v1) }.subtractByKey(b.rdd).toDF.

4 Comments

Perhaps a pure DataFrame-based solution exists as well? I don't use DataFrames much, sorry. But it shouldn't be too painful to jump back to RDDs, use subtractByKey and go back to DataFrames.
You could use except
Ah, except is the perfect answer! Want to post it as a separate answer?
sure, I've posted it as an answer
1

Consider your dataframe a looks like below.

+----+
|col1|
+----+
|  v1|
|  v2|
+----+

Consider your dataframe b looks like below.

+----+
|col1|
+----+
|  v2|
+----+



APPROACH 1:
-------------------

You can use dataframe's join method and use the type of join as left_anti to find out the values that are in dataframe a but not in dataframe b. The code is given below :

a.as('a).join(b.as('b),$"a.col1" === $"b.col1","left_anti").show()

Please find the result below :

enter image description here



APPROACH 2:
-------------------

You can use sql which is similar to Sql server/Oracle etc to do this. For this, first you have to register your dataframe as temp table (which will reside in spark's memory) and then write the sql on top of that table.

a.registerTempTable("table_a")
b.registerTempTable("table_b")
spark.sql("select * from table_a a where not exists(select 1 from table_b b where a.col1=b.col1)").show()

Please find the result below :

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.