spark filter (delete) rows based on values from another dataframe [duplicate]

Question

I have a 'big' dataset (huge_df) with >20 columns. One of the columns is an id field (generated with pyspark.sql.functions.monotonically_increasing_id()).

Using some criteria I generate a second dataframe (filter_df), consisting of id values I want to filter later on from huge_df.

Currently I am using SQL syntax to do this:

filter_df.createOrReplaceTempView('filter_view')
huge_df = huge_df.where('id NOT IN (SELECT id FROM filter_view)')

Question 1: Is there a way to do this using Python only, i.e. without the need to register the TempView?

Question 2: Is there a completely different way to accomplish the same thing?

Sigrist · Accepted Answer · 2018-09-27 11:34:48Z

14

You can use JOIN

huge_df = huge_df.join(filter_df, huge_df.id == filter_df.id, "left_outer")
                 .where(filter_df.id.isNull())
                 .select([col(c) for c in huge_df.columns]

However it will cause expensive shuffle.

Logic is simple: do left join with filter_df on id fields and check if filter_df is null - if it is null, that means that there's no such row in filter_df

edited Sep 27, 2018 at 11:34

Sigrist

1,4712 gold badges14 silver badges18 bronze badges

answered Apr 19, 2017 at 14:23

T. Gawęda

16.1k5 gold badges51 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Pushkr Over a year ago

instead of .select([col(c) for c in huge_df.columns] you can just say .select(huge_df.columns), no?

akoeltringer Over a year ago

thanks! this helped a lot. I tried to avoid join, but it turns out this is way more efficient than id NOT IN (SELECT ...)

ChoNuff Over a year ago

I think you can use a left_anti join instead of left_outer

Pushkr · Accepted Answer · 2017-04-19 15:10:09Z

1

Here is another way to do it-

# Sample data
hugedf = spark.createDataFrame([[1,'a'],[2,'b'],[3,'c'],[4,'d']],schema=(['k1','v1']))
fildf = spark.createDataFrame([[1,'b'],[3,'c']],schema=(['k2','v2']))


from pyspark.sql.functions import col
hugedf\
.select('k1')\     
.subtract(fildf.select('k2'))\ 
.toDF('d1')\
.join(hugedf,col('d1')==hugedf.k1)\
.drop('d1')\
.show()

logic is simple, subtract the id values found in filteredDf from id values found in hugeDF leaving ID values which are not in filterDF,

I marked the subtracted values as column 'd1' just for clarity purpose and then join hugeDF table on the d1 values and dropping d1 to give final result.

answered Apr 19, 2017 at 15:10

Pushkr

3,62921 silver badges32 bronze badges

Collectives™ on Stack Overflow

spark filter (delete) rows based on values from another dataframe [duplicate]

2 Answers 2

3 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Linked

Related