11

I have a 'big' dataset (huge_df) with >20 columns. One of the columns is an id field (generated with pyspark.sql.functions.monotonically_increasing_id()).

Using some criteria I generate a second dataframe (filter_df), consisting of id values I want to filter later on from huge_df.

Currently I am using SQL syntax to do this:

filter_df.createOrReplaceTempView('filter_view')
huge_df = huge_df.where('id NOT IN (SELECT id FROM filter_view)')

Question 1: Is there a way to do this using Python only, i.e. without the need to register the TempView?

Question 2: Is there a completely different way to accomplish the same thing?

0

2 Answers 2

14

You can use JOIN

huge_df = huge_df.join(filter_df, huge_df.id == filter_df.id, "left_outer")
                 .where(filter_df.id.isNull())
                 .select([col(c) for c in huge_df.columns]

However it will cause expensive shuffle.

Logic is simple: do left join with filter_df on id fields and check if filter_df is null - if it is null, that means that there's no such row in filter_df

Sign up to request clarification or add additional context in comments.

3 Comments

instead of .select([col(c) for c in huge_df.columns] you can just say .select(huge_df.columns), no?
thanks! this helped a lot. I tried to avoid join, but it turns out this is way more efficient than id NOT IN (SELECT ...)
I think you can use a left_anti join instead of left_outer
1

Here is another way to do it-

# Sample data
hugedf = spark.createDataFrame([[1,'a'],[2,'b'],[3,'c'],[4,'d']],schema=(['k1','v1']))
fildf = spark.createDataFrame([[1,'b'],[3,'c']],schema=(['k2','v2']))


from pyspark.sql.functions import col
hugedf\
.select('k1')\     
.subtract(fildf.select('k2'))\ 
.toDF('d1')\
.join(hugedf,col('d1')==hugedf.k1)\
.drop('d1')\
.show()

logic is simple, subtract the id values found in filteredDf from id values found in hugeDF leaving ID values which are not in filterDF,

I marked the subtracted values as column 'd1' just for clarity purpose and then join hugeDF table on the d1 values and dropping d1 to give final result.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.