I have a 'big' dataset (huge_df) with >20 columns. One of the columns is an id field (generated with pyspark.sql.functions.monotonically_increasing_id()).
Using some criteria I generate a second dataframe (filter_df), consisting of id values I want to filter later on from huge_df.
Currently I am using SQL syntax to do this:
filter_df.createOrReplaceTempView('filter_view')
huge_df = huge_df.where('id NOT IN (SELECT id FROM filter_view)')
Question 1:
Is there a way to do this using Python only, i.e. without the need to register the TempView?
Question 2: Is there a completely different way to accomplish the same thing?