Spark DataFrame equivalent of pandas.DataFrame.set_index / drop_duplicates vs. dropDuplicates

Question

The drop duplicates methods of Spark DataFrames is not working and I think it is because the index column which was part of my dataset is being treated as a column of data. There definitely are duplicates in there, I checked it by comparing COUNT() and COUNT(DISTINCT()) on all the columns except the index. I'm new to Spark DataFrames but if I was using Pandas, at this point I would do pandas.DataFrame.set_index on that column.

Does anyone know how to handle this situation?

Secondly, there appears to be 2 methods on a Spark DataFrame, drop_duplicates and dropDuplicates. Are they the same?

Share some of your code which will help us understand the question better. — Munesh
– Munesh, Commented Sep 13, 2017 at 17:37

Munesh · Accepted Answer · 2017-09-13 17:47:18Z

2

If you don't want the index column to be considered while checking for the distinct records, you can drop the column using below command or select only the columns required.

df = df.drop('p_index') // Pass column name to be dropped

df = df.select('name', 'age') // Pass the required columns

drop_duplicates() is an alias for dropDuplicates().

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates

answered Sep 13, 2017 at 17:47

Munesh

1,5693 gold badges21 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

cardamom Over a year ago

Thanks, and also for the link. What annoys me about the Spark docs is half the time when you put something into a search engine, you land on the source code rather than the docs which is useless. Okay, will prefer the drop command without the underscore then, why does it need an alias.. The key is the subset argument in the docs. It removes the worry about doing something to the index column.

Collectives™ on Stack Overflow

Spark DataFrame equivalent of pandas.DataFrame.set_index / drop_duplicates vs. dropDuplicates

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related