1

The drop duplicates methods of Spark DataFrames is not working and I think it is because the index column which was part of my dataset is being treated as a column of data. There definitely are duplicates in there, I checked it by comparing COUNT() and COUNT(DISTINCT()) on all the columns except the index. I'm new to Spark DataFrames but if I was using Pandas, at this point I would do pandas.DataFrame.set_index on that column.

Does anyone know how to handle this situation?

Secondly, there appears to be 2 methods on a Spark DataFrame, drop_duplicates and dropDuplicates. Are they the same?

1
  • Share some of your code which will help us understand the question better. Commented Sep 13, 2017 at 17:37

1 Answer 1

2

If you don't want the index column to be considered while checking for the distinct records, you can drop the column using below command or select only the columns required.

df = df.drop('p_index') // Pass column name to be dropped

df = df.select('name', 'age') // Pass the required columns

drop_duplicates() is an alias for dropDuplicates().

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, and also for the link. What annoys me about the Spark docs is half the time when you put something into a search engine, you land on the source code rather than the docs which is useless. Okay, will prefer the drop command without the underscore then, why does it need an alias.. The key is the subset argument in the docs. It removes the worry about doing something to the index column.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.