The drop duplicates methods of Spark DataFrames is not working and I think it is because the index column which was part of my dataset is being treated as a column of data. There definitely are duplicates in there, I checked it by comparing COUNT() and COUNT(DISTINCT()) on all the columns except the index. I'm new to Spark DataFrames but if I was using Pandas, at this point I would do pandas.DataFrame.set_index on that column.
Does anyone know how to handle this situation?
Secondly, there appears to be 2 methods on a Spark DataFrame, drop_duplicates and dropDuplicates. Are they the same?