How to drop duplicates in csv by pandas library in Python?

Question

I've been looking around tried to get examples but can't get it work the way i want to.

I want to dedupe by 'OrderID' and extract duplicates to seperate CSV. Main thing is I need to be able to change the column which I want to dedupe by, in this case its 'Order ID'.

Example Data set:

ID    Fruit   Order ID    Quantity    Price
1     apple      1111        11       £2.00
2     banana     2222        22       £3.00
3     orange     3333        33       £5.00
4     mango      4444        44       £7.00
5     Kiwi       3333        55       £5.00

Output:

ID    Fruit   Order ID    Quantity    Price
5     Kiwi       3333        55       £5.00

I've tried this:

import pandas as pd

df = pd.read_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate example.csv')

new_df = df[['ID','Fruit','Order ID','Quantity','Price']].drop_duplicates()

new_df.to_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate test.csv', index=False)

Issue i have is it doesn't remove any duplicates.

All the examples code I have tried either return no data set or just return the exact same data set. We can't do much with a description, can you share an actual minimal reproducible example? Also, please do not share information as images unless absolutely necessary. See: meta.stackoverflow.com/questions/303812/…, idownvotedbecau.se/imageofcode, idownvotedbecau.se/imageofanexception. — AMC
– AMC, Commented Apr 15, 2020 at 19:04
"Real" or not, it's the same idea. Can you share the data in a convenient format? — AMC
– AMC, Commented Apr 15, 2020 at 19:39

mrbTT · Accepted Answer · 2020-04-15 19:41:31Z

1

You can achieve this by creating a new dataframe with value_counts(), merging and than filtering.

# value_counts returns a Series, to_frame() makes it into DataFrame
df_counts = df['OrderID'].value_counts().to_frame()
# rename the column
df_counts.columns = ['order_counts']

# merging original on column "OrderID" and the counts by it's index
df_merged = pd.merge(df, df_counts, left_on='OrderID', right_index=True)

# Then to get the ones which are duplicate is just the ones that count is higher than 1
df_filtered = df_merged[df_merged['order_counts']>1]

# if you want everything else that isn't a duplicate
df_not_duplicates = df_merged[df_merged['order_counts']==1]

edit: the drop_duplicates() keeps only unique values, but if it finds duplicates it will remove all values but one. Which one to keep you set it by the argument "keep" which can be 'first' or 'last'

edit2: From your comment you want to export the result to csv. Remember, the way I did above I've separated in 2 DataFrames:

a) All items that had a duplicate removed (df_not_duplicates)

b) Only items that had a duplicate still duplicated (df_filtered)

# Type 1 saving all OrderIds that had duplicates but still with duplicates:
df_filtered.to_csv("path_to_my_csv//filename.csv", sep=",", encoding="utf-8")

# Type 2, all OrderIDs that had duplicate values, but only 1 line per OrderID
df_filtered.drop_duplicates(subset="OrderID", keep='last').to_csv("path_to_my_csv//filename.csv", sep=",", encoding="utf-8")

edited Apr 15, 2020 at 19:41

answered Apr 15, 2020 at 19:29

mrbTT

1,4092 gold badges20 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

shaneo Over a year ago

Hi, thank you, will this export the duplicates to a csv as well

mrbTT Over a year ago

you can export the new to a .csv, i'll edit the answer with an example

mrbTT Over a year ago

@shaneo, done. I believe you want the "Type 2" export.

mrbTT Over a year ago

by the way, the reason your drop_duplicates() wasn't working is because you didn't set the argument subset to the column you want. So it tried to find duplicates considering all columns.

shaneo Over a year ago

thank you so much is pulled the dupes to the csv file. The aim i wanted to achieve a function like deduplicate in excel, by deduping on what column you want. Hopefully i wasn't to confusing. Thank you

|

techPirate99 · Accepted Answer · 2020-04-15 20:04:30Z

0

The error is in second line of code (you should use pd.DataFrame), if you want to use drop_duplicates method.

df = pd.read_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicateexample.csv')

# Create dataframe with duplicates
raw_data = {'ID': [1,2,3,4,5], 
            'Fruit': ['apple', 'Banana', 'Orange','Mango', 'Kiwi'], 
            'Order ID': [1111, 2222, 3333, 4444, 5555], 
        'Quantity': [11, 22, 33, 44, 55],
        'Price': [ 2, 3, 5, 7, 5]}

new_df = pd.DataFrame(raw_data, columns = ['ID','Fruit','Order ID','Quantity','Price']).drop_duplicates()

new_df.to_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate test.csv', index=False)

Hope it helps.

answered Apr 15, 2020 at 20:04

techPirate99

1441 silver badge9 bronze badges

2 Comments

shaneo Over a year ago

Hi, thank you. Issue is, it returns the same data not the duplicate only. Drop method was error by me i see now. I've been trying to figure this out for a while and working with different example still new to the pandas and python.

techPirate99 Over a year ago

@shaneo, I got the insight that you are new to python and pandas. Keep working on it. kudos!

Collectives™ on Stack Overflow

How to drop duplicates in csv by pandas library in Python?

2 Answers 2

8 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related