0

I've been looking around tried to get examples but can't get it work the way i want to.

I want to dedupe by 'OrderID' and extract duplicates to seperate CSV. Main thing is I need to be able to change the column which I want to dedupe by, in this case its 'Order ID'.

Example Data set:

ID    Fruit   Order ID    Quantity    Price
1     apple      1111        11       £2.00
2     banana     2222        22       £3.00
3     orange     3333        33       £5.00
4     mango      4444        44       £7.00
5     Kiwi       3333        55       £5.00

Output:

ID    Fruit   Order ID    Quantity    Price
5     Kiwi       3333        55       £5.00

I've tried this:

import pandas as pd

df = pd.read_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate example.csv')

new_df = df[['ID','Fruit','Order ID','Quantity','Price']].drop_duplicates()

new_df.to_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate test.csv', index=False)

Issue i have is it doesn't remove any duplicates.

3
  • All the examples code I have tried either return no data set or just return the exact same data set. We can't do much with a description, can you share an actual minimal reproducible example? Also, please do not share information as images unless absolutely necessary. See: meta.stackoverflow.com/questions/303812/…, idownvotedbecau.se/imageofcode, idownvotedbecau.se/imageofanexception. Commented Apr 15, 2020 at 19:04
  • Hi, the image before was an example data set nothing real Commented Apr 15, 2020 at 19:29
  • "Real" or not, it's the same idea. Can you share the data in a convenient format? Commented Apr 15, 2020 at 19:39

2 Answers 2

1

You can achieve this by creating a new dataframe with value_counts(), merging and than filtering.

# value_counts returns a Series, to_frame() makes it into DataFrame
df_counts = df['OrderID'].value_counts().to_frame()
# rename the column
df_counts.columns = ['order_counts']

# merging original on column "OrderID" and the counts by it's index
df_merged = pd.merge(df, df_counts, left_on='OrderID', right_index=True)

# Then to get the ones which are duplicate is just the ones that count is higher than 1
df_filtered = df_merged[df_merged['order_counts']>1]

# if you want everything else that isn't a duplicate
df_not_duplicates = df_merged[df_merged['order_counts']==1]

edit: the drop_duplicates() keeps only unique values, but if it finds duplicates it will remove all values but one. Which one to keep you set it by the argument "keep" which can be 'first' or 'last'

edit2: From your comment you want to export the result to csv. Remember, the way I did above I've separated in 2 DataFrames:

a) All items that had a duplicate removed (df_not_duplicates)

b) Only items that had a duplicate still duplicated (df_filtered)

# Type 1 saving all OrderIds that had duplicates but still with duplicates:
df_filtered.to_csv("path_to_my_csv//filename.csv", sep=",", encoding="utf-8")

# Type 2, all OrderIDs that had duplicate values, but only 1 line per OrderID
df_filtered.drop_duplicates(subset="OrderID", keep='last').to_csv("path_to_my_csv//filename.csv", sep=",", encoding="utf-8")
Sign up to request clarification or add additional context in comments.

8 Comments

Hi, thank you, will this export the duplicates to a csv as well
you can export the new to a .csv, i'll edit the answer with an example
@shaneo, done. I believe you want the "Type 2" export.
by the way, the reason your drop_duplicates() wasn't working is because you didn't set the argument subset to the column you want. So it tried to find duplicates considering all columns.
thank you so much is pulled the dupes to the csv file. The aim i wanted to achieve a function like deduplicate in excel, by deduping on what column you want. Hopefully i wasn't to confusing. Thank you
|
0

The error is in second line of code (you should use pd.DataFrame), if you want to use drop_duplicates method.

df = pd.read_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicateexample.csv')

# Create dataframe with duplicates
raw_data = {'ID': [1,2,3,4,5], 
            'Fruit': ['apple', 'Banana', 'Orange','Mango', 'Kiwi'], 
            'Order ID': [1111, 2222, 3333, 4444, 5555], 
        'Quantity': [11, 22, 33, 44, 55],
        'Price': [ 2, 3, 5, 7, 5]}

new_df = pd.DataFrame(raw_data, columns = ['ID','Fruit','Order ID','Quantity','Price']).drop_duplicates()

new_df.to_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate test.csv', index=False)

Hope it helps.

2 Comments

Hi, thank you. Issue is, it returns the same data not the duplicate only. Drop method was error by me i see now. I've been trying to figure this out for a while and working with different example still new to the pandas and python.
@shaneo, I got the insight that you are new to python and pandas. Keep working on it. kudos!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.