1

I have a .csv file with multiple duplicate data entries.

Example of entries when viewed in notepad:

"Tom    1234"
"Andrew    4321"

I would like to extract the duplicate entries into another .csv along with line numbers. An expected output would look something like this.

enter image description here

Using

import pandas as pd
df = pd.read_csv('sample_dup.csv')
df[df.duplicated(subset=None, keep=False)].to_csv('dups.csv')

I managed to export this,

enter image description here

But my expected result is supposed to be this,

enter image description here

This is the data file in question

enter image description here

What went wrong for the first entry to keep appearing at the top of the list? and why is the numbering incorrect as well?

1
  • Where specifically are you stuck in doing this? Commented Nov 5, 2021 at 14:12

1 Answer 1

1

Use pd.Duplicated-

df[df.duplicated()].to_csv('dups.csv')

If you want a subset, use the subset parameter...

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.