Extracting duplicate data from a .csv into another .csv file using python

Question

I have a .csv file with multiple duplicate data entries.

Example of entries when viewed in notepad:

"Tom    1234"
"Andrew    4321"

I would like to extract the duplicate entries into another .csv along with line numbers. An expected output would look something like this.

Using

import pandas as pd
df = pd.read_csv('sample_dup.csv')
df[df.duplicated(subset=None, keep=False)].to_csv('dups.csv')

I managed to export this,

But my expected result is supposed to be this,

This is the data file in question

What went wrong for the first entry to keep appearing at the top of the list? and why is the numbering incorrect as well?

Vivek Kalyanarangan · Accepted Answer · 2021-11-05 14:19:05Z

1

df[df.duplicated()].to_csv('dups.csv')

If you want a subset, use the subset parameter...

answered Nov 5, 2021 at 14:19

9,1011 gold badge27 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

1 Answer 1