Remove Duplicate rows from csv [headers + Content]

Question

I have a data set which more more than 100mb in size and also many in number of files. These files have more than 20 columns and about more than 1 million rows.

The main problem with data is:

Headers are repeating -- Duplicate header rows
Duplicate rows in full i.e. data from all the columns in that particular row is duplicate.

Without bothering about the which column or how many columns .. only need to Keep the first occurrence and then remove the rest.

I did find too many examples but what I am looking for is the input and output both need to be same file. The only reason to seek help is, I want the same file to be edited.

sample Input: Here https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=0

Appreciate the help thanks in advance..

If the header is repeated, e.i. the same, you could just store the first line, loop each consecutive line and add it to a new array if it differs from the first line? — OptimusCrime
– OptimusCrime, Commented Sep 22, 2017 at 8:07
@OptimusCrime Actually it is web scrape data which already is being downloaded sing several loops and conditions.. And for every condition a new header is generating.. which i did try to fix from download side which I am not able to achieve hence trying to write a separate program which will remove duplication and will update the same file. — Sitz Blogz
– Sitz Blogz, Commented Sep 22, 2017 at 8:10
I still dont see the problem. Loop and check for identical headers. You can also loop and check for identical lines/rows. A google search should return a million results for finding and removing a duplicate line in a file using Python — OptimusCrime
– OptimusCrime, Commented Sep 22, 2017 at 8:13
Can you post a small reproducible input data set and desired data set? — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Sep 22, 2017 at 8:14
yes they do but all have input file and output file different I already too many number of files so cant have another set so many files as outputs The only reason to ask for a solution is I want to update same file. — Sitz Blogz
– Sitz Blogz, Commented Sep 22, 2017 at 8:15

TomTom101 · Accepted Answer · 2017-09-22 10:08:23Z

2

If the number of duplicate headers is known and constant, skip those rows:

csv = pd.read_csv('https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=1', skiprows=4)

Alternatively, which comes w/ the bonus of removing all duplicates, based on all columns, do this:

csv = pd.read_csv('https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=1') csv = csv.drop_duplicates()

Now you still have a header line in the data, just skip it: csv = csv.iloc[1:]

You certainly can then overwrite the input file with pandas.DataFrame.to_csv

answered Sep 22, 2017 at 10:08

TomTom101

6,9413 gold badges27 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Sitz Blogz Over a year ago

Thank you for the solution .. But the headers repeating are not only in the beginning they are also seen somewhere in the middle some place one row and other place two-three. etc etc.

Sitz Blogz Over a year ago

It did not in my case .. But surely in some other situation it should be useful.. Thank you again ..

TomTom101 Over a year ago

In which way is it not working? What do you get, what do you expect?

Sitz Blogz Over a year ago

May be I was not able to put it properly in the complete code structure ..

Collectives™ on Stack Overflow

Remove Duplicate rows from csv [headers + Content]

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related