1

I have a data set which more more than 100mb in size and also many in number of files. These files have more than 20 columns and about more than 1 million rows.

The main problem with data is:

  1. Headers are repeating -- Duplicate header rows
  2. Duplicate rows in full i.e. data from all the columns in that particular row is duplicate.

Without bothering about the which column or how many columns .. only need to Keep the first occurrence and then remove the rest.

I did find too many examples but what I am looking for is the input and output both need to be same file. The only reason to seek help is, I want the same file to be edited.

sample Input: Here https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=0

Appreciate the help thanks in advance..

6
  • If the header is repeated, e.i. the same, you could just store the first line, loop each consecutive line and add it to a new array if it differs from the first line? Commented Sep 22, 2017 at 8:07
  • @OptimusCrime Actually it is web scrape data which already is being downloaded sing several loops and conditions.. And for every condition a new header is generating.. which i did try to fix from download side which I am not able to achieve hence trying to write a separate program which will remove duplication and will update the same file. Commented Sep 22, 2017 at 8:10
  • I still dont see the problem. Loop and check for identical headers. You can also loop and check for identical lines/rows. A google search should return a million results for finding and removing a duplicate line in a file using Python Commented Sep 22, 2017 at 8:13
  • 1
    Can you post a small reproducible input data set and desired data set? Commented Sep 22, 2017 at 8:14
  • yes they do but all have input file and output file different I already too many number of files so cant have another set so many files as outputs The only reason to ask for a solution is I want to update same file. Commented Sep 22, 2017 at 8:15

1 Answer 1

2

If the number of duplicate headers is known and constant, skip those rows:

csv = pd.read_csv('https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=1', skiprows=4)

Alternatively, which comes w/ the bonus of removing all duplicates, based on all columns, do this:

csv = pd.read_csv('https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=1') csv = csv.drop_duplicates()

Now you still have a header line in the data, just skip it: csv = csv.iloc[1:]

You certainly can then overwrite the input file with pandas.DataFrame.to_csv

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for the solution .. But the headers repeating are not only in the beginning they are also seen somewhere in the middle some place one row and other place two-three. etc etc.
It did not in my case .. But surely in some other situation it should be useful.. Thank you again ..
In which way is it not working? What do you get, what do you expect?
May be I was not able to put it properly in the complete code structure ..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.