I have a data set which more more than 100mb in size and also many in number of files. These files have more than 20 columns and about more than 1 million rows.
The main problem with data is:
- Headers are repeating -- Duplicate header rows
- Duplicate rows in full i.e. data from all the columns in that particular row is duplicate.
Without bothering about the which column or how many columns .. only need to Keep the first occurrence and then remove the rest.
I did find too many examples but what I am looking for is the input and output both need to be same file. The only reason to seek help is, I want the same file to be edited.
sample Input: Here https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=0
Appreciate the help thanks in advance..