0

I am scraping web with python and getting data to .csv file that looks like this. If I append to the file, I might have some repeated/duplicate data. To avoid that what can i use? I am not sure about pandas - If i should open the file in pandas and then drop duplicates. I tried other methods of my own, but was unable to come up with a solution. I was thinking of using pandas as the last option

Date,Time,Status,School,GPA,GRE,GMAT,Round,Location,Post-MBA Career,via,on,Details,Note

2021-05-18,13:59:00,Accepted from Waitlist,Yale SOM,3.8,No data provided,740,Round 2 ,NYC,Non Profit / Social Impact,phone,2021-05-18,GPA: 3.8 GMAT: 740 Round: Round 2 | NYC,Interviewed and was waitlisted in R2. Just received the call this afternoon. Good luck everyone!

2021-05-18,13:51:00,Accepted from Waitlist,Yale SOM,3.8,323,No data provided,Round 2 ,Austin,Marketing,phone,2021-05-18,GPA: 3.8 GRE: 323 Round: Round 2 | Austin,Keep your head up! It all works out how it is supposed to.
3
  • Do the duplicates correspond to exactly identical lines? And are those duplicates consecutive in the file? Commented May 20, 2021 at 12:33
  • Yes, and no, theyre scattered Commented May 20, 2021 at 12:37
  • Then pandas and drop_duplicates is probably your best option if you intend to later use pandas on the data. If you do not, and if the file can fit in memory, then using a set of lines should do the job. Commented May 20, 2021 at 12:47

2 Answers 2

4

If you want to do it with pandas

# 1. Read CSV
df = pd.read_csv("data.csv")

# 2(a). For complete row duplicate
pd.drop_duplicates(inplace=True)
             
# 2(b). For partials
pd.drop_duplicates(subset=['Date', 'Time', <other_fields>], inplace=True)

# 3. Save then
pd.to_csv("data.csv", index=False)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, when you mean complete row duplicate, means it checks each entry for duplicate values right? and for partials - It checks all the 'n' fields for that?
Yes for complete it should be an exact row (All fields) and for partial it will check just n fields e.g for subsets=['Date', 'Time'], it removes all duplicated rows with just Date and Time same (i.e. 4 rows with same Date and Time reduced to 1)
1

Maybe read through the lines one at a time, store them in a set (so there's no duplicates), and then write them back?

lines = set()
file = 'foo.txt'
with open (file) as fd:
    for line in fd:
        lines.add(line)
with open(file, 'w') as fd:
    fd.write(''.join(lines))

4 Comments

The file has about 60,000 entries. You think that will be feasible?
should be fine - only one way to find out ;)
This makes sense, but I have a question - How do you write back? like how do I add more data?
can you show an example? just add extra elements to lines, just make sure they end with \n

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.