I am appending a .csv file with python. The data is scraped from the web. I am through with almost everything related to scraping.
The problem is coming when I am trying to append the file. It enters multiple >100s of entries of same data. So I am sure there is a problem with the loop/ for or if statements that i am not able to identify and solve.
The condition checks for similarity in data scraped from web and already existing data in file. If data doesn't match then program writes a new row, else it breaks or continues.
Note: csvFileArray is an array which checks data from existing file.txt. for example print(csvFileArray[0]) gives:
{'Date': '19/05/21', 'Time': '14:51:00', 'Status': 'Waitlisted', 'School': 'MIT Sloan', 'Details': 'GPA: 3.4 Round: Round 2 | Texas'}
Below is the code that has a problem.
file = open('file.csv', 'a')
writer = csv.writer(file)
#loop for page numbers
for page in range(15, 17):
print("Getting page {}..".format(page))
params["paged"] = page
data = requests.post(url, data=params).json()
soup = BeautifulSoup(data["markup"], "html.parser")
for entry in soup.select(".livewire-entry"):
datime = entry.select_one(".adate")
status = entry.select_one(".status")
name = status.find_next("strong")
details = entry.select_one(".lw-details")
datime = datime.get_text(strip=True)
datime = datetime.datetime.strptime(datime, '%B %d, %Y %I:%M%p')
time = datime.time() #returns time
date = datime.date() #returns date
for firstentry in csvFileArray:
condition = (((firstentry['Date']) == date) and ((firstentry['Time']) == time)
and ((firstentry['Status']) == (status.get_text(strip=True))) and ((firstentry['School']) == (name.get_text(strip=True)))
and ((firstentry['Details']) == details.get_text(strip=True)))
if condition:
continue
else:
writer.writerow([date, time, status.get_text(strip=True), name.get_text(strip=True),details.get_text(strip=True)])
#print('ok')
print("-" * 80)
file.close()
setto ensure uniqueness instead of iterating over each element incsvFileArrayfor each newentry, for each newpage. Now,dictisn't hashable, you'll need to do some transformation, but it will make the code clearer and more efficient