0

For this project, I am scraping data from a database and attempting to export this data to a spreadsheet for further analysis.

While my code seems mostly to work well, when it comes to the last bit--exporting to CSV--I am having no luck. This question has been asked a few times, however it seems the answers were geared towards different approaches, and I didn't have any luck adapting their answers.

My code is below:

from bs4 import BeautifulSoup
import requests
import re
url1 = "http://www.elections.ca/WPAPPS/WPR/EN/NC?province=-1&distyear=2013&district=-1&party=-1&pageno="
url2 = "&totalpages=55&totalcount=1368&secondaryaction=prev25"

date1 = []
date2 = []
date3 = []
party=[]
riding=[]
candidate=[]
winning=[]
number=[]

for i in range(1, 56):
    r  = requests.get(url1 + str(i) + url2)
    data = r.text
    cat = BeautifulSoup(data)
    links = []
    for link in cat.find_all('a', href=re.compile('selectedid=')):
        links.append("http://www.elections.ca" + link.get('href'))  
    for link in links:
        r  = requests.get(link)
        data = r.text
        cat = BeautifulSoup(data)
        date1.append(cat.find_all('span')[2].contents)
        date2.append(cat.find_all('span')[3].contents)
        date3.append(cat.find_all('span')[5].contents)
        party.append(re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip())
        riding.append(re.sub("[\n\r/]", "",  cat.find_all('div', class_="group")[2].contents[2]).strip())  
        cs= cat.find_all("table")[0].find_all("td", headers="name/1")        
        elected=[]
        for c in cs:
            elected.append(c.contents[0].strip())
        number.append(len(elected))
        candidate.append(elected)
        winning.append(cs[0].contents[0].strip())


import csv

file = ""

for i in range(0,len(date1)):
    file = [file,date1[i],date2[i],date3[i],party[i],riding[i],"\n"]

with open ('filename.csv','rb') as file:
   writer=csv.writer(file)
   for row in file:
       writer.writerow(row)

Really--any tips would be GREATLY appreciated. Thanks a lot.

*PART 2: Another question: I previously thought that finding the winning candidate in the table could be simplified by just always selecting the first name that appears in the table, as I thought the "winners" always appeared first. However, this is not the case. Whether or not a candidate was elected is stored in the form of a picture in the first column. How would I scrape this and store it in a spreadsheet? It's located under < td headers > as:

< img src="/WPAPPS/WPR/Content/Images/selected_box.gif" alt="contestant won this nomination contest" >

I had an idea for attempting some sort of Boolean sorting measure, but I am unsure of how to implement. Thanks a lot.* UPDATE: This question is now a separate post here.

1
  • Note, you currently have open('filename.csv','rb'), you should open the file for writing as open('filename.csv','wb'). Commented Sep 29, 2016 at 6:21

1 Answer 1

1

The following should correctly export your data to a CSV file:

from bs4 import BeautifulSoup
import requests
import re
import csv


url = "http://www.elections.ca/WPAPPS/WPR/EN/NC?province=-1&distyear=2013&district=-1&party=-1&pageno={}&totalpages=55&totalcount=1368&secondaryaction=prev25"
rows = []

for i in range(1, 56):
    print(i)
    r  = requests.get(url.format(i))
    data = r.text
    cat = BeautifulSoup(data, "html.parser")
    links = []

    for link in cat.find_all('a', href=re.compile('selectedid=')):
        links.append("http://www.elections.ca" + link.get('href'))  

    for link in links:
        r  = requests.get(link)
        data = r.text
        cat = BeautifulSoup(data, "html.parser")
        lspans = cat.find_all('span')
        cs = cat.find_all("table")[0].find_all("td", headers="name/1")        
        elected = []

        for c in cs:
            elected.append(c.contents[0].strip())

        rows.append([
            lspans[2].contents[0], 
            lspans[3].contents[0], 
            lspans[5].contents[0],
            re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip(),
            re.sub("[\n\r/]", "",  cat.find_all('div', class_="group")[2].contents[2]).strip().encode('latin-1'),
            len(elected),
            cs[0].contents[0].strip().encode('latin-1')
            ])

with open('filename.csv', 'w', newline='') as f_output:
   csv_output = csv.writer(f_output)
   csv_output.writerows(rows)

Giving you the following kind of output in your CSV file:

"September 17, 2016","September 13, 2016","September 17, 2016",Liberal,Medicine Hat--Cardston--Warner,1,Stanley Sakamoto
"June 25, 2016","May 12, 2016","June 25, 2016",Conservative,Medicine Hat--Cardston--Warner,6,Brian Benoit
"September 28, 2015","September 28, 2015","September 28, 2015",Liberal,Cowichan--Malahat--Langford,1,Luke Krayenhoff

There is no need to build up lots of separate lists for each column of your data, it is easier just to build a list of rows directly. This can then easily be written to a CSV in one go (or written a row at a time as your are gathering the data).

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you! Worked very well. I should have mentioned that I am working with 3.5, so I changed // open('filename.csv', 'wb') // to // ('filename.csv', 'w') // to remedy a type error I was getting. Really appreciate the response!
Any ideas on Part 2, anyone? Would that be more appropriate for a separate thread?
For Python 3, you should also use newline='' to open the file, I have updated the script. A separate question might work better, I suggest you reduce the problem to the bare minimum number of lines of code.
Thanks again! Update: added my other question as a separate post here

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.