3

I have to scrape all info for former US governors from this site. However, to read out the results and then follow the links, I need to access the different results pages, or, preferably, simply set the results limit shown per page to the maximum of 100 (I don't think there are more than 100 results for each state). However, the page info seems to use javascript, is not part of a form and it seems I cannot access it as a control.

Any info on how to proceed? I am pretty new to python, only use it for tasks like this from time to time. This is some simple code which iterates through the main form.

import mechanize
import lxml.html
import csv

site = "http://www.nga.org/cms/FormerGovBios"
output = csv.writer(open(r'output.csv','wb'))
br = mechanize.Browser()

response = br.open(site)
br.select_form(name="governorsSearchForm")
states = br.find_control(id="states-field", type="select").items
for pos, item in enumerate(states[1:2]): 
    statename = str([label.text for label in item.get_labels()])
    print pos, item.name, statename, len(states)
    br.select_form(name="governorsSearchForm")
    br["state"] = [item.name]
    response = br.submit(name="submit", type="submit")
    # now set page limit to 100, get links and descriptions\
    # and follow each link to get information
    for form in br.forms():
        print "Form name:", form.name
        print form, "\n"
    for link in br.links():
        print link.text, link.url
1
  • 1
    Change the pagesize to 2500 & save the HTML and then parse the saved HTML however you want. Commented Jun 23, 2013 at 13:16

5 Answers 5

2

I solve this problem with selenium. It is complete firefox(or another) browser, which you can manipulate in code.

Sign up to request clarification or add additional context in comments.

Comments

1

You can use PySide that is a binding for QtWebKit. With QtWebKit you can retrieve a page that uses Javascript and parse it once Javascript has populated the html. So you don't need to know about Javascript. Other alternatives are Selenium and PhantomJS.

Comments

0
+50

Ok this is a screwball approach. Playing around with the different search setting I found that the number of results to display is in the url. So I changed it to 3000 per page, thus it all fits on 1 page.

http://www.nga.org/cms/FormerGovBios?begincac77e09-db17-41cb-9de0-687b843338d0=0&higherOfficesServed=&lastName=&sex=Any&honors=&submit=Search&state=Any&college=&party=&inOffice=Any&biography=&race=Any&birthState=Any&religion=&militaryService=&firstName=&nbrterms=Any&warsServed=&&pagesizecac77e09-db17-41cb-9de0-687b843338d0=3000

After it lodes which does take a while I'd right click and go to view page source. Copy that into a text file on my computer. Then I can scrape the info I need from the file without going to the server and having to process the javascript.

May I recommend "BeautifulSoup" for getting around in the html file.

2 Comments

somehow I missed this, this was the easiest way. thanks, getting the data just now.
I'm glad I can help. If you need anything else clarified, feel free to comment.
0

I would do that with phantomjs http://phantomjs.org/ (javascript) see https://github.com/ariya/phantomjs/wiki/Page-Automation

1 Comment

I know virtually nothing about Javscript. How would I go about doing this?
0

Note that the select element on that page changes the window.location.

I think you can contruct an appropriate URI to load the page simply by replacing $('#pageSizeSelector....-..-..-..-....').val() with the value you need.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.