Python scraper mechanize/javascript

Question

I have to scrape all info for former US governors from this site. However, to read out the results and then follow the links, I need to access the different results pages, or, preferably, simply set the results limit shown per page to the maximum of 100 (I don't think there are more than 100 results for each state). However, the page info seems to use javascript, is not part of a form and it seems I cannot access it as a control.

Any info on how to proceed? I am pretty new to python, only use it for tasks like this from time to time. This is some simple code which iterates through the main form.

import mechanize
import lxml.html
import csv

site = "http://www.nga.org/cms/FormerGovBios"
output = csv.writer(open(r'output.csv','wb'))
br = mechanize.Browser()

response = br.open(site)
br.select_form(name="governorsSearchForm")
states = br.find_control(id="states-field", type="select").items
for pos, item in enumerate(states[1:2]): 
    statename = str([label.text for label in item.get_labels()])
    print pos, item.name, statename, len(states)
    br.select_form(name="governorsSearchForm")
    br["state"] = [item.name]
    response = br.submit(name="submit", type="submit")
    # now set page limit to 100, get links and descriptions\
    # and follow each link to get information
    for form in br.forms():
        print "Form name:", form.name
        print form, "\n"
    for link in br.links():
        print link.text, link.url

Change the pagesize to 2500 & save the HTML and then parse the saved HTML however you want. — EPQRS
– EPQRS, Commented Jun 23, 2013 at 13:16

user1941407 · Accepted Answer · 2013-06-25 15:07:28Z

2

I solve this problem with selenium. It is complete firefox(or another) browser, which you can manipulate in code.

answered Jun 25, 2013 at 15:07

user1941407

2,8324 gold badges30 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

dablak · Accepted Answer · 2013-06-24 13:39:38Z

1

You can use PySide that is a binding for QtWebKit. With QtWebKit you can retrieve a page that uses Javascript and parse it once Javascript has populated the html. So you don't need to know about Javascript. Other alternatives are Selenium and PhantomJS.

answered Jun 24, 2013 at 13:39

dablak

1,5462 gold badges12 silver badges22 bronze badges

Comments

Jacob · Accepted Answer · 2013-06-26 03:49:52Z

0

+50

Ok this is a screwball approach. Playing around with the different search setting I found that the number of results to display is in the url. So I changed it to 3000 per page, thus it all fits on 1 page.

http://www.nga.org/cms/FormerGovBios?begincac77e09-db17-41cb-9de0-687b843338d0=0&higherOfficesServed=&lastName=&sex=Any&honors=&submit=Search&state=Any&college=&party=&inOffice=Any&biography=&race=Any&birthState=Any&religion=&militaryService=&firstName=&nbrterms=Any&warsServed=&&pagesizecac77e09-db17-41cb-9de0-687b843338d0=3000

After it lodes which does take a while I'd right click and go to view page source. Copy that into a text file on my computer. Then I can scrape the info I need from the file without going to the server and having to process the javascript.

May I recommend "BeautifulSoup" for getting around in the html file.

answered Jun 26, 2013 at 3:49

Jacob

1,58318 silver badges31 bronze badges

2 Comments

ilprincipe Over a year ago

somehow I missed this, this was the easiest way. thanks, getting the data just now.

Jacob Over a year ago

I'm glad I can help. If you need anything else clarified, feel free to comment.

sanyi · Accepted Answer · 2013-06-13 17:29:48Z

0

I would do that with phantomjs http://phantomjs.org/ (javascript) see https://github.com/ariya/phantomjs/wiki/Page-Automation

answered Jun 13, 2013 at 17:29

sanyi

6,2792 gold badges21 silver badges30 bronze badges

1 Comment

ilprincipe Over a year ago

I know virtually nothing about Javscript. How would I go about doing this?

utapyngo · Accepted Answer · 2013-06-24 13:13:44Z

0

Note that the select element on that page changes the window.location.

I think you can contruct an appropriate URI to load the page simply by replacing $('#pageSizeSelector....-..-..-..-....').val() with the value you need.

answered Jun 24, 2013 at 13:13

utapyngo

7,2045 gold badges46 silver badges70 bronze badges

Collectives™ on Stack Overflow

Python scraper mechanize/javascript

5 Answers 5

Comments

Comments

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related