web scraping in python

Question

I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.

However, it's just not working.

Here's my code so far:

import urllib2, re
   from bs4 import BeautifulSoup

   soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())

divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]

What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.

what does list contain? Also please don't use the variable name list as it shadows the python builtin of the same name, ALSO scrapy makes scraping each page trivial, but involves using/learning the scrapy framework — dm03514
– dm03514, Commented Jul 26, 2013 at 16:13
Just to note: 1) It doesn't look like the site's AUP allows that, and 2) Even if you did do a simple loop over next page, next page, next page etc...., you'll probably end up blocked as you're going to be making a hell of a lot of requests... Why not just email them and ask if the information you'd like is possible? — Jon Clements
– Jon Clements, Commented Jul 26, 2013 at 16:17
It contains nothing. I'll update a bit then. I'll try emailing them as well now, but I'd still like to try this problem. — cevn
– cevn, Commented Jul 26, 2013 at 16:23

Hayden · Accepted Answer · 2013-07-26 18:55:17Z

You could try something like this:

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')

# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')

results = []
while True:
    # Read the web page in XML mode
    soup = BeautifulSoup(html.read(), "xml")

    try:
        for s in soup.find_all("signature"):
            # Scrape the names from the XML
                    firstname = s.find('firstname').contents[0]
            lastname = s.find('lastname').contents[0]
            results.append(str(firstname) + " " + str(lastname))
    except:
        pass

    # Find the next page to scrape
    prev = soup.find("prev_signature")

    # Check if another page of result exists - if not break from loop   
    if prev == None:
        break

    # Get the previous URL
    url = prev.contents[0]

    # Open the next page of results
    html = urllib2.urlopen(url)
    print("Extracting data from {}".format(url))

# Print the results
print("\n")
print("====================")   
print("= Printing Results =")
print("====================\n")
print(results)

Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.

Blaine · Accepted Answer · 2013-07-26 17:31:17Z

0

In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.

Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).

Or if you do absolutely need to scrape:

Space your requests using a timer
Scrape smartly

I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.

Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.

edited Jul 26, 2013 at 17:31

answered Jul 26, 2013 at 16:37

Blaine

8292 gold badges9 silver badges19 bronze badges

1 Comment

cevn Over a year ago

I'm amenable to using another method, but I don't know how. Can you help me form a request to wherever the signatures are located? I've sent an email to no avail.

michael truong · Accepted Answer · 2013-07-26 16:30:53Z

0

what do you mean by not working? empty list or error?

if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll

answered Jul 26, 2013 at 16:30

michael truong

3111 silver badge4 bronze badges

1 Comment

cevn Over a year ago

It's an empty list. The class seems to exist when I inspect element in Chrome, which is odd because it doesn't when I view the source, now that you mention it.

Collectives™ on Stack Overflow

web scraping in python

3 Answers 3

Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related