0

I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.

However, it's just not working.

Here's my code so far:

import urllib2, re
   from bs4 import BeautifulSoup

   soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())

divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]

What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.

3
  • what does list contain? Also please don't use the variable name list as it shadows the python builtin of the same name, ALSO scrapy makes scraping each page trivial, but involves using/learning the scrapy framework Commented Jul 26, 2013 at 16:13
  • Just to note: 1) It doesn't look like the site's AUP allows that, and 2) Even if you did do a simple loop over next page, next page, next page etc...., you'll probably end up blocked as you're going to be making a hell of a lot of requests... Why not just email them and ask if the information you'd like is possible? Commented Jul 26, 2013 at 16:17
  • It contains nothing. I'll update a bit then. I'll try emailing them as well now, but I'd still like to try this problem. Commented Jul 26, 2013 at 16:23

3 Answers 3

1

You could try something like this:

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')

# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')

results = []
while True:
    # Read the web page in XML mode
    soup = BeautifulSoup(html.read(), "xml")

    try:
        for s in soup.find_all("signature"):
            # Scrape the names from the XML
                    firstname = s.find('firstname').contents[0]
            lastname = s.find('lastname').contents[0]
            results.append(str(firstname) + " " + str(lastname))
    except:
        pass

    # Find the next page to scrape
    prev = soup.find("prev_signature")

    # Check if another page of result exists - if not break from loop   
    if prev == None:
        break

    # Get the previous URL
    url = prev.contents[0]

    # Open the next page of results
    html = urllib2.urlopen(url)
    print("Extracting data from {}".format(url))

# Print the results
print("\n")
print("====================")   
print("= Printing Results =")
print("====================\n")
print(results)

Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.

Sign up to request clarification or add additional context in comments.

Comments

0

In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.

Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).

Or if you do absolutely need to scrape:

  1. Space your requests using a timer
  2. Scrape smartly

I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.

Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.

1 Comment

I'm amenable to using another method, but I don't know how. Can you help me form a request to wherever the signatures are located? I've sent an email to no avail.
0

what do you mean by not working? empty list or error?

if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll

1 Comment

It's an empty list. The class seems to exist when I inspect element in Chrome, which is odd because it doesn't when I view the source, now that you mention it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.