Scrape with BeautifulSoup from site that uses AJAX pagination using Python

Question

I'm fairly new to coding and Python so I apologize if this is a silly question. I'd like a script that goes through all 19,000 search results pages and scrapes each page for all of the urls. I've got all of the scrapping working but can't figure out how to deal with the fact that the page uses AJAX to paginate. Usually I'd just make a loop with the url to capture each search result but that's not possible. Here's the page: http://www.heritage.org/research/all-research.aspx?nomobile&categories=report

This is the script I have so far:

with io.open('heritageURLs.txt', 'a', encoding='utf8') as logfile:
    page = urllib2.urlopen("http://www.heritage.org/research/all-research.aspx?nomobile&categories=report")
    soup = BeautifulSoup(page)
    snippet = soup.find_all('a', attrs={'item-title'})
    for a in snippet:
        logfile.write ("http://www.heritage.org" + a.get('href') + "\n")

print "Done collecting urls"

Obviously, it scrapes the first page of results and nothing more.

And I have looked at a few related questions but none seem to use Python or at least not in a way that I can understand. Thank you in advance for your help.

Each returned page presumably includes a link to "go to next page" (perhaps dressed up as a Javascript snippet) with another URL. If you can identify that link, then said URL is where you might keep scraping. — Alex Martelli
– Alex Martelli, Commented Dec 28, 2014 at 16:53
Does pagination work for you? For me, nothing happens when I click "Next".. — alecxe
– alecxe, Commented Dec 28, 2014 at 17:41
like @AlexMartelli said, you need to dive into the POST request and find out the url source. Or, since you're rather new to Python & scraping, perhaps consider using Selenium as it will cater the ajax/javascript issues you have by simulating an actual button click to go to next page. — Anzel
– Anzel, Commented Dec 28, 2014 at 18:38
pagination works for me but I can't find the javascript snippet but that's probably because I don't know javascript at all so I'll keep looking — Jolijt Tamanaha
– Jolijt Tamanaha, Commented Dec 28, 2014 at 20:58

Anzel · Accepted Answer · 2014-12-28 23:06:14Z

For the sake of completeness, while you may try accessing the POST request and to find a way round to access to next page, like I suggested in my comment, if an alternative is possible, using Selenium will be quite easy to achieve what you want.

Here is a simple solution using Selenium for your question:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep

# uncomment if using Firefox web browser
driver = webdriver.Firefox()

# uncomment if using Phantomjs
#driver = webdriver.PhantomJS()

url = 'http://www.heritage.org/research/all-research.aspx?nomobile&categories=report'
driver.get(url)

# set initial page count
pages = 1
with open('heritageURLs.txt', 'w') as f:
    while True:
        try:
            # sleep here to allow time for page load
            sleep(5)
            # grab the Next button if it exists
            btn_next = driver.find_element_by_class_name('next')
            # find all item-title a href and write to file
            links = driver.find_elements_by_class_name('item-title')
            print "Page: {} -- {} urls to write...".format(pages, len(links))
            for link in links:
                f.write(link.get_attribute('href')+'\n')
            # Exit if no more Next button is found, ie. last page
            if btn_next is None:
                print "crawling completed."
                exit(-1)
            # otherwise click the Next button and repeat crawling the urls
            pages += 1
            btn_next.send_keys(Keys.RETURN)
        # you should specify the exception here
        except:
            print "Error found, crawling stopped"
            exit(-1)

Hope this helps.

Collectives™ on Stack Overflow

Scrape with BeautifulSoup from site that uses AJAX pagination using Python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related