How to speed up web scraping with nested urllib2.urlopen() in Python?

Question

I have the following code to gather the number of words there are in each chapter of a book. In a nutshell, it opens the url of each book, then the urls of each chapter associated with the book.

import urllib2
from bs4 import BeautifulSoup
import re

def scrapeBook(bookId):
    url = 'http://www.qidian.com/BookReader/'+str(bookId)+'.aspx'
    try:
        words = []
        html = urllib2.urlopen(url,'html').read()
        soup = BeautifulSoup(html)           
        try:                             
            chapters = soup.find_all('a', rel='nofollow')  # find all relevant chapters
            for chapter in chapters:                       # loop through chapters
                if 'title' in chapter.attrs: 
                    link = chapter['href']                 # go to chapter to find words
                    htmlTemp = urllib2.urlopen(link,'html').read()
                    soupTemp = BeautifulSoup(htmlTemp)

                    # find out how many words there are in each chapter
                    spans = soupTemp.find_all('span')
                    for span in spans:
                        content = span.string
                        if not content == None:
                            if u'\u5b57\u6570' in content:
                               word = re.sub("[^0-9]", "", content)
                               words.append(word)
        except: pass

        return words

    except:       
        print 'Book'+ str(bookId) + 'does not exist'

Below is a sample run

words = scrapeBook(3501537)
print words
>> [u'2532', u'2486', u'2510', u'2223', u'2349', u'2169', u'2259', u'2194', u'2151', u'2422', u'2159', u'2217', u'2158', u'2134', u'2098', u'2139', u'2216', u'2282', u'2298', u'2124', u'2242', u'2224', u'178', u'2168', u'2334', u'2132', u'2176', u'2271', u'2237']

Without doubt the code is very slow. One major reason is that I need to open the url for each book, and for each book I need to open the url of each chapter. Is there a way to make the process faster?

Here is another bookId without empty return 3052409. It has hundreds of chapters, and the code runs forever.

you got to multithread tutorialspoint.com/python/python_multithreading.htm — Brij Raj Singh - MSFT
– Brij Raj Singh - MSFT, Commented Jul 28, 2015 at 5:10
I like bs4 because it is really easy to use, but if performences matter, you should use lxml which is much faster. — Delgan
– Delgan, Commented Jul 28, 2015 at 5:30
@RishavKundu I was not aware about this, good to know, thank you! — Delgan
– Delgan, Commented Jul 28, 2015 at 5:42

Freek Wiekmeijer · Accepted Answer · 2015-07-28 05:23:52Z

1

The fact that you need to open each book and each chapter is dictated by the views exposed on the server. What you could do, it to implement parallel clients. Create a thread pool where you offload HTTP requests as jobs to the workers, or do something similar with coroutines.

Then there's the choice of the HTTP client library. I found libcurl and geventhttpclient more CPU efficient than urllib or any other of the python standard libs.

answered Jul 28, 2015 at 5:23

Freek Wiekmeijer

4,9881 gold badge34 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ye Tian Over a year ago

Is implementing parallel clients the same as running the same code (sequentially) on different terminals?

Freek Wiekmeijer Over a year ago

Essentially it's the same, you run a bunch of I/O bound jobs in parallel to make it faster. The difference is that a threadpool / coroutine implementation gives you the ease of a single point of entry.

Collectives™ on Stack Overflow

How to speed up web scraping with nested urllib2.urlopen() in Python?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related