0

I have the following code to gather the number of words there are in each chapter of a book. In a nutshell, it opens the url of each book, then the urls of each chapter associated with the book.

import urllib2
from bs4 import BeautifulSoup
import re

def scrapeBook(bookId):
    url = 'http://www.qidian.com/BookReader/'+str(bookId)+'.aspx'
    try:
        words = []
        html = urllib2.urlopen(url,'html').read()
        soup = BeautifulSoup(html)           
        try:                             
            chapters = soup.find_all('a', rel='nofollow')  # find all relevant chapters
            for chapter in chapters:                       # loop through chapters
                if 'title' in chapter.attrs: 
                    link = chapter['href']                 # go to chapter to find words
                    htmlTemp = urllib2.urlopen(link,'html').read()
                    soupTemp = BeautifulSoup(htmlTemp)

                    # find out how many words there are in each chapter
                    spans = soupTemp.find_all('span')
                    for span in spans:
                        content = span.string
                        if not content == None:
                            if u'\u5b57\u6570' in content:
                               word = re.sub("[^0-9]", "", content)
                               words.append(word)
        except: pass

        return words

    except:       
        print 'Book'+ str(bookId) + 'does not exist'    

Below is a sample run

words = scrapeBook(3501537)
print words
>> [u'2532', u'2486', u'2510', u'2223', u'2349', u'2169', u'2259', u'2194', u'2151', u'2422', u'2159', u'2217', u'2158', u'2134', u'2098', u'2139', u'2216', u'2282', u'2298', u'2124', u'2242', u'2224', u'178', u'2168', u'2334', u'2132', u'2176', u'2271', u'2237']

Without doubt the code is very slow. One major reason is that I need to open the url for each book, and for each book I need to open the url of each chapter. Is there a way to make the process faster?

Here is another bookId without empty return 3052409. It has hundreds of chapters, and the code runs forever.

10
  • you got to multithread tutorialspoint.com/python/python_multithreading.htm Commented Jul 28, 2015 at 5:10
  • I like bs4 because it is really easy to use, but if performences matter, you should use lxml which is much faster. Commented Jul 28, 2015 at 5:30
  • @Delgan it is possible to instruct bs4 to use lxml. Commented Jul 28, 2015 at 5:35
  • @Delgan as the folks at bs4 recommend. Commented Jul 28, 2015 at 5:39
  • @RishavKundu I was not aware about this, good to know, thank you! Commented Jul 28, 2015 at 5:42

1 Answer 1

1

The fact that you need to open each book and each chapter is dictated by the views exposed on the server. What you could do, it to implement parallel clients. Create a thread pool where you offload HTTP requests as jobs to the workers, or do something similar with coroutines.

Then there's the choice of the HTTP client library. I found libcurl and geventhttpclient more CPU efficient than urllib or any other of the python standard libs.

Sign up to request clarification or add additional context in comments.

2 Comments

Is implementing parallel clients the same as running the same code (sequentially) on different terminals?
Essentially it's the same, you run a bunch of I/O bound jobs in parallel to make it faster. The difference is that a threadpool / coroutine implementation gives you the ease of a single point of entry.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.