I have the following code to gather the number of words there are in each chapter of a book. In a nutshell, it opens the url of each book, then the urls of each chapter associated with the book.
import urllib2
from bs4 import BeautifulSoup
import re
def scrapeBook(bookId):
url = 'http://www.qidian.com/BookReader/'+str(bookId)+'.aspx'
try:
words = []
html = urllib2.urlopen(url,'html').read()
soup = BeautifulSoup(html)
try:
chapters = soup.find_all('a', rel='nofollow') # find all relevant chapters
for chapter in chapters: # loop through chapters
if 'title' in chapter.attrs:
link = chapter['href'] # go to chapter to find words
htmlTemp = urllib2.urlopen(link,'html').read()
soupTemp = BeautifulSoup(htmlTemp)
# find out how many words there are in each chapter
spans = soupTemp.find_all('span')
for span in spans:
content = span.string
if not content == None:
if u'\u5b57\u6570' in content:
word = re.sub("[^0-9]", "", content)
words.append(word)
except: pass
return words
except:
print 'Book'+ str(bookId) + 'does not exist'
Below is a sample run
words = scrapeBook(3501537)
print words
>> [u'2532', u'2486', u'2510', u'2223', u'2349', u'2169', u'2259', u'2194', u'2151', u'2422', u'2159', u'2217', u'2158', u'2134', u'2098', u'2139', u'2216', u'2282', u'2298', u'2124', u'2242', u'2224', u'178', u'2168', u'2334', u'2132', u'2176', u'2271', u'2237']
Without doubt the code is very slow. One major reason is that I need to open the url for each book, and for each book I need to open the url of each chapter. Is there a way to make the process faster?
Here is another bookId without empty return 3052409. It has hundreds of chapters, and the code runs forever.