Python Scraping Multithreading

Question

I'm attempting to scrape weather data from weatherunderground and using the multiprocessing.dummy library to run my requests through different threads. I'm getting an error when running the following code and I was wondering whether someone could walk me through what's going on and a possible solution. Note: my code could be wildly off.

from bs4 import BeautifulSoup # HTML Text Parsing Package
from urllib2 import urlopen # Package to read URLs
import requests # Package to actually request URL
import nltk
import re
import itertools as ite
import pandas as pd
def scrape(urls):
    actual_temp = []
    string = requests.get(URL)
    soup = BeautifulSoup(string)
    actual_temp_tag = soup.find_all(class_ = "wx-value")[0]
    actual_temp.append(actual_temp_tag.string)
    return actual_temp

URLs = []
for j in range(1,2):
    for i in range(1,32):
        SUB_URL = 'http://www.wunderground.com/history/airport/KBOS/2014/' + str(j) + '/' + str(i) + '/' + '/DailyHistory.html'
        URLs.append(SUB_URL)


from multiprocessing.dummy import Pool as ThreadPool

pool = ThreadPool(8)
results = pool.map(scrape, URLs)

pool.close()
pool.join()

The following is the error message I'm getting:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\bwei\Downloads\WinPython-64bit-2.7.9.4\python-2.7.9.amd64\lib\multiprocessing\pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "C:\Users\bwei\Downloads\WinPython-64bit-2.7.9.4\python-2.7.9.amd64\lib\multiprocessing\pool.py", line 558, in get
    raise self._value
TypeError: object of type 'Response' has no len()

In addition once my program has executed, how do I close all of the threads? I noticed that after trying my % of available memory goes up but doesn't go back down after running

What's the error you are getting?..Post the Traceback message — Iron Fist
– Iron Fist, Commented Jun 30, 2015 at 13:48
string = requests.get(URL) requests.get returns a response object, not a string. — Colonel Thirty Two
– Colonel Thirty Two, Commented Jun 30, 2015 at 14:04

unutbu · Accepted Answer · 2015-06-30 14:29:29Z

4

Don't pass string, the requests.models.Response object, to BeautifulSoup. Pass string.contents instead:

In [124]: type(string)
Out[124]: requests.models.Response

In [120]: BeautifulSoup(string)
TypeError: object of type 'Response' has no len()

In [126]: soup = BeautifulSoup(string.content)

Also, your scrape function refers to URL, which should have been a NameError since it is not defined. Instead pass the argument url to requests.get:

def scrape(url):
    actual_temp = []
    string = requests.get(url)
    soup = BeautifulSoup(string.content)
    actual_temp_tag = soup.find_all(class_ = "wx-value")[0]
    actual_temp.append(actual_temp_tag.string)
    return actual_temp

edited Jun 30, 2015 at 14:29

answered Jun 30, 2015 at 13:59

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

ben890 Over a year ago

I also have one more question. Naturally when I run this code my % of available memory goes up. However after I finish and close the pool it doesn't actually go down. Is there a way to close the threads after I'm done with the code?

unutbu Over a year ago

The memory may not returned to the operating system until the python process is terminated. If you wrap all your present code in a function, you can use multiprocessing.Process to run that function in a separate process. Then it can return the result to the main process and terminate the subprocess, thus returning the memory consumed by the subprocess to the operating system.

ben890 Over a year ago

Would you have any tips on making this run even faster? Does increasing the threadpool have a linear relationship with time to run or are there diminshing returns?

unutbu Over a year ago

The main bottleneck is network communication speed. Increasing the number of threads may help -- but be careful to respect wunderground's terms of service and robot policy and be aware that if you pound a server too hard it may have a way of detecting that and throttle such users.

Collectives™ on Stack Overflow

Python Scraping Multithreading

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related