0

I'm attempting to scrape weather data from weatherunderground and using the multiprocessing.dummy library to run my requests through different threads. I'm getting an error when running the following code and I was wondering whether someone could walk me through what's going on and a possible solution. Note: my code could be wildly off.

from bs4 import BeautifulSoup # HTML Text Parsing Package
from urllib2 import urlopen # Package to read URLs
import requests # Package to actually request URL
import nltk
import re
import itertools as ite
import pandas as pd
def scrape(urls):
    actual_temp = []
    string = requests.get(URL)
    soup = BeautifulSoup(string)
    actual_temp_tag = soup.find_all(class_ = "wx-value")[0]
    actual_temp.append(actual_temp_tag.string)
    return actual_temp

URLs = []
for j in range(1,2):
    for i in range(1,32):
        SUB_URL = 'http://www.wunderground.com/history/airport/KBOS/2014/' + str(j) + '/' + str(i) + '/' + '/DailyHistory.html'
        URLs.append(SUB_URL)


from multiprocessing.dummy import Pool as ThreadPool

pool = ThreadPool(8)
results = pool.map(scrape, URLs)

pool.close()
pool.join()

The following is the error message I'm getting:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\bwei\Downloads\WinPython-64bit-2.7.9.4\python-2.7.9.amd64\lib\multiprocessing\pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "C:\Users\bwei\Downloads\WinPython-64bit-2.7.9.4\python-2.7.9.amd64\lib\multiprocessing\pool.py", line 558, in get
    raise self._value
TypeError: object of type 'Response' has no len()

In addition once my program has executed, how do I close all of the threads? I noticed that after trying my % of available memory goes up but doesn't go back down after running

3
  • What's the error you are getting?..Post the Traceback message Commented Jun 30, 2015 at 13:48
  • Just added the error message to the question Commented Jun 30, 2015 at 13:55
  • string = requests.get(URL) requests.get returns a response object, not a string. Commented Jun 30, 2015 at 14:04

1 Answer 1

4

Don't pass string, the requests.models.Response object, to BeautifulSoup. Pass string.contents instead:

In [124]: type(string)
Out[124]: requests.models.Response

In [120]: BeautifulSoup(string)
TypeError: object of type 'Response' has no len()

In [126]: soup = BeautifulSoup(string.content)

Also, your scrape function refers to URL, which should have been a NameError since it is not defined. Instead pass the argument url to requests.get:

def scrape(url):
    actual_temp = []
    string = requests.get(url)
    soup = BeautifulSoup(string.content)
    actual_temp_tag = soup.find_all(class_ = "wx-value")[0]
    actual_temp.append(actual_temp_tag.string)
    return actual_temp
Sign up to request clarification or add additional context in comments.

4 Comments

I also have one more question. Naturally when I run this code my % of available memory goes up. However after I finish and close the pool it doesn't actually go down. Is there a way to close the threads after I'm done with the code?
The memory may not returned to the operating system until the python process is terminated. If you wrap all your present code in a function, you can use multiprocessing.Process to run that function in a separate process. Then it can return the result to the main process and terminate the subprocess, thus returning the memory consumed by the subprocess to the operating system.
Would you have any tips on making this run even faster? Does increasing the threadpool have a linear relationship with time to run or are there diminshing returns?
The main bottleneck is network communication speed. Increasing the number of threads may help -- but be careful to respect wunderground's terms of service and robot policy and be aware that if you pound a server too hard it may have a way of detecting that and throttle such users.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.