I have a Python 2.7 program which pulls data from websites and dumps the results to a database. It follows the consumer producer model and is written using the threading module.
Just for fun I would like to rewrite this program using the new asyncio module (from 3.4) but I cannot figure out how to do this properly.
The most crucial requirement is that the program must fetch data from the same website in a sequential order. For example for an url 'http://a-restaurant.com' it should first get 'http://a-restaurant.com/menu/0', then 'http://a-restaurant.com/menu/1', then 'http://a-restaurant.com/menu/2', ... If they are not fetched in order the website stops delivering pages altogether and you have to start from 0.
However another fetch for another website ('http://another-restaurant.com') can (and should) run at the same time (the other sites also have the sequantial restriction).
The threading module suits well for this as I can create separate threads for each website and in each thread it can wait until one page has finished loading before fetching another one.
Here's a grossly simplified code snippet from the threading version (Python 2.7):
class FetchThread(threading.Threading)
def __init__(self, queue, url)
self.queue = queue
self.baseurl = url
...
def run(self)
# Get 10 menu pages in a sequantial order
for food in range(10):
url = self.baseurl + '/' + str(food)
text = urllib2.urlopen(url).read()
self.queue.put(text)
...
def main()
queue = Queue.Queue()
urls = ('http://a-restaurant.com/menu', 'http://another-restaurant.com/menu')
for url in urls:
fetcher = FetchThread(queue, url)
fetcher.start()
...
And here's how I tried to do it with asyncio (in 3.4.1):
@asyncio.coroutine
def fetch(url):
response = yield from aiohttp.request('GET', url)
response = yield from response.read_and_close()
return response.decode('utf-8')
@asyncio.coroutine
def print_page(url):
page = yield from fetch(url)
print(page)
l = []
urls = ('http://a-restaurant.com/menu', 'http://another-restaurant.com/menu')
for url in urls:
for food in range(10):
menu_url = url + '/' + str(food)
l.append(print_page(menu_url))
loop.run_until_complete(asyncio.wait(l))
And it fetches and prints everything in a non-sequential order. Well, I guess that's the whole idea of those coroutines. Should I not use aiohttp and just fetch with urllib? But do the fetches for the first restaurant then block the fetches for the other restaurants? Am I just thinking this completely wrong? (This is just a test to try fetch things in a sequential order. Haven't got to the queue part yet.)