0

Can anyone tell me why this code generates queue after starting the threads? Basically, queue is generated after the for loop but in ThreadUrl class it already uses queue.get() method. How does this work? How can it get the values from a queue that is not yet generated?

for i in range(5):
    t = ThreadUrl(queue, out_queue)
    t.setDaemon(True)
    t.start()

# This is what confuses me! Shouldn't it be above the for loop??
for host in hosts:
    queue.put(host)

for i in range(5):
    dt = DatamineThread(out_queue)
    dt.setDaemon(True)
    dt.start()

#wait on the queue until everything has been processed
queue.join()
out_queue.join()

Here is the full source

import Queue
import threading
import urllib2
import time
from BeautifulSoup import BeautifulSoup

hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
        "http://ibm.com", "http://apple.com"]

queue = Queue.Queue()
out_queue = Queue.Queue()

class ThreadUrl(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, queue, out_queue):
        threading.Thread.__init__(self)
        self.queue = queue
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            host = self.queue.get()

            #grabs urls of hosts and then grabs chunk of webpage
            url = urllib2.urlopen(host)
            chunk = url.read()

            #place chunk into out queue
            self.out_queue.put(chunk)

            #signals to queue job is done
            self.queue.task_done()

class DatamineThread(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, out_queue):
        threading.Thread.__init__(self)
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            chunk = self.out_queue.get()

            #parse the chunk
            soup = BeautifulSoup(chunk)
            print soup.findAll(['title'])

            #signals to queue job is done
            self.out_queue.task_done()

start = time.time()
def main():

    #spawn a pool of threads, and pass them queue instance
    for i in range(5):
        t = ThreadUrl(queue, out_queue)
        t.setDaemon(True)
        t.start()

    #populate queue with data
    for host in hosts:
        queue.put(host)

    for i in range(5):
        dt = DatamineThread(out_queue)
        dt.setDaemon(True)
        dt.start()


    #wait on the queue until everything has been processed
    queue.join()
    out_queue.join()

main()
print "Elapsed Time: %s" % (time.time() - start)

1 Answer 1

6

Line host = self.queue.get() blocks executing thread until some element appear in the queue.

So

#spawn a pool of threads, and pass them queue instance
for i in range(5):
    t = ThreadUrl(queue, out_queue)
    t.setDaemon(True)
    t.start()

creates 5 threads that are waiting for any element in the queue.

#populate queue with data
for host in hosts:
    queue.put(host)

fills the queue. After this threads start their processing.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! Is there any differencee between populating the queue before the loop and after the loop?
After your first loop (that creates the ThreadUrls), you have 6 threads. Your main thread feeds the queue; the other threads consume from that queue, and if the queue is empty, they block until something appears in the queue. With populating the queue BEFORE creating the threads, the first thread will see 5 jobs, the second will see 4, etc. So each thread can immediately consume from the queue. With populating the queue AFTER starting the threads, all threads initially block, since the queue is empty. Only after you add in an element does one thread obtain from the queue.
@Kui Tang, thank you for the explanation! So basically there is no difference between populating the queue in the beginning and at the end since the threads will always wait for any possible item in the queue during the program is working right?
@Shaokan You are correct; both forms are functionally the same in your program.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.