1

In the code below, I am considering using mutli-threading or multi-process for fetching from url. I think pools would be ideal, Can anyone help suggest solution..

Idea: pool thread/process, collect data... my preference is process over thread, but not sure.

import urllib

URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')
#symbols = ('GGP')

def fetch_quote(symbols):
    url = URL % '+'.join(symbols)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data

def main():
    data_fp = fetch_quote(symbols)
#    print data_fp
if __name__ =='__main__':
    main()
3
  • Is there even anything else you want to do in parallel? Your code simply does one single request. Commented Sep 8, 2010 at 16:25
  • No, right now, I'm learning python so trying to keep everything real simple. thanks Commented Sep 8, 2010 at 16:29
  • I have seen the process method, can anyone show me threading method. Please, thanks. Commented Sep 8, 2010 at 17:37

4 Answers 4

1

You have a process that request, several information at once. Let's try to fetch these information one by one.. Your code will be :

def fetch_quote(symbols):
    url = URL % '+'.join(symbols)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data

def main():
    for symbol in symbols:
        data_fp = fetch_quote((symbol,))
        print data_fp

if __name__ == "__main__":
    main()

So main() call, one by one every url to get the data. Let's multiprocess it with a pool:

import urllib
from multiprocessing import Pool

URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')

def fetch_quote(symbols):
    url = URL % '+'.join(symbols)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data

def main():
    for symbol in symbols:
        data_fp = fetch_quote((symbol,))
        print data_fp

if __name__ =='__main__':
    pool = Pool(processes=5)
    for symbol in symbols:
        result = pool.apply_async(fetch_quote, [(symbol,)])
        print result.get(timeout=1)

In the following main a new process is created to request each symbols urls.

Note: on python, since the GIL is present, multithreading must be mostly considered as a wrong solution.

For documentation see: Multiprocessing in python

Sign up to request clarification or add additional context in comments.

4 Comments

GIL is not an issue here because this task is definitely IO-bound.
this method is much slower than no multi-processing. If use a list of 150 stocks then errors and very slow. Copy the list above so stocks equal 150. very slow, WOuld threading be better?????
@user428862 a reason it gets slow when your list of symbols inc. in size is 'coz pool.apply_async serializes your list & passes it to your child process via pipes.As the size of passing args (to child processes) inc., you'll have overhead.In Windows there's not much we can do but try this approach in UNIX.in UNIX,fork(2) is used to spawn processes which essentially passes to the child process the entire parent process state. So if there is some global var in parent process,a child process will be able to access it. as symbols is already global don't pass it in args.hence no serialization...
@movie, thanks, single thread is 2 sec, mutliprocess is 18 secs. for comparison how would I multithread this.... or will the same problem arise.
1

So here's a very simple example. It iterates over symbols passing one at a time to fetch_quote.

import urllib
import multiprocessing

URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')
#symbols = ('GGP')

def fetch_quote(symbol):
    url = URL % '+'.join(symbol)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data


def main():

    PROCESSES = 4
    print 'Creating pool with %d processes\n' % PROCESSES
    pool = multiprocessing.Pool(PROCESSES)
    print 'pool = %s' % pool
    print

    results = [pool.apply_async(fetch_quote, sym) for sym in symbols]

    print 'Ordered results using pool.apply_async():'
    for r in results:
        print '\t', r.get()

    pool.close()
    pool.join()

if __name__ =='__main__':
    main()

3 Comments

There might be some issues if retrieved pages are quite large. multiprocessing uses inter-process communication mechanisms for exchanging information among processes.
True, the above was for simple illustrative purposes only. YMMV, but I wanted to show how simple it was to take his code and make it multiprocess.
I got this error: Creating pool with 4 processes pool = <multiprocessing.pool.Pool object at 0x031956D0> Ordered results using pool.apply_async(): Traceback (most recent call last): File "C:\py\Raw\Yh_Mp.py", line 36, in <module> main() File "C:\py\Raw\Yh_Mp.py", line 30, in main print '\t', r.get() File "C:\Python26\lib\multiprocessing\pool.py", line 422, in get raise self._value TypeError: fetch_quote() takes exactly 1 argument (3 given)
0

Actually it's possible to do it without neither. You can get it done in one thread using asynchronous calls, like for example twisted.web.client.getPage from Twisted Web.

6 Comments

@vartec no need to go for any 3rd party extra packages. Python2.6+ onwards have pretty good in-built packages for this kind of purposes.
Uh oh, someone mentioned Twisted, that means that all other answers are going to get downvoted. stackoverflow.com/questions/3490173/…
@movieyoda: well, for obvious reasons (GAE, Jython) I like to stay compatible with 2.5. Anyway, maybe I'm missing out on something, what support for asynchronous web calls was introduced Python 2.6?
@Nick: unfortunately, because of GIL, Python sucks at threading (I know, function calls are done with GIL released), so you gain nothing from using threads instead of deferred async calls. On the other hand event driven programming rules even in cases when you actually could use threads (vide: ngnix, lighttpd), and obviously in case of Python (Twisted, Tornado).
@vartec if i am not wrong multiprocessing module was made available natively in Python from 2.6 onwards. I think it was called pyprocessing before that, a separate 3rd party module.
|
-1

As you would know multi-threading in Python is not actually multi-threading due to GIL. Essentially it's a single thread that's running at a given time. So in your program if you want multiple urls to be fetched at any given time, multi-threading might not be the way to go. Also after the crawl you store the data in a single file or some persistent db? The decision here could affect your performance.

multi-processes are more efficient that way but have the time & memory overhead of extra processes spawn. I have explored both these options in Python recently. Here's the url (with code) -

python -> multiprocessing module

5 Comments

IO code is run without acquiring the GIL. For IO-bound maps threading works well.
all I wanted to say was while considering multi-threading in Python one needs to keep in mind the GIL. After getting the URL data, one may want to parse it (create DOM->CPU Intensive) or directly want to dump it into a file (IO Operation). In the latter the effect of GIL would be downplayed but in the former GIL played a prominent part in the efficiency of the program. Donno why people take it so offensive that they have to downvote the post...
@user428862 threading & multiprocessing in Python essentially have the same Interfaces/API calls. You could just take my example & import threading instead of import multiprocessing. Give it a try & if you run into some problems I'll help you...
I am new to python. I looked at ur code there are no imports. thanks for offer of help.
Oh in that case it should be import multiprocessing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.