python multi threading/ multiprocess code

Question

In the code below, I am considering using mutli-threading or multi-process for fetching from url. I think pools would be ideal, Can anyone help suggest solution..

Idea: pool thread/process, collect data... my preference is process over thread, but not sure.

import urllib

URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')
#symbols = ('GGP')

def fetch_quote(symbols):
    url = URL % '+'.join(symbols)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data

def main():
    data_fp = fetch_quote(symbols)
#    print data_fp
if __name__ =='__main__':
    main()

Is there even anything else you want to do in parallel? Your code simply does one single request. — Jochen Ritzel
– Jochen Ritzel, Commented Sep 8, 2010 at 16:25
No, right now, I'm learning python so trying to keep everything real simple. thanks — Merlin
– Merlin, Commented Sep 8, 2010 at 16:29
I have seen the process method, can anyone show me threading method. Please, thanks. — Merlin
– Merlin, Commented Sep 8, 2010 at 17:37

ohe · Accepted Answer · 2010-09-08 16:53:38Z

1

You have a process that request, several information at once. Let's try to fetch these information one by one.. Your code will be :

def fetch_quote(symbols):
    url = URL % '+'.join(symbols)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data

def main():
    for symbol in symbols:
        data_fp = fetch_quote((symbol,))
        print data_fp

if __name__ == "__main__":
    main()

So main() call, one by one every url to get the data. Let's multiprocess it with a pool:

import urllib
from multiprocessing import Pool

URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')

def fetch_quote(symbols):
    url = URL % '+'.join(symbols)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data

def main():
    for symbol in symbols:
        data_fp = fetch_quote((symbol,))
        print data_fp

if __name__ =='__main__':
    pool = Pool(processes=5)
    for symbol in symbols:
        result = pool.apply_async(fetch_quote, [(symbol,)])
        print result.get(timeout=1)

In the following main a new process is created to request each symbols urls.

Note: on python, since the GIL is present, multithreading must be mostly considered as a wrong solution.

For documentation see: Multiprocessing in python

answered Sep 8, 2010 at 16:53

ohe

3,7133 gold badges29 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Andrey Vlasovskikh Over a year ago

GIL is not an issue here because this task is definitely IO-bound.

Merlin Over a year ago

this method is much slower than no multi-processing. If use a list of 150 stocks then errors and very slow. Copy the list above so stocks equal 150. very slow, WOuld threading be better?????

Srikar Appalaraju Over a year ago

@user428862 a reason it gets slow when your list of symbols inc. in size is 'coz pool.apply_async serializes your list & passes it to your child process via pipes.As the size of passing args (to child processes) inc., you'll have overhead.In Windows there's not much we can do but try this approach in UNIX.in UNIX,fork(2) is used to spawn processes which essentially passes to the child process the entire parent process state. So if there is some global var in parent process,a child process will be able to access it. as symbols is already global don't pass it in args.hence no serialization...

Merlin Over a year ago

@movie, thanks, single thread is 2 sec, mutliprocess is 18 secs. for comparison how would I multithread this.... or will the same problem arise.

mluebke · Accepted Answer · 2010-09-08 16:50:25Z

1

So here's a very simple example. It iterates over symbols passing one at a time to fetch_quote.

import urllib
import multiprocessing

URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')
#symbols = ('GGP')

def fetch_quote(symbol):
    url = URL % '+'.join(symbol)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data


def main():

    PROCESSES = 4
    print 'Creating pool with %d processes\n' % PROCESSES
    pool = multiprocessing.Pool(PROCESSES)
    print 'pool = %s' % pool
    print

    results = [pool.apply_async(fetch_quote, sym) for sym in symbols]

    print 'Ordered results using pool.apply_async():'
    for r in results:
        print '\t', r.get()

    pool.close()
    pool.join()

if __name__ =='__main__':
    main()

answered Sep 8, 2010 at 16:50

mluebke

8,8697 gold badges38 silver badges31 bronze badges

3 Comments

Andrey Vlasovskikh Over a year ago

There might be some issues if retrieved pages are quite large. multiprocessing uses inter-process communication mechanisms for exchanging information among processes.

mluebke Over a year ago

True, the above was for simple illustrative purposes only. YMMV, but I wanted to show how simple it was to take his code and make it multiprocess.

Merlin Over a year ago

I got this error: Creating pool with 4 processes pool = <multiprocessing.pool.Pool object at 0x031956D0> Ordered results using pool.apply_async(): Traceback (most recent call last): File "C:\py\Raw\Yh_Mp.py", line 36, in <module> main() File "C:\py\Raw\Yh_Mp.py", line 30, in main print '\t', r.get() File "C:\Python26\lib\multiprocessing\pool.py", line 422, in get raise self._value TypeError: fetch_quote() takes exactly 1 argument (3 given)

vartec · Accepted Answer · 2010-09-08 16:38:16Z

0

Actually it's possible to do it without neither. You can get it done in one thread using asynchronous calls, like for example twisted.web.client.getPage from Twisted Web.

answered Sep 8, 2010 at 16:38

vartec

135k38 gold badges227 silver badges248 bronze badges

6 Comments

Srikar Appalaraju Over a year ago

@vartec no need to go for any 3rd party extra packages. Python2.6+ onwards have pretty good in-built packages for this kind of purposes.

Nick T Over a year ago

Uh oh, someone mentioned Twisted, that means that all other answers are going to get downvoted. stackoverflow.com/questions/3490173/…

vartec Over a year ago

@movieyoda: well, for obvious reasons (GAE, Jython) I like to stay compatible with 2.5. Anyway, maybe I'm missing out on something, what support for asynchronous web calls was introduced Python 2.6?

vartec Over a year ago

@Nick: unfortunately, because of GIL, Python sucks at threading (I know, function calls are done with GIL released), so you gain nothing from using threads instead of deferred async calls. On the other hand event driven programming rules even in cases when you actually could use threads (vide: ngnix, lighttpd), and obviously in case of Python (Twisted, Tornado).

Srikar Appalaraju Over a year ago

@vartec if i am not wrong multiprocessing module was made available natively in Python from 2.6 onwards. I think it was called pyprocessing before that, a separate 3rd party module.

|

Community · Accepted Answer · 2017-05-23 12:19:48Z

-1

As you would know multi-threading in Python is not actually multi-threading due to GIL. Essentially it's a single thread that's running at a given time. So in your program if you want multiple urls to be fetched at any given time, multi-threading might not be the way to go. Also after the crawl you store the data in a single file or some persistent db? The decision here could affect your performance.

multi-processes are more efficient that way but have the time & memory overhead of extra processes spawn. I have explored both these options in Python recently. Here's the url (with code) -

python -> multiprocessing module

edited May 23, 2017 at 12:19

CommunityBot

11 silver badge

answered Sep 8, 2010 at 16:45

Srikar Appalaraju

74k55 gold badges221 silver badges265 bronze badges

5 Comments

Andrey Vlasovskikh Over a year ago

IO code is run without acquiring the GIL. For IO-bound maps threading works well.

Srikar Appalaraju Over a year ago

all I wanted to say was while considering multi-threading in Python one needs to keep in mind the GIL. After getting the URL data, one may want to parse it (create DOM->CPU Intensive) or directly want to dump it into a file (IO Operation). In the latter the effect of GIL would be downplayed but in the former GIL played a prominent part in the efficiency of the program. Donno why people take it so offensive that they have to downvote the post...

Srikar Appalaraju Over a year ago

@user428862 threading & multiprocessing in Python essentially have the same Interfaces/API calls. You could just take my example & import threading instead of import multiprocessing. Give it a try & if you run into some problems I'll help you...

Merlin Over a year ago

I am new to python. I looked at ur code there are no imports. thanks for offer of help.

Srikar Appalaraju Over a year ago

Oh in that case it should be import multiprocessing.

Collectives™ on Stack Overflow

python multi threading/ multiprocess code

4 Answers 4

4 Comments

3 Comments

6 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

3 Comments

6 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related