Parallelize my python program

Question

I have a python program that reads a line from a input file, does some manipulation and writes it to output file. I have a quadcore machine, and I want to utilize all of them. I think there are two alternatives to do this,

Creating n multiple python processes each handling a total number of records/n
Creating n threads in a single python process for every input record and each thread processing a record.
Creating a pool of n threads in a single python process, each executing a input record.

I have never used python mutliprocessing capabilities, can the hackers please tell which method is best option?

Paul · Accepted Answer · 2011-03-30 14:42:25Z

4

The reference implementation of the Python interpreter (CPython) holds the infamous "Global Interpreter Lock" (GIL), effectively allowing only one thread to execute Python code at a time. As a result, multithreading is very limited in Python -- unless your heavy lifting gets done in C extensions that release the GIL.

The simplest way to overcome this limitation is to use the multiprocessing module instead. It has a similar API to threading and is pretty straight-forward to use. In your case, you could use it like this (assuming that the manipulation is the hard part):

import multiprocessing

def process_line(line):
    # This function is executed in your worker processes.  Manipulate the
    # line and return the results.
    return manipulate(line)

if __name__ == '__main__':
    with open('input.txt') as fin, open('output.txt', 'w') as fout:
        # This creates a pool of N worker processes, where N is the number
        # of CPUs in your machine.
        pool = multiprocessing.Pool()

        # Let the workers do the manipulation and write the results to
        # the output file:
        for manipulated_line in pool.imap(process_line, fin):
            fout.write(manipulated_line)

edited Mar 30, 2011 at 14:42

Paul

43.9k17 gold badges112 silver badges126 bronze badges

answered Mar 30, 2011 at 7:48

Ferdinand Beyer

67.6k18 gold badges160 silver badges147 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Boolean Over a year ago

B u, isn't multiprocessing an overhead compared to multithreading, due to context switching, scheduling among processes. Thanks.

Jonas Elfström Over a year ago

Yes it is but you will not get any real parallelisation in any other way in Python.

salezica Over a year ago

With no overhead due to context switching and scheduling, you can't parallelize at all, never mind the GIL

Michael Dillon · Accepted Answer · 2011-03-30 07:55:28Z

Number one is the right answer.

First of all, it is easier to create and manage multiple processes than multiple threads. You can use the multiprocessing module or something like pyro to take care of the details. Secondly, threading needs to deal with Python's global interpreter lock which makes it more complicated even if you are an expert at threading with Java or C#. And most importantly, performance on multicore machines is harder to predict than you might think. If you haven't implemented and measured two different ways to do things, your intuition as to which way is fastest, is probably wrong.

By the way if you really are an expert at Java or C# threading, then you probably should go with threading instead, but use Jython or IronPython instead of CPython.

Jonas Elfström · Accepted Answer · 2011-03-30 07:58:33Z

0

Reading the same file from several processes concurrently is tricky. Is it possible to split the file beforehand?

While Python has the GIL both Jython and IronPython hasn't that limitation.

Also make sure that a simple single process doesn't already max disk I/O. You will have a hard time gaining anything if it does.

edited Mar 30, 2011 at 7:58

answered Mar 30, 2011 at 7:53

Jonas Elfström

31.5k6 gold badges74 silver badges107 bronze badges

2 Comments

Boolean Over a year ago

I am surprised, why multi threading option is not suggested because, multi processing have obvious performance overhead and compilcated code(splitting) etc.

Ferdinand Beyer Over a year ago

@Bala: This is a detail of the Python interpreter, please read wiki.python.org/moin/GlobalInterpreterLock.

Collectives™ on Stack Overflow

Parallelize my python program

3 Answers 3

3 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related