3

I have a python program that reads a line from a input file, does some manipulation and writes it to output file. I have a quadcore machine, and I want to utilize all of them. I think there are two alternatives to do this,

  1. Creating n multiple python processes each handling a total number of records/n
  2. Creating n threads in a single python process for every input record and each thread processing a record.
  3. Creating a pool of n threads in a single python process, each executing a input record.

I have never used python mutliprocessing capabilities, can the hackers please tell which method is best option?

3 Answers 3

4

The reference implementation of the Python interpreter (CPython) holds the infamous "Global Interpreter Lock" (GIL), effectively allowing only one thread to execute Python code at a time. As a result, multithreading is very limited in Python -- unless your heavy lifting gets done in C extensions that release the GIL.

The simplest way to overcome this limitation is to use the multiprocessing module instead. It has a similar API to threading and is pretty straight-forward to use. In your case, you could use it like this (assuming that the manipulation is the hard part):

import multiprocessing

def process_line(line):
    # This function is executed in your worker processes.  Manipulate the
    # line and return the results.
    return manipulate(line)

if __name__ == '__main__':
    with open('input.txt') as fin, open('output.txt', 'w') as fout:
        # This creates a pool of N worker processes, where N is the number
        # of CPUs in your machine.
        pool = multiprocessing.Pool()

        # Let the workers do the manipulation and write the results to
        # the output file:
        for manipulated_line in pool.imap(process_line, fin):
            fout.write(manipulated_line)
Sign up to request clarification or add additional context in comments.

3 Comments

B u, isn't multiprocessing an overhead compared to multithreading, due to context switching, scheduling among processes. Thanks.
Yes it is but you will not get any real parallelisation in any other way in Python.
With no overhead due to context switching and scheduling, you can't parallelize at all, never mind the GIL
0

Number one is the right answer.

First of all, it is easier to create and manage multiple processes than multiple threads. You can use the multiprocessing module or something like pyro to take care of the details. Secondly, threading needs to deal with Python's global interpreter lock which makes it more complicated even if you are an expert at threading with Java or C#. And most importantly, performance on multicore machines is harder to predict than you might think. If you haven't implemented and measured two different ways to do things, your intuition as to which way is fastest, is probably wrong.

By the way if you really are an expert at Java or C# threading, then you probably should go with threading instead, but use Jython or IronPython instead of CPython.

Comments

0

Reading the same file from several processes concurrently is tricky. Is it possible to split the file beforehand?

While Python has the GIL both Jython and IronPython hasn't that limitation.

Also make sure that a simple single process doesn't already max disk I/O. You will have a hard time gaining anything if it does.

2 Comments

I am surprised, why multi threading option is not suggested because, multi processing have obvious performance overhead and compilcated code(splitting) etc.
@Bala: This is a detail of the Python interpreter, please read wiki.python.org/moin/GlobalInterpreterLock.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.