Increasing speed using multiprocessing module in Python

Question

I am testing out the multiprocessing module in Python using this example. It counts the length of each word in a corpus.

from multiprocessing import Pool

def open_file(file):
    with open(file) as f:
        f = f.read()
    return f

def split_words(file):
    f = open_file(file)
    return [[len(i), i] for i in f.split()]


def split_mult(file):
    #uses the multiprocessing module
    pool = Pool(processes = 4)  
    work = pool.apply_async(split_words, [file])
    return work.get()

print split_words("random.txt") - about 90seconds for a 110K file
print split_mult("random.txt") - about 90seconds for a 110K file

The *split_mult* function uses multiprocessing and *split_words* does not. I was under the impression that I would see faster processing time using the multiprocessing module but there is little to no difference in runtime. I've run each function about 5 times. Is there something I'm missing?

UPDATE:

I rewrote the code with a better understanding of multiprocessing and was able to get processing time down to ~ 12 seconds! It's quick and dirty code but hopefully helpful to others trying to understand this concept - https://github.com/surajkapoor/MultiProcessing-Test/blob/master/multi.py

Is that your whole code, I don't see anything which would benefit of multiprocessing here. — dav1d
– dav1d, Commented Mar 20, 2014 at 19:48
IPython has much more capabilities to do high level parallelization - checkout ipython.org/ipython-doc/stable/parallel/parallel_intro.html — Dietrich
– Dietrich, Commented Mar 20, 2014 at 19:56
@dav1d yep, that's my entire code. I think I misunderstood the module's purpose :-/ — Suraj Kapoor
– Suraj Kapoor, Commented Mar 20, 2014 at 20:06

Oleh Prypin · Accepted Answer · 2014-03-20 19:55:28Z

3

Python does not have the facilities to magically make your code work in parallel.

What you did here is made a pool of 4 processes, and gave it one task, which will be run in 1 process.

A process/thread pool is used to run a large number of tasks in parallel (at most 4, or whatever you specify, at a time).
Splitting a task into many subtasks is the programmer's responsibility.

edited Mar 20, 2014 at 19:55

answered Mar 20, 2014 at 19:50

Oleh Prypin

34.3k11 gold badges94 silver badges100 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Suraj Kapoor Over a year ago

You are right, I seem to have misunderstood the module's purpose. So, effectively, if I gave split_mult several text files to process it will do them concurrently and at a faster speed than giving them individually to the split_word function?

dav1d Over a year ago

Splitting 4 strings in parallel works. Splitting one string in parallel (as in your code), won't. Python doesn't magically make non-parallel functions parallel (split in this case).

John La Rooy · Accepted Answer · 2014-03-20 20:04:32Z

1

I/O intensive tasks can be slowed down by making them more parallel. This is particularly the case with mechanical hard drives.

Imagine you were able to divide the file into 4 parts and run 4 processes they would be causing the drive to seek more than reading the file once sequentially.

The same situation occurs if you had 4 workers on 4 files, but you don't have to think about how to split the file.

If len were a time consuming operation, you may see a performance improvement by reading the file sequentially line by line and have the workers pull those lines from a Queue. However unless you have very fast storage (maybe the file is cached) it will not make much difference.

answered Mar 20, 2014 at 20:04

John La Rooy

306k54 gold badges378 silver badges513 bronze badges

Collectives™ on Stack Overflow

Increasing speed using multiprocessing module in Python

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related