2

I am testing out the multiprocessing module in Python using this example. It counts the length of each word in a corpus.

from multiprocessing import Pool

def open_file(file):
    with open(file) as f:
        f = f.read()
    return f

def split_words(file):
    f = open_file(file)
    return [[len(i), i] for i in f.split()]


def split_mult(file):
    #uses the multiprocessing module
    pool = Pool(processes = 4)  
    work = pool.apply_async(split_words, [file])
    return work.get()

print split_words("random.txt") - about 90seconds for a 110K file
print split_mult("random.txt") - about 90seconds for a 110K file

The *split_mult* function uses multiprocessing and *split_words* does not. I was under the impression that I would see faster processing time using the multiprocessing module but there is little to no difference in runtime. I've run each function about 5 times. Is there something I'm missing?

UPDATE:

I rewrote the code with a better understanding of multiprocessing and was able to get processing time down to ~ 12 seconds! It's quick and dirty code but hopefully helpful to others trying to understand this concept - https://github.com/surajkapoor/MultiProcessing-Test/blob/master/multi.py

3
  • 1
    Is that your whole code, I don't see anything which would benefit of multiprocessing here. Commented Mar 20, 2014 at 19:48
  • 1
    IPython has much more capabilities to do high level parallelization - checkout ipython.org/ipython-doc/stable/parallel/parallel_intro.html Commented Mar 20, 2014 at 19:56
  • @dav1d yep, that's my entire code. I think I misunderstood the module's purpose :-/ Commented Mar 20, 2014 at 20:06

2 Answers 2

3

Python does not have the facilities to magically make your code work in parallel.

What you did here is made a pool of 4 processes, and gave it one task, which will be run in 1 process.

A process/thread pool is used to run a large number of tasks in parallel (at most 4, or whatever you specify, at a time).
Splitting a task into many subtasks is the programmer's responsibility.

Sign up to request clarification or add additional context in comments.

2 Comments

You are right, I seem to have misunderstood the module's purpose. So, effectively, if I gave split_mult several text files to process it will do them concurrently and at a faster speed than giving them individually to the split_word function?
Splitting 4 strings in parallel works. Splitting one string in parallel (as in your code), won't. Python doesn't magically make non-parallel functions parallel (split in this case).
1

I/O intensive tasks can be slowed down by making them more parallel. This is particularly the case with mechanical hard drives.

Imagine you were able to divide the file into 4 parts and run 4 processes they would be causing the drive to seek more than reading the file once sequentially.

The same situation occurs if you had 4 workers on 4 files, but you don't have to think about how to split the file.

If len were a time consuming operation, you may see a performance improvement by reading the file sequentially line by line and have the workers pull those lines from a Queue. However unless you have very fast storage (maybe the file is cached) it will not make much difference.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.