I am testing out the multiprocessing module in Python using this example. It counts the length of each word in a corpus.
from multiprocessing import Pool
def open_file(file):
with open(file) as f:
f = f.read()
return f
def split_words(file):
f = open_file(file)
return [[len(i), i] for i in f.split()]
def split_mult(file):
#uses the multiprocessing module
pool = Pool(processes = 4)
work = pool.apply_async(split_words, [file])
return work.get()
print split_words("random.txt") - about 90seconds for a 110K file
print split_mult("random.txt") - about 90seconds for a 110K file
The *split_mult* function uses multiprocessing and *split_words* does not. I was under the impression that I would see faster processing time using the multiprocessing module but there is little to no difference in runtime. I've run each function about 5 times. Is there something I'm missing?
UPDATE:
I rewrote the code with a better understanding of multiprocessing and was able to get processing time down to ~ 12 seconds! It's quick and dirty code but hopefully helpful to others trying to understand this concept - https://github.com/surajkapoor/MultiProcessing-Test/blob/master/multi.py