0

I have seen a number of questions on using python's multiprocessing but I have not been able to quite wrap my head around on how to use this on my code.

Suppose I have a NxM array. I have a function f that compares the value at (i,j) to every other pixel. So, in essence, I compute NxM values at every point on the grid.

My machine has four cores. I envision that I would split the input locations in four quadrants and then feed each quadrant to a different process.

So, schemetically, my current code would be:

def f(x, array):
    """
    For input location x, do some operation 
    with every other array value. As a simple example,
    this could be the product of array[x] with all other
    array values
    """
    return array[x]*array

if __name__ == '__main__':
    array = some_array
    for x in range(array.size):
        returnval = f(x,array)

`

What would be the best strategy in optimizing a problem like this using multiprocessing?

5
  • How do you want to store the answers? In another array? Commented Nov 7, 2015 at 20:46
  • If your problem are mathemathical you may want to look at numpy. Despite being bound to a single-core performance by GIL in CPython, numpy code releases it so your code can be even more effective with threading+numpy than with multiprocessing. Commented Nov 7, 2015 at 22:48
  • @bbayles yes, it would have to be another array of the same size, NxM. Retaining the order is thus important. Commented Nov 8, 2015 at 8:58
  • @myaut could you elaborate on how numpy specifically would help me here? In addition, as I understand it threading is particularly useful for I/O, not so much for heavy calculations? Thanks! Commented Nov 8, 2015 at 8:58
  • @John: Check this answer: stackoverflow.com/questions/6200437/… Commented Nov 9, 2015 at 11:49

2 Answers 2

1

The multiprocessing library can do most of the heavy lifting for you, with its Pool.map function.

The only limitation is that will use a function that takes a single argument, the value you are iterating. Based on this other question, the final implementation would be.

from multiprocessing import Pool
from functools import partial

# Make a version of f that only takes x parameter
partial_f = partial(f, array=SOME_ARRAY)

if __name__ == '__main__':
    pool = Pool(processes=5)  # Number of CPUs + 1
    returnval = pool.map(partial_f, range(array.size))
Sign up to request clarification or add additional context in comments.

Comments

0

The strategy you've described - send the data set to several workers, have each compute values for part of the data set, and stitch them together at the end - is the classical way to parallelize computation for an algorithm like this where answers don't depend on each other. This is what MapReduce does.

That said, introducing parallelism makes programs harder to understand, and harder to adapt or improve later. Both the algorithm you've described, and the use of raw Python for large numerical computations, might be better things to change first.

Consider checking whether expressing your calculation with numpy arrays makes it run faster, and whether you can find or develop a more efficient algorithm for the problem you're trying to solve, before parallelizing your code.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.