Parallel programming array handling in Python

Question

I have seen a number of questions on using python's multiprocessing but I have not been able to quite wrap my head around on how to use this on my code.

Suppose I have a NxM array. I have a function f that compares the value at (i,j) to every other pixel. So, in essence, I compute NxM values at every point on the grid.

My machine has four cores. I envision that I would split the input locations in four quadrants and then feed each quadrant to a different process.

So, schemetically, my current code would be:

def f(x, array):
    """
    For input location x, do some operation 
    with every other array value. As a simple example,
    this could be the product of array[x] with all other
    array values
    """
    return array[x]*array

if __name__ == '__main__':
    array = some_array
    for x in range(array.size):
        returnval = f(x,array)

`

What would be the best strategy in optimizing a problem like this using multiprocessing?

If your problem are mathemathical you may want to look at numpy. Despite being bound to a single-core performance by GIL in CPython, numpy code releases it so your code can be even more effective with threading+numpy than with multiprocessing. — myaut
– myaut, Commented Nov 7, 2015 at 22:48
@bbayles yes, it would have to be another array of the same size, NxM. Retaining the order is thus important. — user1991
– user1991, Commented Nov 8, 2015 at 8:58
@myaut could you elaborate on how numpy specifically would help me here? In addition, as I understand it threading is particularly useful for I/O, not so much for heavy calculations? Thanks! — user1991
– user1991, Commented Nov 8, 2015 at 8:58
@John: Check this answer: stackoverflow.com/questions/6200437/… — myaut
– myaut, Commented Nov 9, 2015 at 11:49

Community · Accepted Answer · 2017-05-23 12:29:52Z

1

The multiprocessing library can do most of the heavy lifting for you, with its Pool.map function.

The only limitation is that will use a function that takes a single argument, the value you are iterating. Based on this other question, the final implementation would be.

from multiprocessing import Pool
from functools import partial

# Make a version of f that only takes x parameter
partial_f = partial(f, array=SOME_ARRAY)

if __name__ == '__main__':
    pool = Pool(processes=5)  # Number of CPUs + 1
    returnval = pool.map(partial_f, range(array.size))

edited May 23, 2017 at 12:29

CommunityBot

11 silver badge

answered Nov 7, 2015 at 22:43

memoselyk

4,1561 gold badge22 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Will Angley · Accepted Answer · 2015-11-07 20:55:25Z

The strategy you've described - send the data set to several workers, have each compute values for part of the data set, and stitch them together at the end - is the classical way to parallelize computation for an algorithm like this where answers don't depend on each other. This is what MapReduce does.

That said, introducing parallelism makes programs harder to understand, and harder to adapt or improve later. Both the algorithm you've described, and the use of raw Python for large numerical computations, might be better things to change first.

Consider checking whether expressing your calculation with numpy arrays makes it run faster, and whether you can find or develop a more efficient algorithm for the problem you're trying to solve, before parallelizing your code.

Collectives™ on Stack Overflow

Parallel programming array handling in Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest