How to parallelize a nested for loop in python?

Question

Ok, here is my problem: I have a nested for loop in my program which runs on a single core. Since the program spend over 99% of run time in this nested for loop I would like to parallelize it. Right now I have to wait 9 days for the computation to finish. I tried to implement a parallel for loop by using the multiprocessing library. But I only find very basic examples and can not transfer them to my problem. Here are the nested loops with random data:

import numpy as np

dist_n = 100
nrm = np.linspace(1,10,dist_n)

data_Y = 11000
data_I = 90000
I = np.random.randn(data_I, 1000)
Y = np.random.randn(data_Y, 1000)
dist = np.zeros((data_I, dist_n)

for t in range(data_Y):
    for i in range(data_I):
        d = np.abs(I[i] - Y[t])
        for p in range(dist_n):
            dist[i,p] = np.sum(d**nrm[p])/nrm[p]

    print(dist)

Please give me some advise how to make it parallel.

In the code it looks like dist[i,p] gets fully written every step of t. I don't see any dependencies on previous steps of t, so you only need to compute at t=data_Y[-1]... or is this supposed to be dist[i,p] += not just =? I ask because it's important to understand what sections can be run in parallel and which ones are dependent, thus need to be run serially. — bivouac0
– bivouac0, Commented Feb 11, 2018 at 18:00
@bivouac0 Yes, you are right. I forgot to mention that dist gets printed to a file for every t. So it gets filled for every t and not only for t=data_Y[-1]. I hope that helps. — Gilfoyle
– Gilfoyle, Commented Feb 12, 2018 at 8:25

bivouac0 · Accepted Answer · 2018-02-12 14:40:34Z

2

There's a small overhead with initiating a process (50ms+ depending on data size) so it's generally best to MP the largest block of code possible. From your comment it sounds like each loop of t is independent so we should be free to parallelize this.

When python creates a new process you get a copy of the main process so you have available all your global data but when each process writes the data, it writes to it's own local copy. This means dist[i,p] won't be available to the main process unless you explicitly pass it back with a return (which will have some overhead). In your situation, if each process writes dist[i,p] to a file then you should be fine, just don't try to write to the same file unless you implement some type of mutex access control.

#!/usr/bin/python
import time
import multiprocessing as mp
import numpy as np

data_Y = 11 #11000
data_I = 90 #90000
dist_n = 100
nrm = np.linspace(1,10,dist_n)
I = np.random.randn(data_I, 1000)
Y = np.random.randn(data_Y, 1000)
dist = np.zeros((data_I, dist_n))

def worker(t):
    st = time.time()
    for i in range(data_I):
        d = np.abs(I[i] - Y[t])
        for p in range(dist_n):
            dist[i,p] = np.sum(d**nrm[p])/nrm[p]
    # Here - each worker opens a different file and writes to it
    print 'Worker time %4.3f mS' % (1000.*(time.time()-st))


if 1:   # single threaded
    st = time.time()
    for x in map(worker, range(data_Y)):
        pass
    print 'Single-process total time is %4.3f seconds' % (time.time()-st)
    print

if 1:   # multi-threaded
    pool = mp.Pool(28) # try 2X num procs and inc/dec until cpu maxed
    st = time.time()
    for x in pool.imap_unordered(worker, range(data_Y)):
        pass
    print 'Multiprocess total time is %4.3f seconds' % (time.time()-st)
    print

If you re-increase the size of data_Y/data_I again, the speed-up should increase up to the theoretical limit.

answered Feb 12, 2018 at 14:40

bivouac0

2,5701 gold badge15 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Gilfoyle Over a year ago

Why don't you pass arguments like I or Y to worker(). Why is that not necessary?

bivouac0 Over a year ago

When multiprocessing forks a new process, you get a copy of the current process, which includes all it's data (I and Y are globals and so are accessible to the worker function), You can pass them as arguments but that's not required and will force them to be pickled which imposes some additional overhead. It's probably good practice to put global I, Y at the top of worker() although it's not strictly required here. I also see here that dist is defined globally. It probably should be inside the worker()'

Collectives™ on Stack Overflow

How to parallelize a nested for loop in python?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related