Performance issue in python with nested loop

Question

I was able to improve a code written in python a lot with numpy because of the dot product. Now I still have one part of the code which is still very slow. I still don't understand multithreading and if this could help here. In my opinion this should be possible here. Do you have a nice idea what to do here?

for x1 in range(a**l):
    for x2 in range(a**l):
        for x3 in range(a**l):
            f11 = 0
            cv1 = numpy.ndarray.sum(
            numpy.absolute(numpy.subtract(ws[x1], ws[x2])))
            cv2 = numpy.ndarray.sum(
            numpy.absolute(numpy.subtract(ws[x1], ws[x3])))
            if cv1 == 0:
                f11 += 1
            if cv2 == 0:
                f11 += 1
            re[x1][x2][x3] = 1.0*r/(a**l-2)*(numpy.product(numpy.absolute(
                        numpy.subtract((2*ws[x1]+ws[x2]+ws[x3]), 2)))-f11)
            f11 *= 1.0*(1-r)/2
            re[x1][x2][x3] += f11

A triply nested loop with each level running a ** l times? Unless a is 1, or a and l are very small, this is a ridiculously large amount of work to do, O((a**l)**3). For a of 3, l of 10, you'd be looking at 205 trillion executions of the inner loop work; even if you did nothing in the body of the inner loop, that's a lot, and you're not doing nothing. — ShadowRanger
– ShadowRanger, Commented May 6, 2016 at 21:11
@HighwayJohn: Usually 2? Or always 2? Because like I said, it makes a big difference. a of 2 and l of 10 is "only" a billion executions of the inner loop, but if a might be any larger value, the scaling factor is enormous. Even just generating and discarding the numbers the fastest possible way (deque(itertools.product(range(2**10), repeat=3), 0)) takes over 10 seconds on my machine. For a == 3 (47.5 bits of work), you'd expect it to take about 192,000x longer, about 22 days. Don't even think about a == 4. — ShadowRanger
– ShadowRanger, Commented May 6, 2016 at 21:52
So far, and that will probably not change it is always 2 luckily. If it would be larger I would of course make l smaller. Its a simulation, thus I can decide it. Do you have any idea how to improve my code? — HighwayJohn
– HighwayJohn, Commented May 6, 2016 at 22:15

JoshAdel · Accepted Answer · 2016-05-08 14:07:28Z

I'm attempted to re-create the conditions that the question was interested in, but first a smaller test case to illustrate a strategy. First the author's original implementation:

import numpy as np
import numba as nb
import numpy

def func(re, ws, a, l, r):

    for x1 in range(a**l):
        for x2 in range(a**l):
            for x3 in range(a**l):
                f11 = 0
                cv1 = numpy.ndarray.sum(
                numpy.absolute(numpy.subtract(ws[x1], ws[x2])))
                cv2 = numpy.ndarray.sum(
                numpy.absolute(numpy.subtract(ws[x1], ws[x3])))
                if cv1 == 0:
                    f11 += 1
                if cv2 == 0:
                    f11 += 1
                re[x1][x2][x3] = 1.0*r/(a**l-2)*(numpy.product(numpy.absolute(
                            numpy.subtract((2*ws[x1]+ws[x2]+ws[x3]), 2)))-f11)
                f11 *= 1.0*(1-r)/2
                re[x1][x2][x3] += f11

Now with a simple translation to Numba, which is really well suited to these types of deeply nested looping problems when you're dealing with numpy arrays and numerical calculations:

@nb.njit
def func2(re, ws, a, l, r):
    for x1 in range(a**l):
        for x2 in range(a**l):
            for x3 in range(a**l):
                f11 = 0.0
                cv1 = np.sum(np.abs(ws[x1] - ws[x2]))
                cv2 = np.sum(np.abs(ws[x1] - ws[x3]))

                if cv1 == 0:
                    f11 += 1
                if cv2 == 0:
                    f11 += 1
                y = np.prod(np.abs(2*ws[x1]+ws[x2]+ws[x3] -  2)) - f11
                re[x1,x2,x3] = 1.0*r/(a**l-2)*y
                f11 *= 1.0*(1-r)/2
                re[x1,x2,x3] += f11

and then with some further optimizations to get rid of temporary array creation:

@nb.njit
def func3(re, ws, a, l, r):
    for x1 in range(a**l):
        for x2 in range(a**l):
            for x3 in range(a**l):
                f11 = 0.0
                cv1 = 0.0
                cv2 = 0.0
                for i in range(ws.shape[1]):
                    cv1 += np.abs(ws[x1,i] - ws[x2,i])
                    cv2 += np.abs(ws[x1,i] - ws[x3,i])

                if cv1 == 0:
                    f11 += 1
                if cv2 == 0:
                    f11 += 1
                y = 1.0
                for i in range(ws.shape[1]):
                    y *= np.abs(2.0*ws[x1,i] + ws[x2,i] + ws[x3,i] - 2)
                y -= f11
                re[x1,x2,x3] = 1.0*r/(a**l-2)*y
                f11 *= 1.0*(1-r)/2
                re[x1,x2,x3] += f11

So some simple test data:

a = 2
l = 5
r = 0.2
wp = (numpy.arange(2**l)[:,None] >> numpy.arange(l)[::-1]) & 1
wp = numpy.hstack([wp.sum(1,keepdims=True), wp])
ws = wp[:, 3:l+3]
re = numpy.zeros((a**l, a**l, a**l))

and now let's check that all three functions produce the same result:

re = numpy.zeros((a**l, a**l, a**l))
func(re, ws, a, l, r)

re2 = numpy.zeros((a**l, a**l, a**l))
func2(re2, ws, a, l, r)

re3 = numpy.zeros((a**l, a**l, a**l))
func3(re3, ws, a, l, r)

print np.allclose(re, re2)  # True
print np.allclose(re, re3)  # True

And some initial timings using the jupyter notebook %timeit magic:

%timeit func(re, ws, a, l, r)
%timeit func2(re2, ws, a, l, r)
%timeit func3(re3, ws, a, l, r)

1 loop, best of 3: 404 ms per loop
100 loops, best of 3: 14.2 ms per loop
1000 loops, best of 3: 605 µs per loop

func2 is ~28x times faster than the original implementation. func3 is ~680x faster. Note that I'm running on a Macbook laptop with an i7 processor, 16 GB of RAM and using Numba 0.25.0.

Ok, so now let's time the a=2 l=10 case that everyone is wringing their hands about:

a = 2
l = 10
r = 0.2
wp = (numpy.arange(2**l)[:,None] >> numpy.arange(l)[::-1]) & 1
wp = numpy.hstack([wp.sum(1,keepdims=True), wp])
ws = wp[:, 3:l+3]
re = numpy.zeros((a**l, a**l, a**l))
print 'setup complete'

%timeit -n 1 -r 1 func3(re, ws, a, l, r)

# setup complete
# 1 loop, best of 1: 45.4 s per loop

So this took 45 seconds on my machine single threaded, which seems reasonable if you aren't then doing this one calculation too many times.

Wow thanks a lot!! I will try this as fast out as possible :)
What if you are dealing with json in a nested for-loop, where you need to get values of some keys and put them into a dataframe ? What is the best approach for that ?
@ShashankAC that's a largely unrelated problem that should be posed in a new question.
Alright, I asked a separate question. stackoverflow.com/questions/60387457/…

ev-br · Accepted Answer · 2016-05-07 09:05:03Z

1

Before worrying about multiprocessing, try using what plain numpy offers.

First, make sure ws are numpy arrays, not lists of some such. Then

cv1 = numpy.ndarray.sum(numpy.absolute(numpy.subtract(ws[x1], ws[x2])))
if cv1 == 0:
   f11 += 1

becomes f11 = np.nonzero(ws[x1] == ws[x2]).

Do the same for the rest of the code, and you'll be able to see more structure: np.product is just * and so on.

Then, re[x1][x2][x3] is not how you normally index numpy arrays, use re[x1, x2, x3]. This alone would save you quite a bit of time and memory allocations.

Once this is done, look if you can actually vectorize the expressions, i.e. use numpy all-array operations instead of plain python loops (it's likely that you can, but it's too hard to see with the current state of the code snippet).

answered May 7, 2016 at 9:05

ev-br

26.3k9 gold badges68 silver badges84 bronze badges

5 Comments

HighwayJohn Over a year ago

Why is it a different if I write re[x1][x2][x3] instead of re[x1, x2, x3] in terms of memory allocation?

HighwayJohn Over a year ago

I tried your code and it doesn't work:Expected 'Union[ndarray, Iterable]', got bool instead.

Han-Kwang Nienhuys Over a year ago

re[x1][x2][x3] is equivalent to re1=re[x1]; re2=re1[x2]; return re2[x3], even though you don't see those temporary variables re1 and re2. You do this in the inner loop.

HighwayJohn Over a year ago

Okay I changed this but it didn't really speed it up

HighwayJohn Over a year ago

ws is a l *a**l numpy.nsdarray. I think that is also why your code don't work. You can see in the comments above how it is created

Collectives™ on Stack Overflow

Performance issue in python with nested loop

2 Answers 2

4 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related