numpy vectorization of double python for loop

Question

V is (n,p) numpy array typically dimensions are n~10, p~20000

The code I have now looks like

A = np.zeros(p)
for i in xrange(n):
    for j in xrange(i+1):
        A += F[i,j] * V[i,:] * V[j,:]

How would I go about rewriting this to avoid the double python for loop?

did you consider writing those for-loops in C (since they are simple), and using scipy.weave.blitz ? — usethedeathstar
– usethedeathstar, Commented Nov 21, 2013 at 7:53
I would prefer to avoid cython or weave.blitz for the moment if some python code will be "good enough" — user1984528
– user1984528, Commented Nov 21, 2013 at 17:54

Jaime · Accepted Answer · 2013-11-21 01:35:12Z

10

While Isaac's answer seems promising, as it removes those two nested for loops, you are having to create an intermediate array M which is n times the size of your original V array. Python for loops are not cheap, but memory access ain't free either:

n = 10
p = 20000
V = np.random.rand(n, p)
F = np.random.rand(n, n)

def op_code(V, F):
    n, p = V.shape
    A = np.zeros(p)
    for i in xrange(n):
        for j in xrange(i+1):
            A += F[i,j] * V[i,:] * V[j,:]
    return A

def isaac_code(V, F):
    n, p = V.shape
    F = F.copy()
    F[np.triu_indices(n, 1)] = 0
    M = (V.reshape(n, 1, p) * V.reshape(1, n, p)) * F.reshape(n, n, 1)
    return M.sum((0, 1))

If you now take both for a test ride:

In [20]: np.allclose(isaac_code(V, F), op_code(V, F))
Out[20]: True

In [21]: %timeit op_code(V, F)
100 loops, best of 3: 3.18 ms per loop

In [22]: %timeit isaac_code(V, F)
10 loops, best of 3: 24.3 ms per loop

So removing the for loops is costing you an 8x slowdown. Not a very good thing... At this point you may even want to consider whether a function taking about 3ms to evaluate requires any further optimization. IN case you do, there's a small improvement which can be had by using np.einsum:

def einsum_code(V, F):
    n, p = V.shape
    F = F.copy()
    F[np.triu_indices(n, 1)] = 0
    return np.einsum('ij,ik,jk->k', F, V, V)

And now:

In [23]: np.allclose(einsum_code(V, F), op_code(V, F))
Out[23]: True

In [24]: %timeit einsum_code(V, F)
100 loops, best of 3: 2.53 ms per loop

So that's roughly a 20% speed up that introduces code that may very well not be as readable as your for loops. I would say not worth it...

answered Nov 21, 2013 at 1:35

Jaime

67.7k19 gold badges128 silver badges164 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Isaac Over a year ago

Did not know about einsum; very neat!

DSM Over a year ago

Yep. One common heuristic for vectorizing code when it's a little troublesome to do the whole thing is to vectorize over the long axes and loop over the short ones, which gets you almost all the benefit for a fraction of the headache. But here, that's what the OP had already done!

user1984528 Over a year ago

Thanks, I ended up going with the einsum method after testing each with a range of inputs (n=5-30 & p=100-20000), I was not aware it was there.

Isaac · Accepted Answer · 2013-11-20 23:22:07Z

7

The difficult part about this is that you only want to take the sum of the elements with j <= i. If not for that then you could do the following:

M = (V.reshape(n, 1, p) * V.reshape(1, n, p)) * F.reshape(n, n, 1)
A = M.sum(0).sum(0)

If F is symmetric (if F[i,j] == F[j,i]) then you can exploit the symmetry of M above as follows:

D = M[range(n), range(n)].sum(0)
A = (M.sum(0).sum(0) - D) / 2.0 + D

That said, this is really not a great candidate for vectorization, as you have n << p and so your for-loops are not going to have much effect on the speed of this computation.

Edit: As Bill said below, you can just make sure that the elements of F that you don't want to use are set to zero first, and then the M.sum(0).sum(0) result will be what you want.

edited Nov 20, 2013 at 23:22

answered Nov 20, 2013 at 23:11

Isaac

3,6161 gold badge20 silver badges22 bronze badges

3 Comments

wflynny Over a year ago

What about F[np.triu_indices(n, 1)] = 0? That way the upper half of F (offset by 1) is zero and will not contribute to the sum.

askewchan Over a year ago

Also, a.sum(0).sum(0) can be written as np.sum(a, (0,1)) or a.sum((0,1)) which isn't any faster but I find it more readable.

Isaac Over a year ago

As I said, this code is really not a good candidate for vectorization.

flonk · Accepted Answer · 2013-11-21 13:22:29Z

1

The expression can be written as

$A_p = \sum_{ij}F_{ij}V_{ip}V_{jp}$

and thus you can sum it like this using the np.newaxis-construct:

na = np.newaxis
X = (np.tri(n)*F)[:,:,na]*V[:,na,:]*V[na,:,:]
X.sum(axis=1).sum(axis=0)

Here a 3D-array X[i,j,p] is constructed, and then the 2 first axes are summed, which results in a 1D array A[p]. Additionaly F was multiplied with a triangular matrix to restrict the summation according to the problem.

edited Nov 21, 2013 at 13:22

answered Nov 21, 2013 at 12:46

flonk

3,9163 gold badges27 silver badges41 bronze badges

Collectives™ on Stack Overflow

numpy vectorization of double python for loop

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related