Speed up numpy nested loop

Question

I am writing a simulation for a wireless network in python using numpy and cython, where suppose there is a number of nodes no_nodes scattered randomly on the 2d plane which send out some waveforms and their respective receivers, again scattered randomly on the 2d plane. Each transmitting node produces a waveform I call output (each one can produce an output of different length).

What I want to do is to sum these outputs from each node to one big waveform that will be the input to each receiver for demodulation etc. Now two key-points:

the transmitters send asynchronously and therefore a start_clock and an end_clock has to be maintened per transmitting node so as to sum the waveforms properly
the output of the j transmitting node will be attenuated before being received by the i node according to a function attenuate(i,j)

So here is the code:

#create empty 2d array (no_rx_nodes x no_samples for each waveform)
waveforms = np.zeros((no_nodes, max(end_clock))) 

for i in range(no_nodes): #calculate the waveform for each receiver
    for j in range(no_nodes): #sum the waveforms produced by each transmitter
        waveforms[i, start_clock[j]:end_clock[j]] += output[j,:] * attenuate(i,j)
return waveforms

Some comments on the above:

output[j, :] is the output waveform of transmitter j
waveforms[i,:] is the waveform received by receiver i

I hope it is fairly clear what I am trying to accomplish here. Because the waveforms produced are very large (about 10^6 samples), I also tried turning this code into cython but without noticing any particular speedup (maybe 5-10x better but no more). I was wondering if there is anything else I can resort to so as to get a speed up because it is a real bottleneck to the whole simulation (it takes to compute almost as much time as the rest of the code, which in fact is quite more complicated than that).

The code seems fairly efficient as it is - unless no_nodes is large. Are you sure the bottleneck isn't in attenuate? — loopbackbee
– loopbackbee, Commented May 9, 2014 at 13:19
no_nodes is about 10-20 at most. attenuate is just a simple multiplication drawn from a distance matrix — user113478
– user113478, Commented May 9, 2014 at 13:20
@user113478 so you can just compute the matrix attenuate = (attn_const/distances) ** attn_expon, which will yield an array the same size as distances, right? — Henry Gomersall
– Henry Gomersall, Commented May 9, 2014 at 14:32
oh, post that (in the original question) too. I mean, show as much code as you can. Your operation is linear so you could always reorder if that makes sense from an algorithmic perspective. Is your code to compute at a point the sum of all the Green functions? If so, where did you get the algorithm from? In the linear case (which it seems you have) my understanding is this is best solved in the Fourier domain. — Henry Gomersall
– Henry Gomersall, Commented May 9, 2014 at 14:59
@user113478 right, well it's fairly typical to perform convolutions in the Fourier domain. The problem is, you're asking for help on some small aspect of your algorithm, when I suspect the best strategy is to redesign the whole algorithm. My advice is go away and think hard about the algorithm. — Henry Gomersall
– Henry Gomersall, Commented May 9, 2014 at 15:11

jtaylor · Accepted Answer · 2014-05-09 17:34:15Z

5

this is a memory bandwidth bound problem with about 3GB/s memory bandwidth the best you can get out of this is around 2-4ms for the inner loop. to reach that bound you need to block your inner loop to utilize the cpu caches better (numexpr does this for you):

for i in range(no_nodes):
    for j in range(no_nodes):
        # should be chosen so all operands fit in the (next-to-)last level cache
        # first level is normally too small to be usable due to python overhead
        s  = 15000 
        a = attenuation[i,j]
        o = output[j]
        w = waveforms[i]
        for k in range(0, w.size, s): 
            u = min(k + s, w.size)
            w[k:u] += o[k:u] * a
        # or: numexpr.evaluate("w + o * a", out=w)

using float32 data instead of float64 should also half the memory bandwidth requirements and double the performance.

To get larger speedups you have to redesign your full algorithm to have a better data locality

edited May 9, 2014 at 17:34

answered May 9, 2014 at 17:20

jtaylor

2,44422 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user113478 Over a year ago

Great workaround, thank you so much! Could you let me know how the numexpr version would be more or less? Because I have it installed but haven't tried it very much yet!

jtaylor Over a year ago

should be just numexpr.evaluate("w + o * a", out=w) instead of the loop

user113478 Over a year ago

Do you know whether numexpr expressions can be run in cython?

jtaylor Over a year ago

you can call any python function from cython, that includes numexpr. But if you already cythonize the inner loop you don't need numexpr anymore (unless you profit from numexpr's parallelization or VML support)

Jaime · Accepted Answer · 2014-05-09 16:49:34Z

4

I think it is memory access that's crippling your performance, and there is probably little you can do there. You can speed-up things a little bit by using in-place operations and preallocating and reusing buffers. As a toy example for your code, without the time alignment:

def no_buffer(output, attenuate):
    waveforms = np.zeros_like(output)
    for i in xrange(len(output)):
        for j in xrange(len(output)):
            waveforms[i,:] += output[j, :] * attenuate[i, j]

    return waveforms

def with_buffer(output, attenuate):
    waveforms = np.zeros_like(output)
    buffer_arr = np.empty_like(output[0])
    for i in xrange(len(output)):
        for j in xrange(len(output)):
            np.multiply(output[j, :], attenuate[i, j], out=buffer_arr)
            np.add(waveforms[i, :], buffer_arr, out=waveforms[i, :])

    return waveforms

o = np.random.rand(20, 1e6)
a = np.random.rand(20, 20)

In [17]: np.allclose(no_buffer(o, a), with_buffer(o, a))
Out[17]: True

In [18]: %timeit no_buffer(o, a)
1 loops, best of 3: 2.3 s per loop

In [19]: %timeit with_buffer(o, a)
1 loops, best of 3: 1.57 s per loop

Which I guess is better than nothing.

Of course, if you can get rid of the time alignment thing, your operation is simply a matrix multiplication, and it is best to let BLAS deal with that. On my system, using MKL:

In [21]: np.allclose(with_buffer(o, a), np.dot(o.T, a.T).T)
Out[21]: True

In [22]: %timeit np.dot(o.T, a.T).T
10 loops, best of 3: 123 ms per loop

answered May 9, 2014 at 16:49

Jaime

67.7k19 gold badges128 silver badges164 bronze badges

4 Comments

Midnighter Over a year ago

If it's really about size of the arrays and you can't use dot, it might be worthwhile looking into numexpr.

Daniel Over a year ago

@Midnighter You are likely better off writing a few simple loops where you complete your dot product in chunks.

user113478 Over a year ago

@Jaime, I thought I knew linear algebra, well I have changed my mind after that!! Great answer, thank you so much!

user113478 Over a year ago

@Jaime, what if I need to convolve output[j,:] with attenuate[i,j,:], where attenuate would be a 3d array (no_nodes x no_nodes x no_samples_for_channel_model) (suppose that attenuate is a channel model representing the channel that the signal from transmitter j goes through to reach receiver i. Could you propose a workaround for that as well? Could that somehow be manipulated using np.dot ?

user113478 · Accepted Answer · 2014-05-09 17:16:19Z

2

Just for the sake of experimentation suppose that output from each transmitter is time aligned, and therefore no clocks are required. I came up with a version using heavily broadcasting and thus eliminating the for loops completely. However, it is 3x slower. Here is the code I wrote:

import numpy as np
import time

def calc(no_nodes):

    output = np.random.rand(no_nodes, 7e5) #some random data, 7e5 samples here
    attenuate= np.random.rand(no_nodes,no_nodes) #some random data
    start_time = time.time()
    output_per_node = np.zeros((no_nodes,no_nodes,7e5))
    output_per_node += output[None, :, :]
    data = attenuate[:,:,None] * output_per_node
    waveforms = np.sum(data, axis=1)
    end_time = time.time()
    print end_time - start_time
    return waveforms

The regular nested for-loop implementation would be:

def calc1(no_nodes):
    output = np.random.rand(no_nodes, 7e5)
    attenuation = np.random.rand(no_nodes,no_nodes)
    waveforms = np.zeros((no_nodes, 7e5))
    start_time = time.time()
    for i in range(no_nodes):
        for j in range(no_nodes):
            waveforms[i] += output[j] * attenuation[i,j]
    print time.time() - start_time
    return waveforms

So what is really going on here? Numpy is supposed to be so blazingly fast you can't keep up with. I don't say it generally isn't, but in this particular example something is not going on well. Even if you convert both these codes to cython, the second one (with for loops) is much faster than the one with broadcasting. What do you believe I am doing wrong here? Note: try with no_nodes=10

For anyone interested you can find the ipython notebook with the above code displaying the difference in performance, both in ipynb and html form here:

IPython notebook in html (no edits possible)
IPython notebook in .ipynb (for your use)

Any feedback would be greatly appreciated.

edited May 9, 2014 at 17:16

answered May 9, 2014 at 16:28

user113478

1211 gold badge1 silver badge8 bronze badges

4 Comments

Saullo G. P. Castro Over a year ago

I liked your approach, did you check the equality with np.allclose(), for example, to see if one method is giving the same result as the other. You should post this very last question as a new answer, it will create a good discussion...

user113478 Over a year ago

@SaulloCastro, yes they return exactly the same, in fact I will post an ipython notebook for anyone interested in a couple of minutes

Jaime Over a year ago

In your solution using broadcasting you are creating an array 20 times larger than any array used in the other solution. I think that's where the slow down is coming from. If you can really do without the time alignment, look carefully into your code: you have reimplemented matrix multiplication, and np.dot can do that many times faster than you...

user113478 Over a year ago

@Jaime, could you elaborate on that? How would that be done using np.dot?

Collectives™ on Stack Overflow

Speed up numpy nested loop

3 Answers 3

4 Comments

4 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related