python loop faster than numpy array operations

Question

I am working on a code for a model based on some Fourier transforms. Currently I am trying to optimize a part of it, so that it would be usable with some big amount of data. While trying to do that, I found a strange behavior, mainly a loop version of my code turn out to be faster than the same code written with numpy. Testing code is as follows:

# -*- coding: utf-8 -*-
import numpy as np
import timeit


def fourier_coef_loop(ts, times, k_max):
    coefs = np.zeros(k_max, dtype=float)
    t = 2.0 * np.pi * (times - times[0]) / times[-1]
    x = np.dot(np.arange(1,k_max+1)[np.newaxis].T,t[np.newaxis])
    for k in xrange(1, k_max + 1):
        cos_k = np.cos(x[k-1])
        coefs[k - 1] = (ts[-1] - ts[0]) + (ts[:-1] * (cos_k[:-1] - cos_k[1:])).sum()
    return(coefs)


def fourier_coef_np(ts, times, k_max):
    coefs = np.zeros(k_max, dtype=float)
    t = 2.0 * np.pi * (times - times[0]) / times[-1]
    x = np.dot(np.arange(1,k_max+1)[np.newaxis].T,t[np.newaxis])
    coefs = np.add(np.einsum('ij,j->i',np.diff(np.cos(x)), -ts[:-1]), (ts[-1] - ts[0]))
    return(coefs)


if __name__ == '__main__':
    iterations = 10
    size = 20000
    setup = "from __main__ import fourier_coef_loop, fourier_coef_np, size\n" \
            "import numpy as np"

#    arg = np.random.normal(size=size)
#    print(np.all(fourier_coef_np(arg, np.arange(size,dtype=float), size / 2) == fourier_coef_loop(arg, np.arange(size,dtype=float), size / 2)))

    time_loop = timeit.timeit("fourier_coef_loop(np.random.normal(size=size), np.arange(size,dtype=float), size / 2)",
                              setup=setup, number=iterations)
    print("With loop: {} s".format(time_loop))

    time_np = timeit.timeit("fourier_coef_np(np.random.normal(size=size), np.arange(size,dtype=float), size / 2)",
                            setup=setup, number=iterations)
    print("With numpy: {} s".format(time_np))

It gives following results:

With loop: 60.8385488987 s
With numpy: 64.9192998409 s

Can someone please tell me why is the loop version faster than the purely numpy version? I have totally ran out of ideas. I would also appreciate any suggestions on how to make this particular function faster.

You're not timing the loop vs the vectorized version, you're looping a huge mess that includes generating a lot of pseudorandom normals. Furthermore, your code looks highly suboptimal. For instance, einsum('ij,j->i,...) sounds a lot like a matrix-vector product (->np.dot again), and whatever x = np.dot(np.arange(1,k_max+1)[np.newaxis].T,t[np.newaxis]) should do, I'm sure there's a cleaner way to do it. — Andras Deak -- Слава Україні
– Andras Deak -- Слава Україні, Commented Nov 26, 2016 at 0:27
The times are not that different; when I try to use that size I get a memory error. For size=2000, timings are also similar. For 200 the np version has a substantial edge. So my guess is that for larger sizes memory management issues are chewing into the numpy times. — hpaulj
– hpaulj, Commented Nov 26, 2016 at 0:47
@hpaulj I don't think those timings are relevant for comparing that loop with a vectorized version (posting my own in a bit). — Andras Deak -- Слава Україні
– Andras Deak -- Слава Україні, Commented Nov 26, 2016 at 0:48
@hpaulj nevermind, I did my first tests with a silly size of 20; for 2000 I get almost the exact same times. — Andras Deak -- Слава Україні
– Andras Deak -- Слава Україні, Commented Nov 26, 2016 at 0:56
@hpaulj thank you for your suggestions. What I see now is that memory is crucial here. Obviously, for small data numpy performs better (3 or even 4 times better). But the situation gets worst when data size increase (and I will use some really big data unfortunately). The size of array 'x' and later 'diff(cos(x))' is around n^2 which becomes problematic for huge data. Currently I've deleted 'x' in a loop and changed cosinus in $cos_k = np.cos(k * t)$. It is slower for small data but for big data it's more than 50% faster and allocates much less memory. Thanks guys! — Mateusz
– Mateusz, Commented Nov 26, 2016 at 11:41

Andras Deak -- Слава Україні · Accepted Answer · 2016-11-26 01:01:54Z

1

As I noted in a comment, I don't think what you're timing is related to the difference between the loop and the vectorized version, your times even include the generation of 20000 normal pseudorandom numbers. You should try to put as much setup as possible outside the timing if you want to get a precise picture.

Anyway, your code has a few weird points, so here's my own suggestion:

def fourier_coef_new(ts,times,k_max):
    # no need to pre-allocate coefs, you're rebinding later
    t = 2.0 * np.pi * (times - times[0]) / times[-1]
    x = np.arange(1,k_max+1)[:,None] * t # make use of array broadcasting
    coefs = -np.dot(np.diff(np.cos(x)),ts[:-1]) + ts[-1]-ts[0] # einsum was dot
    return(coefs)

I tested this function for a given set of inputs, and it gave the same result as your functions. Note the [:,None] (or equivalent [:,np.newaxis]) way to introduce a singleton dimension into your array. Once you have an array of shape (N,1) and one of shape (M,) (latter compatible with (1,M)), their product will be (N,M) according to the array broadcasting rules of numpy.

I did a quick timing of the three functions with a given, pre-generated set of data, but 1. in python 3, 2. with size = 2000 and 3. using IPython's %timeit built-in. I'm not claiming that these results are any more reliable than yours, but I suspect that the above version should be fastest:

In [37]: %timeit fourier_coef_loop(ts,times,k_max)
1000 loops, best of 3: 1.09 ms per loop

In [38]: %timeit fourier_coef_np(ts,times,k_max)
1000 loops, best of 3: 1.06 ms per loop

In [39]: %timeit fourier_coef_new(ts,times,k_max)
1000 loops, best of 3: 1.05 ms per loop

As you can see, the numpy versions seem to be marginally faster. Since only a smaller part of your code is different between the two timed cases (and there are heavy trigonometric functions involved in both cases), this seems reasonable.

answered Nov 26, 2016 at 1:01

Andras Deak -- Слава Україні

35.4k13 gold badges94 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mateusz Over a year ago

Dear Andras, thank you for your answer. Your code is actually almost identical to one of the things I have tested before. The reason I used numpy einsum was a suggestion, somewhere on stackoverflow, that it may give some better memory management results. But as your code showed, that was probably not the best idea. Like I said in a comment above, it turned out to be a memory problem.

Mateusz Over a year ago

Oh, and thanks for he suggestion about the setup. It was useful and made my testing code more efficient and reliable.

Collectives™ on Stack Overflow

python loop faster than numpy array operations

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related