Faster for-loops with arrays in Python

Question

N, M = 1000, 4000000
a = np.random.uniform(0, 1, (N, M))
k = np.random.randint(0, N, (N, M))

out = np.zeros((N, M))
for i in range(N):
    for j in range(M):
        out[k[i, j], j] += a[i, j]

I work with very long for-loops; %%timeit on above with pass replacing the operation yields

1min 19s ± 663 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

this is unacceptable in context (C++ took 6.5 sec). There's no reason for above to be done with Python objects; arrays have well-defined types. Implementing this in C/C++ as an extension is an overkill on both developer and user ends; I'm just passing arrays to loop and do arithmetic on.

Is there a way to tell Numpy "move this logic to C", or another library that can handle nested loops involving only arrays? I seek it for the general case, not workarounds for this specific example (but if you have one I can open a separate Q&A).

Are you looking for compilers for quasi-Python as in Cython, numba, etc. or in other approaches, as in vectorised numpy operations? — MisterMiyagi
– MisterMiyagi, Commented Oct 26, 2020 at 16:11
@MisterMiyagi Unsure what you mean by "compilers", anyone with the standard C implementation of Python should be able to run it, and extra libraries should handle their own dependencies (including compilers if needed). -- and no vectorization etc, this isn't an algorithm optimization question. — OverLordGoldDragon
– OverLordGoldDragon, Commented Oct 26, 2020 at 16:17
@MisterMiyagi: Cython would seem to be ruled out by not wanting their own extension, but numba is a possibility, assuming conversion to vectorized numpy operations isn't feasible. — ShadowRanger
– ShadowRanger, Commented Oct 26, 2020 at 16:19
@OverLordGoldDragon: For something along those lines, @numba.jit(nopython=True) would be the first thing I'd think of. Whether it will fully optimize your case, I can't say, but it's worth a shot (it's by far the simplest tweak). I'll note, your code as rendered is not in a function, which will make standard CPython slower (just wrapping it in a function would change every read/write to your variables from a dict lookup to a C array indexing operation). — ShadowRanger
– ShadowRanger, Commented Oct 26, 2020 at 16:27
@OverLordGoldDragon: I tested it on a local machine. Without wrapping in a function (but reducing M to 40000), it took about 29.1 seconds user time; wrapping it in a function dropped it to 25.5 seconds (small, but meaningful change), and decorating that function with @numba.jit(nopython=True) dropped it to 2.5 seconds (though the first time it ran the wall clock time was ~12.4 seconds, with the second run dropping to 3.6; loading numba itself and jiting has some non-trivial startup costs, especially if, as in my case, the library has to be cached from NFS the first time). — ShadowRanger
– ShadowRanger, Commented Oct 26, 2020 at 16:39

OverLordGoldDragon · Accepted Answer · 2020-10-26 17:45:21Z

5

This is basically the idea behind Numba. Not as fast as C, but it can get close... It uses a jit compiler to compile python code to machine and it's compatible with most Numpy functions. (In the docs you find all the details)

import numpy as np
from numba import njit


@njit
def f(N, M):
    a = np.random.uniform(0, 1, (N, M))
    k = np.random.randint(0, N, (N, M))

    out = np.zeros((N, M))
    for i in range(N):
        for j in range(M):
            out[k[i, j], j] += a[i, j]
    return out


def f_python(N, M):
    a = np.random.uniform(0, 1, (N, M))
    k = np.random.randint(0, N, (N, M))

    out = np.zeros((N, M))
    for i in range(N):
        for j in range(M):
            out[k[i, j], j] += a[i, j]
    return out

Pure Python:

%%timeit

N, M = 100, 4000
f_python(M, N)

338 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With Numba:

%%timeit

N, M = 100, 4000
f(M, N)

12 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Oct 26, 2020 at 17:45

OverLordGoldDragon

20k12 gold badges60 silver badges120 bronze badges

answered Oct 26, 2020 at 16:29

dzang

2,2902 gold badges15 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

OverLordGoldDragon Over a year ago

Excellent. Plus, your example underpins the smashing of Python's for-loop overhead. I'll keep the question open a little longer, but doubt anything will beat this.

Kelly Bundy Over a year ago

@OverLordGoldDragon Do you actually believe that it did 400 million iterations in 451 ns?

OverLordGoldDragon Over a year ago

@HeapOverflow I realized shortly after that it's absurd, and to test I passed an array with +=1 into the loop to make sure the loop wasn't being skipped, and execution time barely budged - then probably some other optimization was done, but I didn't pursue it further. -- For others, Heap is referring to the 4e8/5e-7 = 1e15 Hz CPU (one core) that's required.

Collectives™ on Stack Overflow

Faster for-loops with arrays in Python

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related