N, M = 1000, 4000000
a = np.random.uniform(0, 1, (N, M))
k = np.random.randint(0, N, (N, M))
out = np.zeros((N, M))
for i in range(N):
for j in range(M):
out[k[i, j], j] += a[i, j]
I work with very long for-loops; %%timeit on above with pass replacing the operation yields
1min 19s ± 663 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
this is unacceptable in context (C++ took 6.5 sec). There's no reason for above to be done with Python objects; arrays have well-defined types. Implementing this in C/C++ as an extension is an overkill on both developer and user ends; I'm just passing arrays to loop and do arithmetic on.
Is there a way to tell Numpy "move this logic to C", or another library that can handle nested loops involving only arrays? I seek it for the general case, not workarounds for this specific example (but if you have one I can open a separate Q&A).
numbais a possibility, assuming conversion to vectorizednumpyoperations isn't feasible.@numba.jit(nopython=True)would be the first thing I'd think of. Whether it will fully optimize your case, I can't say, but it's worth a shot (it's by far the simplest tweak). I'll note, your code as rendered is not in a function, which will make standard CPython slower (just wrapping it in a function would change every read/write to your variables from adictlookup to a C array indexing operation).Mto 40000), it took about 29.1 seconds user time; wrapping it in a function dropped it to 25.5 seconds (small, but meaningful change), and decorating that function with@numba.jit(nopython=True)dropped it to 2.5 seconds (though the first time it ran the wall clock time was ~12.4 seconds, with the second run dropping to 3.6; loadingnumbaitself andjiting has some non-trivial startup costs, especially if, as in my case, the library has to be cached from NFS the first time).