2

I have two vectors (1D arrays) a and b, I want to compute c such that c[i] = sum_j (1/(a[i] + b[j])). My first guess was :

import numpy as np

a = np.arange(1,int(1e4))
b = np.arange(1, int(1e4))

intermediate = 1/(a[:,None] + b)
c = intermediate.sum(-1)

sadly, a is quite big, so creating the intermediate array raises a memory error. The only solution I could come up with was to split a and do a for loop :

a_split = np.array_split(a, 100)
c_split = []
for array in a_split :
    intermediate = 1/(array[:,None] + b)
    c_split.append(intermediate.sum(-1))
c = np.concatenate(c_split)

But I've been told not to use for loops in python (this step will be repeated many times so performance is an issue), and it doesn't look really nice to bypass numpy like that. Is there a more elegant/more efficient way to avoid this huge intermediate array ?

0

1 Answer 1

4

Loops in Python are slow, but so is adding 1e8 elements in numpy.

In short, I think the python loop isn't going to be your bottleneck, and the intermediate matrix is about 400GB, so you'll certainly need to do something. As you have it set up, you're trying to do 100,000,000 sums in numpy and 100 loops python, and although python is relatively slow, it's not that slow.

Overall, in doing these large calculations, you'll need to get a sense of what all the various speeds are to do a better optimization.

To get a sense of it, I ran the following tests.

a = np.arange(1,int(1e4))
x = np.arange(10)

%%timeit
(1./(a+a)).sum()
# 20.2 µs ± 339 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
for i in x:
    pass
# 1.07 µs ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Note this last one was for len(x)=10, so it's more like 0.1µs / loop.

That is, the time for a loop is about 200x less than the time to do the sum, even if you did the sum line-by-line of the matrix (that is, loop through a). Line-by-line would probably be easier too, so you don't need the complex breaking up into parts, and the speed gain you get by breaking into larger parts would probably not be noticeable.

I guess it's not hard to prove this:

Here's your original:

%%timeit
a_split = np.array_split(a, 100)
c_split = []
for array in a_split :
    intermediate = 1/(array[:,None] + b)
    c_split.append(intermediate.sum(-1))
c = np.concatenate(c_split)
# 321 ms ± 7.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And looping through a, item by item (but preserving your exact same structure):

%%timeit
a_split = np.array_split(a, len(a))
c_split = []
for array in a_split :
    intermediate = 1/(array[:,None] + b)
    c_split.append(intermediate.sum(-1))
c = np.concatenate(c_split)
# 252 ms ± 5.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So it's actually faster to loop through a, and if you had written if from scratch this way, it would be much easier too. Actually, I'll do that too, just out of interest:

%%timeit
c = []
for x in a :
    intermediate = 1/(x + b)
    c.append(intermediate.sum())
c = np.array(c)
# 224 ms ± 6.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So.... the fastest of all (but these time differences are usually too small to think about, and the conclusion here is that the chunking adds complexity without a meaningful difference in speed).

The main point is, make sure your inner loop is in numpy (here the sum of the 10000 elements, and I won't show this, but it's about 10000x faster in numpy). Beyond that, it's not always worth the optimization effort.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.