Loops in Python are slow, but so is adding 1e8 elements in numpy.
In short, I think the python loop isn't going to be your bottleneck, and the intermediate matrix is about 400GB, so you'll certainly need to do something. As you have it set up, you're trying to do 100,000,000 sums in numpy and 100 loops python, and although python is relatively slow, it's not that slow.
Overall, in doing these large calculations, you'll need to get a sense of what all the various speeds are to do a better optimization.
To get a sense of it, I ran the following tests.
a = np.arange(1,int(1e4))
x = np.arange(10)
%%timeit
(1./(a+a)).sum()
# 20.2 µs ± 339 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
for i in x:
pass
# 1.07 µs ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Note this last one was for len(x)=10, so it's more like 0.1µs / loop.
That is, the time for a loop is about 200x less than the time to do the sum, even if you did the sum line-by-line of the matrix (that is, loop through a). Line-by-line would probably be easier too, so you don't need the complex breaking up into parts, and the speed gain you get by breaking into larger parts would probably not be noticeable.
I guess it's not hard to prove this:
Here's your original:
%%timeit
a_split = np.array_split(a, 100)
c_split = []
for array in a_split :
intermediate = 1/(array[:,None] + b)
c_split.append(intermediate.sum(-1))
c = np.concatenate(c_split)
# 321 ms ± 7.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
And looping through a, item by item (but preserving your exact same structure):
%%timeit
a_split = np.array_split(a, len(a))
c_split = []
for array in a_split :
intermediate = 1/(array[:,None] + b)
c_split.append(intermediate.sum(-1))
c = np.concatenate(c_split)
# 252 ms ± 5.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So it's actually faster to loop through a, and if you had written if from scratch this way, it would be much easier too. Actually, I'll do that too, just out of interest:
%%timeit
c = []
for x in a :
intermediate = 1/(x + b)
c.append(intermediate.sum())
c = np.array(c)
# 224 ms ± 6.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So.... the fastest of all (but these time differences are usually too small to think about, and the conclusion here is that the chunking adds complexity without a meaningful difference in speed).
The main point is, make sure your inner loop is in numpy (here the sum of the 10000 elements, and I won't show this, but it's about 10000x faster in numpy). Beyond that, it's not always worth the optimization effort.