Optimize 2D array multiplication numpy

Question

I am trying to do a quite memory intensive multiplication and it seems that I am always filling up my RAM. The idea is that I have a 2D gaussian centered in (0,0) and then I have another 2D gaussian that changes its distance with respect to the (0,0) point in time. For each time I need to compute the product of the two gaussian on a specific grid and sum all over the indices. Basically I should end up at each timestep I should have *SUM{g1_ij g2_ij} and end up with a 1D array of the same length of time

The code here is just a pseudo-code. The problem is the creation of a 1001x1001x25000 array here xx[:,:,np.newaxis] which gives a huge array

import numpy as np
import numexpr as ne

def gaussian2d(x,y,x0,y0,x_std,y_std):
   return np.exp(-(x-x0)**2/x_std**2-(y-y0)**2/y_std**2)

x = np.linspace(-5,5,1001)
y = np.linspace(-5,5,1001)

xx, yy = np.meshgrid(x,y)

g1 = gaussian2d(xx, yy, 0, 0, 0.25, 0.25)

x0 = np.random.rand(25000)
y0 = np.random.rand(25000)

X = np.subtract(xx[:,:,np.newaxis], x0)
Y = np.subtract(yy[:,:,np.newaxis], y0)

X_std = 0.75
Y_std = 0.75

temp = ne.evaluate('exp(-(X)**2/(2*X_std**2)-(Y)**2/(2*Y_std**2))')

final = np.sum(np.multiply(temp.T, g1), axis=(1,2))

A very slow alternative would be to just loop along the x0 length, but in future x0 may be as long as 100000 points. The other solution would be to reduce the grid. But in that case I would lose resolution and if the fixed function is not a gaussian but something different may affect calulations.

Any suggestion?

Also, did you mean gaussian2d in g1 = gauss2d(xx, yy, 0, 0, 0.25, 0.25)? — Mad Physicist
– Mad Physicist, Commented Jun 27, 2022 at 19:09
Did you compute the amount of RAM needed? 2 * 1001 * 1001 * 25000 * 8 = 373 GB (of float64) to store X,Y. — Michael Szczesny
– Michael Szczesny, Commented Jun 27, 2022 at 19:24
@MadPhysicist basically yes I need to do a 2D convolution, the problem is computing the second array which is a gaussian varying with time. And yes I meant gaussian2d — gcoppi
– gcoppi, Commented Jun 27, 2022 at 19:45

Jérôme Richard · Accepted Answer · 2022-06-27 21:36:59Z

1

There is no need for all such HUGE arrays. xx and yy as well as X and Y contains 1001 times the same repeated line/columns which is a huge waste of memory! The RAM is a very scarce resource (both throughput and space) so you should avoid operating on very large array (so to use the CPU cache which are far much faster or even CPU registers). You can rewrite the code using loops and use a JIT compiler like Numba (or a static compiler like Cython) so to run this efficiently by removing all the big arrays. In fact, thinking about loops can help to optimize the operation further even in pure Numpy (see later). So Numba/Cython is not even needed. Here is a naive implementation:

import numpy as np
import numba as nb

@nb.njit('(f8[:], f8[:], i8, i8, f8[:,::1])', parallel=True)
def compute(x0, y0, N, T, g1):
    X_std = 0.75
    Y_std = 0.75

    final = np.empty(T)
    for i in nb.prange(T):
        s = 0.0
        for k in range(N):
            for j in range(N):
                X = x[k] - x0[i]
                Y = y[j] - y0[i]
                temp = np.exp(-(X)**2/(2*X_std**2)-(Y)**2/(2*Y_std**2))
                s += g1[k, j] * temp
        final[i] = s
    return final

N = 1001
T = 25000
# [...] (same code as in the question without the big temporary arrays)
final = compute_clever(x0, y0, N, T, g1)

This code is much faster and use only a tiny amount of RAM compared to the initial code that could not even run on a regular PC. The same strategy can be used to compute g1 so not to create xx and yy.

Actually, the above code is not even optimized. On can split the exponential expression in two parts so to pre-compute partial results using only basic math. The computation can then be factorized to reduce the number of mathematical operations even more. Here is a better implementation:

@nb.njit('(f8[:], f8[:], i8, i8, f8[:,::1])', parallel=True)
def compute_clever(x0, y0, N, T, g1):
    X_std = 0.75
    Y_std = 0.75

    final = np.empty(T)
    exp1 = np.empty((T, N))
    exp2 = np.empty((T, N))
    for i in nb.prange(T):
        for k in range(N):
            X = x[k] - x0[i]
            exp1[i, k] = np.exp(-(X)**2/(2*X_std**2))
    for i in nb.prange(T):
        for j in range(N):
            Y = y[j] - y0[i]
            exp2[i, j] = np.exp(-(Y)**2/(2*Y_std**2))
    for i in nb.prange(T):
        s = 0.0
        for k in range(N):
            s2 = 0.0
            for j in range(N):
                s2 += g1[k, j] * exp2[i, j]
            s += s2 * exp1[i, k]
        final[i] = s
    return final

Here are results with N=1001 and T=250 on my 6-core machine:

Naive Numpy:      2380  ms    (use about 4 GiB of RAM)
compute:           374  ms    (use few MiB of RAM)
compute_clever:     55  ms    (use few MiB of RAM)

Note that the code can be further optimized using register blocking though it will make the code more complex. Also note that the last kernel can certainly be computed efficiently using np.einsum. exp1 and exp2 can also be computed using basic Numpy operation (though it will be a bit less efficient). Thus, you could even solve this using a pure Numpy code.

answered Jun 27, 2022 at 21:36

Jérôme Richard

53.3k6 gold badges48 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

gcoppi Over a year ago

The problem is that for my specific case I cannot use numba or cython. I’m writing a series of functions for a specific package and this already uses mpi4py (the pseudo-code here is in a for loop parallelized with mpi, so I can’t add other parallel code ). My bad that I have not specified it

Jérôme Richard Over a year ago

So this actually means that the code you provided is not representative of the actual code? This is critical since the solution can drastically change regarding such "details" and without them the question can hardly be answered. Anyway, AFAIK Numba can be used with mpi4py and you can just remove the parallel=True and nb.prange so to make it sequential. Even if Numba is a still problem, the end of the question point out an einsum-based solution that can be quite written from the last function. By the way, AFAIK NumExpr also run codes in parallel so the problem was already present...

gcoppi Over a year ago

The use of numexpr is due to the fact that is significantly faster than np.exp and it seems that is not affecting mpi4py. Following your suggestion, I wrote the code in this way:

def method2(g1, x, y, x0, y0):     v = np.exp(-(x[:,np.newaxis]-x0)**2)     n = np.exp(-(y[:,np.newaxis]-y0)**2)          return np.einsum('ij,kj,ki->k', g1, v.T, n.T, optimize='optimal')

This does not use a lot of RAM (I can run it on my laptop with N=1001 and T=25000) and it takes around 30s

Jérôme Richard Over a year ago

This is a relatively good implementation. By the way note that einsum with an optimal schedule also use multiple cores exactly like the Numba code does (AFAIK it uses a BLAS implementation internally like OpenBLAS or the MKL which uses multiple threads. You need to configures the threading layer of BLAS implementation to avoid an inefficient over-subscription.

gcoppi Over a year ago

I think I will remove the optimal flag. It also does not change the path with respect to the greedy when looking at np.einsum_path

Collectives™ on Stack Overflow

Optimize 2D array multiplication numpy

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related