Efficient numpy row-wise matrix multiplication using 3d arrays

Question

I have two 3d arrays of shape (N, M, D) and I want to perform an efficient row wise (over N) matrix multiplication such that the resulting array is of shape (N, D, D).

An inefficient code sample showing what I try to achieve is given by:

N = 100
M = 10
D = 50
arr1 = np.random.normal(size=(N, M, D))
arr2 = np.random.normal(size=(N, M, D))
result = []
for i in range(N):
    result.append(arr1[i].T @ arr2[i])
result = np.array(result)

However, this application is quite slow for large N due to the loop. Is there a more efficient way to achieve this computation without using loops? I already tried to find a solution via tensordot and einsum to no avail.

When you say slow, do you mean slow for the given example values of 100, 10 and 50? — 9769953
– 9769953, Commented Sep 22, 2022 at 13:10
np.einsum('ijk,ijl->ikl', arr1, arr2, optimize=True) is ~6x slower than the loopy solution on a 2-core cpu. It seems optimize can't trigger BLAS calls for the trailing axes. That is unexpected. — Michael Szczesny
– Michael Szczesny, Commented Sep 22, 2022 at 13:39
@9769953 N could potentially be very large, say tens of thousands. Then the loop implementation would be inappropriate. — florian bley
– florian bley, Commented Sep 23, 2022 at 8:03
Indeed. But I reckon if the actual matrix multiplication takes relatively long (because M and D are large), then the loop overhead is minimal. The appending of the results in results.append may create additional overhead when the list increases to large amounts, so pre-allocating the memory of an array with the resulting size, instead of using a list, would be wise in that case. Of course, the accepted answer does that all for you. — 9769953
– 9769953, Commented Sep 23, 2022 at 8:25
@9769953 In my application at least M will stay at moderate values around 10. I should have mentioned that. Thank you for the hint with pre-allocating memory. Indeed that would probably make the loop more competitive should D become large. I will definitely keep that in mind. Thanks! — florian bley
– florian bley, Commented Sep 23, 2022 at 8:29

Mechanic Pig · Accepted Answer · 2022-09-23 09:58:14Z

The vectorization solution is to swap the last two axes of arr1:

>>> N, M, D = 2, 3, 4
>>> np.random.seed(0)
>>> arr1 = np.random.normal(size=(N, M, D))
>>> arr2 = np.random.normal(size=(N, M, D))
>>> arr1.transpose(0, 2, 1) @ arr2
array([[[ 6.95815626,  0.38299107,  0.40600482,  0.35990016],
        [-0.95421604, -2.83125879, -0.2759683 , -0.38027618],
        [ 3.54989101, -0.31274318,  0.14188485,  0.19860495],
        [ 3.56319723, -6.36209602, -0.42687188, -0.24932248]],

       [[ 0.67081341, -0.08816343,  0.35430089,  0.69962394],
        [ 0.0316968 ,  0.15129449, -0.51592291,  0.07118177],
        [-0.22274906, -0.28955683, -1.78905988,  1.1486345 ],
        [ 1.68432706,  1.93915798,  2.25785798, -2.34404577]]])

A simple benchmark for the super N:

In [225]: arr1.shape
Out[225]: (100000, 10, 50)

In [226]: %%timeit
     ...: result = []
     ...: for i in range(N):
     ...:     result.append(arr1[i].T @ arr2[i])
     ...: result = np.array(result)
     ...:
     ...:
12.4 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [227]: %timeit arr1.transpose(0, 2, 1) @ arr2
843 ms ± 7.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Use pre allocated lists and do not perform data conversion after the loop ends. The performance here is not much worse than vectorization, which means that the most overhead comes from the final data conversion:

In [375]: %%timeit
     ...: result = [None] * N
     ...: for i in range(N):
     ...:     result[i] = arr1[i].T @ arr2[i]
     ...: # result = np.array(result)
     ...:
     ...:
1.22 s ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Performance of loop solution with data conversion:

In [376]: %%timeit
     ...: result = [None] * N
     ...: for i in range(N):
     ...:     result[i] = arr1[i].T @ arr2[i]
     ...: result = np.array(result)
     ...:
     ...:
11.3 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Another refers to the answer of @9769953 and makes additional optimization test. To my surprise, its performance is almost the same as the vectorization solution:

In [378]: %%timeit
     ...: result = np.empty_like(arr1, shape=(N, D, D))
     ...: for res, ar1, ar2 in zip(result, arr1.transpose(0, 2, 1), arr2):
     ...:     np.matmul(ar1, ar2, out=res)
     ...:
843 ms ± 4.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

9769953 · Accepted Answer · 2022-09-23 11:52:53Z

0

For interest, I wondered about the loop overhead, which I guess is minimal compared to the matrix multiplication; but in particular, the loop overhead is minimal to the potential reallocation of the list memory, which with N = 10000 could be significant.

Using a pre-allocated array instead of a list, I compared the loop result and the solution provided by Mechanic Pig, and achieved the following results on my machine:

In [10]: %timeit result1 = arr1.transpose(0, 2, 1) @ arr2
33.7 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

versus

In [14]: %%timeit
    ...: result = np.empty_like(arr1, shape=(N, D, D))
    ...: for i in range(N):
    ...:     result[i, ...] = arr1[i].T @ arr2[i]
    ...:
    ...:
48.5 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The pure NumPy solution is still faster, so that's good, but only by a factor of about 1.5. Not too bad. Depending on the needs, the loop may be clearer as to what it intents (and easier to modify, in case there's a need for an if-statement or other shenigans).
And naturally, a simple comment above the faster solution can easily point out what it actually replaces.

Following the comments to this answer by Mechanic Pig, I've added below the timing results of a loop without preallocating an array (but with a preallocated list) and without conversion to a NumPy array. Mainly so the results are compared for the same machine:

In [11]: %%timeit
    ...: result = [None] * N
    ...: for i in range(N):
    ...:     result.append(arr1[i].T @ arr2[i])
    ...:
49.5 ms ± 672 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So interestingly, this results, without conversion, is (a tiny bit) slower than the one with a pre-allocated array and directly assigning into the array.

edited Sep 23, 2022 at 11:52

answered Sep 23, 2022 at 8:36

9769953

12.5k3 gold badges31 silver badges46 bronze badges

4 Comments

florian bley Over a year ago

Wow, indeed that is quite fast. I didn't realise that pre allocating memory can make that much of a difference. Thanks for pointing that out!

Mechanic Pig Over a year ago

I also added a test and found that a large part of the overhead of the loop solution comes from the final data conversion.

Mechanic Pig Over a year ago

If we can avoid the copy of each calculation result, the loop solution can actually be almost as fast as vectorization.

9769953 Over a year ago

@MechanicPig Yes, that makes sense, because the list won't actually grow that fast in size: it just contains pointers to each 50x50 matrix. That means there is not much need for reallocation, certainly not for reallocating large blocks of (contiguous) data. But the last steps tries to allocate a (contiguous?) block of memory for the 10,000x50x50 NumPy array, and copy that; even if such operations are highly optimised on a low level (libc or lower even), that'll take time.

Collectives™ on Stack Overflow

Efficient numpy row-wise matrix multiplication using 3d arrays

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related