1

I have two 3d arrays of shape (N, M, D) and I want to perform an efficient row wise (over N) matrix multiplication such that the resulting array is of shape (N, D, D).

An inefficient code sample showing what I try to achieve is given by:

N = 100
M = 10
D = 50
arr1 = np.random.normal(size=(N, M, D))
arr2 = np.random.normal(size=(N, M, D))
result = []
for i in range(N):
    result.append(arr1[i].T @ arr2[i])
result = np.array(result)

However, this application is quite slow for large N due to the loop. Is there a more efficient way to achieve this computation without using loops? I already tried to find a solution via tensordot and einsum to no avail.

6
  • When you say slow, do you mean slow for the given example values of 100, 10 and 50? Commented Sep 22, 2022 at 13:10
  • np.einsum('ijk,ijl->ikl', arr1, arr2, optimize=True) is ~6x slower than the loopy solution on a 2-core cpu. It seems optimize can't trigger BLAS calls for the trailing axes. That is unexpected. Commented Sep 22, 2022 at 13:39
  • @9769953 N could potentially be very large, say tens of thousands. Then the loop implementation would be inappropriate. Commented Sep 23, 2022 at 8:03
  • Indeed. But I reckon if the actual matrix multiplication takes relatively long (because M and D are large), then the loop overhead is minimal. The appending of the results in results.append may create additional overhead when the list increases to large amounts, so pre-allocating the memory of an array with the resulting size, instead of using a list, would be wise in that case. Of course, the accepted answer does that all for you. Commented Sep 23, 2022 at 8:25
  • @9769953 In my application at least M will stay at moderate values around 10. I should have mentioned that. Thank you for the hint with pre-allocating memory. Indeed that would probably make the loop more competitive should D become large. I will definitely keep that in mind. Thanks! Commented Sep 23, 2022 at 8:29

2 Answers 2

1

The vectorization solution is to swap the last two axes of arr1:

>>> N, M, D = 2, 3, 4
>>> np.random.seed(0)
>>> arr1 = np.random.normal(size=(N, M, D))
>>> arr2 = np.random.normal(size=(N, M, D))
>>> arr1.transpose(0, 2, 1) @ arr2
array([[[ 6.95815626,  0.38299107,  0.40600482,  0.35990016],
        [-0.95421604, -2.83125879, -0.2759683 , -0.38027618],
        [ 3.54989101, -0.31274318,  0.14188485,  0.19860495],
        [ 3.56319723, -6.36209602, -0.42687188, -0.24932248]],

       [[ 0.67081341, -0.08816343,  0.35430089,  0.69962394],
        [ 0.0316968 ,  0.15129449, -0.51592291,  0.07118177],
        [-0.22274906, -0.28955683, -1.78905988,  1.1486345 ],
        [ 1.68432706,  1.93915798,  2.25785798, -2.34404577]]])

A simple benchmark for the super N:

In [225]: arr1.shape
Out[225]: (100000, 10, 50)

In [226]: %%timeit
     ...: result = []
     ...: for i in range(N):
     ...:     result.append(arr1[i].T @ arr2[i])
     ...: result = np.array(result)
     ...:
     ...:
12.4 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [227]: %timeit arr1.transpose(0, 2, 1) @ arr2
843 ms ± 7.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Use pre allocated lists and do not perform data conversion after the loop ends. The performance here is not much worse than vectorization, which means that the most overhead comes from the final data conversion:

In [375]: %%timeit
     ...: result = [None] * N
     ...: for i in range(N):
     ...:     result[i] = arr1[i].T @ arr2[i]
     ...: # result = np.array(result)
     ...:
     ...:
1.22 s ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Performance of loop solution with data conversion:

In [376]: %%timeit
     ...: result = [None] * N
     ...: for i in range(N):
     ...:     result[i] = arr1[i].T @ arr2[i]
     ...: result = np.array(result)
     ...:
     ...:
11.3 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Another refers to the answer of @9769953 and makes additional optimization test. To my surprise, its performance is almost the same as the vectorization solution:

In [378]: %%timeit
     ...: result = np.empty_like(arr1, shape=(N, D, D))
     ...: for res, ar1, ar2 in zip(result, arr1.transpose(0, 2, 1), arr2):
     ...:     np.matmul(ar1, ar2, out=res)
     ...:
843 ms ± 4.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

Comments

0

For interest, I wondered about the loop overhead, which I guess is minimal compared to the matrix multiplication; but in particular, the loop overhead is minimal to the potential reallocation of the list memory, which with N = 10000 could be significant.

Using a pre-allocated array instead of a list, I compared the loop result and the solution provided by Mechanic Pig, and achieved the following results on my machine:

In [10]: %timeit result1 = arr1.transpose(0, 2, 1) @ arr2
33.7 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

versus

In [14]: %%timeit
    ...: result = np.empty_like(arr1, shape=(N, D, D))
    ...: for i in range(N):
    ...:     result[i, ...] = arr1[i].T @ arr2[i]
    ...:
    ...:
48.5 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The pure NumPy solution is still faster, so that's good, but only by a factor of about 1.5. Not too bad. Depending on the needs, the loop may be clearer as to what it intents (and easier to modify, in case there's a need for an if-statement or other shenigans).
And naturally, a simple comment above the faster solution can easily point out what it actually replaces.


Following the comments to this answer by Mechanic Pig, I've added below the timing results of a loop without preallocating an array (but with a preallocated list) and without conversion to a NumPy array. Mainly so the results are compared for the same machine:

In [11]: %%timeit
    ...: result = [None] * N
    ...: for i in range(N):
    ...:     result.append(arr1[i].T @ arr2[i])
    ...:
49.5 ms ± 672 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So interestingly, this results, without conversion, is (a tiny bit) slower than the one with a pre-allocated array and directly assigning into the array.

4 Comments

Wow, indeed that is quite fast. I didn't realise that pre allocating memory can make that much of a difference. Thanks for pointing that out!
I also added a test and found that a large part of the overhead of the loop solution comes from the final data conversion.
If we can avoid the copy of each calculation result, the loop solution can actually be almost as fast as vectorization.
@MechanicPig Yes, that makes sense, because the list won't actually grow that fast in size: it just contains pointers to each 50x50 matrix. That means there is not much need for reallocation, certainly not for reallocating large blocks of (contiguous) data. But the last steps tries to allocate a (contiguous?) block of memory for the 10,000x50x50 NumPy array, and copy that; even if such operations are highly optimised on a low level (libc or lower even), that'll take time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.