The vectorization solution is to swap the last two axes of arr1:
>>> N, M, D = 2, 3, 4
>>> np.random.seed(0)
>>> arr1 = np.random.normal(size=(N, M, D))
>>> arr2 = np.random.normal(size=(N, M, D))
>>> arr1.transpose(0, 2, 1) @ arr2
array([[[ 6.95815626, 0.38299107, 0.40600482, 0.35990016],
[-0.95421604, -2.83125879, -0.2759683 , -0.38027618],
[ 3.54989101, -0.31274318, 0.14188485, 0.19860495],
[ 3.56319723, -6.36209602, -0.42687188, -0.24932248]],
[[ 0.67081341, -0.08816343, 0.35430089, 0.69962394],
[ 0.0316968 , 0.15129449, -0.51592291, 0.07118177],
[-0.22274906, -0.28955683, -1.78905988, 1.1486345 ],
[ 1.68432706, 1.93915798, 2.25785798, -2.34404577]]])
A simple benchmark for the super N:
In [225]: arr1.shape
Out[225]: (100000, 10, 50)
In [226]: %%timeit
...: result = []
...: for i in range(N):
...: result.append(arr1[i].T @ arr2[i])
...: result = np.array(result)
...:
...:
12.4 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [227]: %timeit arr1.transpose(0, 2, 1) @ arr2
843 ms ± 7.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use pre allocated lists and do not perform data conversion after the loop ends. The performance here is not much worse than vectorization, which means that the most overhead comes from the final data conversion:
In [375]: %%timeit
...: result = [None] * N
...: for i in range(N):
...: result[i] = arr1[i].T @ arr2[i]
...: # result = np.array(result)
...:
...:
1.22 s ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Performance of loop solution with data conversion:
In [376]: %%timeit
...: result = [None] * N
...: for i in range(N):
...: result[i] = arr1[i].T @ arr2[i]
...: result = np.array(result)
...:
...:
11.3 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Another refers to the answer of @9769953 and makes additional optimization test. To my surprise, its performance is almost the same as the vectorization solution:
In [378]: %%timeit
...: result = np.empty_like(arr1, shape=(N, D, D))
...: for res, ar1, ar2 in zip(result, arr1.transpose(0, 2, 1), arr2):
...: np.matmul(ar1, ar2, out=res)
...:
843 ms ± 4.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.einsum('ijk,ijl->ikl', arr1, arr2, optimize=True)is ~6x slower than the loopy solution on a 2-core cpu. It seemsoptimizecan't trigger BLAS calls for the trailing axes. That is unexpected.results.appendmay create additional overhead when the list increases to large amounts, so pre-allocating the memory of an array with the resulting size, instead of using a list, would be wise in that case. Of course, the accepted answer does that all for you.