I was benchmarking a naive transposition and noticed a very large performance discrepancy in performance between:
- a naive operation where we read data contiguously and write with a large stride;
- the same but reversed, large stride in reads, contiguous in writes.
The benchmark is very crude, we don't flush caches for instance, I can ensure, though, that the clock frequency was constant and that there was nobody else on the machine at this time. I pinned the process to the first core of the second NUMA. It is on purpose, that the transpose is not (recursively) cache blocked.
Looking at the bench, we see that for large matrices (guaranteed out for cache), there is a benefit to using contiguous read, why ?
(Current status, I benchmarked using non powers of two, primes even, and read still come out above the contiguous writes for large matrices. I also ran naive transpose kernels on GPU, on the CDNA2, MI250X device especially, and it turns out that the opposite result is observed. Contiguous writes are considerably faster than reads. CPU and GPU are in disagreements here. You can read here that it seems that GPU prefer contiguous writes (1 & 2), I have also obtained similar results on GPUs. The CDNA 3 white paper mentions that the writes use a half-line.)
I ran the code on Zen3 (AMD EPYC 7A53) and Zen4 (AMD EPYC 9654) leading to similar results:
On zen3
Matrix size: 4x4 128 bytes
Read-contiguous transpose time: 4.41e-08 s
Write-contiguous transpose time: 4.31e-08 s
Read speedup over write: 0.977324
Matrix size: 8x8 512 bytes
Read-contiguous transpose time: 8.32e-08 s
Write-contiguous transpose time: 7.61e-08 s
Read speedup over write: 0.914663
Matrix size: 16x16 2048 bytes
Read-contiguous transpose time: 1.392e-07 s
Write-contiguous transpose time: 1.372e-07 s
Read speedup over write: 0.985632
Matrix size: 32x32 8192 bytes
Read-contiguous transpose time: 3.737e-07 s
Write-contiguous transpose time: 3.847e-07 s
Read speedup over write: 1.02944
Matrix size: 64x64 32768 bytes
Read-contiguous transpose time: 4.113e-06 s
Write-contiguous transpose time: 1.7404e-06 s
Read speedup over write: 0.423146
Matrix size: 128x128 131072 bytes
Read-contiguous transpose time: 4.21795e-05 s
Write-contiguous transpose time: 2.17801e-05 s
Read speedup over write: 0.516367
Matrix size: 256x256 524288 bytes
Read-contiguous transpose time: 0.000228837 s
Write-contiguous transpose time: 0.000143957 s
Read speedup over write: 0.629081
Matrix size: 512x512 2097152 bytes
Read-contiguous transpose time: 0.00213673 s
Write-contiguous transpose time: 0.00080198 s
Read speedup over write: 0.37533
Matrix size: 1024x1024 8388608 bytes
Read-contiguous transpose time: 0.00907537 s
Write-contiguous transpose time: 0.00660607 s
Read speedup over write: 0.727912
Matrix size: 2048x2048 33554432 bytes
Read-contiguous transpose time: 0.047444 s
Write-contiguous transpose time: 0.0615565 s
Read speedup over write: 1.29745
Matrix size: 4096x4096 134217728 bytes
Read-contiguous transpose time: 0.189514 s
Write-contiguous transpose time: 0.871162 s
Read speedup over write: 4.59682
On zen4:
Matrix size: 4x4 128 bytes
Read-contiguous transpose time: 4.3e-08 s
Write-contiguous transpose time: 4.3e-08 s
Read speedup over write: 1
Matrix size: 8x8 512 bytes
Read-contiguous transpose time: 7.32e-08 s
Write-contiguous transpose time: 7.61e-08 s
Read speedup over write: 1.03962
Matrix size: 16x16 2048 bytes
Read-contiguous transpose time: 1.352e-07 s
Write-contiguous transpose time: 1.442e-07 s
Read speedup over write: 1.06657
Matrix size: 32x32 8192 bytes
Read-contiguous transpose time: 3.325e-07 s
Write-contiguous transpose time: 4.517e-07 s
Read speedup over write: 1.3585
Matrix size: 64x64 32768 bytes
Read-contiguous transpose time: 4.1142e-06 s
Write-contiguous transpose time: 1.5844e-06 s
Read speedup over write: 0.385105
Matrix size: 128x128 131072 bytes
Read-contiguous transpose time: 2.02202e-05 s
Write-contiguous transpose time: 1.50385e-05 s
Read speedup over write: 0.743736
Matrix size: 256x256 524288 bytes
Read-contiguous transpose time: 0.000168397 s
Write-contiguous transpose time: 0.000142012 s
Read speedup over write: 0.843319
Matrix size: 512x512 2097152 bytes
Read-contiguous transpose time: 0.00201966 s
Write-contiguous transpose time: 0.000739887 s
Read speedup over write: 0.366342
Matrix size: 1024x1024 8388608 bytes
Read-contiguous transpose time: 0.00789253 s
Write-contiguous transpose time: 0.00646361 s
Read speedup over write: 0.818954
Matrix size: 2048x2048 33554432 bytes
Read-contiguous transpose time: 0.0333825 s
Write-contiguous transpose time: 0.0315571 s
Read speedup over write: 0.945318
Matrix size: 4096x4096 134217728 bytes
Read-contiguous transpose time: 0.153531 s
Write-contiguous transpose time: 0.326645 s
Read speedup over write: 2.12755
#include <chrono>
#include <iostream>
#include <vector>
inline double& at(std::vector<double>& M, int N, int i, int j) {
return M[i + j * N];
}
inline const double& at(const std::vector<double>& M, int N, int i, int j) {
return M[i + j * N];
}
void transpose_read_contiguous(const std::vector<double>& A, std::vector<double>& B, int N) {
for(int j = 0; j < N; ++j) {
for(int i = 0; i < N; ++i) {
at(B, N, j, i) = at(A, N, i, j);
}
}
}
void transpose_write_contiguous(const std::vector<double>& A, std::vector<double>& B, int N) {
for(int i = 0; i < N; ++i) {
for(int j = 0; j < N; ++j) {
at(B, N, j, i) = at(A, N, i, j);
}
}
}
template <typename T>
__attribute__((noinline)) void do_not_optimize(T&) {
}
int main() {
for(int N = 4; N < 8192; N *= 2) {
std::cout << "Matrix size: " << N << "x" << N << " " << N * N * sizeof(double) << " bytes \n";
std::vector<double> A(N * N, 0.0);
std::vector<double> B(N * N, 0.0);
for(int j = 0; j < N; ++j)
for(int i = 0; i < N; ++i)
at(A, N, i, j) = i + j * N;
auto t1 = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 10; ++i)
transpose_read_contiguous(A, B, N);
auto t2 = std::chrono::high_resolution_clock::now();
do_not_optimize(A);
do_not_optimize(B);
std::chrono::duration<double> read_time = (t2 - t1) / 10.0;
std::cout << " Read-contiguous transpose time: " << read_time.count() << " s\n";
auto t3 = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 10; ++i)
transpose_write_contiguous(A, B, N);
auto t4 = std::chrono::high_resolution_clock::now();
do_not_optimize(A);
do_not_optimize(B);
std::chrono::duration<double> write_time = (t4 - t3) / 10.0;
std::cout << " Write-contiguous transpose time: " << write_time.count() << " s\n";
std::cout << " Read speedup over write: " << (write_time.count() / read_time.count()) << "\n\n";
}
return 0;
}
unpcklpd/unpckhpd, or better with AVX2 256-bit shuffles. Or AVX-512 on Zen 4, likevptermt2pdwith 256-bit or 512-bit vectors. Even without SIMD, your scattered accesses will be to lines that stay hot in L1d cache.