3

I was benchmarking a naive transposition and noticed a very large performance discrepancy in performance between:

  • a naive operation where we read data contiguously and write with a large stride;
  • the same but reversed, large stride in reads, contiguous in writes.

The benchmark is very crude, we don't flush caches for instance, I can ensure, though, that the clock frequency was constant and that there was nobody else on the machine at this time. I pinned the process to the first core of the second NUMA. It is on purpose, that the transpose is not (recursively) cache blocked.

Looking at the bench, we see that for large matrices (guaranteed out for cache), there is a benefit to using contiguous read, why ?

(Current status, I benchmarked using non powers of two, primes even, and read still come out above the contiguous writes for large matrices. I also ran naive transpose kernels on GPU, on the CDNA2, MI250X device especially, and it turns out that the opposite result is observed. Contiguous writes are considerably faster than reads. CPU and GPU are in disagreements here. You can read here that it seems that GPU prefer contiguous writes (1 & 2), I have also obtained similar results on GPUs. The CDNA 3 white paper mentions that the writes use a half-line.)

I ran the code on Zen3 (AMD EPYC 7A53) and Zen4 (AMD EPYC 9654) leading to similar results:

On zen3

Matrix size: 4x4 128 bytes
  Read-contiguous transpose time: 4.41e-08 s
  Write-contiguous transpose time: 4.31e-08 s
  Read speedup over write: 0.977324
Matrix size: 8x8 512 bytes
  Read-contiguous transpose time: 8.32e-08 s
  Write-contiguous transpose time: 7.61e-08 s
  Read speedup over write: 0.914663
Matrix size: 16x16 2048 bytes
  Read-contiguous transpose time: 1.392e-07 s
  Write-contiguous transpose time: 1.372e-07 s
  Read speedup over write: 0.985632
Matrix size: 32x32 8192 bytes
  Read-contiguous transpose time: 3.737e-07 s
  Write-contiguous transpose time: 3.847e-07 s
  Read speedup over write: 1.02944
Matrix size: 64x64 32768 bytes
  Read-contiguous transpose time: 4.113e-06 s
  Write-contiguous transpose time: 1.7404e-06 s
  Read speedup over write: 0.423146
Matrix size: 128x128 131072 bytes
  Read-contiguous transpose time: 4.21795e-05 s
  Write-contiguous transpose time: 2.17801e-05 s
  Read speedup over write: 0.516367
Matrix size: 256x256 524288 bytes
  Read-contiguous transpose time: 0.000228837 s
  Write-contiguous transpose time: 0.000143957 s
  Read speedup over write: 0.629081
Matrix size: 512x512 2097152 bytes
  Read-contiguous transpose time: 0.00213673 s
  Write-contiguous transpose time: 0.00080198 s
  Read speedup over write: 0.37533
Matrix size: 1024x1024 8388608 bytes
  Read-contiguous transpose time: 0.00907537 s
  Write-contiguous transpose time: 0.00660607 s
  Read speedup over write: 0.727912
Matrix size: 2048x2048 33554432 bytes
  Read-contiguous transpose time: 0.047444 s
  Write-contiguous transpose time: 0.0615565 s
  Read speedup over write: 1.29745
Matrix size: 4096x4096 134217728 bytes
  Read-contiguous transpose time: 0.189514 s
  Write-contiguous transpose time: 0.871162 s
  Read speedup over write: 4.59682

On zen4:

Matrix size: 4x4 128 bytes
  Read-contiguous transpose time: 4.3e-08 s
  Write-contiguous transpose time: 4.3e-08 s
  Read speedup over write: 1
Matrix size: 8x8 512 bytes
  Read-contiguous transpose time: 7.32e-08 s
  Write-contiguous transpose time: 7.61e-08 s
  Read speedup over write: 1.03962
Matrix size: 16x16 2048 bytes
  Read-contiguous transpose time: 1.352e-07 s
  Write-contiguous transpose time: 1.442e-07 s
  Read speedup over write: 1.06657
Matrix size: 32x32 8192 bytes
  Read-contiguous transpose time: 3.325e-07 s
  Write-contiguous transpose time: 4.517e-07 s
  Read speedup over write: 1.3585
Matrix size: 64x64 32768 bytes
  Read-contiguous transpose time: 4.1142e-06 s
  Write-contiguous transpose time: 1.5844e-06 s
  Read speedup over write: 0.385105
Matrix size: 128x128 131072 bytes
  Read-contiguous transpose time: 2.02202e-05 s
  Write-contiguous transpose time: 1.50385e-05 s
  Read speedup over write: 0.743736
Matrix size: 256x256 524288 bytes
  Read-contiguous transpose time: 0.000168397 s
  Write-contiguous transpose time: 0.000142012 s
  Read speedup over write: 0.843319
Matrix size: 512x512 2097152 bytes
  Read-contiguous transpose time: 0.00201966 s
  Write-contiguous transpose time: 0.000739887 s
  Read speedup over write: 0.366342
Matrix size: 1024x1024 8388608 bytes
  Read-contiguous transpose time: 0.00789253 s
  Write-contiguous transpose time: 0.00646361 s
  Read speedup over write: 0.818954
Matrix size: 2048x2048 33554432 bytes
  Read-contiguous transpose time: 0.0333825 s
  Write-contiguous transpose time: 0.0315571 s
  Read speedup over write: 0.945318
Matrix size: 4096x4096 134217728 bytes
  Read-contiguous transpose time: 0.153531 s
  Write-contiguous transpose time: 0.326645 s
  Read speedup over write: 2.12755
#include <chrono>
#include <iostream>
#include <vector>

inline double& at(std::vector<double>& M, int N, int i, int j) {
    return M[i + j * N];
}
inline const double& at(const std::vector<double>& M, int N, int i, int j) {
    return M[i + j * N];
}

void transpose_read_contiguous(const std::vector<double>& A, std::vector<double>& B, int N) {
    for(int j = 0; j < N; ++j) {
        for(int i = 0; i < N; ++i) {
            at(B, N, j, i) = at(A, N, i, j);
        }
    }
}

void transpose_write_contiguous(const std::vector<double>& A, std::vector<double>& B, int N) {
    for(int i = 0; i < N; ++i) {
        for(int j = 0; j < N; ++j) {
            at(B, N, j, i) = at(A, N, i, j);
        }
    }
}

template <typename T>
__attribute__((noinline)) void do_not_optimize(T&) {
}

int main() {
    for(int N = 4; N < 8192; N *= 2) {
        std::cout << "Matrix size: " << N << "x" << N << " " << N * N * sizeof(double) << " bytes \n";

        std::vector<double> A(N * N, 0.0);
        std::vector<double> B(N * N, 0.0);

        for(int j = 0; j < N; ++j)
            for(int i = 0; i < N; ++i)
                at(A, N, i, j) = i + j * N;

        auto t1 = std::chrono::high_resolution_clock::now();
        for(int i = 0; i < 10; ++i)
            transpose_read_contiguous(A, B, N);
        auto t2 = std::chrono::high_resolution_clock::now();
        do_not_optimize(A);
        do_not_optimize(B);

        std::chrono::duration<double> read_time = (t2 - t1) / 10.0;
        std::cout << "  Read-contiguous transpose time: " << read_time.count() << " s\n";

        auto t3 = std::chrono::high_resolution_clock::now();
        for(int i = 0; i < 10; ++i)
            transpose_write_contiguous(A, B, N);
        auto t4 = std::chrono::high_resolution_clock::now();
        do_not_optimize(A);
        do_not_optimize(B);

        std::chrono::duration<double> write_time = (t4 - t3) / 10.0;
        std::cout << "  Write-contiguous transpose time: " << write_time.count() << " s\n";
        std::cout << "  Read speedup over write: " << (write_time.count() / read_time.count()) << "\n\n";
    }

    return 0;
}
5
  • 1
    Interesting microbenchmark. Read misses stall execution, write misses can be buffered. But OTOH, not writing all of a line means it'll have to get written back as it's evicted as well as later brought back in the next time you write part of it... but Zen does that for nearly free with its L3 cache being a victim cache, so the eviction from L2 can swap places with a line in L3, with the read request from L2 to L3 carrying the evicted data. (chipsandcheese.com/i/174624671/bandwidth). Not sure if that explains everything, hence just a comment. Commented Nov 3 at 15:19
  • If you actually want large transpose to run fast, you cache-block it, recursively doing larger and larger tiles. The smallest size can be some size that fits into SIMD vector regs so you do it with shuffle instructions like unpcklpd / unpckhpd, or better with AVX2 256-bit shuffles. Or AVX-512 on Zen 4, like vptermt2pd with 256-bit or 512-bit vectors. Even without SIMD, your scattered accesses will be to lines that stay hot in L1d cache. Commented Nov 3 at 15:23
  • 1
    One important thing is that you only test power-of-two matrices and this typically causes cache trashing. This is certainly more critical on non-contiguous reads because of the stalls and the high latency of the DRAM. Can you test this with matrices like 3127x3127 (or even 5353x5353)? Commented Nov 3 at 18:27
  • 1
    Moreover, this sequential code should be bound by memory latency. In parallel, with multiple threads, I expect contiguous-writes to be faster on large matrices (due to cache line reloading and cache-line write allocation) -- at least on a single NUMA node. Commented Nov 3 at 18:30
  • I tested for: Matrix size: 4095x4095 (134152200 bytes Read-contiguous transpose time: 0.0330239 s Write-contiguous transpose time: 0.0361488 s Read speedup over write: 1.09462) Matrix size: 4093x4093 (134021192 bytes Read-contiguous transpose time: 0.031891 s Write-contiguous transpose time: 0.0357609 s Read speedup over write: 1.12135) Matrix size: 4099x4099 (134414408 bytes Read-contiguous transpose time: 0.0314977 s Write-contiguous transpose time: 0.0355361 s Read speedup over write: 1.12821) Something may be going bad for powers of two then. Commented Nov 4 at 13:17

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.