2

Is bit_casting arrays of one type to another needed to avoid UB? For example, I have a function

void func(std::vector<int32_t>& dest, std::vector<std::byte>& src, const long stride){
    const auto ptr = reinterpret_cast<std::byte *>(dest.data());
    for (std::size_t i = 0; i < src.size(); ++i) {
        const auto t = ptr + 4 * i;
        t[0] = src[i];
        t[1] = src[i + stride];
        t[2] = src[i + 2 * stride];
        t[3] = src[i + 3 * stride];
    }
}

Do I need to use bit_cast instead?

void func2(std::vector<int32_t>& dest, std::vector<std::byte>& src, const long stride){
    for (std::size_t i = 0; i < src.size(); ++i) {
        alignas(std::int32_t) std::array<std::byte, 4> t;
        t[0] = src[i];
        t[1] = src[i + stride];
        t[2] = src[i + 2 * stride];
        t[3] = src[i + 3 * stride];
        dest[i] = std::bit_cast<std::int32_t>(t);
    }
}

Or a use memcpy?

void func3(std::vector<int32_t>& dest, std::vector<std::byte>& src, const long stride){
    for (std::size_t i = 0; i < src.size(); ++i) {
        alignas(std::int32_t) std::byte t[4];
        t[0] = src[i];
        t[1] = src[i + stride];
        t[2] = src[i + 2 * stride];
        t[3] = src[i + 3 * stride];
        std::memcpy(&dest[i], t, sizeof t);
    }
}

From my tests, the bit_cast and memcpy seems to have some overhead and the generated asm code is different which we wound expect to be the same for scalar types https://godbolt.org/z/Y1W585EWY

8
  • 2
    std::byte (and char) are exempt from strict aliasing, so your first example is fine. I would enable optimizations before comparing overhead though. Commented Apr 30, 2022 at 17:08
  • @HolyBlackCat yes, the first example seems to be the fastest in -O3. Commented Apr 30, 2022 at 17:11
  • What you call overhead here isn't necessarily bad: there is some more arithmetic certainly, but also fewer stores, and the 4 separate stores of the first approach could be a bottleneck there. So I'd definitely measure that before deciding that it's bad, maybe it's OK or even better. Also, it looks doable to use SIMD for this (not with "gather", but using byte-permutes) would you be interested in that? Commented Apr 30, 2022 at 17:18
  • @harold SIMD? is it portable, please point me to some docs! Commented Apr 30, 2022 at 17:20
  • @ggg not portable, then again performance considerations are in general not portable. What's good for one target may not be good for another. Since you showed an x64 example here, I'll link you this documentation Commented Apr 30, 2022 at 17:27

1 Answer 1

3

I don't know if its UB in there but if you can use unsigned version, you can convert this part:

t[0] = src[i];
t[1] = src[i + stride];
t[2] = src[i + 2 * stride];
t[3] = src[i + 3 * stride];

to this:

dest[i] = src[i] +   
          src[i+stride]   * (uint32_t) 256    + 
          src[i+stride*2] * (uint32_t) 65536  +
          src[i+stride*3] * (uint32_t) 16777216;

If you need speedup, you can vectorize the operation:

// for avx512
vector1 = src[i] to src[i+16]
vector2 = src[i+stride] to src[i+stride+16]
vector3 = src[i+stride*2] to src[i+stride*2+16]
vector4 = src[i+stride*3] to src[i+stride*3+16]

then join them the same way but in vectorized form.

// either a 16-element vector extension
destVect[i] = vector1 + vector2*256 + ....
// or just run over 16-elements at a time like tiled-computing
for(j from 0 to 15)
   destVect[i+j] = ...

Maybe you don't even need explicit use of intrinsics. Just try with simple loops working on arrays (vector) of simd-width number of elements but generally encapsulation adds code bloat so you may need to do it on bare plain arrays on stack.

Some compilers have a default minimum number of loop iterations to vectorize so you should test it with different tile width or apply a compiler flag that lets it vectorize even small loops.

Here is a sample solution from a toy auto-vectorized SIMD library: https://godbolt.org/z/qMWsbsrG8

Output:

1000 operations took 5518 nanoseconds

this is with all the stack-array allocation + kernel-launch overheads.

For 10000 operations (https://godbolt.org/z/Mz1K75Kj1) it takes 2.4 nanosecond per operation.

Here is 1000 operations but by using only 16 work-items (single ZMM register on AVX512): https://godbolt.org/z/r9GTfffG8

simd*1000 operations took 20551 nanoseconds

this is 1.25 nanoseconds per operation (at least on godbolt.org server). For FX8150 and a narrower simd value, it has ~6.9 nanoseconds per operation. If you write a non-encapsulation version, it should create less code bloat as a result and be faster.

Lastly, use multiple iterations for benchmarking: https://godbolt.org/z/dn9vj9seP

simd*1000 operations took 12367 nanoseconds
simd*1000 operations took 12420 nanoseconds
simd*1000 operations took 12118 nanoseconds
simd*1000 operations took 2753 nanoseconds
simd*1000 operations took 2694 nanoseconds
simd*1000 operations took 2691 nanoseconds
simd*1000 operations took 2839 nanoseconds
simd*1000 operations took 2698 nanoseconds
simd*1000 operations took 2702 nanoseconds
simd*1000 operations took 2711 nanoseconds
simd*1000 operations took 2718 nanoseconds
simd*1000 operations took 2710 nanoseconds

this is 0.17 nanoseconds per operation. 23GB/s is not too bad for a client-shared server RAM + simple loop. Explicitly using AVX intrinsics and no encapsulation should get you maximum bandwidth of the L1/L2/L3 caches or RAM (depends on dataset size). But beware, if you do this on a real work server with client-sharing, then your neighbors will feel the turbo-underclock during the AVX512-accelerated computations (unless integer doesn't count as a heavy-load for AVX512 pipelines).

Compilers will behave differently. For example, clang produces this:

    .LBB0_4:
    vpslld  zmm1, zmm1, 8 // evil integer bit level hacking?
    vpslld  zmm2, zmm2, 16
    vpslld  zmm3, zmm3, 24
    vpord   zmm0, zmm1, zmm0
    vpternlogd      zmm3, zmm0, zmm2, 254 // what the code?
    vmovdqu64       zmmword ptr [r14 + 4*rax], zmm3
    add     rax, 16
    cmp     rax, 16000
    jne     .LBB0_4

while gcc produces much more code bloat. I don't know why.

(you also have overflow in src vector by iterating to last element and adding stride)

Sign up to request clarification or add additional context in comments.

5 Comments

Or use bit shift, I did consider, but didn't test it. I just assumed it will be the least CPU performant.
Actually compiler would know better to use which one, depending on what kind of pipeline is in CPU free for using. Maybe CPU has one pipeline for shifting and another for multiplication so it can combine both. Imo, vectorization should hide a lot of memory access latency since it is working in chunks and the multiplication/shifting would be boosted at SSE/AVX/AVX512 scaling.
nice answer! can you tell me where you learn SIMD?
Im still learning, like optimizing constant division with magic shift etc by looking at OpenCL assembly output. Then I imitate either the intrinsics version of asembly output or the kernel itself in host environment.
Writing a kernel with implicit loops & implicit arrays is easier to move to a system with different SIMD hardware but certainly not as fast as intrinsics as compiler create too much code bloat sometimes. Sometimes clang is faster sometimes gcc is.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.