I don't know if its UB in there but if you can use unsigned version, you can convert this part:
t[0] = src[i];
t[1] = src[i + stride];
t[2] = src[i + 2 * stride];
t[3] = src[i + 3 * stride];
to this:
dest[i] = src[i] +
src[i+stride] * (uint32_t) 256 +
src[i+stride*2] * (uint32_t) 65536 +
src[i+stride*3] * (uint32_t) 16777216;
If you need speedup, you can vectorize the operation:
// for avx512
vector1 = src[i] to src[i+16]
vector2 = src[i+stride] to src[i+stride+16]
vector3 = src[i+stride*2] to src[i+stride*2+16]
vector4 = src[i+stride*3] to src[i+stride*3+16]
then join them the same way but in vectorized form.
// either a 16-element vector extension
destVect[i] = vector1 + vector2*256 + ....
// or just run over 16-elements at a time like tiled-computing
for(j from 0 to 15)
destVect[i+j] = ...
Maybe you don't even need explicit use of intrinsics. Just try with simple loops working on arrays (vector) of simd-width number of elements but generally encapsulation adds code bloat so you may need to do it on bare plain arrays on stack.
Some compilers have a default minimum number of loop iterations to vectorize so you should test it with different tile width or apply a compiler flag that lets it vectorize even small loops.
Here is a sample solution from a toy auto-vectorized SIMD library: https://godbolt.org/z/qMWsbsrG8
Output:
1000 operations took 5518 nanoseconds
this is with all the stack-array allocation + kernel-launch overheads.
For 10000 operations (https://godbolt.org/z/Mz1K75Kj1) it takes 2.4 nanosecond per operation.
Here is 1000 operations but by using only 16 work-items (single ZMM register on AVX512): https://godbolt.org/z/r9GTfffG8
simd*1000 operations took 20551 nanoseconds
this is 1.25 nanoseconds per operation (at least on godbolt.org server). For FX8150 and a narrower simd value, it has ~6.9 nanoseconds per operation. If you write a non-encapsulation version, it should create less code bloat as a result and be faster.
Lastly, use multiple iterations for benchmarking: https://godbolt.org/z/dn9vj9seP
simd*1000 operations took 12367 nanoseconds
simd*1000 operations took 12420 nanoseconds
simd*1000 operations took 12118 nanoseconds
simd*1000 operations took 2753 nanoseconds
simd*1000 operations took 2694 nanoseconds
simd*1000 operations took 2691 nanoseconds
simd*1000 operations took 2839 nanoseconds
simd*1000 operations took 2698 nanoseconds
simd*1000 operations took 2702 nanoseconds
simd*1000 operations took 2711 nanoseconds
simd*1000 operations took 2718 nanoseconds
simd*1000 operations took 2710 nanoseconds
this is 0.17 nanoseconds per operation. 23GB/s is not too bad for a client-shared server RAM + simple loop. Explicitly using AVX intrinsics and no encapsulation should get you maximum bandwidth of the L1/L2/L3 caches or RAM (depends on dataset size). But beware, if you do this on a real work server with client-sharing, then your neighbors will feel the turbo-underclock during the AVX512-accelerated computations (unless integer doesn't count as a heavy-load for AVX512 pipelines).
Compilers will behave differently. For example, clang produces this:
.LBB0_4:
vpslld zmm1, zmm1, 8 // evil integer bit level hacking?
vpslld zmm2, zmm2, 16
vpslld zmm3, zmm3, 24
vpord zmm0, zmm1, zmm0
vpternlogd zmm3, zmm0, zmm2, 254 // what the code?
vmovdqu64 zmmword ptr [r14 + 4*rax], zmm3
add rax, 16
cmp rax, 16000
jne .LBB0_4
while gcc produces much more code bloat. I don't know why.
(you also have overflow in src vector by iterating to last element and adding stride)
std::byte(andchar) are exempt from strict aliasing, so your first example is fine. I would enable optimizations before comparing overhead though.-O3.