So there is this original question I assume most of the C++ developers familiar with : Why is processing a sorted array faster than processing an unsorted array?
Answer: branch prediction
Then I tried to recreate the original question and discovered that it is no longer relevant (or the specific code there is not relevant) as the compilers are able to do it without branching as you can see in the answer I've received here : Why branch (miss)prediction doesn't impact performance (C++)?
So I thought to myself, if the compiler has a specific function to compare + count, to recreate the original problem, I'll change the code to compare and multiply, as you can see here :
#include <iostream>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
int main()
{
// Step 1: Allocate a vector of size 1 million
std::vector<int> vec(100'000'000);
// Step 2: Fill it with random numbers between 0 and 10
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(0, 10);
for (auto& val : vec)
{
val = dis(gen);
}
// Step 3: Count numbers above 5 (and measure time)
auto start = std::chrono::high_resolution_clock::now();
int sum_above_5 = 1;
for (size_t i = 0; i < vec.size(); i++)
{
if (vec[i] < 5)
{
sum_above_5*=vec[i];
}
}
auto end = std::chrono::high_resolution_clock::now();
auto duration_before_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Count of numbers above 5 (before sorting): " << sum_above_5 << std::endl;
std::cout << "Time taken (before sorting): " << duration_before_sort << " ns" << std::endl;
// Step 4: Sort the array
std::sort(vec.begin(), vec.end());
// Step 5: Count numbers above 5 in the sorted array (and measure time)
start = std::chrono::high_resolution_clock::now();
sum_above_5 = 1;
for (size_t i = 0; i < vec.size(); i++)
{
if (vec[i] < 5)
{
sum_above_5*= vec[i];
}
}
end = std::chrono::high_resolution_clock::now();
auto duration_after_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Count of numbers above 5 (after sorting): " << sum_above_5 << std::endl;
std::cout << "Time taken (after sorting): " << duration_after_sort << " ns" << std::endl;
return 0;
}
Finally I got the result I've wanted, when I run this code, the sorted version runs 5 times faster as expected.
I was happy for a minute, until I've looked on the disassembly on Godbolt and saw that the code of the multiplication is done by this piece of code:
.L121:
mov eax, DWORD PTR [r13+0] ; load 32-bit value from memory
mov edx, ebx ; copy ebx → edx
imul edx, eax ; edx = ebx * eax (signed 32-bit multiply)
cmp eax, 5 ; compare current value with 5
cmovl ebx, edx ; if (eax < 5) ebx = edx
add r13, 4 ; advance pointer to next int
cmp rbp, r13 ; reached end?
jne .L121 ; loop while (r13 != rbp)
There are no jumps after the cmp eax, 5. Or as ChatGPT stated :
Unlike a jmp, this instruction doesn’t cause branching. It’s dataflow-based, not controlflow-based.
That means:
No pipeline flush or misprediction penalty.
The CPU always executes the instruction, but the writeback is gated by the flags.
So now I'm totally confused, why the sorted version is faster ?
sum_above_5should be calledsum_below_5orprod_below_5or something.-O3,-O3 -march=native, and-O2, with both compilers. The program's internal timing was within a few percent for both runs in every case.-O3or-O3 -march=skylake, GCC and Clang auto-vectorize, using e.g. AVX2vpmulldto do 32-bit multiply. Or more shuffling and SSE2pmuludqwidening 32->64-bit multiply if we don't allow it to use AVX2 or SSE4.)cmovalways writes its output: it's just an ALU select operation. Unlike with ARM predicated instructions (which can fully NOP an instruction even a load or store with a bad address),cmovdoesn't require any support from the rest of the pipeline. It's just an instruction that reads 2 integer regs and FLAGS, and produces 1 output, exactly likeadcorsbbexcept how the ALU calculates the result.sum_above_5is a product of all values below 5 and thevector of size 1 millionhas 100 million entries. For example use more common names when the exact detail doesn't matter, likeprocessed_valueorAllocate a vector with a lot of values. Comments and descriptive names are good, but not when they are wrong and misleading.