2

So there is this original question I assume most of the C++ developers familiar with : Why is processing a sorted array faster than processing an unsorted array?

Answer: branch prediction

Then I tried to recreate the original question and discovered that it is no longer relevant (or the specific code there is not relevant) as the compilers are able to do it without branching as you can see in the answer I've received here : Why branch (miss)prediction doesn't impact performance (C++)?

So I thought to myself, if the compiler has a specific function to compare + count, to recreate the original problem, I'll change the code to compare and multiply, as you can see here :

#include <iostream>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>

int main()
{
    // Step 1: Allocate a vector of size 1 million
    std::vector<int> vec(100'000'000);

    // Step 2: Fill it with random numbers between 0 and 10
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(0, 10);

    for (auto& val : vec)
    {
        val = dis(gen);
    }

    // Step 3: Count numbers above 5 (and measure time)
    auto start = std::chrono::high_resolution_clock::now();
    int sum_above_5 = 1;
    for (size_t i = 0; i < vec.size(); i++)
    {
        if (vec[i] < 5)
        {
            sum_above_5*=vec[i];
        }
    }

    auto end = std::chrono::high_resolution_clock::now();

    auto duration_before_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

    std::cout << "Count of numbers above 5 (before sorting): " << sum_above_5 << std::endl;
    std::cout << "Time taken (before sorting): " << duration_before_sort << " ns" << std::endl;

    // Step 4: Sort the array
    std::sort(vec.begin(), vec.end());

    // Step 5: Count numbers above 5 in the sorted array (and measure time)

    start = std::chrono::high_resolution_clock::now();
    sum_above_5 = 1;
    for (size_t i = 0; i < vec.size(); i++)
    {
        if (vec[i] < 5)
        {
            sum_above_5*= vec[i];
        }
    }
    end = std::chrono::high_resolution_clock::now();


    auto duration_after_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

    std::cout << "Count of numbers above 5 (after sorting): " << sum_above_5 << std::endl;
    std::cout << "Time taken (after sorting):  " << duration_after_sort << " ns" << std::endl;

    return 0;
}

Finally I got the result I've wanted, when I run this code, the sorted version runs 5 times faster as expected.

I was happy for a minute, until I've looked on the disassembly on Godbolt and saw that the code of the multiplication is done by this piece of code:

.L121:
    mov     eax, DWORD PTR [r13+0]   ; load 32-bit value from memory
    mov     edx, ebx                 ; copy ebx → edx
    imul    edx, eax                 ; edx = ebx * eax  (signed 32-bit multiply)
    cmp     eax, 5                   ; compare current value with 5
    cmovl   ebx, edx                 ; if (eax < 5)  ebx = edx
    add     r13, 4                   ; advance pointer to next int
    cmp     rbp, r13                 ; reached end?
    jne     .L121                    ; loop while (r13 != rbp)

There are no jumps after the cmp eax, 5. Or as ChatGPT stated :

Unlike a jmp, this instruction doesn’t cause branching. It’s dataflow-based, not controlflow-based.

That means:

No pipeline flush or misprediction penalty.

The CPU always executes the instruction, but the writeback is gated by the flags.

So now I'm totally confused, why the sorted version is faster ?

18
  • 2
    It's not inherently faster if the compiler made that asm for both; you must be seeing some kind of warmup effect, maybe CPU frequency. Or else the two loops have some difference, perhaps in code alignment or something, although I don't expect even the Skylake JCC erratum could create a 5x perf difference for that loop. What CPU model, and what compiler version/options? Also sum_above_5 should be called sum_below_5 or prod_below_5 or something. Commented Oct 13 at 19:15
  • 1
    I can't reproduce your result with GCC15.2 or Clang20.1 on x86-64 GNU/Linux on i7-6700k, energy_performance_preference=performance (so it jumps to max turbo basically right away). I tried compiling with -O3, -O3 -march=native, and -O2, with both compilers. The program's internal timing was within a few percent for both runs in every case. Commented Oct 13 at 19:20
  • 1
    (And BTW, with -O3 or -O3 -march=skylake, GCC and Clang auto-vectorize, using e.g. AVX2 vpmulld to do 32-bit multiply. Or more shuffling and SSE2 pmuludq widening 32->64-bit multiply if we don't allow it to use AVX2 or SSE4.) Commented Oct 13 at 19:29
  • 2
    Also, ChatGPT's description is misleading in one detail: cmov always writes its output: it's just an ALU select operation. Unlike with ARM predicated instructions (which can fully NOP an instruction even a load or store with a bad address), cmov doesn't require any support from the rest of the pipeline. It's just an instruction that reads 2 integer regs and FLAGS, and produces 1 output, exactly like adc or sbb except how the ALU calculates the result. Commented Oct 13 at 19:37
  • 3
    Minor detail but please adopt the comments and variable names to what the code really does. sum_above_5 is a product of all values below 5 and the vector of size 1 million has 100 million entries. For example use more common names when the exact detail doesn't matter, like processed_value or Allocate a vector with a lot of values. Comments and descriptive names are good, but not when they are wrong and misleading. Commented Oct 14 at 11:10

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.