Why is this sequential array loop slower than a loop that uses a "lookup" array?

Question

I've been studying cache locality recently and I'm trying to understand how CPUs access memory. I wrote an experiment to see if there was a performance difference when looping an array sequentially vs. using a lookup table of some sort to index into the data array. I was surprised to find the lookup method slightly faster. My code is below. I compiled with GCC on Windows (MinGW).

#include <stdlib.h>
#include <stdio.h>
#include <windows.h>

int main()
{
    DWORD dwElapsed, dwStartTime;

    //random arrangement of keys to lookup
    int lookup_arr[] = {0, 3, 8, 7, 2, 1, 4, 5, 6, 9};

    //data for both loops
    int data_arr1[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    int data_arr2[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};

    //first loop, sequential access
    dwStartTime = GetTickCount();
    for (int n = 0; n < 9000000; n++) {
        for (int i = 0; i < 10; i++)
            data_arr1[i]++;
    }
    dwElapsed = GetTickCount() - dwStartTime;
    printf("Normal loop completed: %d\n", dwElapsed);

    //second loop, indexes into data_arr2 using the lookup array
    dwStartTime = GetTickCount();
    for (int n = 0; n < 9000000; n++) {
        for (int i = 0; i < 10; i++)
            data_arr2[lookup_arr[i]]++;
    }
    dwElapsed = GetTickCount() - dwStartTime;
    printf("Lookup loop completed: %d\n", dwElapsed);

    return 0;
}

Running this, I get:

Normal loop completed: 375
Lookup loop completed: 297

Did you check the assembler output? And what optimizations flag did you use? — Devolus
– Devolus, Commented Dec 10, 2013 at 13:54
You need to span a much larger range of memory (512k or greater) if you want to see cache coherence effects - and you will want to turn off optimization to see the "raw, underlying" performance. I would recommend you have a really big random array for the lookup and try again. — Floris
– Floris, Commented Dec 10, 2013 at 13:58
What happens if you have your "look up" array loop running first? What if you run each "many times" and average their performance? Many other things may be happening at the same time that affect timing of a short loop. A measurement is meaningless without an estimate of the error... — Floris
– Floris, Commented Dec 10, 2013 at 14:00
Pretty sure that on your second loop when i == 9 you will be accessing a invalid position... — João Augusto
– João Augusto, Commented Dec 10, 2013 at 14:00

Floris · Accepted Answer · 2013-12-10 14:32:53Z

Following up on my earlier comments, here is how you do this kind of thing.

Repeated measurements
Estimate error
Large memory block
Randomized vs linear indices (so either way you have an indirection)

The result is a significant difference in speed with the "randomized indexing".

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <math.h>

#define N 1000000

int main(void) {
  int *rArr;
  int *rInd; // randomized indices
  int *lInd; // linear indices
  int ii;

  rArr = malloc(N * sizeof(int) );
  rInd = malloc(N * sizeof(int) );
  lInd = malloc(N * sizeof(int) );

  for(ii = 0; ii < N; ii++) {
    lInd[ii] = ii;
    rArr[ii] = rand();
    rInd[ii] = rand()%N;
  }

  int loopCount;
  int sum;
  time_t startT, stopT;
  double dt, totalT=0, tt2=0;

  startT = clock();
  for(loopCount = 0; loopCount < 100; loopCount++) {
    for(ii = 0; ii < N; ii++) {
      sum += rArr[lInd[ii]];
    }
    stopT = clock();
    dt = stopT - startT;
    totalT += dt;
    tt2 += dt * dt;
    startT = stopT;
  }
  printf("sum is %d\n", sum);
  printf("total time: %lf += %lf\n", totalT/(double)(CLOCKS_PER_SEC), (tt2 - totalT * totalT / 100.0)/100.0 / (double)(CLOCKS_PER_SEC));

  totalT = 0; tt2 = 0;
  startT = clock();
  for(loopCount = 0; loopCount < 100; loopCount++) {
    for(ii = 0; ii < N; ii++) {
      sum += rArr[rInd[ii]];
    }
    stopT = clock();
    dt = stopT - startT;
    totalT += dt;
    tt2 += dt * dt;
    startT = stopT;
  }
  printf("sum is %d\n", sum);
  printf("total time: %lf += %lf\n", totalT/(double)(CLOCKS_PER_SEC), sqrt((tt2 - totalT * totalT / 100.0)/100.0) / (double)(CLOCKS_PER_SEC));
}

Result - the sequential access is > 2x faster (on my machine):

sum is -1444272372
total time: 0.396539 += 0.000219
sum is 546230204
total time: 0.756407 += 0.001165

With -O3 optimization, the difference is even starker - a full 3x faster:

sum is -318372465
total time: 0.142444 += 0.013230
sum is 1672130111
total time: 0.455804 += 0.000402

In case anyone intends to use this method in future for doing performance measurements, note that the "error" I compute is the standard deviation on a SINGLE measurement - so strictly speaking the number I should have reported is "time per loop" which is 100 times smaller (and then the error is right). 3.96 +- 0.22 ms per loop vs 7.56 +- 1.2 ms would be more correct than what my (somewhat hastily written) code spits out.

Timo · Accepted Answer · 2013-12-10 14:05:38Z

1

I believe you are compiling without optimizations turned on. With -O2 g++ optimizes away everything so the run time is 0, and without the flag I get similar results.

After modifying the program so that values in data_arr1 and data_arr2 are actually used for something I get 78ms for both.

edited Dec 10, 2013 at 14:05

answered Dec 10, 2013 at 13:57

Timo

5,1963 gold badges27 silver badges29 bronze badges

Collectives™ on Stack Overflow

Why is this sequential array loop slower than a loop that uses a "lookup" array?

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related