1

I've been studying cache locality recently and I'm trying to understand how CPUs access memory. I wrote an experiment to see if there was a performance difference when looping an array sequentially vs. using a lookup table of some sort to index into the data array. I was surprised to find the lookup method slightly faster. My code is below. I compiled with GCC on Windows (MinGW).

#include <stdlib.h>
#include <stdio.h>
#include <windows.h>

int main()
{
    DWORD dwElapsed, dwStartTime;

    //random arrangement of keys to lookup
    int lookup_arr[] = {0, 3, 8, 7, 2, 1, 4, 5, 6, 9};

    //data for both loops
    int data_arr1[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    int data_arr2[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};

    //first loop, sequential access
    dwStartTime = GetTickCount();
    for (int n = 0; n < 9000000; n++) {
        for (int i = 0; i < 10; i++)
            data_arr1[i]++;
    }
    dwElapsed = GetTickCount() - dwStartTime;
    printf("Normal loop completed: %d\n", dwElapsed);

    //second loop, indexes into data_arr2 using the lookup array
    dwStartTime = GetTickCount();
    for (int n = 0; n < 9000000; n++) {
        for (int i = 0; i < 10; i++)
            data_arr2[lookup_arr[i]]++;
    }
    dwElapsed = GetTickCount() - dwStartTime;
    printf("Lookup loop completed: %d\n", dwElapsed);

    return 0;
}

Running this, I get:

Normal loop completed: 375
Lookup loop completed: 297
10
  • 2
    How many times did you run this program? Commented Dec 10, 2013 at 13:54
  • Did you check the assembler output? And what optimizations flag did you use? Commented Dec 10, 2013 at 13:54
  • 2
    You need to span a much larger range of memory (512k or greater) if you want to see cache coherence effects - and you will want to turn off optimization to see the "raw, underlying" performance. I would recommend you have a really big random array for the lookup and try again. Commented Dec 10, 2013 at 13:58
  • 1
    What happens if you have your "look up" array loop running first? What if you run each "many times" and average their performance? Many other things may be happening at the same time that affect timing of a short loop. A measurement is meaningless without an estimate of the error... Commented Dec 10, 2013 at 14:00
  • 3
    Pretty sure that on your second loop when i == 9 you will be accessing a invalid position... Commented Dec 10, 2013 at 14:00

2 Answers 2

2

Following up on my earlier comments, here is how you do this kind of thing.

  1. Repeated measurements
  2. Estimate error
  3. Large memory block
  4. Randomized vs linear indices (so either way you have an indirection)

The result is a significant difference in speed with the "randomized indexing".

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <math.h>

#define N 1000000

int main(void) {
  int *rArr;
  int *rInd; // randomized indices
  int *lInd; // linear indices
  int ii;

  rArr = malloc(N * sizeof(int) );
  rInd = malloc(N * sizeof(int) );
  lInd = malloc(N * sizeof(int) );

  for(ii = 0; ii < N; ii++) {
    lInd[ii] = ii;
    rArr[ii] = rand();
    rInd[ii] = rand()%N;
  }

  int loopCount;
  int sum;
  time_t startT, stopT;
  double dt, totalT=0, tt2=0;

  startT = clock();
  for(loopCount = 0; loopCount < 100; loopCount++) {
    for(ii = 0; ii < N; ii++) {
      sum += rArr[lInd[ii]];
    }
    stopT = clock();
    dt = stopT - startT;
    totalT += dt;
    tt2 += dt * dt;
    startT = stopT;
  }
  printf("sum is %d\n", sum);
  printf("total time: %lf += %lf\n", totalT/(double)(CLOCKS_PER_SEC), (tt2 - totalT * totalT / 100.0)/100.0 / (double)(CLOCKS_PER_SEC));

  totalT = 0; tt2 = 0;
  startT = clock();
  for(loopCount = 0; loopCount < 100; loopCount++) {
    for(ii = 0; ii < N; ii++) {
      sum += rArr[rInd[ii]];
    }
    stopT = clock();
    dt = stopT - startT;
    totalT += dt;
    tt2 += dt * dt;
    startT = stopT;
  }
  printf("sum is %d\n", sum);
  printf("total time: %lf += %lf\n", totalT/(double)(CLOCKS_PER_SEC), sqrt((tt2 - totalT * totalT / 100.0)/100.0) / (double)(CLOCKS_PER_SEC));
}

Result - the sequential access is > 2x faster (on my machine):

sum is -1444272372
total time: 0.396539 += 0.000219
sum is 546230204
total time: 0.756407 += 0.001165

With -O3 optimization, the difference is even starker - a full 3x faster:

sum is -318372465
total time: 0.142444 += 0.013230
sum is 1672130111
total time: 0.455804 += 0.000402
Sign up to request clarification or add additional context in comments.

1 Comment

In case anyone intends to use this method in future for doing performance measurements, note that the "error" I compute is the standard deviation on a SINGLE measurement - so strictly speaking the number I should have reported is "time per loop" which is 100 times smaller (and then the error is right). 3.96 +- 0.22 ms per loop vs 7.56 +- 1.2 ms would be more correct than what my (somewhat hastily written) code spits out.
1

I believe you are compiling without optimizations turned on. With -O2 g++ optimizes away everything so the run time is 0, and without the flag I get similar results.

After modifying the program so that values in data_arr1 and data_arr2 are actually used for something I get 78ms for both.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.