optimizing array loop in c

Question

I have looked online and in my books but I can't seem to get this. I was asked to optimize a small part of a program. Specifically to take an array and add its contents within a small amount of time, with vi and gcc, without using the built-in optimizer. I have tried loop unrolling and a couple of other optimizations meant for products. Can you please help?

int length = ARRAY_SIZE;
int limit = length-4;
for (j=0; j < limit; j+=5) {
    sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4];
}
for(; j < length; j++){
    sum += array[j];    
}

The array values are non-constant ints and all values have been initialized.

How are you measuring the performance of the various alternatives? Have you investigated whether pointers instead of array subscripting improves things? Have you investigated whether this code is actually the bottleneck in the program? — Jonathan Leffler
– Jonathan Leffler, Commented May 24, 2011 at 6:02
unrolling by 5 is bad. try 4 or 8 instead. compile with -msse4 — Anycorn
– Anycorn, Commented May 24, 2011 at 6:04
@newb @missingno Here is a related question: stackoverflow.com/questions/5952636/… — Peter G.
– Peter G., Commented May 24, 2011 at 6:33

Drew Hoskins · Accepted Answer · 2011-05-24 06:30:27Z

11

Create sub-sums which then add up to a sum.

Here's a basic version of what it might look like

for (j=0; j < limit; j+=4) {
    sum1 += array[j];
    sum2 += array[j+1];
    sum3 += array[j+2];
    sum4 += array[j+3];
}
sum = sum1 + sum2 + sum3 + sum4;

This avoids some read-after-write dependencies - that is, the computation of sum2 in each loop iteration need not wait on the results of sum1 to execute, and the processor can schedule both lines in the loop simultaneously.

edited May 24, 2011 at 6:30

answered May 24, 2011 at 6:19

Drew Hoskins

4,18622 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Paul R Over a year ago

+1: loop unrolling often doesn't help on modern CPUs, but breaking dependencies in simple loops like this can be beneficial

Anycorn · Accepted Answer · 2011-05-24 06:15:09Z

4

use sse/mmx set:

__m128i sum;
for (j=0; j < limit; j+=4) {
    sum = _mm_add_epi32(sum, array+j);
}

answered May 24, 2011 at 6:15

Anycorn

51.8k45 gold badges173 silver badges267 bronze badges

3 Comments

Drew Hoskins Over a year ago

I would assume, since this is homework, that he's supposed to just use C rather than find crazy processor instructions.

Dietrich Epp Over a year ago

Autovectorization will do this anyway, and it will correctly handle unaligned arrays.

Christian Rau Over a year ago

You shouldn't assume auto-vectorization, I think.

Mike Dunlavey · Accepted Answer · 2011-05-24 16:36:43Z

As it is, the loop is already unrolled by 5.

Since you're disabling the optimizer, all that indexing is going to cost you.

The first loop could be replaced by:

int* p = array;
for (j = 0; j < ARRAY_SIZE - 4; j += 5, p += 5){
  sum += p[0] + p[1] + p[2] + p[3] + p[4];
}

so it's not doing any indexing (multiplying j by sizeof(int) and adding it to the address).

Added: Of course, since ARRAY_SIZE is presumably a known constant, this is probably the fastest code, but you might need to write a code generator (or clever macro) to make it:

sum += array[0];
sum += array[1];
...
sum += array[ARRAY_SIZE - 1];

An example of such a macro is, if ARRAY_SIZE is a power of 2, like 64, you could have:

#define FOO64(i) FOO32(i); FOO32((i)+32)
#define FOO32(i) FOO16(i); FOO16((i)+16)
#define FOO16(i) FOO8(i); FOO8((i)+8)
#define FOO8(i) FOO4(i); FOO4((i)+4)
#define FOO4(i) FOO2(i); FOO2((i)+2)
#define FOO2(i) FOO1(i); FOO1((i)+1)
#define FOO1(i) sum += array[i]

FOO64(0);

You could do the same idea for other powers, like 10.

paxdiablo · Accepted Answer · 2011-05-24 06:39:22Z

I'm not sure why you can't use the optimiser since, in my experience, it will usually produce faster code than the vast majority of "wanna-be" manual optimisers :-) In addition, you should make sure that this code is actually a problem area - there's no point optimising code that's already close to maximum speed, nor should you concern yourself with something that accounts for 0.01% of the time taken when there may be code elsewhere responsible for 20%.

Optimisation should be heavily targeted otherwise it's wasted effort.

Any solution other than the naive "just add the numbers together" will most likely have to use special features in the target CPU.

Provided you're willing to take a small hit on each update to the array (and this may not be an option given your "all values have been initialized" comment), you can get the sum in very quick time. Use a "class" to maintain the array and the sum side-by-side. Pseudo-code like:

def initArray (sz):
    allocate data as sz+1 integers
    foreach i 0 thru sz:
        set data[i] to 0

def killArray(data):
    free data

def getArray (data,indx):
    return data[indx+1]

def setArray (data,indx,val):
    data[0] = data[0] - data[indx] + val
    data[indx+1] = val

def sumArray(data):
    return data[0]

should do the trick.

The following complete C program shows a very rough first-cut which you can use as a basis for a more robust solution:

#include <stdio.h>
#include <stdlib.h>

static int *initArray (int sz) {
    int i;
    int *ret = malloc (sizeof (int) * (sz + 1));
    for (i = 0; i <= sz; i++)
        ret[i] = 0;
    return ret;
}

static void killArray(int *data) {
    free (data);
}

static int getArray (int *data, int indx) {
    return data[indx+1];
}

static void setArray (int *data, int indx, int val) {
    data[0] = data[0] - data[indx] + val;
    data[indx+1] = val;
}

static int sumArray (int *data) {
    return data[0];
}

int main (void) {
    int i;
    int *mydata = initArray (10);
    if (mydata != NULL) {
        setArray (mydata, 5, 27);
        setArray (mydata, 9, -7);
        setArray (mydata, 7, 42);
        for (i = 0; i < 10; i++)
            printf ("Element %d is %3d\n", i, getArray (mydata, i));
        printf ("Sum is %3d\n", sumArray (mydata));
    }
    killArray (mydata);
    return 0;
}

The output of this is:

Element 0 is   0
Element 1 is   0
Element 2 is   0
Element 3 is   0
Element 4 is   0
Element 5 is  27
Element 6 is   0
Element 7 is  42
Element 8 is   0
Element 9 is  -7
Sum is  62

As I said, this may not be an option but, if you can swing it, you'll be hard-pressed finding a faster way to get the sum than a single array index extraction.

And, as long as you're implementing a class to do this, you may as well use the first two elements for housekeeping, one for the current sum and one for the maximum index, so that you can avoid out-of-bounds errors by checking indx against the maximum.

It looks to me like the homework is meant to demonstrate what optimizers do, by having the student do it.

Thomas Matthews · Accepted Answer · 2011-05-24 19:13:57Z

You may gain more performance by prefetching data inside the rolled loop.
I'll build on Drew's answer:

register int value1, value2, value3, value4;
or (j=0; j < limit; j+=4)
{
    // Prefetch the data
    value1 = array[j];
    value2 = array[j + 1];
    value3 = array[j + 2];
    value4 = array[j + 4];

    // Use the prefetched data
    sum1 += value1;
    sum2 += value2;
    sum3 += value3;
    sum4 += value4;
}
sum = sum1 + sum2 + sum3 + sum4;

The idea here is to have the processor load contiguous data into it's cache, then operate on the cached data. In order for this to be effective, the compiler must not optimize-away the prefetching; this could be performed by declaring the temporary variables as volatile. I don't know if volatile can be combined with register.

Search the web for "Data driven design".

Koistinen · Accepted Answer · 2011-05-24 20:17:58Z

0

Since five seem to be the number of additions to do at a time in the sample, I do it here as well. Normally you do it with a power of 2 as Drew Hoskins suggested. By getting the modulo right at the beginning and stepping in the other direction fewer values might be needed. Computing in a different order is something that is often profitable in scientific computing, not just for indexing. To see if and how good an optimization is, testing is essential.

int sum1, sum2, sum3, sum4;

for(j = ARRAY_SIZE; j%5; j--){
    sum += array[j]; 
}
sum1 = sum2 = sum3 = sum4 = 0;
for (; j; j-=5) {
    sum += array[j-1];
    sum1 += array[j-2];
    sum2 += array[j-3];
    sum3 += array[j-4];
    sum4 += array[j-5];
}
sum += sum1+sum2+sum3+sum4;

edited May 24, 2011 at 20:17

answered May 24, 2011 at 13:37

Koistinen

2731 silver badge9 bronze badges

Comments

Lindydancer · Accepted Answer · 2011-05-24 20:35:53Z

0

One solution would be to maintain a sum at all times. You would, of course, have to do update it every time you change the values in the array, but if that doesn't happen that often might be worth the trouble.

edited May 24, 2011 at 20:35

answered May 24, 2011 at 6:06

Lindydancer

26.3k4 gold badges54 silver badges72 bronze badges

Collectives™ on Stack Overflow

optimizing array loop in c

7 Answers 7

1 Comment

3 Comments

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

1 Comment

3 Comments

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related