GCC generates different code depending on array index value

Question

This code (arm):

void blinkRed(void)
{
    for(;;)
    {
        bb[0x0008646B] ^= 1;
        sys.Delay_ms(14);
    }
}

...is compiled to folowing asm-code:

08000470:   ldr r4, [pc, #20]       ; (0x8000488 <blinkRed()+24>) // r4 = 0x422191ac
08000472:   ldr r6, [pc, #24]       ; (0x800048c <blinkRed()+28>)
08000474:   movs r5, #14
08000476:   ldr r3, [r4, #0]
08000478:   eor.w r3, r3, #1
0800047c:   str r3, [r4, #0]
0800047e:   mov r0, r6
08000480:   mov r1, r5
08000482:   bl 0x80001ac <CSTM32F100C6::Delay_ms(unsigned int)>
08000486:   b.n 0x8000476 <blinkRed()+6>

It is ok.

But, if I just change array index (-0x400)....

void blinkRed(void)
{
    for(;;)
    {
        bb[0x0008606B] ^= 1;
        sys.Delay_ms(14);
    }
}

...I've got not so optimized code:

08000470:   ldr r4, [pc, #24]       ; (0x800048c <blinkRed()+28>) // r4 = 0x42218000
08000472:   ldr r6, [pc, #28]       ; (0x8000490 <blinkRed()+32>)
08000474:   movs r5, #14
08000476:   ldr.w r3, [r4, #428]    ; 0x1ac
0800047a:   eor.w r3, r3, #1
0800047e:   str.w r3, [r4, #428]    ; 0x1ac
08000482:   mov r0, r6
08000484:   mov r1, r5
08000486:   bl 0x80001ac <CSTM32F100C6::Delay_ms(unsigned int)>
0800048a:   b.n 0x8000476 <blinkRed()+6>

The difference is that in the first case r4 is loaded with target address immediately (0x422191ac) and then access to memory is performed with 2-byte instructions, but in the second case r4 is loaded with some intermediate address (0x42218000) and then access to memory is performed with 4-bytes instruction with offset (+0x1ac) to target address (0x422181ac).

Why compiler does so?

I use: arm-none-eabi-g++ -mcpu=cortex-m3 -mthumb -g2 -Wall -O1 -std=gnu++14 -fno-exceptions -fno-use-cxa-atexit -fstrict-volatile-bitfields -c -DSTM32F100C6T6B -DSTM32F10X_LD_VL

bb is:

__attribute__ ((section(".bitband"))) volatile u32 bb[0x00800000];

In .ld it is defined as: in MEMORY section:

BITBAND(rwx): ORIGIN = 0x42000000, LENGTH = 0x02000000

in SECTIONS section:

.bitband (NOLOAD) :
SUBALIGN(0x02000000)
{
    KEEP(*(.bitband))
} > BITBAND

Ummm... not sure if this is relevant, but... would the optimized version work as well? Maybe there are some preconditions for the faster load. This information would indicate whether it is the optimizer's problem, or architecture related. — luk32
– luk32, Commented May 10, 2015 at 12:59
Both versions work. I cannot imagine any preconditions that are available in one of case and not available in other. I changed index step by step, and find out that second version is compiled if array index is in the range from 0x00086020 to 0x000863FF. If array index is out of this range then the first (optimized) version is compiled. — Woodoo
– Woodoo, Commented May 10, 2015 at 13:32
As mentioned -mcpu=cortex-m3. It is STM32F100C6. bb is array of 32-bit words with fixed base address of 0x42000000. No alignment issues should exist. — Woodoo
– Woodoo, Commented May 10, 2015 at 14:05
you have not posted enough, the latter (both) look like an optimization, I dont really see there being a performance difference from what you have shown. for us to know why the compiler did something you need to show all the relevant code, I suspect you have not based on what the compiler output. — old_timer
– old_timer, Commented May 10, 2015 at 20:44

FPK · Accepted Answer · 2015-05-31 09:32:00Z

I would consider it an artefact/missing optimization opportunity of -O1.

It can be understood in more detail if we look at the code generated with -O- to load bb[...]:

First case:

movw    r2, #:lower16:bb
movt    r2, #:upper16:bb
movw    r3, #37292
movt    r3, 33
adds    r3, r2, r3
ldr r3, [r3, #0]

Second case:

movw    r3, #:lower16:bb
movt    r3, #:upper16:bb
add r3, r3, #2195456       ; 0x218000    = 4*0x86000
add r3, r3, #428
ldr r3, [r3, #0]

The code in the second case is better and it can be done this way because the constant can be added with two add instructions (which is not the case if the index is 0x0008646B).

-O1 does only optimizations which are not time consuming. So apparently it merges early the add and the ldr so it misses later the opportunity to load the whole address with one pc relative ldr.

Compile with -O2 (or -fgcse) and the code looks like expected.

Collectives™ on Stack Overflow

GCC generates different code depending on array index value

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related