gcc optimizations while copying an array

Question

I need to profile an application which performs a lot of array copies, so I ended up profiling this very simple function:

typedef unsigned char UChar;
void copy_mem(UChar *src, UChar *dst, unsigned int len) {
        UChar *end = src + len;
        while (src < end)
                *dst++ = *src++;
}

I'm using Intel VTune to do the actual profiling, and from there I've seen that there are dramatic differences when compiling with gcc -O3 and "plain" gcc (4.4).

To understand the why and how, I've got the assembly output of both compilation.

The unoptimized version is this one:

.L3:
        movl    8(%ebp), %eax
        movzbl  (%eax), %edx
        movl    12(%ebp), %eax
        movb    %dl, (%eax)
        addl    $1, 12(%ebp)
        addl    $1, 8(%ebp)
.L2:
        movl    8(%ebp), %eax
        cmpl    -4(%ebp), %eax
        jb      .L3
        leave

So I see that it first load a dword from *src and puts the lower byte into edx, then it stores it into *dst and updates the pointers: simple enough.

Then I saw the optimized version, and I didn't understand nothing.

EDIT: here there is the optimized assembly.

My question therefore is: what kind of optimizations gcc can do in this function?

Use memcpy - it should be way faster than your loop.

Paul R
– Paul R

2011-05-08 19:02:36 +00:00
Commented May 8, 2011 at 19:02 — Paul R
– Paul R, Commented May 8, 2011 at 19:02
Please copy the optimized output to your question.

pts
– pts

2011-05-08 19:03:21 +00:00
Commented May 8, 2011 at 19:03 — pts
– pts, Commented May 8, 2011 at 19:03
I cannot use memcpy, because I can have overlaps.

akappa
– akappa

2011-05-08 19:03:27 +00:00
Commented May 8, 2011 at 19:03 — akappa
– akappa, Commented May 8, 2011 at 19:03
@akappa: Then use memmove. It can handle overlaps.

kennytm
– kennytm

2011-05-08 19:03:47 +00:00
Commented May 8, 2011 at 19:03 — kennytm
– kennytm, Commented May 8, 2011 at 19:03
en.wikipedia.org/wiki/Restrict

Anycorn
– Anycorn

2011-05-08 19:03:54 +00:00
Commented May 8, 2011 at 19:03 — Anycorn
– Anycorn, Commented May 8, 2011 at 19:03

BlackBear · Accepted Answer · 2011-05-09 16:25:46Z

2

That optimized code is quite a mess, but I can spot 3 loops (near L6, L13 and L12). I think gcc does what @GJ suggested (I upvoted him). The loop near L6 moves 4 bytes every time, while loop #2 moves only one byte and is executed only sometimes after loop #1. I still can't get loop #3 since it's identical to loop #2.

answered May 9, 2011 at 16:25

BlackBear

23.1k10 gold badges53 silver badges90 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

akappa Over a year ago

yes, I think it is the only true optimization it can do with it. Thanks!

GJ. · Accepted Answer · 2011-05-08 19:42:31Z

2

Your unoptimized function moving byte per byte!

If you first calcolate the length, than you can move 4 byte at once, the rest 1..3 bytes move manualy. If you can ensure proper (4 byte) memory aligment the copy function should be also faster. And there is no need for incrementing pointers on the stack, you can use registers. All this thinks will dramatically improve the speed of function.

Or use dedicated mem move functions like memmove!

answered May 8, 2011 at 19:42

GJ.

11.1k2 gold badges48 silver badges66 bronze badges

2 Comments

Peter Cordes Dec 10, 2024 at 22:19

you can move 4 byte at once - how in portable C? Dereferencing a uint32_t* pointer would have strict-aliasing and alignment UB. (This is a problem regardless of the underlying ISA: Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?). The only portable way is using 4-byte memcpy(&tmp, src, sizeof(tmp)) that a good compile will expand inline, to to a single mov load into a register. But then you might as well just call memmove for the whole thing.

Peter Cordes Dec 10, 2024 at 22:19

In GNU C you can typedef long safe_long __attribute__((aligned(1),may_alias)); but your answer doesn't say that. Re: incrementing pointers on the stack: that's just from optimization being disabled. You could use the register keyword to keep variables in registers even at the default -O0, but seriously enable at least -O2 for production use.

Necrolis · Accepted Answer · 2011-05-08 19:49:21Z

Well the types of optimizations depend on the function and its properties, if the function was marked as inline, and was small enough, it would be turned into and unrolled loop of MOV, which is faster that REP based variants (and it can avoid register spilling). for unknown sizes you get the REP MOVS family of instructions (starting with the largest word size to lessen the amount of loops for a constant size, else it'll use the size of the data unit your copying).

If SSE is enable, it would more than likely use either unrolled unaligned moves (MOVDQU) where the length permits or looped unaligned moves (dunno if it would use temporal prefetching, the gain from that depends on the block size) if the length is great enough. if the source/dest are aligned correctly it'll try use the faster aligned variants.

As it stands right now, that best you can get on that func is MOVSB when its not inlined.

pts · Accepted Answer · 2011-05-08 19:09:49Z

0

The fastest x86 assembly instructions gcc can generate would be rep movsd, which would copy 4 bytes at a time. The standard libc function memcpy in <string.h>, together with the special inlining gcc does for memcpy and many other functions in <string.h> give you the fastest possible result.

answered May 8, 2011 at 19:09

pts

88.7k26 gold badges117 silver badges203 bronze badges

3 Comments

akappa Over a year ago

so it is just an application of loop unrolling (unrolling 4 copies per time)?

Anycorn Over a year ago

wouldnt movpd/movps be faster?

pts Over a year ago

@akappa: `rep movsd' does loop unrolling, but it also contains the loop itself, i.e. it's one compound assembly instruction for copying a large data block. It should be a bit faster than a loop containing move, compare and jump instructions.

arsenm · Accepted Answer · 2011-05-08 19:58:56Z

0

You would also benefit from using restrict here.

answered May 8, 2011 at 19:58

arsenm

2,9331 gold badge25 silver badges23 bronze badges

Collectives™ on Stack Overflow

gcc optimizations while copying an array

5 Answers 5

1 Comment

2 Comments

Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

2 Comments

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related