3

I need to profile an application which performs a lot of array copies, so I ended up profiling this very simple function:

typedef unsigned char UChar;
void copy_mem(UChar *src, UChar *dst, unsigned int len) {
        UChar *end = src + len;
        while (src < end)
                *dst++ = *src++;
}

I'm using Intel VTune to do the actual profiling, and from there I've seen that there are dramatic differences when compiling with gcc -O3 and "plain" gcc (4.4).

To understand the why and how, I've got the assembly output of both compilation.

The unoptimized version is this one:

.L3:
        movl    8(%ebp), %eax
        movzbl  (%eax), %edx
        movl    12(%ebp), %eax
        movb    %dl, (%eax)
        addl    $1, 12(%ebp)
        addl    $1, 8(%ebp)
.L2:
        movl    8(%ebp), %eax
        cmpl    -4(%ebp), %eax
        jb      .L3
        leave

So I see that it first load a dword from *src and puts the lower byte into edx, then it stores it into *dst and updates the pointers: simple enough.

Then I saw the optimized version, and I didn't understand nothing.

EDIT: here there is the optimized assembly.

My question therefore is: what kind of optimizations gcc can do in this function?

11
  • 5
    Use memcpy - it should be way faster than your loop. Commented May 8, 2011 at 19:02
  • 1
    Please copy the optimized output to your question. Commented May 8, 2011 at 19:03
  • I cannot use memcpy, because I can have overlaps. Commented May 8, 2011 at 19:03
  • 9
    @akappa: Then use memmove. It can handle overlaps. Commented May 8, 2011 at 19:03
  • 1
    en.wikipedia.org/wiki/Restrict Commented May 8, 2011 at 19:03

5 Answers 5

2

That optimized code is quite a mess, but I can spot 3 loops (near L6, L13 and L12). I think gcc does what @GJ suggested (I upvoted him). The loop near L6 moves 4 bytes every time, while loop #2 moves only one byte and is executed only sometimes after loop #1. I still can't get loop #3 since it's identical to loop #2.

Sign up to request clarification or add additional context in comments.

1 Comment

yes, I think it is the only true optimization it can do with it. Thanks!
2

Your unoptimized function moving byte per byte!

If you first calcolate the length, than you can move 4 byte at once, the rest 1..3 bytes move manualy. If you can ensure proper (4 byte) memory aligment the copy function should be also faster. And there is no need for incrementing pointers on the stack, you can use registers. All this thinks will dramatically improve the speed of function.

Or use dedicated mem move functions like memmove!

2 Comments

you can move 4 byte at once - how in portable C? Dereferencing a uint32_t* pointer would have strict-aliasing and alignment UB. (This is a problem regardless of the underlying ISA: Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?). The only portable way is using 4-byte memcpy(&tmp, src, sizeof(tmp)) that a good compile will expand inline, to to a single mov load into a register. But then you might as well just call memmove for the whole thing.
In GNU C you can typedef long safe_long __attribute__((aligned(1),may_alias)); but your answer doesn't say that. Re: incrementing pointers on the stack: that's just from optimization being disabled. You could use the register keyword to keep variables in registers even at the default -O0, but seriously enable at least -O2 for production use.
1

Well the types of optimizations depend on the function and its properties, if the function was marked as inline, and was small enough, it would be turned into and unrolled loop of MOV, which is faster that REP based variants (and it can avoid register spilling). for unknown sizes you get the REP MOVS family of instructions (starting with the largest word size to lessen the amount of loops for a constant size, else it'll use the size of the data unit your copying).

If SSE is enable, it would more than likely use either unrolled unaligned moves (MOVDQU) where the length permits or looped unaligned moves (dunno if it would use temporal prefetching, the gain from that depends on the block size) if the length is great enough. if the source/dest are aligned correctly it'll try use the faster aligned variants.

As it stands right now, that best you can get on that func is MOVSB when its not inlined.

Comments

0

The fastest x86 assembly instructions gcc can generate would be rep movsd, which would copy 4 bytes at a time. The standard libc function memcpy in <string.h>, together with the special inlining gcc does for memcpy and many other functions in <string.h> give you the fastest possible result.

3 Comments

so it is just an application of loop unrolling (unrolling 4 copies per time)?
wouldnt movpd/movps be faster?
@akappa: `rep movsd' does loop unrolling, but it also contains the loop itself, i.e. it's one compound assembly instruction for copying a large data block. It should be a bit faster than a loop containing move, compare and jump instructions.
0

You would also benefit from using restrict here.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.