I need to profile an application which performs a lot of array copies, so I ended up profiling this very simple function:
typedef unsigned char UChar;
void copy_mem(UChar *src, UChar *dst, unsigned int len) {
UChar *end = src + len;
while (src < end)
*dst++ = *src++;
}
I'm using Intel VTune to do the actual profiling, and from there I've seen that there are dramatic differences when compiling with gcc -O3 and "plain" gcc (4.4).
To understand the why and how, I've got the assembly output of both compilation.
The unoptimized version is this one:
.L3:
movl 8(%ebp), %eax
movzbl (%eax), %edx
movl 12(%ebp), %eax
movb %dl, (%eax)
addl $1, 12(%ebp)
addl $1, 8(%ebp)
.L2:
movl 8(%ebp), %eax
cmpl -4(%ebp), %eax
jb .L3
leave
So I see that it first load a dword from *src and puts the lower byte into edx, then it stores it into *dst and updates the pointers: simple enough.
Then I saw the optimized version, and I didn't understand nothing.
EDIT: here there is the optimized assembly.
My question therefore is: what kind of optimizations gcc can do in this function?
memmove. It can handle overlaps.