Understanding optimized assembly code generated by gcc

Question

I'm trying to understand what kind of optimizations are performed by gcc when -O3 flag was set. I'm quite confused what these two lines,

xor %esi, %esi
lea 0x0(%esi), %esi

It seems to me redundant. What's point to use lea instruction here?

Are you looking at disassembly output of an unlinked .o file, or raw -S output? If the latter, I don't know. If the former, that's probably where the offset goes after the linker computes it. — torek
– torek, Commented Sep 30, 2013 at 2:41
Ah, in that case, perhaps it's because the linker calculated the offset at link time, and it turned out the offset was 0. I'd have to look at the source and/or intermediate -S output to be sure. — torek
– torek, Commented Sep 30, 2013 at 2:45
@torek The assembly output from -S has not been passed to a linker. @REALFREE Is this followed by a loop? My guess is that it is just a multi-byte no-op for alignment purposes. — ughoavgfhw
– ughoavgfhw, Commented Sep 30, 2013 at 3:18

ughoavgfhw · Accepted Answer · 2013-09-30 04:36:42Z

That instruction is used to fill space for alignment purposes. Loops can be faster when they start on aligned addresses, because the processor loads memory into the decoder in chunks. By aligning the beginnings of loops and functions, it becomes more likely that they will be at the beginning of one of these chunks. This prevents previous instructions which will not be used from being loaded, maximizes the number of future instructions that will, and, possibly most importantly, ensures that the first instruction is entirely in the first chunk, so it does not take two loads to execute it.

The compiler knows that it is best to align the loop, and has two options to do so. It can either place a jump to the beginning of the loop, or fill the gap with no-ops and let the processor flow through them. Jump instructions break the flow of instructions and often cause wasted cycles on modern processors, so adding them unnecessarily is inadvisable. For a short distance like this no-ops are better.

The x86 architecture contains an instruction specifically for the purpose of doing nothing, nop. However, this is one byte long, so it would take more than one to align the loop. Decoding each one and deciding it does nothing takes time, so it is faster to simply insert another longer instruction that has no side effects. Therefore, the compiler inserted the lea instruction you see. It has absolutely no effects, and is chosen by the compiler to have the exact length required. In fact, recent processors have standard multi-byte no-op instructions, so this will likely be recognized during decode and never even executed.

Community · Accepted Answer · 2017-05-23 11:57:03Z

As explained by ughoavgfhw - these are paddings for better code alignment. You can find this lea in the following link -

http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2010-September/003881.html

quote:

  1-byte: XCHG EAX, EAX
  2-byte: 66 NOP
  3-byte: LEA REG, 0 (REG) (8-bit displacement)
  4-byte: NOP DWORD PTR [EAX + 0] (8-bit displacement)
  5-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement)
**6-byte: LEA REG, 0 (REG) (32-bit displacement)**
  7-byte: NOP DWORD PTR [EAX + 0] (32-bit displacement)
  8-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
  9-byte: NOP WORD  PTR [EAX + EAX*1 + 0] (32-bit displacement)

Also note this SO question describing it in more details - What does NOPL do in x86 system?

Note that the xor itself is not a nop (it changes the value of the reg), but it is also very cheap to perform since it's a zero idiom - What is the purpose of XORing a register with itself?

Collectives™ on Stack Overflow

Understanding optimized assembly code generated by gcc

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related