1

I have written a very basic int main program as shown below:

#include <stdio.h>
#include <windows.h>

int main(int argc, char** argv)
{
    char buffer[500];

    strcpy(buffer, argv[1]);
    printf("strcpy gives %s\n", buffer);

    return 0;    
}

I compiled it with optimization enabled in Visual Studio 2019, and used IDA Pro Free to disassemble the program:

sub     rsp, 238h
mov     rax, cs:__security_cookie
xor     rax, rsp
mov     [rsp+238h+var_18], rax
mov     rax, [rdx+8]              <===== Do not understand what is happening here
lea     rdx, [rsp+238h+var_218]   <===== IDA says this line adds padding for destination buffer
sub     rdx, rax                  <===== What is happening here ?
db      66h, 66h                  <===== what is the significance of this line and the one immediately following
nop     word ptr [rax+rax+00000000h]

loc_1400010A0:                      ; CODE XREF: main+3C↓j

movzx   ecx, byte ptr [rax]
mov     [rdx+rax], cl
lea     rax, [rax+1]
test    cl, cl
jnz     short loc_1400010A0

lea     rdx, [rsp+238h+var_218]
lea     rcx, Format              ; "strcpy gives %s\n"
call    sub_140001010            ; calling printf
xor     eax, eax
mov     rcx, [rsp+238h+var_18]
xor     rcx, rsp        ; StackCookie
call    __security_check_cookie
add     rsp, 238h
retn

I have annotated the instructions which I don't understand and was hoping someone might be able to explain their high level meaning. Please ignore the fact that I am using unsafe runtime functions such as strcpy.

10
  • @genpfault: does the SO colorizer have a tag for assembly? Commented Oct 7 at 19:06
  • I recommend making a study of assembly language via books or free online sources. Commented Oct 7 at 19:25
  • @chqrlie: highlight.js docs claims to (x86asm) but I'm not seeing SO do anything when I use that alias; maybe I'm holding it wrong. Commented Oct 7 at 19:26
  • @chqrlie: not last I looked a couple years ago; I think they don't use/enable the x86asm highlighter. Some non-asm colorizers happen to work ok for # comment syntaxes like GAS, others for ; comment syntaxes, at least in some cases. In answers where I'm mixing C and asm blocks, I sometimes just use a name like nasm or x86asm or asm-x86 to identify the syntax for potential future colorizers, at least after automated search/replace to whatever the correct identifier will be. Commented Oct 7 at 19:39
  • @blogger13: which compiler made this machine code? I assume not GCC or Clang, since inlining strcpy with a naive byte-at-a-time loop is usually not a good idea, especially when it can see (from the destination array size) that the copy could be up to 500 bytes. GCC used to inline rep scasb for strlen at -O1 but stopped doing that because it's very slow in bad cases. (Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled?) Commented Oct 7 at 19:43

2 Answers 2

4

The surprising code generated by MSVC is just a very long nop to align the loop starting at loc_1400010A0. it only appears if you ask for /O2 where strcpy is inlined. the db 66h,66h are extra, redundant prefix bytes that are consumed by the instruction decoder and ignored. It is preferable to generate a long cumbersome instruction with no side effect over a bunch of nop instructions to save CPU cycles, the instruction decoder is pipelined and only a single nop get executed.

With lesser optimisation settings, the code is straightforward as can be experimented on Godbolt's Compiler Explorer

This is the output for /O1 (optimize for size):

_DATA   SEGMENT
COMM    `__local_stdio_printf_options'::`2'::_OptionsStorage:QWORD                                                    ; `__local_stdio_printf_options'::`2'::_OptionsStorage
_DATA   ENDS
`string' DB 'strcpy gives %s', 0aH, 00H ; `string'

buffer$ = 32
__$ArrayPad$ = 544
argc$ = 576
argv$ = 584
main    PROC                      ; COMDAT
$LN4:
        sub     rsp, 568          ; 00000238H      ; make room for `buffer` and extra space
        mov     rax, QWORD PTR __security_cookie   ; store magic value after the array
        xor     rax, rsp                           ; to try and detect
        mov     QWORD PTR __$ArrayPad$[rsp], rax   ; buffer overflow
        mov     rdx, QWORD PTR [rdx+8]             ; get `argv[1]` as arg2
        lea     rcx, QWORD PTR buffer$[rsp]        ; get address of `buffer` as arg1
        call    strcpy                             ; self explanatory
        lea     rdx, QWORD PTR buffer$[rsp]        ; get address of `buffer` as arg2
        lea     rcx, OFFSET FLAT:`string'          ; get address of format string as arg1
        call    printf                             ; ditto
        xor     eax, eax                           ; return value (0)
        mov     rcx, QWORD PTR __$ArrayPad$[rsp]   ; read back sentinel at end of buffer
        xor     rcx, rsp                           ; decypher it
        call    __security_check_cookie            ; check and abort on memory corruption
        add     rsp, 568           ; 00000238H     ; restore stack
        ret     0                                  ; return to caller (0 seems redundant)
main    ENDP

The __security_check_cookie stuff attempts to thwart buffer overflow attacks by detecting if memory just after the buffer has been corrupted. The attacker can provide a long custom crafted command line argument to overwrite the return address and get arbitrary code executed with the program's privileges, but since the __security_cookie is unpredictable, the chances for the value overwritten there to match are vanishingly small at 1 chance in 264.

Note however that stack randomisation (mapping the stack at a random 64-bit address at startup) makes it just as difficult to overwrite the return address with something meaningful, yet if another flaw allows for leaking the value of the stack pointer, then the security cookie method explained here above may prove useful.

Sign up to request clarification or add additional context in comments.

7 Comments

chqrlie thanks. Could you please explain in the disassembly of the following two lines ? lea rdx, rsp+238h+var_218] and the following line ? sub rdx, rax as I still don't understand.
lea rdx, [rsp+238h+var_218] computes the address of buffer: rsp+238h is the end of the stack space used for local variables, var_218 is the offset of buffer from the end of the end of this local area, presumably -532: MSVC seems to add a 32 byte padding area after the local variables, the beginning of which is used for the buffer overflow sentinel.
chqrlie thanks. One last line please sub rdx, rax ?
@blogger13: sub rdx,rax computes in rdx the distance between the source and destination arrays so mov [rdx+rax],cl stores the byte read by movzx ecx,byte ptr [rax] into the destination array. This way only rax need incrementing, which is performed by lea rax,[rax+1] which might be more efficient than inc rax in this specific loop. movzx ecx,byte ptr [rax] is probably chosen over mov cl,byte ptr [rax] for the same reason. Optimizing strcpy is a classic research subject for which there is no perfect solution.
lea rax,[rax+1] isn't more efficient than add rax, 1 on any x86-64 CPU I'm aware of. Except possibly in-order Atom (pre-Silvermont); I think a 64-bit capable version of that existed at one point. It runs LEA on its AGU, earlier in the pipe than the ALU, so results are ready earlier to forward to later stuff. Except that might mean LEA competes for the AGU with the load/store, so couldn't pair with either. On a few CPUs, inc rax is worse because of not updating CF, including possibly Intel's Silvermont-based E-cores where it's 1 uop front-end but needs 2 pipes in the back-end.
See stackoverflow.com/questions/36510095/… . Again, the choices are between inc rax and add rax,1. There's no good reason to use LEA, although in this case it's not longer. Ice Lake and later, and I think AMD, can run LEA on every integer ALU port, but Skylake and earlier can't so LEA is generally worse unless you're trying to make sure a uop can't compete with some other insn in the loop. (This specific loop doesn't bottleneck on back-end ALU throughput.)
movzx loads: yes, mov cl, [rax] has to merge into RCX, so it's a micro-fused load+ALU-merge uop. With a false dependency on the old RCX. movzx is a pure load.
4

MSVC addresses buffer relative to the other pointer, allowing it to do only one increment but not need as many registers as with [base1 + index] and [base2 + index], and with one of the addressing-modes being a simple [reg].


MSVC inlined strcpy, as @chqrlie said. It surprisingly uses a naive byte-at-a-time loop that will be really slow for large copies.

The top of the loop is aligned, presumably to a multiple-of-16 address, like align 16 would do in asm source. The padding bytes which create that alignment are one long NOP, which IDA strangely disassembles to two separate lines. It uses the 0F 1F /0 opcode, with three 66h operand-size prefix bytes, each of which sets the operand-size to 16-bit. (This of course doesn't matter, it's a NOP anyway, we just want something the decoders can chew through quickly.) db 66h, 66h is first two extra operand-size prefixes. They're redundant, so apparently IDA decides to show them on a separate line. Other disassemblers will show extra prefixes in the disassembly for the NOP, like o16 o16 nop ...

The function args (int argc and char *argv[]) are in registers at the top of main. Since this is Windows x64, argc is in ECX and argv is in RDX.

sub     rsp, 238h             ; reserve outgoing shadow space + buffer[500]
  [...] then some stack-smashing protection, storing a stack cookie which it will check before RET

mov     rax, [rdx+8]             ; char *srcp = load argv[1]
lea     rdx, [rsp+238h+var_218]   <===== IDA says this line adds padding for destination buffer
                     ; RDX points at buffer[].  IDA is wrong of confusing.
sub     rdx, rax     ; RDX = buffer - srcp

rsp+238h is where the frame-pointer would point if you were using RBP as a frame pointer. var_218 is presumably -508 or -516 or so, the offset of char buffer[500] within our stack frame.

After the sub, [rdx+rax] would address buffer[0], while [rax] addresses argv[1][0]. (Let's call that srcp[0].)

This is setting up for the loc_1400010A0: loop over 2 arrays (argv[1][] and buffer[]) which addresses one relative to the other. This is an obscure but useful asm trick which GCC and Clang don't seem to know about, unfortunately. MSVC has been using it for the past few years, in loops in general, not just this inlined strcpy.

The "obvious" methods to loop over arrays are:

  • Increment two pointers:

    • Needs two add instructions in the loop. (One per array)
      (Plus a separate cmp ptr, end_ptr / jne .keep_looping for counted loops.)
    • only two total registers needed, although you do need to keep the original base addresses somewhere else if you need them later, potentially in stack space. But in terms of values that need to be in registers during the loop, only the two pointers. (And an end-pointer or counter if it's a counted loop, unlike here looking for a sentinel nul byte.)
    • Has simple [reg] addressing modes for both arrays
  • Increment an index (up toward a max, or up toward 0 indexing from the ends of the arrays)

    • Needs only one add or inc instruction.
      (And for counted loops, a separate cmp/jcc like with pointer increments. Unless you're counting up towards 0 if indexing from the ends of the arrays, then you can use inc rcx / jnz .top_of_loop.)
    • Needs three registers: both bases and an index (unless one of the arrays is static, but only without largeaddressaware or in 32-bit mode, since the only RIP-relative addressing mode is [rip+rel32] with no other registers).
    • All addressing modes are [reg + reg], which costs 1 extra byte of code size, and can be slower on Intel with some instructions (Micro fusion and addressing modes). (Including with stores on Intel from Haswell to Skylake where the port-7 AGU can only handle 1-register addressing mode. But for loads, instructions other than pure loads (like mov and movzx) which don't read their destination or don't have exactly 2 operands. Such instructions unlaminate at alloc/rename if they micro-fused in the decoders, unlike with 1-register addressing modes where they can stay micro-fused, saving front-end bandwidth and space in the ROB (ReOrder Buffer). AMD can keep loads fused with ALU uops regardless of addressing mode, AFAIK.)

Indexing one array relative to the other is sort of half-way between and has some advantages of both:

  • Only two registers (plus an end-pointer or counter for counted loops, unlike this one.) One of the registers will point at the end of an array, so if you need the original array bases after the loop, you again need to save them elsewhere.
  • Only one add. (Plus a compare against an end-pointer for counted loops.)
    This is true even with more arrays, addressing all the others relative to the one you increment.
  • One of the addressing modes is simple [reg], the other is [reg+reg]. If you choose wisely, you can make the simple one the one you use twice, or the one used with an instruction that Intel would un-laminate. MSVC is not wise, last time I looked at a loop where it did this with AVX instructions. :/ Same here: it used the indexed addressing-mode with the store instead of the load, which is worse on Haswell/Skylake CPUs1.

It can also require an extra instruction or two to set up the pointers, like to copy and subtract.

// C-like pseudo-code for MSVC's asm
// Don't write source like this; the pointer subtraction is UB in C, safe in asm for machines like x86 with a flat memory model.
 char *RAX = argv[1];
 char *RDX = buffer - RAX;
 do {
     char tmp_src = *RAX;            // MOVZX load
     RDX[(uintptr_t)RAX] = tmp_src;  // MOV byte store
     RAX++;      // LEA pointer increment, silly compiler could have used ADD or INC
 } while(tmp_src != '\0');

Note that storing unconditionally, before the loop branch, does exactly what we want for strcpy: it copies the terminating 0.


Using a negative index relative to the ends of both arrays also costs extra setup, but allows inc / jnz for only one uop of loop overhead for counted loops, if two-register addressing modes don't cost extra. (Or compared to simple indexing where you just zero a counter ahead of the loop, to make asm that works the same way C src[i] and dst[i] does. That's often the worst choice, especially with Intel CPUs.)

// The other rarely-used trick: indexing from the ends of arrays.
// Just to illustrate what the asm looks like; don't write source like this, although in this case it's fully portable and safe, just weird style.
 char *end_dst = array1 + size;
 char *end_src = array2 + size;
 for(ssize_t i = -size; i!=0 ; ++i) {
    use end_dst[i] and end_src[i];
 }

MSVC vs other compilers

GCC and Clang wouldn't inline strcpy with a naive byte-at-a-time loop. That's usually not a good idea, especially when it can see (from the destination array size) that the copy could be up to 500 bytes. GCC used to inline repne scasb for strlen at -O1 but stopped doing that because it's very slow in bad cases, and not great even for small cases since rep microcode startup still has a cost. (Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled?).

One byte at a time is a total joke. strcpy can easily go 32 or 64 bytes at a time with 32-byte AVX2 vectors (YMM registers), although handling startup in a safe way costs extra instructions to check that we're not near the end of a 4K page. (The source string could be short and the next page unmapped, so we must not read it.) It's a difficult tradeoff if we expect typical string lengths of only a few bytes and also want to be good with lengths like a few hundred bytes. See also Why does glibc's strlen need to be so complicated to run quickly? which is basically the same problem, except doesn't have to store the string data, which is inconvenient for an odd length.

MSVC is using cs:__security_cookie for stack-smashing protection, copying it (or an XOR of it and RSP) between the local vars and the return address.
GCC and Clang on Linux at least use thread-local storage for the stack-cookie. MSVC is I guess using BSS storage.

IIRC, IDA just uses cs: to show a RIP-relative addressing mode. It's a weird notation a CS segment-override prefix would be allowed. (And RIP-relative addressing still implies the DS segment, not CS, although there's basically no difference in 64-bit mode. Both are fixed at base=0 limit=unlimited flat memory model, and unlike 32-bit mode, CS segment overrides don't cause stores to fault.)


Footnote 1: Slightly worse at least with hyperthreading. With a bottleneck of 1 store per clock on those CPUs, and 1 loop-branch per clock even on more recent CPUs, the store-address uops aren't going to cause a bottleneck competing for ports 2 and 3 with loads. (https://www.realworldtech.com/haswell-cpu/4/).
If the loop did 2 loads and a store, then it would be a problem on Haswell/Skylake that MSVC used the indexed addressing-mode for the store. (Ports 2 and 3 have load units and can run store-address uops. Port 7 runs only 1-register store-address uops. Ice Lake and later split this up so store-address uops only run on dedicated ports, not ones that also handle loads.)

6 Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.