MSVC addresses buffer relative to the other pointer, allowing it to do only one increment but not need as many registers as with [base1 + index] and [base2 + index], and with one of the addressing-modes being a simple [reg].
MSVC inlined strcpy, as @chqrlie said. It surprisingly uses a naive byte-at-a-time loop that will be really slow for large copies.
The top of the loop is aligned, presumably to a multiple-of-16 address, like align 16 would do in asm source. The padding bytes which create that alignment are one long NOP, which IDA strangely disassembles to two separate lines. It uses the 0F 1F /0 opcode, with three 66h operand-size prefix bytes, each of which sets the operand-size to 16-bit. (This of course doesn't matter, it's a NOP anyway, we just want something the decoders can chew through quickly.) db 66h, 66h is first two extra operand-size prefixes. They're redundant, so apparently IDA decides to show them on a separate line. Other disassemblers will show extra prefixes in the disassembly for the NOP, like o16 o16 nop ...
The function args (int argc and char *argv[]) are in registers at the top of main. Since this is Windows x64, argc is in ECX and argv is in RDX.
sub rsp, 238h ; reserve outgoing shadow space + buffer[500]
[...] then some stack-smashing protection, storing a stack cookie which it will check before RET
mov rax, [rdx+8] ; char *srcp = load argv[1]
lea rdx, [rsp+238h+var_218] <===== IDA says this line adds padding for destination buffer
; RDX points at buffer[]. IDA is wrong of confusing.
sub rdx, rax ; RDX = buffer - srcp
rsp+238h is where the frame-pointer would point if you were using RBP as a frame pointer. var_218 is presumably -508 or -516 or so, the offset of char buffer[500] within our stack frame.
After the sub, [rdx+rax] would address buffer[0], while [rax] addresses argv[1][0]. (Let's call that srcp[0].)
This is setting up for the loc_1400010A0: loop over 2 arrays (argv[1][] and buffer[]) which addresses one relative to the other. This is an obscure but useful asm trick which GCC and Clang don't seem to know about, unfortunately. MSVC has been using it for the past few years, in loops in general, not just this inlined strcpy.
The "obvious" methods to loop over arrays are:
Indexing one array relative to the other is sort of half-way between and has some advantages of both:
- Only two registers (plus an end-pointer or counter for counted loops, unlike this one.) One of the registers will point at the end of an array, so if you need the original array bases after the loop, you again need to save them elsewhere.
- Only one
add. (Plus a compare against an end-pointer for counted loops.)
This is true even with more arrays, addressing all the others relative to the one you increment.
- One of the addressing modes is simple
[reg], the other is [reg+reg]. If you choose wisely, you can make the simple one the one you use twice, or the one used with an instruction that Intel would un-laminate. MSVC is not wise, last time I looked at a loop where it did this with AVX instructions. :/ Same here: it used the indexed addressing-mode with the store instead of the load, which is worse on Haswell/Skylake CPUs1.
It can also require an extra instruction or two to set up the pointers, like to copy and subtract.
// C-like pseudo-code for MSVC's asm
// Don't write source like this; the pointer subtraction is UB in C, safe in asm for machines like x86 with a flat memory model.
char *RAX = argv[1];
char *RDX = buffer - RAX;
do {
char tmp_src = *RAX; // MOVZX load
RDX[(uintptr_t)RAX] = tmp_src; // MOV byte store
RAX++; // LEA pointer increment, silly compiler could have used ADD or INC
} while(tmp_src != '\0');
Note that storing unconditionally, before the loop branch, does exactly what we want for strcpy: it copies the terminating 0.
Using a negative index relative to the ends of both arrays also costs extra setup, but allows inc / jnz for only one uop of loop overhead for counted loops, if two-register addressing modes don't cost extra. (Or compared to simple indexing where you just zero a counter ahead of the loop, to make asm that works the same way C src[i] and dst[i] does. That's often the worst choice, especially with Intel CPUs.)
// The other rarely-used trick: indexing from the ends of arrays.
// Just to illustrate what the asm looks like; don't write source like this, although in this case it's fully portable and safe, just weird style.
char *end_dst = array1 + size;
char *end_src = array2 + size;
for(ssize_t i = -size; i!=0 ; ++i) {
use end_dst[i] and end_src[i];
}
MSVC vs other compilers
GCC and Clang wouldn't inline strcpy with a naive byte-at-a-time loop. That's usually not a good idea, especially when it can see (from the destination array size) that the copy could be up to 500 bytes. GCC used to inline repne scasb for strlen at -O1 but stopped doing that because it's very slow in bad cases, and not great even for small cases since rep microcode startup still has a cost. (Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled?).
One byte at a time is a total joke. strcpy can easily go 32 or 64 bytes at a time with 32-byte AVX2 vectors (YMM registers), although handling startup in a safe way costs extra instructions to check that we're not near the end of a 4K page. (The source string could be short and the next page unmapped, so we must not read it.) It's a difficult tradeoff if we expect typical string lengths of only a few bytes and also want to be good with lengths like a few hundred bytes. See also Why does glibc's strlen need to be so complicated to run quickly? which is basically the same problem, except doesn't have to store the string data, which is inconvenient for an odd length.
MSVC is using cs:__security_cookie for stack-smashing protection, copying it (or an XOR of it and RSP) between the local vars and the return address.
GCC and Clang on Linux at least use thread-local storage for the stack-cookie. MSVC is I guess using BSS storage.
IIRC, IDA just uses cs: to show a RIP-relative addressing mode. It's a weird notation a CS segment-override prefix would be allowed. (And RIP-relative addressing still implies the DS segment, not CS, although there's basically no difference in 64-bit mode. Both are fixed at base=0 limit=unlimited flat memory model, and unlike 32-bit mode, CS segment overrides don't cause stores to fault.)
Footnote 1: Slightly worse at least with hyperthreading. With a bottleneck of 1 store per clock on those CPUs, and 1 loop-branch per clock even on more recent CPUs, the store-address uops aren't going to cause a bottleneck competing for ports 2 and 3 with loads. (https://www.realworldtech.com/haswell-cpu/4/).
If the loop did 2 loads and a store, then it would be a problem on Haswell/Skylake that MSVC used the indexed addressing-mode for the store. (Ports 2 and 3 have load units and can run store-address uops. Port 7 runs only 1-register store-address uops. Ice Lake and later split this up so store-address uops only run on dedicated ports, not ones that also handle loads.)
x86asm) but I'm not seeing SO do anything when I use that alias; maybe I'm holding it wrong.x86asmhighlighter. Some non-asm colorizers happen to work ok for# commentsyntaxes like GAS, others for; commentsyntaxes, at least in some cases. In answers where I'm mixing C and asm blocks, I sometimes just use a name likenasmorx86asmorasm-x86to identify the syntax for potential future colorizers, at least after automated search/replace to whatever the correct identifier will be.strcpywith a naive byte-at-a-time loop is usually not a good idea, especially when it can see (from the destination array size) that the copy could be up to 500 bytes. GCC used to inlinerep scasbforstrlenat-O1but stopped doing that because it's very slow in bad cases. (Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled?)