5

what should be the behavior in the following case:

class C {
    boost::mutex mutex_;
    std::map<...> data_;
};

C& get() {
    static C c;
    return c;
}

int main() {
    get(); // is compiler free to optimize out the call? 
    ....
}

is compiler allowed to optimize out the call to get()?

the idea was to touch static variable to initialize it before multithreaded operations needed it

is this a better option?:

C& get() {
    static C *c = new C();
    return *c;
}
11
  • 1
    Are the constructor and destructor of C trivial? Commented Sep 20, 2010 at 5:07
  • @James in my case no, they have mutex and map Commented Sep 20, 2010 at 5:08
  • even if the constructors are non-trivial, ::get::c is static, they get called in the same manner regardless of the call to ::get(), the compiler could correctly optimize that call out without otherwise harming the functionality. Commented Sep 20, 2010 at 5:19
  • @TokenMacGuy: The compiler (and linker, probably) would have to ensure that the get() in main() is the first call to get(); this won't be the case if get() is called during dynamic initialization of static objects at namespace scope. Commented Sep 20, 2010 at 5:22
  • 4
    @Tok i thought for static vars its not needed, they get inited once Commented Sep 20, 2010 at 5:44

4 Answers 4

4

Updated (2023) Answer:

In C++23 (N4950) any side effects of initializing a static local variable are observable as its containing block is entered. As such, unless the compiler can determine that initializing the variable has no visible side effects, it will have to generate code for to call get() at the appropriate time (or to execute an inlined version of get(), as the case may be).

Contrary to earlier standards, C++ 23 no longer gives permission for dynamic initialization of a static local variable to be done "early" (as discussed below).

[stmt.dcl]/3:

Dynamic initialization of a block variable with static storage duration (6.7.5.2) or thread storage duration (6.7.5.3) is performed the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization.

Original (2010) answer:

The C and C++ standards operate under a rather simple principle generally known as the "as-if rule" -- basically, that the compiler is free to do almost anything as long as no conforming code can discern the difference between what it did and what was officially required.

I don't see a way for conforming code to discern whether get was actually called in this case, so it looks to me like it's free to optimize it out.

At least as recently as N4296, the standard contained explicit permission to do early initialization of static local variables:

Constant initialization (3.6.2) of a block-scope entity with static storage duration, if applicable, is performed before its block is first entered. An implementation is permitted to perform early initialization of other block-scope variables with static or thread storage duration under the same conditions that an implementation is permitted to statically initialize a variable with static or thread storage duration in namespace scope (3.6.2). Otherwise such a variable is initialized the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization.

So, under this rule, initialization of the local variable could happen arbitrarily early in execution, so even if it has visible side effects, they're allowed to happen before any code that attempts to observed them. As such, you aren't guaranteed to see them, so optimizing it out is allowed.

Sign up to request clarification or add additional context in comments.

12 Comments

I don't understand how the compiler could know that get doesn't perform any side effect... and thus how it could decide to optimize it. Unless get is pure (and how would it know about it ?), there is no reason not to execute it... is it ?
@Matthieu: The C standard defines a side effect as modifying a volatile variable or calling a library function. It's pretty easy for a compiler to figure out that get does neither.
if the definition of get is visible within the translation unit, yet it is, but if it is defined within another translation unit, would this be subject to LTO ? I doubt it, but I don't know much about LTO yet.
@Matthieu: Hard to say -- from the viewpoint of the standard, there's no real separation between LTO and other optimization. That said, my guess would be that what usually gets called LTO wouldn't typically do this, but what gets called LTCG might.
Indeed, my reading is that C++20 and earlier specifically allow visible side-effects to happen early (but not late). Or any side-effects that might indirectly lead to I/O or whatever, like writing to non-volatile globals. I guess even reading globals that might not even be initialized yet if done early? C++23 requires all of that to happen as-if on first call, if there are any. Everything else is still up to the as-if rule.
|
3

Based on your edits, here's an improved version, with the same results.

Input:

struct C { 
    int myfrob;
    int frob();
    C(int f);
 };
C::C(int f) : myfrob(f) {}
int C::frob() { return myfrob; }

C& get() {
    static C *c = new C(5);
    return *c;
}

int main() {
    return get().frob(); // is compiler free to optimize out the call? 

}

Output:

; ModuleID = '/tmp/webcompile/_28088_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-linux-gnu"

%struct.C = type { i32 }

@guard variable for get()::c = internal global i64 0            ; <i64*> [#uses=4]

declare i32 @__cxa_guard_acquire(i64*) nounwind

declare i8* @operator new(unsigned long)(i64)

declare void @__cxa_guard_release(i64*) nounwind

declare i8* @llvm.eh.exception() nounwind readonly

declare i32 @llvm.eh.selector(i8*, i8*, ...) nounwind

declare void @__cxa_guard_abort(i64*) nounwind

declare i32 @__gxx_personality_v0(...)

declare void @_Unwind_Resume_or_Rethrow(i8*)

define i32 @main() {
entry:
  %0 = load i8* bitcast (i64* @guard variable for get()::c to i8*), align 8 ; <i8> [#uses=1]
  %1 = icmp eq i8 %0, 0                           ; <i1> [#uses=1]
  br i1 %1, label %bb.i, label %_Z3getv.exit

bb.i:                                             ; preds = %entry
  %2 = tail call i32 @__cxa_guard_acquire(i64* @guard variable for get()::c) nounwind ; <i32> [#uses=1]
  %3 = icmp eq i32 %2, 0                          ; <i1> [#uses=1]
  br i1 %3, label %_Z3getv.exit, label %bb1.i

bb1.i:                                            ; preds = %bb.i
  %4 = invoke i8* @operator new(unsigned long)(i64 4)
          to label %invcont.i unwind label %lpad.i ; <i8*> [#uses=2]

invcont.i:                                        ; preds = %bb1.i
  %5 = bitcast i8* %4 to %struct.C*               ; <%struct.C*> [#uses=1]
  %6 = bitcast i8* %4 to i32*                     ; <i32*> [#uses=1]
  store i32 5, i32* %6, align 4
  tail call void @__cxa_guard_release(i64* @guard variable for get()::c) nounwind
  br label %_Z3getv.exit

lpad.i:                                           ; preds = %bb1.i
  %eh_ptr.i = tail call i8* @llvm.eh.exception()  ; <i8*> [#uses=2]
  %eh_select12.i = tail call i32 (i8*, i8*, ...)* @llvm.eh.selector(i8* %eh_ptr.i, i8* bitcast (i32 (...)* @__gxx_personality_v0 to i8*), i8* null) ; <i32> [#uses=0]
  tail call void @__cxa_guard_abort(i64* @guard variable for get()::c) nounwind
  tail call void @_Unwind_Resume_or_Rethrow(i8* %eh_ptr.i)
  unreachable

_Z3getv.exit:                                     ; preds = %invcont.i, %bb.i, %entry
  %_ZZ3getvE1c.0 = phi %struct.C* [ null, %bb.i ], [ %5, %invcont.i ], [ null, %entry ] ; <%struct.C*> [#uses=1]
  %7 = getelementptr inbounds %struct.C* %_ZZ3getvE1c.0, i64 0, i32 0 ; <i32*> [#uses=1]
  %8 = load i32* %7, align 4                      ; <i32> [#uses=1]
  ret i32 %8
}

Noteworth, no code is emitted for ::get, but main still allocates ::get::c (at %4) with a guard variable as needed (at %2 and at the end of invcont.i and lpad.i). llvm here is inlining all of that stuff.

tl;dr: Don't worry about it, the optimizer normally gets this stuff right. Are you seeing an error?

4 Comments

Well there's no use of C struct in main() after the initialization, calling get() has no side effects apart from initialization and returning reference to c which you don't keep. So there's no possible case where optimizing that line out makes the code behave differently ... hard to blame the compiler. This is similar to the original question, except we don't know what code after the call does.
no error yet, but dont want to find out the hard way. thanks
If an optimization changes the behavior of your code, its either because you're doing something that is undefined (you aren't in this case) or because the optimizer is broken.
This doesn't look better; it adds an unnecessary level of indirection for all future uses of the class/struct.
1

Your original code is safe. Don't introduce an extra level of indirection (a pointer variable that has to get loaded before the address of the std::map is available.)

As Jerry Coffin says, your code has to run as if it ran in source order. That includes running as-if it has constructed your boost or std::mutex and std::map before later stuff in main, such as starting threads.

Pre C++11, the language standard and memory model wasn't officially thread-aware, but stuff like this (thread-safe static-local initialization) worked anyway because compiler writers wanted their compilers to be useful. e.g. GCC 4.1 from 2006 (https://godbolt.org/z/P3sjo4Tjd) still uses a guard variable with to make sure a single thread does the constructing in case multiple calls to get() happen at the same time.

Now, with C++11 and later, the ISO standard does include threads and it's officially required for that to be safe.


Since your program can't observe the difference, it's hypothetically possible that a compiler could choose to skip construction now let it happen in the first thread to actually call get() in a way that isn't optimized away. That's fine, construction of static locals is thread-safe, with compilers like GCC and Clang using a "guard variable" that they check (read-only with an acquire load) at the start of the function.

A file-scope static variable would avoid the load+test/branch fast-path overhead of the guard variable that happens every call, and would be safe as long as nothing calls get() before the start of main(). A guard variable is pretty cheap especially on ISAs like x86, AArch64, and 32-bit ARMv8 that have cheap acquire loads, but more expensive on ARMv7 for example where an acquire load uses a dmb ish full barrier.

If some hypothetical compiler actually did the optimization you're worried about, the difference could be in NUMA placement of the page of .bss holding static C c, if nothing else in that page was touched first. And potentially stalling other threads very briefly in their first calls to get() if construction isn't finished by the time a second thread also calls get().


Current GCC and clang don't in practice do this optimization

Clang 17 with libc++ makes the following asm for x86-64, with -O3. (demangled by Godbolt). The asm for get() is also inlined into main. GCC with libstdc++ is pretty similar, really only differing in the std::map internals.

get():
        movzx   eax, byte ptr [rip + guard variable for get()::c]  # all x86 loads are acquire loads
        test    al, al                       # check the guard variable
        je      .LBB0_1
        lea     rax, [rip + get()::c]        # retval = address of the static variable
   # end of the fast path through the function.
   # after the first call, all callers go through this path.
        ret

 # slow path, only reached if the guard variable is zero
.LBB0_1:
        push    rax
        lea     rdi, [rip + guard variable for get()::c]
        call    __cxa_guard_acquire@PLT
        test    eax, eax   # check if we won the race to construct c,
        je      .LBB0_3    # or if we waited until another thread finished doing it.

        xorps   xmm0, xmm0
        movups  xmmword ptr [rip + get()::c+16], xmm0     # first 16 bytes of std::map<int,int> = NULL pointers
        movups  xmmword ptr [rip + get()::c], xmm0        # std::mutex = 16 bytes of zeros
        mov     qword ptr [rip + get()::c+32], 0          # another NULL
        lea     rsi, [rip + get()::c]                     # arg for __cxa_atexit
        movups  xmmword ptr [rip + get()::c+48], xmm0     # more zeros, maybe a root node?
        lea     rax, [rip + get()::c+48]                  
        mov     qword ptr [rip + get()::c+40], rax        # pointer to another part of the map object

        lea     rdi, [rip + C::~C() [base object destructor]]  # more args for atexit
        lea     rdx, [rip + __dso_handle]
        call    __cxa_atexit@PLT                 # register the destructor function-pointer with a "this" pointer

        lea     rdi, [rip + guard variable for get()::c]
        call    __cxa_guard_release@PLT          # "unlock" the guard variable, setting it to 1 for future calls
             # and letting any other threads return from __cxa_guard_acquire and see a fully-constructed object

.LBB0_3:                                     # epilogue
        add     rsp, 8
        lea     rax, [rip + get()::c]        # return value, same as in the fast path.
        ret

Even though the std::map is unused, constructing it involves calling __cxa_atexit (a C++-internals version of atexit) to register the destructor to free the red-black tree as the program exits. I suspect this is the part that's opaque to the optimizer and the main reason it doesn't get optimized like static int x = 123; or static void *foo = &bar; into pre-initialized space in .data with no run-time construction (and no guard variable).

Constant-propagation to avoid the need for any run-time initialization is what happens if struct C only includes std::mutex, which in GNU/Linux at least doesn't have a destructor and is actually zero-initialized. (C++ before C++23 allowed early init even when that included visible side-effects. This doesn't; compilers can still constant-propagate static int local_foo = an_inline_function(123); into some bytes in .data with no run-time call.)

GCC and Clang also don't optimize away the guard variable (if there's any run-time work to do), even though main doesn't start any threads at all, let alone before calling get(). A constructor in some other compilation unit (including a shared library) could have started another thread that called get() at the same time main did. (It's arguably a missed optimization with gcc -fwhole-program.)


If the constructors had any (potentially) visible side-effects, perhaps including a call to new since new is replaceable, compilers couldn't defer it because the C++ language rules say when the constructor is called in the abstract machine. (Compilers are allowed to make some assumptions about new, though, e.g. clang with libc++ can optimize away new / delete for an unused std::vector.)

Classes like std::unordered_map (a hash table instead of a red-black tree) do use new in their constructor.

I was testing with std::map<int,int>, so the individual objects don't have destructors with visible side-effects. A std::map<Foo,Bar> where Foo::~Foo prints something would make it matter when the static-local initializer runs, since that's when we call __cxa_atexit. Assuming destruction order happens in reverse of construction, waiting until later to call __cxa_atexit could lead to it being destructed sooner, leading to Foo::~Foo() calls happening too soon, potentially before instead of after some other visible side effect.

Or some other global data structure could maybe have references to the int objects inside a std::map<int,int>, and use those in its destructor. That wouldn't be safe if we destruct the std::map too soon.

(I'm not sure if ISO C++, or GNU C++, gives such ordering guarantees for sequencing of destructors. But if it does, that would be a reason compilers couldn't normally defer construction when it involves registering a destructor. And looking for that optimization in trivial programs isn't worth the cost in compile time.)


With file-scope static to avoid a guard variable

Notice the lack of a guard variable, making the fast path faster, especially for ISAs like ARMv7 that don't have a good way to do just an acquire barrier. https://godbolt.org/z/4bGx3Tasj -

static C global_c;     // It's not actually global, just file-scoped static

C& get2() {
    return global_c;
}
# clang -O3 for x86-64
get2():
      # note the lack of a load + branch on a guard variable
        lea     rax, [rip + global_c]
        ret

main:
      # construction already happened before main started, and we don't do anything with the address
        xor     eax, eax
        ret
# GCC -O3 -mcpu=cortex-a15     // a random ARMv7 CPU
get2():
        ldr     r0, .L81          @ PC-relative load
        bx      lr

@ somewhere nearby, between functions
.L81:
        .word   .LANCHOR0+52      @ pointer to struct C global_c

main:
        mov     r0, #0
        bx      lr

The constructor code that does the stores and calls __cxa_atexit still exists, it's just in a separate function called _GLOBAL__sub_I_example.cpp: (clang) or _GLOBAL__sub_I_get(): (GCC), which the compiler adds to a list of init functions to be called before main.

Function-scoped local vars are normally fine, the overhead is pretty minimal, especially on x86-64 and ARMv8. But since you were worried about micro-optimizations like when std::map was constructed at all, I thought it was worth mentioning. And to show the mechanism compilers use to make this stuff work under the hood.

Comments

0

Whether the compiler optimizes the function call or not is basically unspecified behavior as per the Standard. An unspecified behavior is basically a behavior which is chosen from a set of finite possibilities, but the choice may not be consistent every time. In this case, the choice is 'to optimize' or 'not', which the Standard does not specify and the implementation is also not supposed to document, as it is a choice which may not be consistently taken by a given implementation.

If the idea is just to 'touch', will it help if we just add a dummy volatile variable and dummy increment it in each call

e.g

C& getC(){
   volatile int dummy;
   dummy++;
   // rest of the code
}

4 Comments

How do you define "first call"? In any case, the function here is quite simple that it can be entirely optimized out.
Sadly, Sutter mentions in one of his talks that a smart compiler would be able to discard the volatile qualifier in dummy. The rationale is that it can know for a fact that being on the stack it is not a variable that refers to special hardware. Also a pointer to the variable is not being passed to any other function, so the compiler can know for a fact that changes to dummy are only visible inside getC, and as such it could remove the volatile. After that if the compiler notices that the value is never used, it can completely remove the var. I don't know any compiler that does this.
@David Rodríguez - dribeas: well, I Just tried it all out in llvm. In one case, the volitile variable was removed, (leaving nothing more than a main { return 0 }, but in the other case, the volitile with increment was inlined into main with the rest of getC. I think this goes to prove your point that you just can't know what's gonna happen!
After inlining getC, a volatile store or increment is separate from initializing static C c; which has a std::map member. If the compiler was going to optimize away a call to its constructor before starting threads, the presence or absence of a volatile access before (or after) that wouldn't matter. If your reasoning was based on doing all or none of the function, that's not how it works.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.