Your original code is safe. Don't introduce an extra level of indirection (a pointer variable that has to get loaded before the address of the std::map is available.)
As Jerry Coffin says, your code has to run as if it ran in source order. That includes running as-if it has constructed your boost or std::mutex and std::map before later stuff in main, such as starting threads.
Pre C++11, the language standard and memory model wasn't officially thread-aware, but stuff like this (thread-safe static-local initialization) worked anyway because compiler writers wanted their compilers to be useful. e.g. GCC 4.1 from 2006 (https://godbolt.org/z/P3sjo4Tjd) still uses a guard variable with to make sure a single thread does the constructing in case multiple calls to get() happen at the same time.
Now, with C++11 and later, the ISO standard does include threads and it's officially required for that to be safe.
Since your program can't observe the difference, it's hypothetically possible that a compiler could choose to skip construction now let it happen in the first thread to actually call get() in a way that isn't optimized away. That's fine, construction of static locals is thread-safe, with compilers like GCC and Clang using a "guard variable" that they check (read-only with an acquire load) at the start of the function.
A file-scope static variable would avoid the load+test/branch fast-path overhead of the guard variable that happens every call, and would be safe as long as nothing calls get() before the start of main(). A guard variable is pretty cheap especially on ISAs like x86, AArch64, and 32-bit ARMv8 that have cheap acquire loads, but more expensive on ARMv7 for example where an acquire load uses a dmb ish full barrier.
If some hypothetical compiler actually did the optimization you're worried about, the difference could be in NUMA placement of the page of .bss holding static C c, if nothing else in that page was touched first. And potentially stalling other threads very briefly in their first calls to get() if construction isn't finished by the time a second thread also calls get().
Current GCC and clang don't in practice do this optimization
Clang 17 with libc++ makes the following asm for x86-64, with -O3. (demangled by Godbolt). The asm for get() is also inlined into main. GCC with libstdc++ is pretty similar, really only differing in the std::map internals.
get():
movzx eax, byte ptr [rip + guard variable for get()::c] # all x86 loads are acquire loads
test al, al # check the guard variable
je .LBB0_1
lea rax, [rip + get()::c] # retval = address of the static variable
# end of the fast path through the function.
# after the first call, all callers go through this path.
ret
# slow path, only reached if the guard variable is zero
.LBB0_1:
push rax
lea rdi, [rip + guard variable for get()::c]
call __cxa_guard_acquire@PLT
test eax, eax # check if we won the race to construct c,
je .LBB0_3 # or if we waited until another thread finished doing it.
xorps xmm0, xmm0
movups xmmword ptr [rip + get()::c+16], xmm0 # first 16 bytes of std::map<int,int> = NULL pointers
movups xmmword ptr [rip + get()::c], xmm0 # std::mutex = 16 bytes of zeros
mov qword ptr [rip + get()::c+32], 0 # another NULL
lea rsi, [rip + get()::c] # arg for __cxa_atexit
movups xmmword ptr [rip + get()::c+48], xmm0 # more zeros, maybe a root node?
lea rax, [rip + get()::c+48]
mov qword ptr [rip + get()::c+40], rax # pointer to another part of the map object
lea rdi, [rip + C::~C() [base object destructor]] # more args for atexit
lea rdx, [rip + __dso_handle]
call __cxa_atexit@PLT # register the destructor function-pointer with a "this" pointer
lea rdi, [rip + guard variable for get()::c]
call __cxa_guard_release@PLT # "unlock" the guard variable, setting it to 1 for future calls
# and letting any other threads return from __cxa_guard_acquire and see a fully-constructed object
.LBB0_3: # epilogue
add rsp, 8
lea rax, [rip + get()::c] # return value, same as in the fast path.
ret
Even though the std::map is unused, constructing it involves calling __cxa_atexit (a C++-internals version of atexit) to register the destructor to free the red-black tree as the program exits. I suspect this is the part that's opaque to the optimizer and the main reason it doesn't get optimized like static int x = 123; or static void *foo = &bar; into pre-initialized space in .data with no run-time construction (and no guard variable).
Constant-propagation to avoid the need for any run-time initialization is what happens if struct C only includes std::mutex, which in GNU/Linux at least doesn't have a destructor and is actually zero-initialized. (C++ before C++23 allowed early init even when that included visible side-effects. This doesn't; compilers can still constant-propagate static int local_foo = an_inline_function(123); into some bytes in .data with no run-time call.)
GCC and Clang also don't optimize away the guard variable (if there's any run-time work to do), even though main doesn't start any threads at all, let alone before calling get(). A constructor in some other compilation unit (including a shared library) could have started another thread that called get() at the same time main did. (It's arguably a missed optimization with gcc -fwhole-program.)
If the constructors had any (potentially) visible side-effects, perhaps including a call to new since new is replaceable, compilers couldn't defer it because the C++ language rules say when the constructor is called in the abstract machine. (Compilers are allowed to make some assumptions about new, though, e.g. clang with libc++ can optimize away new / delete for an unused std::vector.)
Classes like std::unordered_map (a hash table instead of a red-black tree) do use new in their constructor.
I was testing with std::map<int,int>, so the individual objects don't have destructors with visible side-effects. A std::map<Foo,Bar> where Foo::~Foo prints something would make it matter when the static-local initializer runs, since that's when we call __cxa_atexit. Assuming destruction order happens in reverse of construction, waiting until later to call __cxa_atexit could lead to it being destructed sooner, leading to Foo::~Foo() calls happening too soon, potentially before instead of after some other visible side effect.
Or some other global data structure could maybe have references to the int objects inside a std::map<int,int>, and use those in its destructor. That wouldn't be safe if we destruct the std::map too soon.
(I'm not sure if ISO C++, or GNU C++, gives such ordering guarantees for sequencing of destructors. But if it does, that would be a reason compilers couldn't normally defer construction when it involves registering a destructor. And looking for that optimization in trivial programs isn't worth the cost in compile time.)
With file-scope static to avoid a guard variable
Notice the lack of a guard variable, making the fast path faster, especially for ISAs like ARMv7 that don't have a good way to do just an acquire barrier. https://godbolt.org/z/4bGx3Tasj -
static C global_c; // It's not actually global, just file-scoped static
C& get2() {
return global_c;
}
# clang -O3 for x86-64
get2():
# note the lack of a load + branch on a guard variable
lea rax, [rip + global_c]
ret
main:
# construction already happened before main started, and we don't do anything with the address
xor eax, eax
ret
# GCC -O3 -mcpu=cortex-a15 // a random ARMv7 CPU
get2():
ldr r0, .L81 @ PC-relative load
bx lr
@ somewhere nearby, between functions
.L81:
.word .LANCHOR0+52 @ pointer to struct C global_c
main:
mov r0, #0
bx lr
The constructor code that does the stores and calls __cxa_atexit still exists, it's just in a separate function called _GLOBAL__sub_I_example.cpp: (clang) or _GLOBAL__sub_I_get(): (GCC), which the compiler adds to a list of init functions to be called before main.
Function-scoped local vars are normally fine, the overhead is pretty minimal, especially on x86-64 and ARMv8. But since you were worried about micro-optimizations like when std::map was constructed at all, I thought it was worth mentioning. And to show the mechanism compilers use to make this stuff work under the hood.
Ctrivial?::get(), the compiler could correctly optimize that call out without otherwise harming the functionality.get()inmain()is the first call toget(); this won't be the case ifget()is called during dynamic initialization of static objects at namespace scope.