Reference:
A deep dive into SmallVector::push_back
https://llvm-compile-time-tracker.com/
Shrink-wrapping optimization
Shrink-wrapping is a compiler optimization technique designed to minimize the overhead of a function's prologue and epilogue by moving them so they execute only when absolutely necessary.To understand why this is useful, it helps to look at how a standard function handles memory and registers at the assembly level.
The Problem: Preemptive Over-Allocation
When a function executes, it often needs to save certain CPU registers to the stack (callee-saved registers) and allocate space for local variables. This setup is called the prologue. Before the function returns, it restores those registers and cleans up the stack, which is called the epilogue.By default, standard compilers place the prologue at the very beginning of a function and the epilogue at the very end.
If a function has a hot fast-path (e.g., an early return or a quick validation check) that doesn't actually use those registers or local variables, it still pays the CPU cycle tax of executing the prologue and epilogue.
How Shrink-Wrapping Fixes This
Shrink-wrapping analyzes the control flow graph (CFG) of a function to find the precise basic blocks where the saved registers or stack space are actually required. It then "shrinks" the scope of the prologue and epilogue, wrapping them tightly only around those specific paths.Before Shrink-Wrapping (Standard Layout)
Function_Start:
PUSH registers <-- Prologue runs every single time
ALLOCATE stack
IF (condition) THEN
GOTO Fast_Path
ELSE
GOTO Slow_Path
Fast_Path:
Result = 0 <-- Doesn't even need those registers!
GOTO Function_End
Slow_Path:
Use registers to do heavy math...
GOTO Function_End
Function_End:
DEALLOCATE stack <-- Epilogue runs every single time
POP registers
RETURN
After Shrink-Wrapping
Function_Start:
IF (condition) THEN
GOTO Fast_Path
ELSE
GOTO Slow_Path
Fast_Path:
Result = 0 <-- Pure profit: No stack overhead, no register spills
RETURN <-- Direct return
Slow_Path:
PUSH registers <-- Prologue is "wrapped" only around the slow path
ALLOCATE stack
Use registers to do heavy math...
DEALLOCATE stack
POP registers <-- Epilogue wrapped here too
RETURN
Limitations: The "Rejoin" Problem
Shrink-wrapping is incredibly effective for functions with early exits. However, it can fail or get disabled if the control flow graph gets messy.
If a fast path and a slow path diverge but then rejoin later in the function before a single, shared return point, the compiler may not be able to safely decouple the stack frames.
clang and gcc cmd:
$ clang -mllvm -debug-only=shrink-wrap
$ gcc -fshrink-wrap-separate (on at -O2)
How to solve if shrink-wrap not working?
Hand coding the tail duplication with fast and slow code path.
e.g.
LLVM_ATTRIBUTE_NOINLINE void growAndPushBack(ValueParamT Elt) {
// in case Elt aliases storage that grow() invalidates
// This is very much edge-case considered.
T Tmp = Elt;
// +1 is sufficient, while internally doing
// exponential growth algorithm.
this->grow(this->size() + 1);
std::memcpy(reinterpret_cast<void *>(this->end()), &Tmp, sizeof(T));
// size is not just +1 but applied with
// exponential growth algorithm.
this->set_size(this->size() + 1);
}
void push_back(ValueParamT Elt) {
if (LLVM_UNLIKELY(this->size() >= this->capacity()))
return growAndPushBack(Elt);
std::memcpy(reinterpret_cast<void *>(this->end()), &Elt, sizeof(T));
this->set_size(this->size() + 1);
}
mov eax, [rdi + 8]
cmp eax, [rdi + 12]
jae growAndPushBack # TAILCALL
mov rcx, [rdi]
mov [rcx + rax*4], esi
inc dword ptr [rdi + 8]
retSince there are no callee-saved registers, shrink-wrapping is unnecessary.
Moving the slow path into a separate COMDAT/section groups section as an out-of-line function further degrades its performance.
Moving the slow path into a separate COMDAT/section groups section as an out-of-line function further degrades its performance.
// noinline growAndPushBack is load-bearing for both Clang and GCC.
void DecodeMOVDDUPMask(unsigned n, llvm::SmallVectorImpl<int> &v) {
for (unsigned l = 0; l < n; l += 2)
for (unsigned i = 0; i < 2; ++i)
v.push_back(i);
}The noinline attribute is required here. Without it, Clang and GCC may inline the helper function, which defeats the optimization by reintroducing the prologue.Creating a temporary copy (T Tmp = Elt) safely handles cases where Elt references the vector's own storage. While this copy is easily elided for small, trivially-copyable types, passing a larger element by reference to the out-of-line growAndPushBack() forces it to be memory-materialized. Because its address must remain stable across a non-inlined call, this defeats construct-in-place optimization for large types. However, this overhead is negligible compared to the cost of grow(), which must already copy or move all existing size() elements.








