Jul 2, 2026

[C++] is inline a function always a good idea?

Reference:
A deep dive into SmallVector::push_back
https://llvm-compile-time-tracker.com/


Shrink-wrapping optimization

Shrink-wrapping is a compiler optimization technique designed to minimize the overhead of a function's prologue and epilogue by moving them so they execute only when absolutely necessary.

To understand why this is useful, it helps to look at how a standard function handles memory and registers at the assembly level.

The Problem: Preemptive Over-Allocation

When a function executes, it often needs to save certain CPU registers to the stack (callee-saved registers) and allocate space for local variables. This setup is called the prologue. Before the function returns, it restores those registers and cleans up the stack, which is called the epilogue.

By default, standard compilers place the prologue at the very beginning of a function and the epilogue at the very end.

If a function has a hot fast-path (e.g., an early return or a quick validation check) that doesn't actually use those registers or local variables, it still pays the CPU cycle tax of executing the prologue and epilogue.


How Shrink-Wrapping Fixes This

Shrink-wrapping analyzes the control flow graph (CFG) of a function to find the precise basic blocks where the saved registers or stack space are actually required. It then "shrinks" the scope of the prologue and epilogue, wrapping them tightly only around those specific paths.


Before Shrink-Wrapping (Standard Layout)

Function_Start:
  PUSH registers      <-- Prologue runs every single time
  ALLOCATE stack
  
  IF (condition) THEN
    GOTO Fast_Path    
  ELSE
    GOTO Slow_Path
    
Fast_Path:
  Result = 0          <-- Doesn't even need those registers!
  GOTO Function_End

Slow_Path:
  Use registers to do heavy math...
  GOTO Function_End

Function_End:
  DEALLOCATE stack    <-- Epilogue runs every single time
  POP registers
  RETURN

After Shrink-Wrapping

Function_Start:
  IF (condition) THEN
    GOTO Fast_Path    
  ELSE
    GOTO Slow_Path
    
Fast_Path:
  Result = 0          <-- Pure profit: No stack overhead, no register spills
  RETURN              <-- Direct return

Slow_Path:
  PUSH registers      <-- Prologue is "wrapped" only around the slow path
  ALLOCATE stack
  Use registers to do heavy math...
  DEALLOCATE stack
  POP registers       <-- Epilogue wrapped here too
  RETURN


Limitations: The "Rejoin" Problem

Shrink-wrapping is incredibly effective for functions with early exits. However, it can fail or get disabled if the control flow graph gets messy.

If a fast path and a slow path diverge but then rejoin later in the function before a single, shared return point, the compiler may not be able to safely decouple the stack frames.


clang and gcc cmd:
$ clang -mllvm -debug-only=shrink-wrap
$ gcc -fshrink-wrap-separate (on at -O2) 


How to solve if shrink-wrap not working?

Hand coding the tail duplication with fast and slow code path.
e.g.
LLVM_ATTRIBUTE_NOINLINE void growAndPushBack(ValueParamT Elt) {
  // in case Elt aliases storage that grow() invalidates
  // This is very much edge-case considered.
  T Tmp = Elt;
  // +1 is sufficient, while internally doing
  // exponential growth algorithm.
  this->grow(this->size() + 1);
  std::memcpy(reinterpret_cast<void *>(this->end()), &Tmp, sizeof(T));
  // size is not just +1 but applied with 
  // exponential growth algorithm.
  this->set_size(this->size() + 1);
}

void push_back(ValueParamT Elt) {
  if (LLVM_UNLIKELY(this->size() >= this->capacity()))
    return growAndPushBack(Elt);
  std::memcpy(reinterpret_cast<void *>(this->end()), &Elt, sizeof(T));
  this->set_size(this->size() + 1);
}

mov    eax, [rdi + 8]
cmp    eax, [rdi + 12]
jae    growAndPushBack         # TAILCALL
mov    rcx, [rdi]
mov    [rcx + rax*4], esi
inc    dword ptr [rdi + 8]
ret
Since there are no callee-saved registers, shrink-wrapping is unnecessary.
Moving the slow path into a separate COMDAT/section groups section as an out-of-line function further degrades its performance.

Use:
// noinline growAndPushBack is load-bearing for both Clang and GCC.
void DecodeMOVDDUPMask(unsigned n, llvm::SmallVectorImpl<int> &v) {
  for (unsigned l = 0; l < n; l += 2)
    for (unsigned i = 0; i < 2; ++i)
      v.push_back(i);
}
The noinline attribute is required here. Without it, Clang and GCC may inline the helper function, which defeats the optimization by reintroducing the prologue.

Creating a temporary copy (T Tmp = Elt) safely handles cases where Elt references the vector's own storage. While this copy is easily elided for small, trivially-copyable types, passing a larger element by reference to the out-of-line growAndPushBack() forces it to be memory-materialized. Because its address must remain stable across a non-inlined call, this defeats construct-in-place optimization for large types. However, this overhead is negligible compared to the cost of grow(), which must already copy or move all existing size() elements.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.