Low latency <-> High throughput
Server side is in the middle.
HFT, Audio, Game is more lean to the left.
2 Categories
- Efficient programming
- Programming for deterministic execution time
Most crucial thing: measuring!
Microbenchmarks are tricky
- Warm the cache
- Randomise the heap
- Measure release build with same compiler flags
- But optimisations change what code you are measuring!
- A lot of stuff can "constexpr away" in microbenchmark but not in production code
Writing efficient code requires...
- Knowledge of the programming language
- Knowledge of the libraries used
- Knowledge of the compiler
- Optimiser
- ABI (Itanium, Microsoft, ...)
- Knowledge of the hardware architecture
- CPU architecture (Instruction set, pipeline, SIMD...)
- Cache hierarchy (Registers, L1/2/3 cache)
- Prefetcher, translation lookaside buffer
- Branch predictor, branch target buffer
Avoid unnecessary work
- Avoid unnecessary copies
- Avoid unnecessary function calls / indirections
- inline functions
- prefer std: variant / CRTP / "deducing this" (since C++23)
over virtual functions - Make as many decisions as possible at compile time
- constexpr all the things
- Template metaprogramming
- Generate lookup tables at compile time
- Use efficient mathematical operations
- Fast approximations
- Use powers of two for sizes (compiler can replace division/mod with bit shifts)
- Lookup tables
- Many other techniques
Low-level bit manipulation in C++
- Object lifetime rules
- Aliasing rules
- Alignment rules
- Object representations
- Value representations
- std::bit_cast (since C++20)
- Implicit-lifetime types (since C++20)
- std::start_lifetime_as (since C++23)
The optimiser & undefined behaviour
- memory-related UB
- type system violation
- out-of-bounds
- lifetime violation (dangling pointers, use-after-free)
- uninitialised variables
- data races
- signed integer overflow
- infinite loops with no side effects
The optimiser & undefined behaviour
- std::assume_aligned (Ct+20)
- [[assume]] (C++23)
- [[noalias]] (C++26 ??)
- [unsequenced]] == [[gnu::const]]
[reproducible]] == [[gnu::pure]] (C23, C++26 ??)
Sub-properties of reproducible & unsequenced
- stateless: function that does not define mutable static or thread-local objects (nor do functions that are called by it)
- effectless: function that does not have observable side effects
- idempotent: repeated evaluation gives the same result (hence may read global state)
- independent: does not depend on other state than the arguments or constants (hence may write to globals)
- reproducible: effectless and idempotent
- unsequenced: stateless, effectless, idempotent, and independent
C++ currently does not have a way to tell the compiler that the pointer are not alias:
CPU pipeline hazards
- Branch hazard
- Data hazard
- Hardware hazard
- Limited amount of adders/shifters
- Limited amount of load/store units
- (latest Intel CPUs: 3 loads + 2 stores per cycle)
- → sometimes you can replace loads by shifts → increase bandwidth of every load using SIMD
[likely]] and [unlikely]]
- Does not affect branch predictor!
- Can affect code layout
- Have various of pitfalls
- Aaron Ballman: "Don't use the [likely]] or [unlikely]] attributes"
- Amir Kirsh & Tomer Vromen:"C++20 Likely and Unlikely:
- A Journey Through Branch Prediction and Compiler Optimizations" (2022)
The 3 Types of Data Hazards
- RAW (Read After Write): The most common type. Instruction B tries to read data before Instruction A finishes writing it.
- WAR (Write After Read): Instruction B tries to write to a location before Instruction A has a chance to read it.
- WAW (Write After Write): Instruction B tries to write to a destination before Instruction A writes to it, potentially leaving the wrong final value.
Rarely have to worry about this breaking your program because the system handles it automatically:
- The Hardware (CPU): Modern CPUs use a trick called Data Forwarding (passing the result directly from the math unit to the next instruction before it's even written to memory) or they just stall the processor for a cycle to let the data catch up.
- The Compiler: Optimization flags (like -O2 or -O3 in GCC/Clang) allow the C++ compiler to smartly rearrange your instructions so that independent calculations are placed in between dependent ones, keeping the pipeline moving smoothly without stalls.
SIMD
- "Single instruction, multiple data"
- CPU-specific: MMX, SSE 1/2/3/4, AVX, AVX2, AVX-512, AMX, NEON, SVE...
- How to use?
- Auto-vectorisation
- Explicit vectorisation using SIMD libraries
- Google Highway, xsimd, vectorclass, eve, std:: simd proposal for C++26 (P1928)
- Jeff Garland: "SIMD Libraries in C++" (CppNow 2023)
- Writing intrinsics
- Writing assembly
- SWAR ("SIMD Within A Register")
Other SIMD considerations
- If you know the exact target CPU:
- use arch compiler flags
- If you don't:
- dynamic dispatch
- function multiversioning (GCC/Clang only)
function multiversioning:
#include <iostream>
// 1. Version optimized for modern CPUs with AVX2
__attribute__((target("avx2")))
void process_data() {
std::cout << "Running high-performance AVX2 vectorized version!\n";
// Fast vector math goes here
}
// 2. Version optimized for older SSE4.2 capable CPUs
__attribute__((target("sse4.2")))
void process_data() {
std::cout << "Running mid-tier SSE4.2 version!\n";
}
// 3. The mandatory default fallback version
__attribute__((target("default")))
void process_data() {
std::cout << "Running generic baseline version.\n";
}
int main() {
// You call it like a regular function.
// The resolution happens completely behind the scenes.
process_data();
return 0;
}
Autovectorisation
- Highly dependent on compiler
- Only works with vectors/arrays of int, char, double etc. - no structs
- Workaround: struct of arrays instead of array of structs
- Traverse data linearly
- for loop, not while loop
- Number of iterations predetermined (ideally, known at compile time)
- No data-dependent break, goto, etc.
- No conditionals
- No data dependencies between array elements
- No aliasing