Jun 16, 2026

[low latency] What is low latency (define)

Low latency <-> High throughput

Server side is in the middle.

HFT, Audio, Game is more lean to the left.

2 Categories

  • Efficient programming
  • Programming for deterministic execution time
Most crucial thing: measuring!







Microbenchmarks are tricky

  • Warm the cache
  • Randomise the heap
  • Measure release build with same compiler flags
  • But optimisations change what code you are measuring!
    • A lot of stuff can "constexpr away" in microbenchmark but not in production code


Writing efficient code requires...

  • Knowledge of the programming language
  • Knowledge of the libraries used
  • Knowledge of the compiler
  • Optimiser
    • ABI (Itanium, Microsoft, ...)
  • Knowledge of the hardware architecture
    • CPU architecture (Instruction set, pipeline, SIMD...)
    •  Cache hierarchy (Registers, L1/2/3 cache)
    • Prefetcher, translation lookaside buffer
    • Branch predictor, branch target buffer


Avoid unnecessary work

  • Avoid unnecessary copies
  • Avoid unnecessary function calls / indirections
  • inline functions
  • prefer std: variant / CRTP / "deducing this" (since C++23)
    over virtual functions
  • Make as many decisions as possible at compile time
    • constexpr all the things
    • Template metaprogramming
    • Generate lookup tables at compile time
  • Use efficient mathematical operations
    • Fast approximations
    • Use powers of two for sizes (compiler can replace division/mod with bit shifts)
    • Lookup tables
    • Many other techniques


Low-level bit manipulation in C++

  • Object lifetime rules
  • Aliasing rules
  • Alignment rules
  • Object representations
  • Value representations
  • std::bit_cast (since C++20)
  • Implicit-lifetime types (since C++20)
  • std::start_lifetime_as (since C++23)


The optimiser & undefined behaviour

  • memory-related UB
  • type system violation
  • out-of-bounds
  • lifetime violation (dangling pointers, use-after-free)
  • uninitialised variables
  • data races
  • signed integer overflow
  • infinite loops with no side effects


The optimiser & undefined behaviour


Sub-properties of reproducible & unsequenced

  • stateless: function that does not define mutable static or thread-local objects (nor do functions that are called by it)
  • effectless: function that does not have observable side effects
  • idempotent: repeated evaluation gives the same result (hence may read global state)
  • independent: does not depend on other state than the arguments or constants (hence may write to globals)
  • reproducible: effectless and idempotent
  • unsequenced: stateless, effectless, idempotent, and independent

Assume explained:

  int f(int i) {
  	[[assume(++i == 43)]]
    return i;
  }
  // function f can be optimized to '42'
  


C++ currently does not have a way to tell the compiler that the pointer are not alias:



CPU pipeline hazards

  • Branch hazard
  • Data hazard
  • Hardware hazard
    • Limited amount of adders/shifters
    • Limited amount of load/store units
    • (latest Intel CPUs: 3 loads + 2 stores per cycle)
    • → sometimes you can replace loads by shifts → increase bandwidth of every load using SIMD



[likely]] and [unlikely]]
https://vsdmars.blogspot.com/2016/01/likely-or-unlikely-easy-misleading.html

  • Does not affect branch predictor!
  • Can affect code layout
  • Have various of pitfalls
    • Aaron Ballman: "Don't use the [likely]] or [unlikely]] attributes"
    • Amir Kirsh & Tomer Vromen:"C++20 Likely and Unlikely:
    • A Journey Through Branch Prediction and Compiler Optimizations" (2022)

The 3 Types of Data Hazards

  • RAW (Read After Write): The most common type. Instruction B tries to read data before Instruction A finishes writing it.
  • WAR (Write After Read): Instruction B tries to write to a location before Instruction A has a chance to read it.
  • WAW (Write After Write): Instruction B tries to write to a destination before Instruction A writes to it, potentially leaving the wrong final value.

Rarely have to worry about this breaking your program because the system handles it automatically:

  • The Hardware (CPU): Modern CPUs use a trick called Data Forwarding (passing the result directly from the math unit to the next instruction before it's even written to memory) or they just stall the processor for a cycle to let the data catch up.
  • The Compiler: Optimization flags (like -O2 or -O3 in GCC/Clang) allow the C++ compiler to smartly rearrange your instructions so that independent calculations are placed in between dependent ones, keeping the pipeline moving smoothly without stalls.


SIMD

  • "Single instruction, multiple data"
  • CPU-specific: MMX, SSE 1/2/3/4, AVX, AVX2, AVX-512, AMX, NEON, SVE...
  • How to use?
  • Auto-vectorisation
  • Explicit vectorisation using SIMD libraries
    • Google Highway, xsimd, vectorclass, eve, std:: simd proposal for C++26 (P1928)
  • Jeff Garland: "SIMD Libraries in C++" (CppNow 2023)
  • Writing intrinsics
  • Writing assembly
  • SWAR ("SIMD Within A Register")

Other SIMD considerations

  • If you know the exact target CPU:
    • use arch compiler flags
  • If you don't:
    • dynamic dispatch
    • function multiversioning (GCC/Clang only)

function multiversioning:

#include <iostream>

// 1. Version optimized for modern CPUs with AVX2
__attribute__((target("avx2"))) 
void process_data() {
    std::cout << "Running high-performance AVX2 vectorized version!\n";
    // Fast vector math goes here
}

// 2. Version optimized for older SSE4.2 capable CPUs
__attribute__((target("sse4.2"))) 
void process_data() {
    std::cout << "Running mid-tier SSE4.2 version!\n";
}

// 3. The mandatory default fallback version
__attribute__((target("default"))) 
void process_data() {
    std::cout << "Running generic baseline version.\n";
}

int main() {
    // You call it like a regular function. 
    // The resolution happens completely behind the scenes.
    process_data(); 
    return 0;
}  

Autovectorisation

  • Highly dependent on compiler
  • Only works with vectors/arrays of int, char, double etc. - no structs
  • Workaround: struct of arrays instead of array of structs
  • Traverse data linearly
  • for loop, not while loop
  • Number of iterations predetermined (ideally, known at compile time)
  • No data-dependent break, goto, etc.
  • No conditionals
  • No data dependencies between array elements
  • No aliasing




Cache miss

any optimization is pointless if there's a cache miss.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.