Ataraxia through Epoché: [low latency] What is low latency (define)

Low latency <-> High throughput

Server side is in the middle.

HFT, Audio, Game is more lean to the left.

2 Categories

Efficient programming
Programming for deterministic execution time

Most crucial thing: measuring!

Microbenchmarks are tricky

Warm the cache
Randomise the heap
Measure release build with same compiler flags
But optimisations change what code you are measuring!

A lot of stuff can "constexpr away" in microbenchmark but not in production code

Writing efficient code requires...

Knowledge of the programming language
Knowledge of the libraries used
Knowledge of the compiler
Optimiser

ABI (Itanium, Microsoft, ...)

Knowledge of the hardware architecture

CPU architecture (Instruction set, pipeline, SIMD...)
Cache hierarchy (Registers, L1/2/3 cache)
Prefetcher, translation lookaside buffer
Branch predictor, branch target buffer

Avoid unnecessary work

Avoid unnecessary copies
Avoid unnecessary function calls / indirections
inline functions
prefer std: variant / CRTP / "deducing this" (since C++23)
over virtual functions
Make as many decisions as possible at compile time

constexpr all the things
Template metaprogramming
Generate lookup tables at compile time

Use efficient mathematical operations

Fast approximations
Use powers of two for sizes (compiler can replace division/mod with bit shifts)
Lookup tables
Many other techniques

Low-level bit manipulation in C++

Object lifetime rules
Aliasing rules
Alignment rules
Object representations
Value representations
std::bit_cast (since C++20)
Implicit-lifetime types (since C++20)
std::start_lifetime_as (since C++23)

The optimiser & undefined behaviour

memory-related UB
type system violation
out-of-bounds
lifetime violation (dangling pointers, use-after-free)
uninitialised variables
data races
signed integer overflow
infinite loops with no side effects

The optimiser & undefined behaviour

std::assume_aligned (Ct+20)
[[assume]] (C++23)
[[noalias]] (C++26 ??)
[unsequenced]] == [[gnu::const]]
[reproducible]] == [[gnu::pure]] (C23, C++26 ??)

Sub-properties of reproducible & unsequenced

stateless: function that does not define mutable static or thread-local objects (nor do functions that are called by it)
effectless: function that does not have observable side effects
idempotent: repeated evaluation gives the same result (hence may read global state)
independent: does not depend on other state than the arguments or constants (hence may write to globals)
reproducible: effectless and idempotent
unsequenced: stateless, effectless, idempotent, and independent

Assume explained:

  int f(int i) {
  	[[assume(++i == 43)]]
    return i;
  }
  // function f can be optimized to '42'

C++ currently does not have a way to tell the compiler that the pointer are not alias:

CPU pipeline hazards

Branch hazard
Data hazard
Hardware hazard

Limited amount of adders/shifters
Limited amount of load/store units
(latest Intel CPUs: 3 loads + 2 stores per cycle)
→ sometimes you can replace loads by shifts → increase bandwidth of every load using SIMD

https://vsdmars.blogspot.com/2022/02/cnotes-branchless-programming-in-c.html

[likely]] and [unlikely]]
https://vsdmars.blogspot.com/2016/01/likely-or-unlikely-easy-misleading.html

Does not affect branch predictor!
Can affect code layout
Have various of pitfalls

Aaron Ballman: "Don't use the [likely]] or [unlikely]] attributes"
Amir Kirsh & Tomer Vromen:"C++20 Likely and Unlikely:
A Journey Through Branch Prediction and Compiler Optimizations" (2022)

The 3 Types of Data Hazards

RAW (Read After Write): The most common type. Instruction B tries to read data before Instruction A finishes writing it.
WAR (Write After Read): Instruction B tries to write to a location before Instruction A has a chance to read it.
WAW (Write After Write): Instruction B tries to write to a destination before Instruction A writes to it, potentially leaving the wrong final value.

Rarely have to worry about this breaking your program because the system handles it automatically:

The Hardware (CPU): Modern CPUs use a trick called Data Forwarding (passing the result directly from the math unit to the next instruction before it's even written to memory) or they just stall the processor for a cycle to let the data catch up.
The Compiler: Optimization flags (like -O2 or -O3 in GCC/Clang) allow the C++ compiler to smartly rearrange your instructions so that independent calculations are placed in between dependent ones, keeping the pipeline moving smoothly without stalls.

SIMD

"Single instruction, multiple data"
CPU-specific: MMX, SSE 1/2/3/4, AVX, AVX2, AVX-512, AMX, NEON, SVE...
How to use?
Auto-vectorisation
Explicit vectorisation using SIMD libraries

Google Highway, xsimd, vectorclass, eve, std:: simd proposal for C++26 (P1928)

Jeff Garland: "SIMD Libraries in C++" (CppNow 2023)
Writing intrinsics
Writing assembly
SWAR ("SIMD Within A Register")

Other SIMD considerations

If you know the exact target CPU:

use arch compiler flags

If you don't:

dynamic dispatch
function multiversioning (GCC/Clang only)

function multiversioning:

#include <iostream>

// 1. Version optimized for modern CPUs with AVX2
__attribute__((target("avx2"))) 
void process_data() {
    std::cout << "Running high-performance AVX2 vectorized version!\n";
    // Fast vector math goes here
}

// 2. Version optimized for older SSE4.2 capable CPUs
__attribute__((target("sse4.2"))) 
void process_data() {
    std::cout << "Running mid-tier SSE4.2 version!\n";
}

// 3. The mandatory default fallback version
__attribute__((target("default"))) 
void process_data() {
    std::cout << "Running generic baseline version.\n";
}

int main() {
    // You call it like a regular function. 
    // The resolution happens completely behind the scenes.
    process_data(); 
    return 0;
}

Autovectorisation

Highly dependent on compiler
Only works with vectors/arrays of int, char, double etc. - no structs
Workaround: struct of arrays instead of array of structs
Traverse data linearly
for loop, not while loop
Number of iterations predetermined (ideally, known at compile time)
No data-dependent break, goto, etc.
No conditionals
No data dependencies between array elements
No aliasing

Cache miss

any optimization is pointless if there's a cache miss.

Ataraxia through Epoché

Jun 16, 2026

[low latency] What is low latency (define)

2 Categories

Microbenchmarks are tricky

Writing efficient code requires...

Avoid unnecessary work

Low-level bit manipulation in C++

The optimiser & undefined behaviour

The optimiser & undefined behaviour

Sub-properties of reproducible & unsequenced

CPU pipeline hazards

[likely]] and [unlikely]]
https://vsdmars.blogspot.com/2016/01/likely-or-unlikely-easy-misleading.html

The 3 Types of Data Hazards

SIMD

Other SIMD considerations

Autovectorisation

Cache miss

No comments:

Post a Comment

Jun 16, 2026

[low latency] What is low latency (define)

2 Categories

Microbenchmarks are tricky

Writing efficient code requires...

Avoid unnecessary work

Low-level bit manipulation in C++

The optimiser & undefined behaviour

The optimiser & undefined behaviour

Sub-properties of reproducible & unsequenced

CPU pipeline hazards

[likely]] and [unlikely]]https://vsdmars.blogspot.com/2016/01/likely-or-unlikely-easy-misleading.html

The 3 Types of Data Hazards

SIMD

Other SIMD considerations

Autovectorisation

Cache miss

No comments:

Post a Comment

[likely]] and [unlikely]]
https://vsdmars.blogspot.com/2016/01/likely-or-unlikely-easy-misleading.html