Reference:
https://github.com/crill-dev/crill
Low latency <-> High throughput
Server side is in the middle.
HFT, Audio, Game is more lean to the left.
2 Categories
- Efficient programming
- Programming for deterministic execution time
Most crucial thing: measuring!
Efficient programming
Microbenchmarks are tricky
- Warm the cache
- Randomise the heap
- Measure release build with same compiler flags
- But optimisations change what code you are measuring!
- A lot of stuff can "constexpr away" in microbenchmark but not in production code
Writing efficient code requires...
- Knowledge of the programming language
- Knowledge of the libraries used
- Knowledge of the compiler
- Optimiser
- ABI (Itanium, Microsoft, ...)
- Knowledge of the hardware architecture
- CPU architecture (Instruction set, pipeline, SIMD...)
- Cache hierarchy (Registers, L1/2/3 cache)
- Prefetcher, translation lookaside buffer
- Branch predictor, branch target buffer
Avoid unnecessary work
- Avoid unnecessary copies
- Avoid unnecessary function calls / indirections
- inline functions
- prefer std: variant / CRTP / "deducing this" (since C++23)
over virtual functions - Make as many decisions as possible at compile time
- constexpr all the things
- Template metaprogramming
- Generate lookup tables at compile time
- Use efficient mathematical operations
- Fast approximations
- Use powers of two for sizes (compiler can replace division/mod with bit shifts)
- Lookup tables
- Many other techniques
Low-level bit manipulation in C++
- Object lifetime rules
- Aliasing rules
- Alignment rules
- Object representations
- Value representations
- std::bit_cast (since C++20)
- Implicit-lifetime types (since C++20)
- std::start_lifetime_as (since C++23)
The optimiser & undefined behaviour
- memory-related UB
- type system violation
- out-of-bounds
- lifetime violation (dangling pointers, use-after-free)
- uninitialised variables
- data races
- signed integer overflow
- infinite loops with no side effects
The optimiser & undefined behaviour
- std::assume_aligned (Ct+20)
- [[assume]] (C++23)
- [[noalias]] (C++26 ??)
- [unsequenced]] == [[gnu::const]]
[reproducible]] == [[gnu::pure]] (C23, C++26 ??)
Sub-properties of reproducible & unsequenced
- stateless: function that does not define mutable static or thread-local objects (nor do functions that are called by it)
- effectless: function that does not have observable side effects
- idempotent: repeated evaluation gives the same result (hence may read global state)
- independent: does not depend on other state than the arguments or constants (hence may write to globals)
- reproducible: effectless and idempotent
- unsequenced: stateless, effectless, idempotent, and independent
C++ currently does not have a way to tell the compiler that the pointer are not alias:
CPU pipeline hazards
- Branch hazard
- Data hazard
- Hardware hazard
- Limited amount of adders/shifters
- Limited amount of load/store units
- (latest Intel CPUs: 3 loads + 2 stores per cycle)
- → sometimes you can replace loads by shifts → increase bandwidth of every load using SIMD
[likely]] and [unlikely]]
https://vsdmars.blogspot.com/2016/01/likely-or-unlikely-easy-misleading.html
- Does not affect branch predictor!
- Can affect code layout
- Have various of pitfalls
- Aaron Ballman: "Don't use the [likely]] or [unlikely]] attributes"
- Amir Kirsh & Tomer Vromen:"C++20 Likely and Unlikely:
- A Journey Through Branch Prediction and Compiler Optimizations" (2022)
The 3 Types of Data Hazards
- RAW (Read After Write): The most common type. Instruction B tries to read data before Instruction A finishes writing it.
- WAR (Write After Read): Instruction B tries to write to a location before Instruction A has a chance to read it.
- WAW (Write After Write): Instruction B tries to write to a destination before Instruction A writes to it, potentially leaving the wrong final value.
Rarely have to worry about this breaking your program because the system handles it automatically:
- The Hardware (CPU): Modern CPUs use a trick called Data Forwarding (passing the result directly from the math unit to the next instruction before it's even written to memory) or they just stall the processor for a cycle to let the data catch up.
- The Compiler: Optimization flags (like -O2 or -O3 in GCC/Clang) allow the C++ compiler to smartly rearrange your instructions so that independent calculations are placed in between dependent ones, keeping the pipeline moving smoothly without stalls.
SIMD
- "Single instruction, multiple data"
- CPU-specific: MMX, SSE 1/2/3/4, AVX, AVX2, AVX-512, AMX, NEON, SVE...
- How to use?
- Auto-vectorisation
- Explicit vectorisation using SIMD libraries
- Google Highway, xsimd, vectorclass, eve, std::simd proposal for C++26 (P1928)
- Jeff Garland: "SIMD Libraries in C++" (CppNow 2023)
- Writing intrinsics
- Writing assembly
- SWAR ("SIMD Within A Register")
Other SIMD considerations
- If you know the exact target CPU:
- use arch compiler flags
- If you don't:
- dynamic dispatch
- function multiversioning (GCC/Clang only)
function multiversioning:
#include <iostream>
// 1. Version optimized for modern CPUs with AVX2
__attribute__((target("avx2")))
void process_data() {
std::cout << "Running high-performance AVX2 vectorized version!\n";
// Fast vector math goes here
}
// 2. Version optimized for older SSE4.2 capable CPUs
__attribute__((target("sse4.2")))
void process_data() {
std::cout << "Running mid-tier SSE4.2 version!\n";
}
// 3. The mandatory default fallback version
__attribute__((target("default")))
void process_data() {
std::cout << "Running generic baseline version.\n";
}
int main() {
// You call it like a regular function.
// The resolution happens completely behind the scenes.
process_data();
return 0;
}
Autovectorisation
- Highly dependent on compiler
- Only works with vectors/arrays of int, char, double etc. - no structs
- Workaround: struct of arrays instead of array of structs
- Traverse data linearly
- for loop, not while loop
- Number of iterations predetermined (ideally, known at compile time)
- No data-dependent break, goto, etc.
- No conditionals
- No data dependencies between array elements
- No aliasing
Cache miss
any optimization is pointless if there's a cache miss.
Minimise data cache misses
- Data locality
- Align data on cachelines
- Concurrency aspect (true/false sharing)
- Contiguous data traversal (also good for prefetcher)
- "Almost always vector"
- Cache-friendly associative containers
- std::flat_set/std::flat_map (C++23), Abseil containers
- Cache-friendly algorithms
- Cache-friendly binary search
Minimise instruction cache misses
- Consider generated code layout & alignment
- Avoid branches
- Avoid virtual functions
- std::variant
- Compile-time polymorphism
- CRTP, mixins, "deducing this" (since C++23)
Keep the cache warm
- Data cache
- Periodically poke data on a timer
- Instruction cache
- Periodically run the hot path with dummy input/output
Other fun hardware problems
- CPU throttling
- Due to overheating
- spread thermal dissipation
- Due to low activity
- keep CPU busy with pause instructions
Programming for deterministic execution time
What not to do in the hot path:
- Dynamic memory allocations/deallocations
- blocking the thread
- I/O
- exceptions
- context switches / mode switches (user/kernel space)
- syscalls
- calling into unknown code
- loops without definite bounds
- algorithms > O(1), or with no statically known upper bound on N
Dynamic memory allocations/deallocations
Avoiding allocations
- Do not use data types that allocate
- Do not use algorithms that allocate
- Do not use data structures that allocate
Don't use(in real time):
- std::stable_sort
- std::stable_partition
- std::inplace_merge
Do use(in real time):
std::array
std::pair
std::tuple
std::optional
std::variant
Don't use(in real time):
std::any
std::function
std::vector, std::deque, std::list etc.
Customer allocators(also for coroutine):
tcmalloc, rpmalloc...
are not good for ultra low latency and real-time:
- minimising average cost, not worst case
- not constant time
- multithreaded (locks)
- eventually go to OS to request dynamic memory
Custom Allocators
Preallocate everything
- Monotonic allocators
- std::pmr::monotonic_buffer_resource
- Pool allocators
- std::pmr::unsynchronised_pool_resource
- Frame allocators
- Arena allocators
- Double-ended allocators for big/small buffers
- Lock-free allocator (require helper thread)
Lambda does not malloc.
Coroutine might.
rely on the optimiser?
→ Eyal Zedaka: "Using Coroutines to Implement C++
Exceptions for Freestanding Environments" (CppCon 2021)
- create and suspend coroutine frame upfront
- write your own promise type, defining its own custom operator new and operator delete
- Don't use coroutines in a low-latency/real-time scenario
Blocking the thread
Don't use mutex. (and spanner violates all these rules in every aspects lol)
Concurrency
- atomic == indivisible, race-free
- lock-free == at least one thread is guaranteed to make progress
- wait-free == all threads are guaranteed to make progress
Wait-free concurrency
- Can't block hot path (no mutexes, no spinning)
- Can't do any syscalls
- Can't do anything with unbounded/non-deterministic runtime
- Only available synchronisation mechanism:
- std::atomic
- std::atomic_ref (since C++20)
- never on hot path:
- spin on std::atomic, compare_exchange in a loop, etc.
- use std::atomic<T>::wait/notify_one/notify_all
- always:
- static_assert(std::atomic<T>::is_always_lock_free);
How:
- Passing data to/from "hot path" thread
- wait-free queue
- Sharing data with "hot path" thread
- Hot path reads
- spinlock try_lock (involve syscall.) Solution: spinlock with try_lock.
- Use with spinlock, not std: mutex!
- Use progressive backoff to avoid wasting energy
- Tradeoffs:
- Reader always wait-free
- Single reader
- Only works if reading is allowed to fail
- Reader blocks writer(s) who needs to spin
- one writer (or multiple writers who block each other)
- "spin-on-write"
- Easy to use
- Reader always wait-free
- Tradeoffs:
- Single reader
- read is 2x faster then std::atomic::exchange; because .load read from
local CPU cache(since the value is MESI shared) - reader blocks writer(s) who need to spin
- one writer (or multiple writers who block each other)
- writer needs to heap-allocate + copy
std::unique_ptr‹biquad_coefficients> storage;
std::atomic<biquad_coefficients*> coeffs;
void process(audio_buffer& buffer) {
auto* current_coeffs = coeffs.exchange(nullptr);
process_biquad(buffer, *current_coffs) ;
coeffs.store(current_coeffs) ;
}
void update_coeffs (biquad_coefficients new_coeffs) {
auto new_coeffs = std::make_unique‹biquad_coefficients>(new_coeffs);
for (auto* expected = storage.get();
!coeffs.compare_exchange_weak(expected, new_coeffs.get());
expected = storage.get() /* spin */;)
storage = std::move(new_coffs);
}
// Or just use crill lib
crill::spin_on_write_object<biquad_coefficients> coeffs;
void process (audio_buffer& buffer) {
auto read_ptr = coeffs. lock_read();
process_biquad (buffer, *read_ptr);
}
void update_coeffs(biquad_coefficients new_coeffs) {
coeffs. update(new_coeffs);
}
- read is now a single atomic load (= no overhead on modern platforms)
- Multiple concurrent readers & writers
- Readers don't block writers
- But we need to solve deferred reclamation problem!
- hard...
- https://www.youtube.com/watch?v=7fKxIZOyBCE
crill::defer_reclaim_object<biquad_coefficients> coeffs; void process(audio_buffer& buffer) { uto read_ptr = coeffs. Lock_read() ; process_biquad (buffer, *read_ptr); } void update_coeffs (biquad_coefficients new_coeffs) { coeffs.update(new_coeffs); } void timer_callback(){ coeffs.reclaim(); } - Reading always wait-free
- Multiple readers
- algorithm simplifies greatly if single reader
- read is single atomic load (= no overhead on modern platforms)
- Readers do not block writers
- Writer needs to do heap allocation
- User needs to manage reclamation
- Variation: reclaim_on_write_object
- Reading always wait-free
- Multiple readers
- algorithm simplifies greatly if single reader
- read is single atomic load (= no overhead on modern platforms)
- Writer waits for active reader(s) to finish
- Writing is fast and does not require heap allocation
- No need to manage reclamation
- Double-buffering (just atomic SWAP to update the data.)
- Always be aware of ABA problem. Thus always having state in the single
atomic variable.std::array<frequency_spectrum, 2> slots; std::atomic<int> idx = {0}; void process (audio_buffer& audio_in) { auto spectrum = calculate_spectrum(audio_in); int write_id = id. load(); slots [write_idx] = spectrum; } void update_spectrum() { int read_id = idx.fetch_xor(1); draw_spectrum(slots[read_idx]); } // Solve the ABA problem std: : array<frequency_spectrum, 2> slots; std: :atomic<int> idx = {0}; enum { BIT_IDX = (1 << 0), BIT_NEWDATA = (1 << 1), BIT_BUSY = (1 << 2), }; void process (audio_buffer& audio_in) { auto spectrum = calculate_spectrum(audio_in); int write_idx = idx.fetch_or(BIT_BUSY) & BIT_IDX; slots [write_idx] = spectrum; idx.store ((write_idx & BIT_IDX) | BIT_NEWDATA); } // ... - SeqLock (HFT technique)
- Writing always wait-free
- Good solution if writing happens more rarely, data not too large
- Tradeoffs:
- Single writer, multiple readers
- Readers lock-free but not wait-free (might have to retry unbounded nr of times)
- Data must be trivially copyable
- Overhead of copying data atomically on both reader & writer
std::atomic<std::size_t> seq = 0; // single thread doing the write. void store(T t) noexcept { std::size_t old_seq = seq.fetch_add(1); /* // Faster auto old_seq = seq.load(std::memory_order_relaxed); seq.store(old_seq + 1, std::memory_order_relaxed); std::atomic_thread_fence(std::memory_order_release); */ // write data... seq. store(old_seq + 2); } bool try_load(T& t) const noexcept { std::size_t seq1 = seq.load(std::memory_order_acquired); if (seq1 % 2 != 0) return false; // read data... /* Use Byte-wise atomic memcpy to read the data; or chunk it; DO NOT USE memcpy for (size_t i = 0; i < count; ++i) { reinterpret_cast<char*>(dest)[i] = atomic_ref<char>(reinterpret_cast<char*>(source)[i]).load(memory_order_relaxed); } atomic_thread_fence(order); */ std::atomic_thread_fence(std::memory_order_acquired); std: :size_t seq2 = seq.load(std::memory_order_relaxed); return seq1 == seq2; }
Ring-Buffer, as always
I/O
- With other threads: Push message into wait-free specific queue
- With other processes: Shared memory
- With hardware: Direct Memory Access (DMA)
Reading data from disk on the hot path
- Pre-load into RAM and lock address range (to prevent swap-out) / MMAP
- mlock (POSIX)
mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area. munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process's virtual address space, so that pages in the specified virtual address range can be swapped out again if required by the kernel memory manager. Memory locking and unlocking are performed in units of whole pages. - VirtualLock (Windows)
- Disk streaming
- Pre-load into RAM first ~ 100 ms of every possible sound
- Once it starts playing, start filling in the rest from disk (on a background thread)
No exceptions.
Avoiding context switches/mode switches
- Mainstream operating systems: Thread priority
- Real-time operating systems: deterministic thread scheduler
- If you control the hardware:
- Kernel bypass
- If your hot path is in a single thread,
- and you don't care about efficiency of other threads:
- Turn off hyperthreading
- Pin hot path thread to one CPU core