Jun 16, 2026

[low latency] What is low latency (definition)

Reference:
https://github.com/crill-dev/crill

Low latency <-> High throughput

Server side is in the middle.

HFT, Audio, Game is more lean to the left.

2 Categories

  • Efficient programming
  • Programming for deterministic execution time
Most crucial thing: measuring!







Efficient programming

Microbenchmarks are tricky

  • Warm the cache
  • Randomise the heap
  • Measure release build with same compiler flags
  • But optimisations change what code you are measuring!
    • A lot of stuff can "constexpr away" in microbenchmark but not in production code


Writing efficient code requires...

  • Knowledge of the programming language
  • Knowledge of the libraries used
  • Knowledge of the compiler
  • Optimiser
    • ABI (Itanium, Microsoft, ...)
  • Knowledge of the hardware architecture
    • CPU architecture (Instruction set, pipeline, SIMD...)
    •  Cache hierarchy (Registers, L1/2/3 cache)
    • Prefetcher, translation lookaside buffer
    • Branch predictor, branch target buffer


Avoid unnecessary work

  • Avoid unnecessary copies
  • Avoid unnecessary function calls / indirections
  • inline functions
  • prefer std: variant / CRTP / "deducing this" (since C++23)
    over virtual functions
  • Make as many decisions as possible at compile time
    • constexpr all the things
    • Template metaprogramming
    • Generate lookup tables at compile time
  • Use efficient mathematical operations
    • Fast approximations
    • Use powers of two for sizes (compiler can replace division/mod with bit shifts)
    • Lookup tables
    • Many other techniques


Low-level bit manipulation in C++

  • Object lifetime rules
  • Aliasing rules
  • Alignment rules
  • Object representations
  • Value representations
  • std::bit_cast (since C++20)
  • Implicit-lifetime types (since C++20)
  • std::start_lifetime_as (since C++23)


The optimiser & undefined behaviour

  • memory-related UB
  • type system violation
  • out-of-bounds
  • lifetime violation (dangling pointers, use-after-free)
  • uninitialised variables
  • data races
  • signed integer overflow
  • infinite loops with no side effects


The optimiser & undefined behaviour


Sub-properties of reproducible & unsequenced

  • stateless: function that does not define mutable static or thread-local objects (nor do functions that are called by it)
  • effectless: function that does not have observable side effects
  • idempotent: repeated evaluation gives the same result (hence may read global state)
  • independent: does not depend on other state than the arguments or constants (hence may write to globals)
  • reproducible: effectless and idempotent
  • unsequenced: stateless, effectless, idempotent, and independent

Assume explained:

  int f(int i) {
  	[[assume(++i == 43)]]
    return i;
  }
  // function f can be optimized to '42'
  


C++ currently does not have a way to tell the compiler that the pointer are not alias:



CPU pipeline hazards

  • Branch hazard
  • Data hazard
  • Hardware hazard
    • Limited amount of adders/shifters
    • Limited amount of load/store units
    • (latest Intel CPUs: 3 loads + 2 stores per cycle)
    • → sometimes you can replace loads by shifts → increase bandwidth of every load using SIMD



[likely]] and [unlikely]]
https://vsdmars.blogspot.com/2016/01/likely-or-unlikely-easy-misleading.html

  • Does not affect branch predictor!
  • Can affect code layout
  • Have various of pitfalls
    • Aaron Ballman: "Don't use the [likely]] or [unlikely]] attributes"
    • Amir Kirsh & Tomer Vromen:"C++20 Likely and Unlikely:
    • A Journey Through Branch Prediction and Compiler Optimizations" (2022)

The 3 Types of Data Hazards

  • RAW (Read After Write): The most common type. Instruction B tries to read data before Instruction A finishes writing it.
  • WAR (Write After Read): Instruction B tries to write to a location before Instruction A has a chance to read it.
  • WAW (Write After Write): Instruction B tries to write to a destination before Instruction A writes to it, potentially leaving the wrong final value.

Rarely have to worry about this breaking your program because the system handles it automatically:

  • The Hardware (CPU): Modern CPUs use a trick called Data Forwarding (passing the result directly from the math unit to the next instruction before it's even written to memory) or they just stall the processor for a cycle to let the data catch up.
  • The Compiler: Optimization flags (like -O2 or -O3 in GCC/Clang) allow the C++ compiler to smartly rearrange your instructions so that independent calculations are placed in between dependent ones, keeping the pipeline moving smoothly without stalls.


SIMD

  • "Single instruction, multiple data"
  • CPU-specific: MMX, SSE 1/2/3/4, AVX, AVX2, AVX-512, AMX, NEON, SVE...
  • How to use?
  • Auto-vectorisation
  • Explicit vectorisation using SIMD libraries
    • Google Highway, xsimd, vectorclass, eve, std::simd proposal for C++26 (P1928)
  • Jeff Garland: "SIMD Libraries in C++" (CppNow 2023)
  • Writing intrinsics
  • Writing assembly
  • SWAR ("SIMD Within A Register")

Other SIMD considerations

  • If you know the exact target CPU:
    • use arch compiler flags
  • If you don't:
    • dynamic dispatch
    • function multiversioning (GCC/Clang only)

function multiversioning:

#include <iostream>

// 1. Version optimized for modern CPUs with AVX2
__attribute__((target("avx2"))) 
void process_data() {
    std::cout << "Running high-performance AVX2 vectorized version!\n";
    // Fast vector math goes here
}

// 2. Version optimized for older SSE4.2 capable CPUs
__attribute__((target("sse4.2"))) 
void process_data() {
    std::cout << "Running mid-tier SSE4.2 version!\n";
}

// 3. The mandatory default fallback version
__attribute__((target("default"))) 
void process_data() {
    std::cout << "Running generic baseline version.\n";
}

int main() {
    // You call it like a regular function. 
    // The resolution happens completely behind the scenes.
    process_data(); 
    return 0;
}  

Autovectorisation

  • Highly dependent on compiler
  • Only works with vectors/arrays of int, char, double etc. - no structs
  • Workaround: struct of arrays instead of array of structs
  • Traverse data linearly
  • for loop, not while loop
  • Number of iterations predetermined (ideally, known at compile time)
  • No data-dependent break, goto, etc.
  • No conditionals
  • No data dependencies between array elements
  • No aliasing




Cache miss

any optimization is pointless if there's a cache miss.
Minimise data cache misses
  • Data locality
  • Align data on cachelines
  • Concurrency aspect (true/false sharing)
  • Contiguous data traversal (also good for prefetcher)
  • "Almost always vector"
  • Cache-friendly associative containers
  • std::flat_set/std::flat_map (C++23), Abseil containers
  • Cache-friendly algorithms
    • Cache-friendly binary search
Minimise instruction cache misses
  • Consider generated code layout & alignment
  • Avoid branches
  • Avoid virtual functions
    • std::variant
    • Compile-time polymorphism
    • CRTP, mixins, "deducing this" (since C++23)

Keep the cache warm
  • Data cache
    • Periodically poke data on a timer
  • Instruction cache
    • Periodically run the hot path with dummy input/output


Other fun hardware problems
  • CPU throttling
    • Due to overheating
      • spread thermal dissipation
    • Due to low activity
      • keep CPU busy with pause instructions




Programming for deterministic execution time

What not to do in the hot path:

  • Dynamic memory allocations/deallocations
  • blocking the thread
  • I/O
  • exceptions
  • context switches / mode switches (user/kernel space)
  • syscalls
  • calling into unknown code
  • loops without definite bounds
  • algorithms > O(1), or with no statically known upper bound on N

Dynamic memory allocations/deallocations

Avoiding allocations
  • Do not use data types that allocate
  • Do not use algorithms that allocate
  • Do not use data structures that allocate
Don't use(in real time):
  • std::stable_sort 
  • std::stable_partition 
  • std::inplace_merge

Do use(in real time):
std::array
std::pair
std::tuple
std::optional 
std::variant

Don't use(in real time):
std::any
std::function
std::vector, std::deque, std::list etc.


Customer allocators(also for coroutine):
tcmalloc, rpmalloc...
are not good for ultra low latency and real-time:
  • minimising average cost, not worst case
  • not constant time
  • multithreaded (locks)
  • eventually go to OS to request dynamic memory

Custom Allocators
Preallocate everything
  • Monotonic allocators
  • std::pmr::monotonic_buffer_resource
  • Pool allocators
  • std::pmr::unsynchronised_pool_resource
  • Frame allocators
  • Arena allocators
  • Double-ended allocators for big/small buffers
  • Lock-free allocator (require helper thread)


Lambda does not malloc.

Coroutine might.
rely on the optimiser?
→ Eyal Zedaka: "Using Coroutines to Implement C++
Exceptions for Freestanding Environments" (CppCon 2021)
  • create and suspend coroutine frame upfront
  • write your own promise type, defining its own custom operator new and operator delete
  • Don't use coroutines in a low-latency/real-time scenario


Blocking the thread

Don't use mutex. (and spanner violates all these rules in every aspects lol)


Concurrency
  • atomic == indivisible, race-free
  • lock-free == at least one thread is guaranteed to make progress
  • wait-free == all threads are guaranteed to make progress

Wait-free concurrency
  • Can't block hot path (no mutexes, no spinning)
  • Can't do any syscalls
  • Can't do anything with unbounded/non-deterministic runtime
  • Only available synchronisation mechanism:
    • std::atomic
    • std::atomic_ref (since C++20)
    • never on hot path:
      • spin on std::atomic, compare_exchange in a loop, etc.
      • use std::atomic<T>::wait/notify_one/notify_all
  • always:
    • static_assert(std::atomic<T>::is_always_lock_free);
How:
  • Passing data to/from "hot path" thread
    • wait-free queue
  • Sharing data with "hot path" thread
    • Hot path reads
      • spinlock try_lock (involve syscall.) Solution: spinlock with try_lock.
        • Use with spinlock, not std: mutex! 
        • Use progressive backoff to avoid wasting energy
        • Tradeoffs:
          • Reader always wait-free
          • Single reader
          • Only works if reading is allowed to fail
          • Reader blocks writer(s) who needs to spin
          • one writer (or multiple writers who block each other)
      • "spin-on-write"
        • Easy to use
        • Reader always wait-free
        • Tradeoffs:
          • Single reader
          •  read is 2x faster then std::atomic::exchange; because .load read from
            local CPU cache(since the value is MESI shared)
          •  reader blocks writer(s) who need to spin
          • one writer (or multiple writers who block each other)
          • writer needs to heap-allocate + copy
        std::unique_ptr‹biquad_coefficients> storage;
        std::atomic<biquad_coefficients*> coeffs;
        void process(audio_buffer& buffer) { auto* current_coeffs = coeffs.exchange(nullptr); process_biquad(buffer, *current_coffs) ; coeffs.store(current_coeffs) ; } void update_coeffs (biquad_coefficients new_coeffs) { auto new_coeffs = std::make_unique‹biquad_coefficients>(new_coeffs); for (auto* expected = storage.get(); !coeffs.compare_exchange_weak(expected, new_coeffs.get()); expected = storage.get() /* spin */;) storage = std::move(new_coffs); } // Or just use crill lib crill::spin_on_write_object<biquad_coefficients> coeffs; void process (audio_buffer& buffer) { auto read_ptr = coeffs. lock_read(); process_biquad (buffer, *read_ptr); } void update_coeffs(biquad_coefficients new_coeffs) { coeffs. update(new_coeffs); }
      • RCU
        • read is now a single atomic load (= no overhead on modern platforms)
        • Multiple concurrent readers & writers
        • Readers don't block writers
        • But we need to solve deferred reclamation problem!
          • hard...
          • https://www.youtube.com/watch?v=7fKxIZOyBCE
            crill::defer_reclaim_object<biquad_coefficients> coeffs;
            void process(audio_buffer& buffer) {
              uto read_ptr = coeffs. Lock_read() ;
              process_biquad (buffer, *read_ptr);
            }
            
            void update_coeffs (biquad_coefficients new_coeffs) {
              coeffs.update(new_coeffs);
            }
            
            void timer_callback(){
              coeffs.reclaim();
            }
          • Reading always wait-free
          • Multiple readers
          • algorithm simplifies greatly if single reader
          • read is single atomic load (= no overhead on modern platforms)
          • Readers do not block writers
          • Writer needs to do heap allocation
          • User needs to manage reclamation
        • Variation: reclaim_on_write_object
          • Reading always wait-free
          • Multiple readers
          • algorithm simplifies greatly if single reader
          • read is single atomic load (= no overhead on modern platforms)
          • Writer waits for active reader(s) to finish
          • Writing is fast and does not require heap allocation
          • No need to manage reclamation
    • Hot path writes
      • Double-buffering (just atomic SWAP to update the data.)
        • Always be aware of ABA problem. Thus always having state in the single
          atomic variable.
          std::array<frequency_spectrum, 2> slots;
          std::atomic<int> idx = {0};
          
          void process (audio_buffer& audio_in) {
          	auto spectrum = calculate_spectrum(audio_in);
          	int write_id = id. load();
          	slots [write_idx] = spectrum;
          }
          
          void update_spectrum() {
          	int read_id = idx.fetch_xor(1);
          	draw_spectrum(slots[read_idx]);
          }
          
          // Solve the ABA problem
          std: : array<frequency_spectrum, 2> slots;
          std: :atomic<int> idx = {0};
          enum {
          	BIT_IDX = (1 << 0),
          	BIT_NEWDATA = (1 << 1), 
              BIT_BUSY = (1 << 2),
          };
          void process (audio_buffer& audio_in) {
          	auto spectrum = calculate_spectrum(audio_in);
          	int write_idx = idx.fetch_or(BIT_BUSY) & BIT_IDX;
          	slots [write_idx] = spectrum;
          	idx.store ((write_idx & BIT_IDX) | BIT_NEWDATA);
          }
          // ...
          
      • SeqLock (HFT technique)
        • Writing always wait-free
        • Good solution if writing happens more rarely, data not too large
        • Tradeoffs:
          • Single writer, multiple readers
          • Readers lock-free but not wait-free (might have to retry unbounded nr of times)
          • Data must be trivially copyable
          • Overhead of copying data atomically on both reader & writer
      • std::atomic<std::size_t> seq = 0;
        // single thread doing the write.
        void store(T t) noexcept {
        	std::size_t old_seq = seq.fetch_add(1);
            /*
            // Faster
            auto old_seq = seq.load(std::memory_order_relaxed);
            seq.store(old_seq + 1, std::memory_order_relaxed);
        
            std::atomic_thread_fence(std::memory_order_release);
            */
        	// write data...
        	seq. store(old_seq + 2);
        }
        
        bool try_load(T& t) const noexcept {
        	std::size_t seq1 = seq.load(std::memory_order_acquired);
        	if (seq1 % 2 != 0) return false;
        	
            // read data...
            /*
            
            Use Byte-wise atomic memcpy to read the data;
            or chunk it; DO NOT USE memcpy
            
            for (size_t i = 0; i < count; ++i) {
        		reinterpret_cast<char*>(dest)[i] =
        			atomic_ref<char>(reinterpret_cast<char*>(source)[i]).load(memory_order_relaxed);
            }
        	
            atomic_thread_fence(order);
            */
            
            std::atomic_thread_fence(std::memory_order_acquired);
        	std: :size_t seq2 = seq.load(std::memory_order_relaxed);
        	return seq1 == seq2;
        }
Ring-Buffer, as always



I/O

  • With other threads: Push message into wait-free specific queue
  • With other processes: Shared memory
  • With hardware: Direct Memory Access (DMA)

Reading data from disk on the hot path
  • Pre-load into RAM and lock address range (to prevent swap-out) / MMAP
    • mlock (POSIX)
      mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area. munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process's virtual address space, so that pages in the specified virtual address range can be swapped out again if required by the kernel memory manager. Memory locking and unlocking are performed in units of whole pages.
    • VirtualLock (Windows)
  • Disk streaming
    • Pre-load into RAM first ~ 100 ms of every possible sound
    • Once it starts playing, start filling in the rest from disk (on a background thread)

No exceptions.

Avoiding context switches/mode switches
  • Mainstream operating systems: Thread priority
  • Real-time operating systems: deterministic thread scheduler
  • If you control the hardware:
    • Kernel bypass
  • If your hot path is in a single thread,
  • and you don't care about efficiency of other threads:
    • Turn off hyperthreading
    • Pin hot path thread to one CPU core




[Optimization] Peephole Optimizations

Reference:
https://blog.regehr.org/archives/2485

Local Peephole Optimization and Compositional Refinement

Let f be a function targeted for peephole optimization. 
We can structurally decompose f into a target redex o (the optimizable instruction sequence) and a continuation context r (the residual program), such that:
f = r ∘ o

By definition, a correct optimization yields a refined sequence o′ that refines the original sequence o, denoted as:
o ⊑ o′

To lift this local optimization to the function level, we rely on the compositionality of the refinement relation. For any valid refinement definition, the context r must act as a monotonic operator:
If x ⊑ y, then r(x) ⊑ r(y)

Applying compositionality to our decomposition:
r(o) ⊑ r(o′)

Because f = r(o), it transitively follows that:
f ⊑ r(o′)

This proves that the locally optimized function refines the original function. By inductively applying this compositionality principle, we can scale this local refinement guarantee across the entire program.

Jun 13, 2026

[RACI]

The 4 RACI Roles

  • Responsible (The "Doer"): The person (or team) who actually performs the work to complete the task. They are hands-on and do the heavy lifting.
    Rule of thumb: There should be at least one 'R', but multiple contributors can share this responsibility.
  • Accountable (The "Owner"): The individual ultimately answerable for the correct and thorough completion of the deliverable. They delegate the work and sign off on the final result.
    Rule of thumb: There must be exactly one Accountable person per task to guarantee decision-making authority.
  • Consulted (The "Expert"): People who provide subject-matter expertise or input before a decision or action is finalized.
    Rule of thumb: Communication is two-way.
  • Informed (The "Loop"): People who need to be kept up-to-date on project progress, usually stakeholders or leadership.
    Rule of thumb: Communication is one-way—they are simply updated, not required to make decisions.

May 27, 2026

[C++] pointer arithmetic

Refernece:
[C++] Object Lifetimes reading minute
class Vec {
  public:
    double* data() { return &x; }

  private:
    double x,y,z;
};

Eigen::Map<...>(absl::Span<Vec>);

*(Vect::data() + 1) // Does not give us y

Take away: 

* Even though x, y, and z are allocated sequentially in memory without padding,
physical layout does not supersede semantic rules.
* The layout guarantees mean you can safely memcpy the data, or cast a Vec* to a double* to access the first element (x). It does not grant permission to use pointer arithmetic on double* to slide across the members.
* The pointer arithmetic is only guaranteed within the type of array.
* Pointer to variable only is considered as pointer to array of size 1.
* Thus any pointer arithmetic on single variable is considered out-of-bound; compiler is free to assume anything.

Explain:

Only char*, unsigned char*, and std::byte* are explicitly granted an exception in the standard to 
examine the raw object representation. double* enjoys no such privilege.

Fix:

class Vec {
 public:
  double* data() { return data_; } // Legal: returns pointer to element 0 of a 3-element array
 private:
  double data_[3]; // x=data_[0], y=data_[1], z=data_[2]
};

May 10, 2026

心胸格局 器大識深

職場上遇到值得學習的大老

實在是可遇不可求。

與智者同行,與強者為伍。

格局大的人在種樹,

格局小的人在砍樹。

格局大的人懂得長期投資,

願意合作共贏,把餅做大;

所以有些人身邊的貴人越來越多,

有些人身邊的人卻越來越少。

不是能力的差距,

而是格局的差距。

與北七工作實在是心累 LoL