Jun 16, 2026

[low latency] What is low latency (define)

Low latency <-> High throughput

Server side is in the middle.

HFT, Audio, Game is more lean to the left.

2 Categories

  • Efficient programming
  • Programming for deterministic execution time
Most crucial thing: measuring!







Microbenchmarks are tricky

  • Warm the cache
  • Randomise the heap
  • Measure release build with same compiler flags
  • But optimisations change what code you are measuring!
    • A lot of stuff can "constexpr away" in microbenchmark but not in production code


Writing efficient code requires...

  • Knowledge of the programming language
  • Knowledge of the libraries used
  • Knowledge of the compiler
  • Optimiser
    • ABI (Itanium, Microsoft, ...)
  • Knowledge of the hardware architecture
    • CPU architecture (Instruction set, pipeline, SIMD...)
    •  Cache hierarchy (Registers, L1/2/3 cache)
    • Prefetcher, translation lookaside buffer
    • Branch predictor, branch target buffer


Avoid unnecessary work

  • Avoid unnecessary copies
  • Avoid unnecessary function calls / indirections
  • inline functions
  • prefer std: variant / CRTP / "deducing this" (since C++23)
    over virtual functions
  • Make as many decisions as possible at compile time
    • constexpr all the things
    • Template metaprogramming
    • Generate lookup tables at compile time
  • Use efficient mathematical operations
    • Fast approximations
    • Use powers of two for sizes (compiler can replace division/mod with bit shifts)
    • Lookup tables
    • Many other techniques


Low-level bit manipulation in C++

  • Object lifetime rules
  • Aliasing rules
  • Alignment rules
  • Object representations
  • Value representations
  • std::bit_cast (since C++20)
  • Implicit-lifetime types (since C++20)
  • std::start_lifetime_as (since C++23)


The optimiser & undefined behaviour

  • memory-related UB
  • type system violation
  • out-of-bounds
  • lifetime violation (dangling pointers, use-after-free)
  • uninitialised variables
  • data races
  • signed integer overflow
  • infinite loops with no side effects


The optimiser & undefined behaviour


Sub-properties of reproducible & unsequenced

  • stateless: function that does not define mutable static or thread-local objects (nor do functions that are called by it)
  • effectless: function that does not have observable side effects
  • idempotent: repeated evaluation gives the same result (hence may read global state)
  • independent: does not depend on other state than the arguments or constants (hence may write to globals)
  • reproducible: effectless and idempotent
  • unsequenced: stateless, effectless, idempotent, and independent


C++ currently does not have a way to tell the compiler that the pointer are not alias:



CPU pipeline hazards

  • Branch hazard
  • Data hazard
  • Hardware hazard
    • Limited amount of adders/shifters
    • Limited amount of load/store units
    • (latest Intel CPUs: 3 loads + 2 stores per cycle)
    • → sometimes you can replace loads by shifts → increase bandwidth of every load using SIMD



[likely]] and [unlikely]]

  • Does not affect branch predictor!
  • Can affect code layout
  • Have various of pitfalls
    • Aaron Ballman: "Don't use the [likely]] or [unlikely]] attributes"
    • Amir Kirsh & Tomer Vromen:"C++20 Likely and Unlikely:
    • A Journey Through Branch Prediction and Compiler Optimizations" (2022)

The 3 Types of Data Hazards

  • RAW (Read After Write): The most common type. Instruction B tries to read data before Instruction A finishes writing it.
  • WAR (Write After Read): Instruction B tries to write to a location before Instruction A has a chance to read it.
  • WAW (Write After Write): Instruction B tries to write to a destination before Instruction A writes to it, potentially leaving the wrong final value.

Rarely have to worry about this breaking your program because the system handles it automatically:

  • The Hardware (CPU): Modern CPUs use a trick called Data Forwarding (passing the result directly from the math unit to the next instruction before it's even written to memory) or they just stall the processor for a cycle to let the data catch up.
  • The Compiler: Optimization flags (like -O2 or -O3 in GCC/Clang) allow the C++ compiler to smartly rearrange your instructions so that independent calculations are placed in between dependent ones, keeping the pipeline moving smoothly without stalls.


SIMD

  • "Single instruction, multiple data"
  • CPU-specific: MMX, SSE 1/2/3/4, AVX, AVX2, AVX-512, AMX, NEON, SVE...
  • How to use?
  • Auto-vectorisation
  • Explicit vectorisation using SIMD libraries
    • Google Highway, xsimd, vectorclass, eve, std:: simd proposal for C++26 (P1928)
  • Jeff Garland: "SIMD Libraries in C++" (CppNow 2023)
  • Writing intrinsics
  • Writing assembly
  • SWAR ("SIMD Within A Register")

Other SIMD considerations

  • If you know the exact target CPU:
    • use arch compiler flags
  • If you don't:
    • dynamic dispatch
    • function multiversioning (GCC/Clang only)

function multiversioning:

#include <iostream>

// 1. Version optimized for modern CPUs with AVX2
__attribute__((target("avx2"))) 
void process_data() {
    std::cout << "Running high-performance AVX2 vectorized version!\n";
    // Fast vector math goes here
}

// 2. Version optimized for older SSE4.2 capable CPUs
__attribute__((target("sse4.2"))) 
void process_data() {
    std::cout << "Running mid-tier SSE4.2 version!\n";
}

// 3. The mandatory default fallback version
__attribute__((target("default"))) 
void process_data() {
    std::cout << "Running generic baseline version.\n";
}

int main() {
    // You call it like a regular function. 
    // The resolution happens completely behind the scenes.
    process_data(); 
    return 0;
}  

Autovectorisation

  • Highly dependent on compiler
  • Only works with vectors/arrays of int, char, double etc. - no structs
  • Workaround: struct of arrays instead of array of structs
  • Traverse data linearly
  • for loop, not while loop
  • Number of iterations predetermined (ideally, known at compile time)
  • No data-dependent break, goto, etc.
  • No conditionals
  • No data dependencies between array elements
  • No aliasing



[Optimization] Peephole Optimizations

Reference:
https://blog.regehr.org/archives/2485

Local Peephole Optimization and Compositional Refinement

Let f be a function targeted for peephole optimization. 
We can structurally decompose f into a target redex o (the optimizable instruction sequence) and a continuation context r (the residual program), such that:
f = r ∘ o

By definition, a correct optimization yields a refined sequence o′ that refines the original sequence o, denoted as:
o ⊑ o′

To lift this local optimization to the function level, we rely on the compositionality of the refinement relation. For any valid refinement definition, the context r must act as a monotonic operator:
If x ⊑ y, then r(x) ⊑ r(y)

Applying compositionality to our decomposition:
r(o) ⊑ r(o′)

Because f = r(o), it transitively follows that:
f ⊑ r(o′)

This proves that the locally optimized function refines the original function. By inductively applying this compositionality principle, we can scale this local refinement guarantee across the entire program.

Jun 13, 2026

[RACI]

The 4 RACI Roles

  • Responsible (The "Doer"): The person (or team) who actually performs the work to complete the task. They are hands-on and do the heavy lifting.
    Rule of thumb: There should be at least one 'R', but multiple contributors can share this responsibility.
  • Accountable (The "Owner"): The individual ultimately answerable for the correct and thorough completion of the deliverable. They delegate the work and sign off on the final result.
    Rule of thumb: There must be exactly one Accountable person per task to guarantee decision-making authority.
  • Consulted (The "Expert"): People who provide subject-matter expertise or input before a decision or action is finalized.
    Rule of thumb: Communication is two-way.
  • Informed (The "Loop"): People who need to be kept up-to-date on project progress, usually stakeholders or leadership.
    Rule of thumb: Communication is one-way—they are simply updated, not required to make decisions.

May 27, 2026

[C++] pointer arithmetic

Refernece:
[C++] Object Lifetimes reading minute
class Vec {
  public:
    double* data() { return &x; }

  private:
    double x,y,z;
};

Eigen::Map<...>(absl::Span<Vec>);

*(Vect::data() + 1) // Does not give us y

Take away: 

* Even though x, y, and z are allocated sequentially in memory without padding,
physical layout does not supersede semantic rules.
* The layout guarantees mean you can safely memcpy the data, or cast a Vec* to a double* to access the first element (x). It does not grant permission to use pointer arithmetic on double* to slide across the members.
* The pointer arithmetic is only guaranteed within the type of array.
* Pointer to variable only is considered as pointer to array of size 1.
* Thus any pointer arithmetic on single variable is considered out-of-bound; compiler is free to assume anything.

Explain:

Only char*, unsigned char*, and std::byte* are explicitly granted an exception in the standard to 
examine the raw object representation. double* enjoys no such privilege.

Fix:

class Vec {
 public:
  double* data() { return data_; } // Legal: returns pointer to element 0 of a 3-element array
 private:
  double data_[3]; // x=data_[0], y=data_[1], z=data_[2]
};

May 10, 2026

心胸格局 器大識深

職場上遇到值得學習的大老

實在是可遇不可求。

與智者同行,與強者為伍。

格局大的人在種樹,

格局小的人在砍樹。

格局大的人懂得長期投資,

願意合作共贏,把餅做大;

所以有些人身邊的貴人越來越多,

有些人身邊的人卻越來越少。

不是能力的差距,

而是格局的差距。

與北七工作實在是心累 LoL