Mar 25, 2025

[Bits manipulation]

align:

return (value + align_size - 1) & (~(align_size - 1);

has_single_bit:

return x && !(x & (x - 1));

midpoint:

return (a & b) + (a ^ b) / 2;

負i保 減一去
Set union A | B 
Set intersection A & B 
Set subtraction A & ~B 
Set negation ALL_BITS ^ A or ~A 
Set bit A |= 1 << bit 
Clear bit A &= ~(1 << bit) 
Test bit (A & 1 << bit) != 0 
Extract last bit A & -A or A & ~(A - 1) or x ^ (x & (x - 1)) 
Remove last bit A & (A - 1) 
Get all 1-bits ~0

Mar 14, 2025

[cppcon-2024] Ultrafast Trading Systems in C++

Reference:
When Nanoseconds Matter: Ultrafast Trading Systems in C++ - David Gross - CppCon 2024  


Latency constraint for algorithmic trading

It's not just about being fast but being fast to be accurate


Latency constraint to not drop packets on the NIC

Limited number of buffers

Not reading fast enough will cause the application to drop packets, causing an outage.

Up to hundreds of thousands of price updates per second


Principle 1: Most of time no need for node containers(associated container)

Paper: Optimizing Open Addressing


Principle 2: Understanding the problem(by looking at the data)


Principle 3: hand tailored(specialized) algorithms are key to achieve performance.

Running perf on a benchmark

Trick
  1.  fork() before bm code
  2.  $ perf stat -I 10000 -M Frontend_Bound, Backend_Bound, Bad_Speculation, Retiring, -p pid 


  3. $ perf record - g -p <pid>

Branchless binary search

reference:




Access HW counts with libpapi




Binary search - memory access

Linear search is blazing fast.

Principle 4: Simplicity is the ultimate sophistication.


Principle 5: Mechanical sympathy. Harmony with hardware.


I-Cache missies - likely/unlikely attributes.


I-Cache misses - Immediately Invoked Function Expressions(IIFE)
inline or not inline, inline might cause I-Cache misses



Lambda, Functor vs. std::function
Use Lambda, because std::function do type erasure, which is hard to debug.



Transport: networking & concurrency
General pattern:
 Kernel bypass when receiving data from the exchange (or other low-latency signals)
 Dispatch / Fan-out to processes on the same server.


Userspace Networking 




Principle 6: True efficiency is found not in the layers of complexity we add, but in the unnecessary layers we remove.



Shared memory
  • Why shared memory
    • if you don't need sockets, no need to pay for their complexity
    • As fast as it gets, kernel isn't involved in any operations(only if you mmap it)
    • multi processes requires it - which is good for minimizing operational risk.
  • What works well in shared memory
    • Contiguous blocks of data: arrays.
    • One writer, one or multiple readers, stay away from multiple writers.
    • shm_open, mmap, munmap, shm_unlink, ftruncate, flock...


Concurrent queues



Principle 7: Choose the right tool for the right task.


FastQueue - Design (how Go's goroutine queue is designed)






Feb 12, 2025

[Algorithm] Fowler–Noll–Vo (FNV) hash

static uint64_t FNVhash64(uint64_t val) {
  const uint64_t FNV_offset_basis_64 = 0xCBF29CE484222325L;
  const uint64_t FNV_prime_64 = 1099511628211L;

  uint64_t hashval = FNV_offset_basis_64;
  for (int i = 0; i < 8; i++) {
    hashval = hashval ^ (val & 0xff);
    hashval = hashval * FNV_prime_64;
    val >>= 8;
  }
  return (hashval & 0x8000000000000000) ? -hashval : hashval;
}

Jan 31, 2025

[C++] transparently replaceable

https://eel.is/c++draft/basic.life#9

An object o1 is transparently replaceable by an object o2 if 

(9.1) the storage that o2 occupies exactly overlays the storage that o1 occupied,

(9.2) o1 and o2 are of the same type (ignoring the top-level cv-qualifiers)

(9.3) o1 is not a const, complete object

(9.4) neither o1 nor o2 is a potentially-overlapping subobject ([intro.object])

(9.5) either o1 and o2 are both complete objects, or o1 and o2 are direct subobjects of objects p1 and p2 , respectively, and p1 is transparently replaceable by p2 .


Thus, corner case of std::optional<T>:

If optional<T> holds its T subobject using a [[no_unique_address]] member (in order to pack its bool into the tail padding of T), then you can't use a pointer to the old object to transparently point to the new one.