Mar 14, 2025

[cppcon-2024] Ultrafast Trading Systems in C++

Reference:
When Nanoseconds Matter: Ultrafast Trading Systems in C++ - David Gross - CppCon 2024  


Latency constraint for algorithmic trading

It's not just about being fast but being fast to be accurate


Latency constraint to not drop packets on the NIC

Limited number of buffers

Not reading fast enough will cause the application to drop packets, causing an outage.

Up to hundreds of thousands of price updates per second


Principle 1: Most of time no need for node containers(associated container)

Paper: Optimizing Open Addressing


Principle 2: Understanding the problem(by looking at the data)


Principle 3: hand tailored(specialized) algorithms are key to achieve performance.

Running perf on a benchmark

Trick
  1.  fork() before bm code
  2.  $ perf stat -I 10000 -M Frontend_Bound, Backend_Bound, Bad_Speculation, Retiring, -p pid 


  3. $ perf record - g -p <pid>

Branchless binary search

reference:




Access HW counts with libpapi




Binary search - memory access

Linear search is blazing fast.

Principle 4: Simplicity is the ultimate sophistication.


Principle 5: Mechanical sympathy. Harmony with hardware.


I-Cache missies - likely/unlikely attributes.


I-Cache misses - Immediately Invoked Function Expressions(IIFE)
inline or not inline, inline might cause I-Cache misses



Lambda, Functor vs. std::function
Use Lambda, because std::function do type erasure, which is hard to debug.



Transport: networking & concurrency
General pattern:
 Kernel bypass when receiving data from the exchange (or other low-latency signals)
 Dispatch / Fan-out to processes on the same server.


Userspace Networking 




Principle 6: True efficiency is found not in the layers of complexity we add, but in the unnecessary layers we remove.



Shared memory
  • Why shared memory
    • if you don't need sockets, no need to pay for their complexity
    • As fast as it gets, kernel isn't involved in any operations(only if you mmap it)
    • multi processes requires it - which is good for minimizing operational risk.
  • What works well in shared memory
    • Contiguous blocks of data: arrays.
    • One writer, one or multiple readers, stay away from multiple writers.
    • shm_open, mmap, munmap, shm_unlink, ftruncate, flock...


Concurrent queues



Principle 7: Choose the right tool for the right task.


FastQueue - Design (how Go's goroutine queue is designed)






No comments:

Post a Comment

Note: Only a member of this blog may post a comment.