Ataraxia through Epoché: [cppcon-2024] Ultrafast Trading Systems in C++

Reference:
When Nanoseconds Matter: Ultrafast Trading Systems in C++ - David Gross - CppCon 2024

Latency constraint for algorithmic trading

It's not just about being fast but being fast to be accurate

Latency constraint to not drop packets on the NIC

Limited number of buffers

Not reading fast enough will cause the application to drop packets, causing an outage.

Up to hundreds of thousands of price updates per second

Principle 1: Most of time no need for node containers(associated container)

Paper: Optimizing Open Addressing

Principle 2: Understanding the problem(by looking at the data)

Principle 3: hand tailored(specialized) algorithms are key to achieve performance.

Running perf on a benchmark

Trick

fork() before bm code
$ perf stat -I 10000 -M Frontend_Bound, Backend_Bound, Bad_Speculation, Retiring, -p pid
$ perf record - g -p <pid>

Branchless binary search

reference:

[c++20] std::midpoint

Access HW counts with libpapi

https://github.com/david-grs/papipp

Binary search - memory access

Linear search is blazing fast.

Principle 4: Simplicity is the ultimate sophistication.

Principle 5: Mechanical sympathy. Harmony with hardware.

I-Cache missies - likely/unlikely attributes.

I-Cache misses - Immediately Invoked Function Expressions(IIFE)

inline or not inline, inline might cause I-Cache misses

Lambda, Functor vs. std::function

Use Lambda, because std::function do type erasure, which is hard to debug.

Transport: networking & concurrency

General pattern:

Kernel bypass when receiving data from the exchange (or other low-latency signals)

Dispatch / Fan-out to processes on the same server.

Userspace Networking

Solarflare - industry standard low-latency NIC (https://www.amd.com/en/products/ethernet-adapters.html)
OpenOnLoad (https://github.com/majek/openonload)

Easy to use: compatible with BSD sockets (no code changes needed!)
$ onload --profile=latency ./your_binary

TCPDirect (https://stacresearch.com/system/files/resource/files/STAC-Summit-7-November-2016-Solarflare.pdf)

Custom TCP/UDP stack
Reduced set of features

EF_VI (https://docs.amd.com/r/en-US/ug1586-onload-user/ef_vi)

Layer 2 API: interface, buffers and memory management (similar to DPDK)
Lowest latency

Principle 6: True efficiency is found not in the layers of complexity we add, but in the unnecessary layers we remove.

Shared memory

Why shared memory

if you don't need sockets, no need to pay for their complexity
As fast as it gets, kernel isn't involved in any operations(only if you mmap it)
multi processes requires it - which is good for minimizing operational risk.

What works well in shared memory

Contiguous blocks of data: arrays.
One writer, one or multiple readers, stay away from multiple writers.
shm_open, mmap, munmap, shm_unlink, ftruncate, flock...

Concurrent queues

Principle 7: Choose the right tool for the right task.

FastQueue - Design (how Go's goroutine queue is designed)

Reference:

P1478R5: Byte-wise atomic memcpy

Trading at light speed: designing low latency systems in C++ - David Gross - Meeting C++ 2022

Measuring

Clang Xray

https://llvm.org/docs/XRay.html

Ataraxia through Epoché

Mar 14, 2025

[cppcon-2024] Ultrafast Trading Systems in C++