Reference:
When Nanoseconds Matter: Ultrafast Trading Systems in C++ - David Gross - CppCon 2024
Latency constraint for algorithmic trading
It's not just about being fast but being fast to be accurate
Latency constraint to not drop packets on the NIC
Limited number of buffers
Not reading fast enough will cause the application to drop packets, causing an outage.
Up to hundreds of thousands of price updates per second
Principle 1: Most of time no need for node containers(associated container)
Paper: Optimizing Open Addressing
Principle 2: Understanding the problem(by looking at the data)
Principle 3: hand tailored(specialized) algorithms are key to achieve performance.
Running perf on a benchmark
Trick
Branchless binary search
reference:
Access HW counts with libpapi
Binary search - memory access
Linear search is blazing fast.
Principle 4: Simplicity is the ultimate sophistication.
Principle 5: Mechanical sympathy. Harmony with hardware.
I-Cache missies - likely/unlikely attributes.
I-Cache misses - Immediately Invoked Function Expressions(IIFE)
inline or not inline, inline might cause I-Cache misses
Lambda, Functor vs. std::function
Use Lambda, because std::function do type erasure, which is hard to debug.
Transport: networking & concurrency
General pattern:
Kernel bypass when receiving data from the exchange (or other low-latency signals)
Dispatch / Fan-out to processes on the same server.
Userspace Networking
- Solarflare - industry standard low-latency NIC (https://www.amd.com/en/products/ethernet-adapters.html)
- OpenOnLoad (https://github.com/majek/openonload)
- Easy to use: compatible with BSD sockets (no code changes needed!)
- $ onload --profile=latency ./your_binary
- TCPDirect (https://stacresearch.com/system/files/resource/files/STAC-Summit-7-November-2016-Solarflare.pdf)
- Custom TCP/UDP stack
- Reduced set of features
- EF_VI (https://docs.amd.com/r/en-US/ug1586-onload-user/ef_vi)
- Layer 2 API: interface, buffers and memory management (similar to DPDK)
- Lowest latency
Principle 6: True efficiency is found not in the layers of complexity we add, but in the unnecessary layers we remove.
Shared memory
- Why shared memory
- if you don't need sockets, no need to pay for their complexity
- As fast as it gets, kernel isn't involved in any operations(only if you mmap it)
- multi processes requires it - which is good for minimizing operational risk.
- What works well in shared memory
- Contiguous blocks of data: arrays.
- One writer, one or multiple readers, stay away from multiple writers.
- shm_open, mmap, munmap, shm_unlink, ftruncate, flock...
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.