Reference:
Branchless Programming in C++ - Fedor Pikus - CppCon 2021
Computer Architecture - A quantitative approach (Hennessy, Patterson) Appendix-C
https://vsdmars.blogspot.com/2016/01/likely-or-unlikely-easy-misleading.html
What determines performance?
- Optimal algorithm
Get the result with minimal work. - Efficient use of language
do not do any unnecessary work - Efficient use of hardware
use all available resources
at the same time
all the time
Hazards
- Structural; use stall/bubble
- Data; forwarding, stall/bubble in the middle of pipeline
- Branch;
Freeze/flush; holding or deleting any instructions after the branch until the branch destination is known.
Treat every branch as not taken.
Treat every branch as taken.
Delayed branch.
As for branch hazards, we usually use static branch prediction by profiling.
A branch is usually bimodally distributed.
As for dynamic branch prediction; we use branch history table.
Tools
Google benchmark is our friend https://github.com/google/benchmark
KDAB hotspot https://github.com/KDAB/hotspot
KDE heaptrack https://github.com/KDE/heaptrack
$ perf stat our_binary // shows branch-misses
When do benchmark like this(branch mis-predicting), we should avoid the predicting is done by the compiler, which is damn smart to generate efficient code.
- Optimizing away branches almost always results in doing more work.
- CPU usually has idle compute resources which it can handle a bit of extra work.
- Branch mis-prediction is very expensive.
- Trade off between the extra work vs. the code of the branch is usually impossible to predict; must be measured.
Less branch means better
Tricks:
use hashmap[] as branches. hashmap is O(1).
However; keep in mind this cause extra memory thus use iff branch is poorly predicted and the extra hashmap computations are small.
- Sometimes the compiler will do a branchless transformation for you. (conditional move instruction)
- Compiler's branchless optimization is usually better than ours'.
- This is almost always branchless in reality:
return cond ? x : y; - Never optimize code preemptively.
- Optimize only if the profiler shows high mis-prediction rate.
- Optimizations depend on the compiler.
if (b) s += x;
//vs.
s += b * x;
- Sometimes branchless code is not really branchless
- Indirect function calls are similar to branches
if (cond) f1(); else f2(); - Can be converted to branchess:
funcptr f[2] = {&f2, &f1};
(f[cond])(); - Above code almost never works.
If f1 and f2 were inlined.. - Always measure
Lessons learned
- Predicted branches are cheap
- Mis-predicted branches are very expensive - pipeline flush
- Optimization - user fewer(or zero) branches
- Always use profiler to detect and validate optimization locations
- Don't fight with the compiler - sometimes it does the job for you
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.