Video: https://www.youtube.com/watch?v=3_FXy3cT5C8
Baseline:
Choose a baseline for measuring!
e.g:
Differential:
Common benchmarking pitfalls:
Floating point:
Strength reduction:
Ok, if really need to do (u)int division, look:
Also, above code is measured!
With 4, it’s because of the cache line.
Remember, comparisons are CHEAP!
Even better: (Using LIKELY)
When write a loop, always start with writing an infinite loop!
With inside, use conditional return!
Minimize indirect writes: Why??
Checkpoint:
Baseline:
Choose a baseline for measuring!
e.g:
- std::sort
- iostream
- scanf
Differential:
- Run baseline 2n times, measure t(2a)
- Run baseline n times and contender n times, measure t(a+b)
- Relative improvement:
r = t(2a) / (2*(t(a+b)) - t(2a))Some overhead noises canceled.
Common benchmarking pitfalls:
- No Debug builds.
- Different setup for baseline and measured.
- Including ancillary work in measurement
malloc, printf etc. - Mixtures: measure t(a) + t(b) , improve t(a),
conclude t(b) got improved. - Optimize rare cases, pessimism others.
- Prefer static linking and PDC
- Prefer 64-bit code, 32-bit data, e.g 2 32-bit int instead of 1 64-bit int.
- Prefer 32-bit array indexing to pointers.
- Prefer a[i++] to a[++i] # I am gooood :-) Data dependancy.
# a[i++] simultaneously access the array element
# and increase i.
# PREFETCH!!! Man, i am really good… - Prefer regular memory access patterns
- Minimize flow, avoid data dependencies
- Use static const for all immutable, constexpr in C++11 and beyond.
Beware cache issues. - Use stack for most variables
Hot
0-cost addressing, like struct/class data members - Globals: aliasing issues
- thread_local slowest, use local caching.
1 instruction in Windows, Linux
3-4 in OSX
- Prefer 32-bit int to all other sizes
(because cache load alignment? Neh, because CPU ALU can
process as long as 64 bit of data.
i.e if is 2 32-bit int, it can process in parallel,
if it’s 1 64-bit, it can only process 1 op.)
64 bit may make some code 20x slower
8, 16-bit computations use conversion to 32 bits and back - Use small ints in arrays
- Prefer unsigned to signed (opposite with google coding standard) https://google.github.io/styleguide/cppguide.html
Except when converting to floating point - Most numbers are small
Floating point:
- Double precision as fast as single precision
- Extended precision just a bit slower
Do not mix the three. - 1-2 FP add/sub units
- 1-2 FP mul/div units
- SSE accelerates throughput for certain computation kernels
- ints -> FPs cheap, Fps -> ints expensive
Strength reduction:
- Don’t waste time replacing a/=2 with a>>1
- Speed hierarchy:
comparisons - (u)int add, sub, bits, shift
- FP add, sub (separate unit!)
- (u)int32 mul, FP mul
- FP division, remainder (for exp, just a minus)
- (u)int division(only div by 2 is fast. it’s a search op.), remainder
Ok, if really need to do (u)int division, look:
Also, above code is measured!
With 4, it’s because of the cache line.
Remember, comparisons are CHEAP!
Even better: (Using LIKELY)
When write a loop, always start with writing an infinite loop!
With inside, use conditional return!
Minimize indirect writes: Why??
- Diables enregistering
- A write is really a read and a write.
- Transfer 64bytes from memory to cache-line.
- First load a cache-line, i.e 64 bytes, modify it, and write back to the cache-line.
- If, write a 64bytes data, than it’s going to be FAST, since there’s no need a load, it just directly write to the cache-line(full 64bytes).
- i.e PADDING is a performance trick!!!
- Maculates the cache
- Fewer indirect writes
- Regular access pattern
- Fast on small numbers
- Data dependencies reduced
Checkpoint:
- We can’t improve what we can’t measure
- Always choose good baselines
- Optimize code judiciously for today’s dynamics
Reference:
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.