Ataraxia through Epoché: [C++] Writing Fast Code I

Video: https://www.youtube.com/watch?v=3_FXy3cT5C8

Baseline:
Choose a baseline for measuring!

e.g:

Differential timing:
Differential:

Run baseline 2n times, measure t(2a)
Run baseline n times and contender n times, measure t(a+b)
Relative improvement:
r = t(2a) / (2*(t(a+b)) - t(2a))Some overhead noises canceled.

Common benchmarking pitfalls:

Generalities:

Prefer static linking and PDC
Prefer 64-bit code, 32-bit data, e.g 2 32-bit int instead of 1 64-bit int.
Prefer 32-bit array indexing to pointers.
Prefer a[i++] to a[++i] # I am gooood :-) Data dependancy.
# a[i++] simultaneously access the array element
# and increase i.
# PREFETCH!!! Man, i am really good…
Prefer regular memory access patterns
Minimize flow, avoid data dependencies

Storage pecking order:

Use static const for all immutable, constexpr in C++11 and beyond.
Beware cache issues.
Use stack for most variables
Hot
0-cost addressing, like struct/class data members
Globals: aliasing issues
thread_local slowest, use local caching.
1 instruction in Windows, Linux
3-4 in OSX

Integrals:

Prefer 32-bit int to all other sizes
(because cache load alignment? Neh, because CPU ALU can
process as long as 64 bit of data.
i.e if is 2 32-bit int, it can process in parallel,
if it’s 1 64-bit, it can only process 1 op.)
64 bit may make some code 20x slower
8, 16-bit computations use conversion to 32 bits and back
Use small ints in arrays
Prefer unsigned to signed (opposite with google coding standard) https://google.github.io/styleguide/cppguide.html
Except when converting to floating point
Most numbers are small

Floating point:

Strength reduction:

Ok, if really need to do (u)int division, look:

Also, above code is measured!
With 4, it’s because of the cache line.
Remember, comparisons are CHEAP!

Even better: (Using LIKELY)

When write a loop, always start with writing an infinite loop!
With inside, use conditional return!

Minimize indirect writes: Why??

Transfer 64bytes from memory to cache-line.
First load a cache-line, i.e 64 bytes, modify it, and write back to the cache-line.
If, write a 64bytes data, than it’s going to be FAST, since there’s no need a load, it just directly write to the cache-line(full 64bytes).
i.e PADDING is a performance trick!!!

Summary:

Checkpoint:

Reference:

Ataraxia through Epoché