Jan 11, 2016

[C++] Writing Fast Code I

Video: https://www.youtube.com/watch?v=3_FXy3cT5C8

Baseline:
Choose a baseline for measuring!

e.g:
  • std::sort
  • iostream
  • scanf
Differential timing:
Differential:
  • Run baseline 2n times, measure t(2a)
  • Run baseline n times and contender n times, measure t(a+b)
  • Relative improvement:
    r = t(2a) / (2*(t(a+b)) - t(2a))Some overhead noises canceled.

Common benchmarking pitfalls:
  • No Debug builds.
  • Different setup for baseline and measured.
  • Including ancillary work in measurement
    malloc, printf etc.
  • Mixtures: measure t(a) + t(b) , improve t(a),
    conclude t(b) got improved.
  • Optimize rare cases, pessimism others.
Generalities:
  • Prefer static linking and PDC
  • Prefer 64-bit code, 32-bit data, e.g 2 32-bit int instead of 1 64-bit int.
  • Prefer 32-bit array indexing to pointers.
  • Prefer a[i++] to a[++i] # I am gooood :-) Data dependancy.
    # a[i++] simultaneously access the array element
    # and increase i.
    # PREFETCH!!! Man, i am really good…
  • Prefer regular memory access patterns
  • Minimize flow, avoid data dependencies
Storage pecking order:
  • Use static const for all immutable, constexpr in C++11 and beyond.
    Beware cache issues.
  • Use stack for most variables
    Hot
    0-cost addressing, like struct/class data members
  • Globals: aliasing issues
  • thread_local slowest, use local caching.
    1 instruction in Windows, Linux
    3-4 in OSX
Integrals:
  • Prefer 32-bit int to all other sizes
    (because cache load alignment? Neh, because CPU ALU can
    process as long as 64 bit of data.
    i.e if is 2 32-bit int, it can process in parallel,
    if it’s 1 64-bit, it can only process 1 op.)
    64 bit may make some code 20x slower
    8, 16-bit computations use conversion to 32 bits and back
  • Use small ints in arrays
  • Prefer unsigned to signed (opposite with google coding standard) https://google.github.io/styleguide/cppguide.html
    Except when converting to floating point
  • Most numbers are small

Floating point:
  • Double precision as fast as single precision
  • Extended precision just a bit slower
    Do not mix the three.
  • 1-2 FP add/sub units
  • 1-2 FP mul/div units
  • SSE accelerates throughput for certain computation kernels
  • ints -> FPs cheap, Fps -> ints expensive

Strength reduction:
  • Don’t waste time replacing a/=2 with a>>1
  • Speed hierarchy:
    comparisons
    • (u)int add, sub, bits, shift
    • FP add, sub (separate unit!)
    • (u)int32 mul, FP mul
    • FP division, remainder (for exp, just a minus)
    • (u)int division(only div by 2 is fast. it’s a search op.), remainder

Ok, if really need to do (u)int division, look:


Also, above code is measured!
With 4, it’s because of the cache line.
Remember, comparisons are CHEAP!

Even better: (Using LIKELY)

When write a loop, always start with writing an infinite loop!
With inside, use conditional return!

Minimize indirect writes: Why??
  • Diables enregistering
  • A write is really a read and a write.
    • Transfer 64bytes from memory to cache-line.
    • First load a cache-line, i.e 64 bytes, modify it, and write back to the cache-line.
    • If, write a 64bytes data, than it’s going to be FAST, since there’s no need a load, it just directly write to the cache-line(full 64bytes).
    • i.e PADDING is a performance trick!!!
  • Maculates the cache
Summary:
  • Fewer indirect writes
  • Regular access pattern
  • Fast on small numbers
  • Data dependencies reduced

Checkpoint:
  • We can’t improve what we can’t measure
  • Always choose good baselines
  • Optimize code judiciously for today’s dynamics

Reference:

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.