Ataraxia through Epoché: [C++][CppCon 2017] Chandler Carruth “Going Nowhere Faster”

[2018]

Before tackling with Chandler's 2018 CppCon's talk,
review what he gave last year(2017)

CppCon 2017: Chandler Carruth “Going Nowhere Faster”

Make Code Fast:
1. Use efficient algorithms, fast data structure & idioms
2. Benchmark the code that matters, understand why
3. Use hybrid data structures to optimize allocations & cache locality

Only care about performance that you BENCHMARK.

Additional reference:

Mike Acton's 2014's talk:

https://vsdmars.blogspot.com/2017/11/cppcon-2014-data-oriented-design-mike.html

Read and write considered harmful - Hubert Matthews [ACCU 2018]

https://vsdmars.blogspot.com/2018/05/accu-2018-readwrite-thinking-would-be.html

Profiling:

Use counters to track cache miss rates.

Use Efficiency Sanitizer to optimize data structures.

Tools:

https://github.com/google/sanitizers

https://github.com/google/benchmark

CPU Register reference:

http://faydoc.tripod.com/cpu/

https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/x64-architecture

Cheatsheet:

https://cs.brown.edu/courses/cs033/docs/guides/x64_cheatsheet.pdf

%rax: 64 bits version of %eax

Clamp loop example:

int run(int a, int b) {
    return a >= b ? a : b;
}

$ gcc -O3 -masm=intel

.LFB0:
    .cfi_startproc
    cmp edi, esi
    mov eax, esi
    cmovge  eax, edi
    ret
    .cfi_endproc

[assembly] CMOVGE command is the prologue of this talk.

Example from the talk:

void run(){
   vector<int> v;
   # init. v
   for (auto &i : v)
 i = i > 255 ? 100 : i;

}

Keep in mind, most of the time branch isn't good for performance.

The idea of this talk is about cmovge should be faster then branches, but it's not.

Why CMOVGE is slower?

x86 runs microcode instead of assembly (more high level code).

"reorder buffer" captures the microcode, which unrolls the small loop,

i.e instead if branch out, but unrolling it into continuously microcodes,

and save the result back to 'reorder buffer'.

Since microcode unrolls the loops, register conflict happens,

since the same register is been used again and again.

So, x86 has this 'register renaming' concept.

register renaming

--quote:

Register renaming is a form of pipelining that deals with data dependences between instructions by renaming their register operands. An assembly language programmer or a compiler specifies these operands using architectural registers - the registers that are explicit in the instruction set architecture. Renaming replaces architectural register names by, in effect, value names, with a new value name for each instruction destination operand. This eliminates the name dependences (output dependences and antidependences) between instructions and automatically recognizes true dependences.

The recognition of true data dependences between instructions permits a more flexible life cycle for instructions. By maintaining a status bit for each value indicating whether or not it has been computed yet, it allows the execution phase of two instruction operations to be performed out of order when there are no true data dependences between them. This is called out-of-order execution.

After looking at the process of renaming operands we will look at the life cycle of an instruction in a register renaming architecture. Then we will look at a generic hardware organization for it and some possible performance enhancements. Finally, we will look at a brief history of the register renaming concept.

end quote--

Once unrolling the loop into microcode iteration sets, they can be

run 'concurrently' inside each set/forward compute from later sets,

iff there's no dependencies.

This is called "speculative execution".

Refer to Patterson's Computer Organization and Design

For ALUs, there could be out of order executions.

So, why CMOVGE is slow?

CMOVGE act as a binary operator, it has to evaluate Both operants, which blocks and wait.

And branches thus faster then CMOV.

The init. example from Chandler's talk shows that a simple:

i = i > 100 ? 100 : i;

can be unrolled into if i <= 100 then branches to next loop.

Which is thus faster.

Live demo with tool 'perf'

As we can see, for CMOV, there's a high percentage of backend cycle idle.
As for unrolled microcode branches, it's even higher percentage of backend cycle idle.
Why? Because for each cycle it's storing to memory, and it should be stalled every
cycle due to it's trying to store to memory. For CMOV has lower backend cycle idle
is due to it's doing calculation for the CMOV's operants and waits for CMOV.

C++20 alert :-)

[[likely]] [[unlikely]] attributes _will_ turn CMOV into JUMPS :-)

However, be aware, benchmark to see if the static attribute hints really
agrees with your logic, or the opposite.
Read:
https://vsdmars.blogspot.com/2016/01/likely-or-unlikely-easy-misleading.html

Again, std::vector is gooooooood.... push_back guarantees vector grows.

Question from the audience:

1. Why can't CMOV to be translated to the same code as branches?
It sounds to me like:
can
---

return 42 + std::async([]{sleep(10);return 42;}).get()

---
returns early without waiting .get() ?

2. How about store buffer?
From Chandler: Store buffer doesn't help us here.
Agree that SB is used as another usage.
Reference:
http://vsdmars.blogspot.com/2018/09/concurrency-c-wrap-up-2018.html

3. total store order (TSO)
https://en.wikipedia.org/wiki/Memory_ordering

4. When CMOV is useful?
When dependencies are already being required.

Tools:

Intel® Architecture Code Analyzer
https://software.intel.com/en-us/articles/intel-architecture-code-analyzer

CMDs:
$ perf stat BINARY
$ perf list
$ perf stat -e L1-icache-loads
$ perf record BINARY
$ perf report
$ clang++ -MMD -MT file.o -MF file.o.d -std=c++17 -Wall -O3 -fno-exceptions -fvisibility=hidden -qmlt -fno-omit-frame-pointer -pthread files.cpp -S -o files.s -mllvm -x86-cmov-converter-threshold=0 -stats # -qmlt debug info.

Reference:

How Computers Work [Jakob Stoklund Olesen]

Ataraxia through Epoché

Oct 4, 2018

[C++][CppCon 2017] Chandler Carruth “Going Nowhere Faster”

CppCon 2017: Chandler Carruth “Going Nowhere Faster”

Why CMOVGE is slower?

Live demo with tool 'perf'

C++20 alert :-)

Question from the audience:

Tools:

Reference:

No comments:

Post a Comment