Ataraxia through Epoché: [book][notes] The art of writing efficient programs

Reference:

The Google CPU profiler also uses hardware performance counters and requires
link-time instrumentation of the code.
$ clang++ -g -O3 -mavx2 -Wall -pendantic x1.cpp x2.cpp -lprofiler -o example
$ CPUPROFILE=prof.data CPUPROFILE_FREQUENCY=1000 ./example
$ google-pprof --text ./example prof.data
$ google-pprof --text --lines ./example prof.data
$ google-pprof --pdf ./example prof.data > prof.pdf

$ perf stat ./example
$ perf list // list counter type
$ perf -e counter types ./example
$ perf record -c sampling_count ./example
$ perf report // report the prof.data

$ perf stat -e cycles,instructions,L1-dcache-load-misses,L1-dcache-loads ./program

Profiling for branch mispredictions:

$ perf stat ./benchmark

First generates the report

$ perf record -e branches,branch-misses ./benchmark // perf --list gets the -e arguments

Then read it:

$ perf report

Micro-benchmark example:

system_clock::time_point t1 = system_clock::now();

// Our code

system_clock::time_point t2 = system_clock::now();
cout << "Sort time: " <<
duration_cast<milliseconds>(t2 - t1).count() << "ms (" << count << " comparisons)" << endl;

auto t0 = system_clock::now();
// ... do some work ...
auto t1 = system_clock::now();
auto delta_t = duration_cast(t1 – t0);
cout << "Time: " << delta_t.count() << endl;

clock_gettime used in linux only.

Be ware, take care to subtract seconds first and only then add nanoseconds, otherwise, you lose significant digits of the result by subtracting two large numbers.

If the reported CPU time does not match the real time, it is likely that the machine is overloaded (many other processes are competing for the CPU resources), or the program is running out of memory (if the program uses more memory than the physical memory on the machine, it will have to use the much slower disk swap, and the CPUs can't do any work while the program is waiting for the memory to be paged in from disk).

double duration(timespec a, timespec b) {
return a.tv_sec - b.tv_sec + 1e-9*(a.tv_nsec - b.tv_nsec);
}

{
timespec rt0, ct0, tt0;
clock_gettime(CLOCK_REALTIME, &rt0);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ct0);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tt0);

// Code
constexpr double X = 1e6;
double s = 0;
for (double x = 0; x < X; x += 0.1) s += sin(x);
// Code end

timespec rt1, ct1, tt1;

clock_gettime(CLOCK_REALTIME, &rt1);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ct1);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tt1);

cout << "Real time: " << duration(rt1, rt0) << "s, "
"CPU time: " << duration(ct1, ct0) << "s, "
"Thread time: " << duration(tt1, tt0) << "s" << endl;
}

cmove/cmov (intel manual)
https://stackoverflow.com/questions/30150274/purpose-of-cmove-instruction-in-x86-assembly

The CMOVcc instructions don't compare the source and destination. They use the flags from a previous comparison (or other operation that sets the flags) which determines if the move should be done or not.

This prevents branch miss. (compiler uses cmove for ternary operator)

e.g. this copies edx to ecx if eax and ebx are equal:

cmp eax, ebx
cmove ecx, edx

same as:

cmp eax, ebx
jne skip
  mov ecx, edx
skip:

Instruction-level parallelism(ILP):

As long as the operands are already in the registers, the processor can execute several operations at once.

e.g.

void BM_add_multiply(benchmark::State& state) {
... prepare data ...
for (auto _ : state) {
	unsigned long a1 = 0, a2 = 0;
	for (size_t i = 0; i < N; ++i) {
		a1 += p1[i] + p2[i]; // p1[i] and p2[i] are inside the register
		a2 += p1[i] * p2[i]; // p1[i] and p2[i] are inside the register, 
		// thus adding a2 has no performance hit; same as running a1 only.
	}	
	benchmark::DoNotOptimize(a1);
	benchmark::DoNotOptimize(a2);
	benchmark::ClobberMemory();
}

state.SetItemsProcessed(N*state.iterations());
}

There is a limit to how many operations can be executed:

the processor has only so many execution units capable of doing integer computations.
Still, it is instructive to try to push the CPU to its limits by adding more and more instructions to one iteration.

Be ware of data dependency. This would add extra cycles to the processing. (CS architecture 101; pipeline)

Use MCA:

#define MCA_START __asm volatile("# LLVM-MCA-BEGIN");
#define MCA_END __asm volatile("# LLVM-MCA-END");
...
for (size_t i = 0; i < N; ++i) {
	MCA_START
	a1 += p1[i] + p2[i];
	MCA_END
}

$ clang++ -std=c++20 benchmark.cpp -g -O3 -mavx2 -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2 -timeline

Can we benchmark memory access speed through below code? Nope; due to we have store-buffer,L1/L2/L3 cache in front of the memory (to CPU).

volatile int* p = new int;
*p = 42;
for (auto _ : state) {
	benchmark::DoNotOptimize(*p);
    //... repeat access *p 32 times ...
    benchmark::DoNotOptimize(*p);
}
state.SetItemsProcessed(32*state.iterations());
delete p;

Test read/write memory including L1/L2/L3/Memory:

#define REPEAT2(x) x x
#define REPEAT4(x) REPEAT2(x) REPEAT2(x)
#define REPEAT8(x) REPEAT4(x) REPEAT4(x)
#define REPEAT16(x) REPEAT8(x) REPEAT8(x)
#define REPEAT32(x) REPEAT16(x) REPEAT16(x)
#define REPEAT(x) REPEAT32(x)

template <class Word>
void BM_read_seq(benchmark::State& state) {
    const size_t size = state.range(0);
    void* memory = ::malloc(size);
    void* const end = static_cast<char*>(memory) + size;
    volatile Word* const p0 = static_cast<Word*>(memory);
    Word* const p1 = static_cast<Word*>(end);
    for (auto _ : state) {
        for (volatile Word* p = p0; p != p1; ) {
            REPEAT(benchmark::DoNotOptimize(*p++);)
        }
        benchmark::ClobberMemory();
    }

    Word fill = {};
    // Default-constructed
    for (auto _ : state) {
        for (volatile Word* p = p0; p != p1; ) {
            REPEAT(benchmark::DoNotOptimize(*p++ = fill);)
        }
        benchmark::ClobberMemory();
    }
::free(memory);
state.SetBytesProcessed(size*state.iterations());
state.SetItemsProcessed((p1 - p0)*state.iterations());
}

#define ARGS ->RangeMultiplier(2)->Range(1<<10, 1<<30)
BENCHMARK_TEMPLATE1(BM_read_seq, unsigned int) ARGS;
BENCHMARK_TEMPLATE1(BM_read_seq, unsigned long) ARGS;

// for SSE(16 bytes) and AVX(32 bytes) instructions
#include <emmintrin.h>
#include <immintrin.h>
BENCHMARK_TEMPLATE1(BM_read_seq, __m128i) ARGS;
BENCHMARK_TEMPLATE1(BM_read_seq, __m256i) ARGS;

Spectre attack, Fedor did a great explain about the notorious Spectre attack,

follow up refresh: Chandler Carruth “Spectre: Secrets, Side-Channels, Sandboxes, and Security” in CPP2018

Ataraxia through Epoché

Oct 19, 2022

[book][notes] The art of writing efficient programs - Fedor

No comments:

Post a Comment