Reference:
link-time instrumentation of the code.
$ clang++ -g -O3 -mavx2 -Wall -pendantic x1.cpp x2.cpp -lprofiler -o example
$ CPUPROFILE=prof.data CPUPROFILE_FREQUENCY=1000 ./example
$ google-pprof --text ./example prof.data
$ google-pprof --text --lines ./example prof.data
$ google-pprof --pdf ./example prof.data > prof.pdf
$ perf stat ./example
$ perf list // list counter type
$ perf -e counter types ./example
$ perf record -c sampling_count ./example
$ perf report // report the prof.data
$ perf stat -e cycles,instructions,L1-dcache-load-misses,L1-dcache-loads ./program
Profiling for branch mispredictions:
$ perf stat ./benchmark
First generates the report
$ perf record -e branches,branch-misses ./benchmark // perf --list gets the -e arguments
Then read it:
$ perf report
Micro-benchmark example:
system_clock::time_point t1 = system_clock::now();
// Our code
system_clock::time_point t2 = system_clock::now();
cout << "Sort time: " <<
duration_cast<milliseconds>(t2 - t1).count() << "ms (" << count << " comparisons)" << endl;
auto t0 = system_clock::now();
// ... do some work ...
auto t1 = system_clock::now();
auto delta_t = duration_cast(t1 – t0);
cout << "Time: " << delta_t.count() << endl;
clock_gettime used in linux only.
Be ware, take care to subtract seconds first and only then add nanoseconds, otherwise, you lose significant digits of the result by subtracting two large numbers.
If the reported CPU time does not match the real time, it is likely that the machine is overloaded (many other processes are competing for the CPU resources), or the program is running out of memory (if the program uses more memory than the physical memory on the machine, it will have to use the much slower disk swap, and the CPUs can't do any work while the program is waiting for the memory to be paged in from disk).
double duration(timespec a, timespec b) {
return a.tv_sec - b.tv_sec + 1e-9*(a.tv_nsec - b.tv_nsec);
}
{
timespec rt0, ct0, tt0;
clock_gettime(CLOCK_REALTIME, &rt0);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ct0);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tt0);
// Code
constexpr double X = 1e6;
double s = 0;
for (double x = 0; x < X; x += 0.1) s += sin(x);
// Code end
timespec rt1, ct1, tt1;
clock_gettime(CLOCK_REALTIME, &rt1);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ct1);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tt1);
cout << "Real time: " << duration(rt1, rt0) << "s, "
"CPU time: " << duration(ct1, ct0) << "s, "
"Thread time: " << duration(tt1, tt0) << "s" << endl;
}
The CMOVcc instructions don't compare the source and destination. They use the flags from a previous comparison (or other operation that sets the flags) which determines if the move should be done or not.
This prevents branch miss. (compiler uses cmove for ternary operator)
e.g. this copies edx to ecx if eax and ebx are equal:
cmp eax, ebx
cmove ecx, edx
same as:
cmp eax, ebx
jne skip
mov ecx, edx
skip:
As long as the operands are already in the registers, the processor can execute several operations at once.
e.g.
void BM_add_multiply(benchmark::State& state) {
... prepare data ...
for (auto _ : state) {
unsigned long a1 = 0, a2 = 0;
for (size_t i = 0; i < N; ++i) {
a1 += p1[i] + p2[i]; // p1[i] and p2[i] are inside the register
a2 += p1[i] * p2[i]; // p1[i] and p2[i] are inside the register,
// thus adding a2 has no performance hit; same as running a1 only.
}
benchmark::DoNotOptimize(a1);
benchmark::DoNotOptimize(a2);
benchmark::ClobberMemory();
}
state.SetItemsProcessed(N*state.iterations());
}
There is a limit to how many operations can be executed:
the processor has only so many
execution units capable of doing integer computations. Still, it is instructive to try to push the CPU to its limits by adding more and more instructions to one iteration.
Be ware of data dependency. This would add extra cycles to the processing. (CS architecture 101; pipeline)
Use MCA:
#define MCA_START __asm volatile("# LLVM-MCA-BEGIN");
#define MCA_END __asm volatile("# LLVM-MCA-END");
...
for (size_t i = 0; i < N; ++i) {
MCA_START
a1 += p1[i] + p2[i];
MCA_END
}
$ clang++ -std=c++20 benchmark.cpp -g -O3 -mavx2 -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2 -timelineCan we benchmark memory access speed through below code? Nope; due to we have store-buffer,L1/L2/L3 cache in front of the memory (to CPU).
volatile int* p = new int;
*p = 42;
for (auto _ : state) {
benchmark::DoNotOptimize(*p);
//... repeat access *p 32 times ...
benchmark::DoNotOptimize(*p);
}
state.SetItemsProcessed(32*state.iterations());
delete p;
#define REPEAT2(x) x x
#define REPEAT4(x) REPEAT2(x) REPEAT2(x)
#define REPEAT8(x) REPEAT4(x) REPEAT4(x)
#define REPEAT16(x) REPEAT8(x) REPEAT8(x)
#define REPEAT32(x) REPEAT16(x) REPEAT16(x)
#define REPEAT(x) REPEAT32(x)
template <class Word>
void BM_read_seq(benchmark::State& state) {
const size_t size = state.range(0);
void* memory = ::malloc(size);
void* const end = static_cast<char*>(memory) + size;
volatile Word* const p0 = static_cast<Word*>(memory);
Word* const p1 = static_cast<Word*>(end);
for (auto _ : state) {
for (volatile Word* p = p0; p != p1; ) {
REPEAT(benchmark::DoNotOptimize(*p++);)
}
benchmark::ClobberMemory();
}
Word fill = {};
// Default-constructed
for (auto _ : state) {
for (volatile Word* p = p0; p != p1; ) {
REPEAT(benchmark::DoNotOptimize(*p++ = fill);)
}
benchmark::ClobberMemory();
}
::free(memory);
state.SetBytesProcessed(size*state.iterations());
state.SetItemsProcessed((p1 - p0)*state.iterations());
}
#define ARGS ->RangeMultiplier(2)->Range(1<<10, 1<<30)
BENCHMARK_TEMPLATE1(BM_read_seq, unsigned int) ARGS;
BENCHMARK_TEMPLATE1(BM_read_seq, unsigned long) ARGS;
// for SSE(16 bytes) and AVX(32 bytes) instructions
#include <emmintrin.h>
#include <immintrin.h>
BENCHMARK_TEMPLATE1(BM_read_seq, __m128i) ARGS;
BENCHMARK_TEMPLATE1(BM_read_seq, __m256i) ARGS;
Spectre attack, Fedor did a great explain about the notorious Spectre attack,
follow up refresh: Chandler Carruth “Spectre: Secrets, Side-Channels, Sandboxes, and Security” in CPP2018
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.