Ataraxia through Epoché: [C++] Next-Gen C++ Optimization Techniques

Reference:
Unlocking Modern CPU Power - Next-Gen C++ Optimization Techniques - Fedor G Pikus - C++Now 2024
https://vsdmars.blogspot.com/2016/01/likely-or-unlikely-easy-misleading.html
https://vsdmars.blogspot.com/2022/10/book-art-of-writing-efficient-programs.html

RCU:
https://vsdmars.blogspot.com/2024/07/c-rcu.html

TLB:
https://vsdmars.blogspot.com/2020/07/virtual-memory-refresh.html
https://vsdmars.blogspot.com/2020/07/pacific-2018re-read-designing-for.html
https://vsdmars.blogspot.com/2018/11/pacific-2018-designing-for-efficient.html

Modern CPUs rely on caches and pipelining to a much greater degree.
Penalty for not using caches and for disrupting pipelines is far greater.

Memory access is characterized bny bandwidth and latency
Bandwidth is much higher than 'latency per word'
Random access speed is limited by latency
Sequential access speed is limited by bandwidth

Prefetch attempts to predict future memory accesses and transfers memory content into cache in advance.
Random access defeats prediction.

Key

In NUMA, the basic unit is NUMA node.

Solution to cross NUMA node latency-bound program

Trick: task_count_ as in main thread.

even better; batch processing

Redesign for NUMA data structure is intrusive.

CMD:

$ /sbin/lspci

$ cat /sys/bus/pci/devices/xxx/numa_node

$ numactl

GPU

I/O bound program

Real world cases

1) old code run slower on faster hardware

NUMA comes into play

Kernal flushes everything if TLB is outdated through 'TLB shootdown"; which is an inter-processor interrupt. The shootdown kernel code runs on the CPU. The shootdown is counted as 'system time' in the profiler.