Jun 30, 2026

[Design] Cache friendly design, CPU/Memory level

Reference:
What do you mean by "Cache Friendly"? - Björn Fahller
What Every Programmer Should Know About Memory

Simplistic model of cache behaviour

Includes

  • The cache is small
  • and consists of fixed size lines
  • and is very very fast when hit
  • and missing is very slow

Excludes

  • Multiple levels of caches
  • Associativity
  • Threading

All models are wrong, but some are useful


$ valgrind --tool=callgrind --cache-sim=yes --dump-instr=yes --branch-sim=yes
$ perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-Load-misses --call-graph=dwarf
$ perf record -e cycles,instructions,L1-dcache-loads,L1-dcache-Load-misses --call-graph=dwarf
$ pert report
$ hotspot perf.data

macos:
$ samply


Rule of thumb

follow/chasing pointer -> cache miss.

So,
Get rid of pointers.
Use std::vector (contiguous memory) vs. prev/next pointers.
std::vector be aware of memmove/memcpy.
memmove cause cache miss; if possible, memcpy.
Thus try to move less data.

binary tree/search still chasing pointer, 
can we get Log(n) lookup/insert without chasing pointers?
Heap. Use contiguous memory(e.g. std::vector) to implement.

But the heap is not searchable. Need extra contiguous memory to store the elements in order;
and the element holds the index to the contigous memory. Thus when the smallest heap
is processed, use the index to the contiguous memory and process that chunk.
Trade off: extra memory usage for speed and less cache miss.

Can we make it better?
B-Heap(binary heap implemented to keep subtrees in a single page); smaller heap fit into cache-line.
(not really better, depends on use case. i.e. if within a loop, the cache is hot, good.
otherwise, not worth the extra instructions due to cache is flushed anyways.)













No comments:

Post a Comment

Note: Only a member of this blog may post a comment.