Ataraxia through Epoché: [Design] Cache friendly design, CPU/Memory level

Jun 30, 2026

[Design] Cache friendly design, CPU/Memory level

Reference:
What do you mean by "Cache Friendly"? - Björn Fahller
What Every Programmer Should Know About Memory

Simplistic model of cache behaviour

Includes

The cache is small
and consists of fixed size lines
and is very very fast when hit
and missing is very slow

Excludes

Multiple levels of caches
Associativity
Threading

All models are wrong, but some are useful

$ valgrind --tool=callgrind --cache-sim=yes --dump-instr=yes --branch-sim=yes
$ perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-Load-misses --call-graph=dwarf
$ perf record -e cycles,instructions,L1-dcache-loads,L1-dcache-Load-misses --call-graph=dwarf
$ pert report
$ hotspot perf.data

macos:

$ samply

Rule of thumb

follow/chasing pointer -> cache miss.

So,

Get rid of pointers.

Use std::vector (contiguous memory) vs. prev/next pointers.

std::vector be aware of memmove/memcpy.

memmove cause cache miss; if possible, memcpy.

Thus try to move less data.

binary tree/search still chasing pointer,

can we get Log(n) lookup/insert without chasing pointers?

Heap. Use contiguous memory(e.g. std::vector) to implement.

But the heap is not searchable. Need extra contiguous memory to store the elements in order;

and the element holds the index to the contigous memory. Thus when the smallest heap

is processed, use the index to the contiguous memory and process that chunk.

Trade off: extra memory usage for speed and less cache miss.

Can we make it better?

B-Heap(binary heap implemented to keep subtrees in a single page); smaller heap fit into cache-line.

(not really better, depends on use case. i.e. if within a loop, the cache is hot, good.

otherwise, not worth the extra instructions due to cache is flushed anyways.)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Subscribe to: Post Comments (Atom)