Ataraxia through Epoché: [Pacific++ 2018] "Designing for Efficient Cache Usage" - Scott McMillan

Nov 14, 2018

[Pacific++ 2018] "Designing for Efficient Cache Usage" - Scott McMillan

Categorized into 6 sections

Cache Lines
Hardware Prefetch
Access Locality
Multiple CPU Core consideration
Write Combined Memory
Address Translation

This talk can be a supplement to 'OOP Is Dead, Long Live Data-oriented Design - Stoyan Nikolov' talk.

Cache Lines

Transfer occur as cache lines.

(Think about Data Oriented Design)

reference:

CppCon 2016: Chandler Carruth “High Performance Code 201: Hybrid Data Structures"

Hardware Prefetch

Predictable access patterns are faster.

We need sequential locality.

Access Locality

Cache locality

spatial
temporal

Use vector.

Hash map with key designed being flat.

reference:

C++Now 2018: You Can Do Better than std::unordered_map: New Improvements to Hash Table Performance

Paper:

On Quantifying memory allocation strategies

Multiple CPU Core consideration

(MESI)

Write Combined Memory

https://vsdmars.blogspot.com/2018/09/cassembly-nontemporal-accesses.html

Use compiler intrinsics:

SSE2

_mm_stream_si32: store 4 bytes
_mm_Stream_si128: store 16 bytes

AVX

_mm256_stream_si256: store 32 bytes

AVX-512

_mm512_stream_si512: store 64 bytes

Address Translation

TLB Size (4KiB pages)

Address translation can be a significant overhead.
Large pages can help.

Linux

Huge TLB Page

Allocate on hugetlbfs
Access via mmap or shared memory

Transparent Huge Pages

Latency spike bewared

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Subscribe to: Post Comments (Atom)