Nov 14, 2018

[Pacific++ 2018] "Designing for Efficient Cache Usage" - Scott McMillan



Categorized into 6 sections
  • Cache Lines
  • Hardware Prefetch
  • Access Locality
  • Multiple CPU Core consideration
  • Write Combined Memory
  • Address Translation

This talk can be a supplement to 'OOP Is Dead, Long Live Data-oriented Design - Stoyan Nikolov' talk.



Cache Lines
Transfer occur as cache lines.
(Think about Data Oriented Design)
reference:


Hardware Prefetch
Predictable access patterns are faster.
We need sequential locality.



Access Locality
Cache locality
  • spatial
  • temporal
Use vector.

Hash map with key designed being flat.
reference:
Paper:


Multiple CPU Core consideration
(MESI)


Write Combined Memory
Use compiler intrinsics:
  • SSE2
    • _mm_stream_si32: store 4 bytes
    • _mm_Stream_si128: store 16 bytes
  • AVX
    • _mm256_stream_si256: store 32 bytes
  • AVX-512
    • _mm512_stream_si512: store 64 bytes


Address Translation
TLB Size (4KiB pages)
  • Address translation can be a significant overhead.
  • Large pages can help.
Linux
  • Huge TLB Page
    • Allocate on hugetlbfs
    • Access via mmap or shared memory
  • Transparent Huge Pages
    • Latency spike bewared

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.