Jul 4, 2020

[Pacific++ 2018][re-read] "Designing for Efficient Cache Usage" - Scott McMillan

Reference:
Latency Numbers Every Programmer Should Know:
https://colin-scott.github.io/personal_website/research/interactive_latency.html



Cache Lines



 

Hardware Prefetch

  • Predictable access patterns are faster
  • We want sequential locality

 

Access Locality

  • Cache locality
    • Spatial
    • Temporal
  • Beware the algorithm/data structure to use honors cache locality.
  • Look beyond just big-O notation as constant-time costs can differ significantly.
  • Large benefit in hitting faster cache levels.
  • In C++, allocators matters. In Go(lang), the standard implement dealt with this problem.
  • In Go(lang), use flat map instead of std map.

 
 

Multipe CPU Core Considerations



 
 

Write Combined Memory

  • Accumulate writes to flush as 64Bytes  operations
  • Partial buffer flush causes: (avoid this)
    • Not writing all bytes convered by a buffer
    • Writing too many streams at once
    • Atomic read-modify-write operations
  • Write Combined memory read causes:
    • C++ bit fields
    • Optimization
    • Virtually always an accident (read the source code of implement)
    • Solution: Expose write-only interface
  • Non-temporal writes on x86: 
    Optimizing Cache Usage With Nontemporal Accesses
    • Use compiler intrinsics:
      • SSE2
        • _mm_stream_si32: store 4 bytes
        • _mm_Stream_si128: store 16 bytes
      • AVX
        • _mm256_stream_si256: store 32 bytes
      • AVX-512
        • _mm512_stream_si512: store 64 bytes



Address Translation

  • Platform-specific
  • Not directly pageable
  • Difficult/slow to allocate
  • Linux: Use hugepage

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.