Mar 11, 2022

[C++] memory allignment wrap up

Reference:
Björn Andrist - C++ High Performance(2nd)



Memory Alignment

  1. CPU reads memory into its registers one word at a time.
  2. The word size is 64 bits on a 64-bit architecture, 32 bits on a 32-bit architecture, and so forth.
  3. For the CPU to work efficiently when working with different data types, it has restrictions on the addresses where objects of different types are located.
  4. Every type in C++ has an alignment requirement that defines the addresses at which an object of a certain type should be located in memory.
  5. If the alignment of a type is 1, it means that the objects of that type can be located at any byte address. If the alignment of a type is 2, it means that the number of bytes between successive allowed addresses is 2.
    Quote:
    "An alignment is an implementation-defined integer value representing the number
    of bytes between successive addresses at which a given object can be allocated."
  6. use alignof to find out the alignment of a type:
      // Possible output is 4
    std::cout << alignof(int) << '\n';
  7. Use std::align() and not modulo to check the alignment of an object.
    <bit>
    std::has_single_bit Checks if x is an integral power of two. 
      
      bool is_aligned(void* ptr, std::size_t alignment) {
        assert(ptr != nullptr);
        assert(std::has_single_bit(alignment)); // Power of 2
    
        auto s = std::numeric_limits<std::size_t>::max();
        auto aligned_ptr = ptr;
        std::align(alignment, 1, aligned_ptr, s);
    
        return ptr == aligned_ptr;
    }
    Another code tip from CacheLib/HotHashDetector.h
    // Enforce that the number of buckets is a power of two.
      assert((numBuckets & (numBuckets - 1)) == 0);
      
  8. new and malloc() are guaranteed to always return memory suitably aligned for any scalar type.
  9. The <cstddef> header provides us with a type called std::max_align_t, whose alignment requirement is at least as strict as all the scalar types.
    auto* p = new char{};
    auto max_alignment = alignof(std::max_align_t);
    assert(is_aligned(p, max_alignment)); // True
      
    Let's allocate char two times in a row with new:
    auto* p1 = new char{'a'};
    auto* p2 = new char{'b'};
    Then, the memory may look something like this:
    The space between p1 and p2 depends on the alignment requirements of std::max_align_t
  10. It is possible to specify custom alignment requirements that are stricter than the default alignment when declaring a variable using the alignas specifier.

    Let's say we have a cache line size of 64 bytes and that we, for some reason, want to ensure that two variables are placed on separate cache lines. We could do the following:
    alignas(64) int x{};
    alignas(64) int y{};
    // x and y will be placed on different cache lines
  11. It's also possible to specify a custom alignment when defining a type.
    The following is a struct that will occupy exactly one cache line when being used:
    struct alignas(64) CacheLine {
    std::byte data[64];
    };

  12. The stricter alignment requirements are also satisfied when allocating objects on the heap. In order to support dynamic allocation of types with non-default alignment requirements, C++17 introduced new overloads of operator new() and operator
  13. delete() which accept an alignment argument of TAG type std::align_val_t.
  14. There is also an C11 aligned_alloc() function defined in <cstdlib> which can be used to manually allocate aligned heap memory.
    e.g.
    constexpr auto ps = std::size_t{4096};
     // Page size
    struct alignas(ps) Page {
        std::byte data_[ps];
    };
    
    auto* page = new Page{};
    assert(is_aligned(page, ps));
    
    // Use page ...
    delete page;
  15. Memory pages are not part of the C++ abstract machine, so there is no portable way to programmatically get hold of the page size of the currently running system.
  16. However, you could use boost::mapped_region::get_page_size() or a platform- specific system call, such as getpagesize(), on Unix systems. 
  17. A final caveat to be aware of is that the supported set of alignments are defined by the implementation of the standard library you are using, and not the C++ standard.

Type size padding

Reference:

  1. The compiler sometimes needs to add extra bytes, padding, to our user-defined types.
    e.g
    class Document {
        bool is_cached_{};
        double rank_{};
        int id_{};
    };
    
    // turns to
    class Document {
        bool is_cached_{};
        std::byte padding1[7]; // Invisible padding inserted by compiler
        double rank_{};
        int id_{};
        std::byte padding2[4]; // Invisible padding inserted by compiler
    };

    Better:
    class Document {
        double rank_{}; // Rearranged data members
        int id_{};
        bool is_cached_{};
    };
    
    // thus
    class Document {
        double rank_{};
        int id_{};
        bool is_cached_{};
        std::byte padding[3]; // Invisible padding inserted by compiler
    };
  2. As a general rule, you can place the biggest data members at the beginning and the smallest members at the end.
  3. From a performance perspective, there can also be cases where you want to align objects to cache lines to minimize the number of cache lines an object spans over.
  4. While we are on the subject of cache friendliness, it should also be mentioned that it can be beneficial to place multiple data members that are frequently used together next to each other. i.e std::mutex as first data member to avoid whole type ping-pong effect.
  5. A standard data type that needs 16-byte alignment (long long for example), malloc already guarantees that your returned blocks will be aligned correctly.
    Section 7.20.3 of C99 states The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object.
    https://vsdmars.blogspot.com/2021/03/goticket-make-64-bit-fields-64-bit.html
    void *malloc16 (size_t s) {
        unsigned char *p;
        unsigned char *porig = malloc (s + 0x10);   // allocate extra
        if (porig == NULL) return NULL;             // catch out of memory
        // adds 16 to the address then sets the lower 4 bits to 0, 
        // in effect bringing it back to the next lowest alignment point
        // (the +16 guarantees it is past the actual start of the maloc'ed block).
        p = (porig + 16) & (~0xf);                  // insert padding
        *(p-1) = p - porig;                         // store padding size
        return p;
    }
    
    void free16(void *p) {
        unsigned char *porig = p;                   // work out original
        porig = porig - *(porig-1);                 // by subtracting padding
        free (porig);                               // then free that
    }

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.