Ataraxia through Epoché: [debugging techniques] C++ on Nightmare Mode

Reference:
https://youtu.be/dUfNsVxplQA?si=wuAKES3uxdRruCXO

SIGBUS Memory alignment bug

Unsafe:

The Lie: The static_cast<const std::uint32_t *>(p) is a promise to the compiler. You are saying, "Trust me, this void* pointer p definitely points to a memory location that is properly aligned for a std::uint32_t."
The Requirement: On many architectures (like ARM, SPARC, and MIPS), a std::uint32_t (a 4-byte integer) must be located at a memory address that is a multiple of 4. For example, 0x1000, 0x1004, or 0x1008 are all aligned. An address like 0x1001 is unaligned.
The Machine Code: When the compiler sees the dereference (*), it trusts your promise and generates the most efficient instruction to load a 4-byte integer. On ARM, this might be a single LDR (Load Register) instruction.
The Crash: If p happens to be an unaligned address (like 0x1001), the CPU's LDR instruction cannot execute. The processor hardware itself traps this invalid operation and triggers an exception. The operating system handles this exception by sending a SIGBUS (Bus Error) signal to your program, which causes it to crash.

(Note: This code often "works" on x86/x86-64 (like most PCs), but it's not actually safe. x86 hardware is more permissive and will handle the unaligned access, but it incurs a significant performance penalty.)

std::uint32_t decode(const void* p) {
  return boost::endian::little_to_native(*static_cast<const std::uint32_t *>(p));
}

Safe:

The Safe Variable: You first declare a local variable, std::uint32_t v. Because this variable is declared on the stack, the compiler guarantees that v itself is properly aligned. Its address (&v) will be a multiple of 4.
The Honest Copy: std::memcpy makes no assumptions about alignment. Its job is to copy bytes, one way or another.
The Machine Code: The compiler knows that p (the source) might be unaligned and &v (the destination) is aligned. It will generate a sequence of safe instructions to accomplish this. This often means it will load the 4 bytes from p one byte at a time (e.g., using four LDRB - Load Byte instructions) and then reassemble them into the aligned v variable.
The Result: No unaligned 4-byte integer load is ever attempted. The CPU is only asked to do byte-level access, which is always safe and has no alignment.

This memcpy pattern is the standard, portable, and optimizer-friendly way to safely read a value from a potentially unaligned buffer.

std::uint32_t decode(const void* p) {
  std::uint32_t v = 0; // local variable, guaranteed aligned.
  std::memcpy(&v, p, sizeof(v));
  boost::endian::little_to_native_inplace(v);
  return v;
}

This was so much a known issue thus in C++20 we have std::bit_cast

#include <bit>     // Required for std::bit_cast
#include <array>   // Required for std::array
#include <cstdint> // Required for std::uint32_t
// ... boost::endian headers ...

std::uint32_t decode(const void* p) {
  // 1. Cast 'p' to a pointer to an array of 4 bytes.
  //    This cast itself is just a reinterpretation; no memory is read.
  // Reason cast to const unsigned char* won't work due to
  // std::bit_cast has safety check making sure that the sizeof(From) == sizef(To)
  // Thus sizeof(const unsigned char*) is just one byte, not 4 bytes.
  const auto* byte_ptr = reinterpret_cast<const std::array<unsigned char, 4>*>(p);

  // 2. Dereference the pointer.
  //    This performs a *safe* copy of 4 bytes from the (potentially unaligned)
  //    address 'p' into a local, *aligned* 'bytes' object.
  std::array<unsigned char, 4> bytes = *byte_ptr;

  // 3. Reinterpret the bits of the byte array as a uint32_t.
  std::uint32_t v = std::bit_cast<std::uint32_t>(bytes);

  // 4. Fix endianness, same as before.
  boost::endian::little_to_native_inplace(v);
  return v;
}

Classic time-travel compiler optimization bug

This is due to compiler seeing we actually dereference p at foo(*p) thus if not crash, p must not be null_ptr, thus eliminate if (!p) check completely.
Raymond Chen has a post[Undefined behavior can result in time travel (among other things, but time travel is the funkiest)] on this.

There's compiler flag no-delete-null-pointer-checks which is also enabled by default by Chromium project.

Compiler might not honor your code

float get_first_element(__m128 v)
{
  return _mm_cvtss_f32(v);
}


_Z17get_first_elementDv4_f:
.LFB6474:
.cfi_startproc
endbr64
ret
.cfi_endproc

Standard ambiguity

When you are faced with surprising cross-platform behavior, it can save you a lot of time to refer back to the standard (every word matters)

// get the epoch
auto epoch =
  std::chrono::system_clock::now().time_since_epoch();
// send it to a peer
send_network(epoch);

Until C++ 20

The epoch of system_clock is unspecified, but most implementations use Unix Time
(i.e., time since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, not counting leap seconds).

Since C++ 20

system_clock measures Unix Time (i.e., time since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, not counting leap seconds).

Alignment

alignas(64) at the class level specifies that the object must be aligned on a 64 bytes boundary.
In C++ 20 you can use alignas() on fields, no need for manual padding.

Wrong:

// align on the struct, not the data members, thus false sharing still happening.
struct alignas(64) my_struct {
  std::atomic<int> one;
  std::atomic<int> two;
};

Correct:
p.s. alignas(std::hardware_destructive_interference_size) won't work on Mac compiler(as of 2025).

struct my_struct {
  alignas(std::hardware_destructive_interference_size) std::atomic<int> one;
  alignas(std::hardware_destructive_interference_size) std::atomic<int> two;
};

Lessons learned

i.e. malloc_trim shall be called manually before the thread that alloc the arena memory goes to sleep, otherwise, the allotted memory will not be return back to the system. Usually a thread_local context object could have solved this issue.