Nov 2, 2025

[debugging techniques] C++ on Nightmare Mode


SIGBUS Memory alignment bug


Unsafe:
  • The Lie: The static_cast<const std::uint32_t *>(p) is a promise to the compiler. You are saying, "Trust me, this void* pointer p definitely points to a memory location that is properly aligned for a std::uint32_t."
  • The Requirement: On many architectures (like ARM, SPARC, and MIPS), a std::uint32_t (a 4-byte integer) must be located at a memory address that is a multiple of 4. For example, 0x1000, 0x1004, or 0x1008 are all aligned. An address like 0x1001 is unaligned.
  • The Machine Code: When the compiler sees the dereference (*), it trusts your promise and generates the most efficient instruction to load a 4-byte integer. On ARM, this might be a single LDR (Load Register) instruction.
  • The Crash: If p happens to be an unaligned address (like 0x1001), the CPU's LDR instruction cannot execute. The processor hardware itself traps this invalid operation and triggers an exception. The operating system handles this exception by sending a SIGBUS (Bus Error) signal to your program, which causes it to crash.
(Note: This code often "works" on x86/x86-64 (like most PCs), but it's not actually safe. x86 hardware is more permissive and will handle the unaligned access, but it incurs a significant performance penalty.)
std::uint32_t decode(const void* p) {
  return boost::endian::little_to_native(*static_cast<const std::uint32_t *>(p));
} 
Safe:
  • The Safe Variable: You first declare a local variable, std::uint32_t v. Because this variable is declared on the stack, the compiler guarantees that v itself is properly aligned. Its address (&v) will be a multiple of 4.
  • The Honest Copy: std::memcpy makes no assumptions about alignment. Its job is to copy bytes, one way or another.
  • The Machine Code: The compiler knows that p (the source) might be unaligned and &v (the destination) is aligned. It will generate a sequence of safe instructions to accomplish this. This often means it will load the 4 bytes from p one byte at a time (e.g., using four LDRB - Load Byte instructions) and then reassemble them into the aligned v variable.
  • The Result: No unaligned 4-byte integer load is ever attempted. The CPU is only asked to do byte-level access, which is always safe and has no alignment.
This memcpy pattern is the standard, portable, and optimizer-friendly way to safely read a value from a potentially unaligned buffer.
std::uint32_t decode(const void* p) {
  std::uint32_t v = 0; // local variable, guaranteed aligned.
  std::memcpy(&v, p, sizeof(v));
  boost::endian::little_to_native_inplace(v);
  return v;
}

This was so much a known issue thus in C++20 we have std::bit_cast
#include <bit>     // Required for std::bit_cast
#include <array>   // Required for std::array
#include <cstdint> // Required for std::uint32_t
// ... boost::endian headers ...

std::uint32_t decode(const void* p) {
  // 1. Cast 'p' to a pointer to an array of 4 bytes.
  //    This cast itself is just a reinterpretation; no memory is read.
  // Reason cast to const unsigned char* won't work due to
  // std::bit_cast has safety check making sure that the sizeof(From) == sizef(To)
  // Thus sizeof(const unsigned char*) is just one byte, not 4 bytes.
  const auto* byte_ptr = reinterpret_cast<const std::array<unsigned char, 4>*>(p);

  // 2. Dereference the pointer.
  //    This performs a *safe* copy of 4 bytes from the (potentially unaligned)
  //    address 'p' into a local, *aligned* 'bytes' object.
  std::array<unsigned char, 4> bytes = *byte_ptr;

  // 3. Reinterpret the bits of the byte array as a uint32_t.
  std::uint32_t v = std::bit_cast<std::uint32_t>(bytes);

  // 4. Fix endianness, same as before.
  boost::endian::little_to_native_inplace(v);
  return v;
}


Classic time-travel compiler optimization bug

This is due to compiler seeing we actually dereference p at foo(*p) thus if not crash, p must not be null_ptr, thus eliminate if (!p) check completely. 
Raymond Chen has a post[Undefined behavior can result in time travel (among other things, but time travel is the funkiest)] on this.

There's compiler flag no-delete-null-pointer-checks which is also enabled by default by Chromium project.


Compiler might not honor your code

float get_first_element(__m128 v)
{
  return _mm_cvtss_f32(v);
}

_Z17get_first_elementDv4_f:
.LFB6474:
.cfi_startproc
endbr64
ret
.cfi_endproc



Standard ambiguity

When you are faced with surprising cross-platform behavior, it can save you a lot of time to refer back to the standard (every word matters)
// get the epoch
auto epoch =
  std::chrono::system_clock::now().time_since_epoch();
// send it to a peer
send_network(epoch);
Until C++ 20
  • The epoch of system_clock is unspecified, but most implementations use Unix Time
    (i.e., time since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, not counting leap seconds).

Since C++ 20
  • system_clock measures Unix Time (i.e., time since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, not counting leap seconds).


Alignment

  • alignas(64) at the class level specifies that the object must be aligned on a 64 bytes boundary.
  • In C++ 20 you can use alignas() on fields, no need for manual padding.
Wrong:
// align on the struct, not the data members, thus false sharing still happening.
struct alignas(64) my_struct {
  std::atomic<int> one;
  std::atomic<int> two;
}; 

Correct:
p.s. alignas(std::hardware_destructive_interference_size) won't work on Mac compiler(as of 2025).
struct my_struct {
  alignas(std::hardware_destructive_interference_size) std::atomic<int> one;
  alignas(std::hardware_destructive_interference_size) std::atomic<int> two;
}; 

Lessons learned


i.e. malloc_trim shall be called manually before the thread that alloc the arena memory goes to sleep, otherwise, the allotted memory will not be return back to the system. Usually a thread_local context object could have solved this issue.









Oct 25, 2025

[Unix IO] minute for IO models - W. Richard Stevens

Blocking IO


Nonblocking I/O Model


I/O Multiplexing Model
Disadvantage: using select requires two system calls (select and recvfrom) instead of one
Advantage: we can wait for more than one descriptor to be ready (see the select function later in this chapter)




Signal-Driven I/O Mode

Signal-driven I/O is rarely used in modern applications because its disadvantages generally outweigh its benefits. Newer APIs like epoll (Linux), kqueue (BSD/macOS), and IOCP (Windows) are far superior.

  1. Complexity of Signal Handling: Dealing with signals is notoriously difficult and error-prone. Signal handlers have many restrictions (e.g., only a limited set of functions, known as async-signal-safe functions, can be safely called from within them).

  2. Unreliable Signal Queuing: Signals are not queued. If two I/O events occur in rapid succession, the kernel might only deliver a single SIGIO signal. This means the signal handler must be written in a loop to read (or write) until the operation would block (returning an EWOULDBLOCK or EAGAIN error). Forgetting this leads to lost I/O events.

  3. Complexity in Multithreading: Signal handling in a multithreaded program is extremely complex. It's often unclear which thread will receive the signal, leading to difficult synchronization problems.

  4. Still a Synchronous Model: According to the POSIX standard, signal-driven I/O is still a synchronous I/O model. While the notification is asynchronous, the actual I/O call (e.g., recvfrom()) is initiated by the process and can still block in the signal handler or main loop. This is different from true asynchronous I/O (AIO), where the kernel performs the entire operation (including the data copy) and only notifies the process upon completion.





Asynchronous I/O Model
The main difference between this model and the signal-driven I/O model is that with signal-driven I/O, the kernel tells us when an I/O operation can be initiated, but with asynchronous I/O, the kernel tells us when an I/O operation is complete.
Advantage:
  • True Asynchronicity: This is the only model (besides the modern io_uring) that is truly asynchronous. The application is completely unblocked and can perform other computations while the I/O is in progress.
  • Parallel I/O and Computation: It allows an application to overlap its computations with its I/O operations, which can lead to significant performance gains, especially in data-intensive applications like database servers.
  • Request Queuing: You can submit (queue) multiple I/O requests to the kernel at once, allowing the kernel to potentially optimize the scheduling of these operations (e.g., re-ordering disk reads).
Disadvantages:
The POSIX AIO model, despite its theoretical benefits, is rarely used and generally not recommended on Linux for several critical reasons:
  • Poor Linux Implementation: This is the biggest problem. The standard glibc implementation of POSIX AIO is not a true kernel-level AIO. Instead, it's implemented in user-space by creating a pool of worker threads. When you call aio_read(), it just hands the request to one of these hidden threads, which then performs a normal blocking read(). This adds all the overhead of threading and synchronization, often making it slower and more resource-intensive than just managing your own thread pool.
  • Limited to Disk Files: Even the "true" kernel AIO support (which requires using O_DIRECT) works only for disk files. It does not work for network sockets. For high-performance networking, epoll (I/O multiplexing) is the standard.
  • API Complexity: The API is complex, requiring you to manage aiocb (AIO control block) structures for every request and handle notifications, which often fall back to signals (with all their associated problems) or require you to poll for completion.
  • Superseded by io_uring: On modern Linux, AIO is considered obsolete. io_uring is the modern, high-performance interface for true asynchronous I/O. It works for both file I/O and network I/O, is vastly more efficient, and is designed to eliminate the flaws of POSIX AIO.



Comparison chart:




Oct 17, 2025

[C++] SFINAE only match to signature (shallow match)

Notice the difference between ApplyIndexForConstexpr and ApplyIndexForConstexprElse. SFINAE works if there's a signature match; `ApplyIndexForConstexpr` generates 2 explicit template instance while `ApplyIndexForConstexprElse` generates single explicit template instance.
#include <functional>
#include <iostream>
#include <optional>


template <size_t I, typename F>
constexpr std::optional<int> ApplyIndexForConstexpr(F f) {
    if constexpr(sizeof(F) == 1){
        return I;
    } 
    if constexpr(I - 1 == 0){
        return std::nullopt;
    } else {
        return ApplyIndexForConstexpr<I-1>(f);
    }
}

template <size_t I, typename F>
constexpr std::optional<int> ApplyIndexForConstexprElse(F f) {
    if constexpr(sizeof(F) == 1){
        return I;
    } else if constexpr(I - 1 == 0){
        return std::nullopt;
    } else {
        return ApplyIndexForConstexpr<I-1>(f);
    }
}

int main() {
 auto run = []{};
 ApplyIndexForConstexpr<2>(run);
 ApplyIndexForConstexprElse<2>(run);
}
Generated code:
#include <functional>
#include <iostream>
#include <optional>

template<size_t I, typename F>
inline constexpr std::optional<int> ApplyIndexForConstexpr(F f)
{
  if constexpr(sizeof(F) == 1) {
    return std::optional<int>(I);
  } 
  
  if constexpr((I - 1) == 0) {
    return std::optional<int>(std::nullopt_t(std::nullopt));
  } else /* constexpr */ {
    return ApplyIndexForConstexpr<I - 1>(f);
  } 
  
}

/* First instantiated from: insights.cpp:31 */
#ifdef INSIGHTS_USE_TEMPLATE
template<>
inline constexpr std::optional<int> ApplyIndexForConstexpr<2, __lambda_30_13>(__lambda_30_13 f)
{
  if constexpr(true) {
    return std::optional<int>(2UL);
  } 
  
  if constexpr(false) {
  } else /* constexpr */ {
    return ApplyIndexForConstexpr<2UL - 1>(__lambda_30_13(f));
  } 
  
}
#endif


/* First instantiated from: insights.cpp:14 */
#ifdef INSIGHTS_USE_TEMPLATE
template<>
inline constexpr std::optional<int> ApplyIndexForConstexpr<1, __lambda_30_13>(__lambda_30_13 f)
{
  if constexpr(true) {
    return std::optional<int>(1UL);
  } 
  
  if constexpr(true) {
    return std::optional<int>(std::nullopt_t(std::nullopt));
  } else /* constexpr */ {
  } 
  
}
#endif


template<size_t I, typename F>
inline constexpr std::optional<int> ApplyIndexForConstexprElse(F f)
{
  if constexpr(sizeof(F) == 1) {
    return std::optional<int>(I);
  } else /* constexpr */ {
    if constexpr((I - 1) == 0) {
      return std::optional<int>(std::nullopt_t(std::nullopt));
    } else /* constexpr */ {
      return ApplyIndexForConstexpr<I - 1>(f);
    } 
    
  } 
  
}

/* First instantiated from: insights.cpp:32 */
#ifdef INSIGHTS_USE_TEMPLATE
template<>
inline constexpr std::optional<int> ApplyIndexForConstexprElse<2, __lambda_30_13>(__lambda_30_13 f)
{
  if constexpr(true) {
    return std::optional<int>(2UL);
  } else /* constexpr */ {
  } 
  
}
#endif


int main()
{
    
  class __lambda_30_13
  {
    public: 
    inline /*constexpr */ void operator()() const
    {
    }
    
    using retType_30_13 = auto (*)() -> void;
    inline constexpr operator retType_30_13 () const noexcept
    {
      return __invoke;
    };
    
    private: 
    static inline /*constexpr */ void __invoke()
    {
      __lambda_30_13{}.operator()();
    }
    
    public: 
    // inline /*constexpr */ __lambda_30_13(const __lambda_30_13 &) noexcept = default;
    
  };
  
  __lambda_30_13 run = __lambda_30_13{};
  ApplyIndexForConstexpr<2>(__lambda_30_13(run));
  ApplyIndexForConstexprElse<2>(__lambda_30_13(run));
  return 0;
}

[C++][template] Double checked Stop technique

godbolt:
https://godbolt.org/z/Yenv16xzj

#include <functional>
#include <iostream>
#include <optional>


template <size_t I, typename F>
constexpr std::optional<int> ApplyIndexForOverflow(F f) {
    return ApplyIndexForOverflow<I - 1>(f);
}

template <size_t I, typename F>
constexpr std::optional<int> ApplyIndexFor(F f) {
    if (I == 0) {
        return std::nullopt;
    }
    // double checked stop; otherwise introduced stack overflow
    // from the compiler runtime due to has to instantiate
    // unbounded template instance, like above `ApplyIndexForOverflow`
    return ApplyIndexFor<(I == 0 ? 0 : I - 1 )>(f);
}

template <size_t I, typename F>
constexpr std::optional<int> ApplyIndexForConstexpr(F f) {
    if constexpr(sizeof(F) == 1){
        return I;
    }
    if constexpr(I - 1 == 0){
        return std::nullopt;
    } else {
        return ApplyIndexForConstexpr<I-1>(f);
    }
}

int main() {
 auto run = []{};
 ApplyIndexForOverflow<100>(run);
 ApplyIndexFor<100>(run);
 ApplyIndexForConstexpr<100>(run);
}

Oct 4, 2025

[alrotighm] trampoline pattern

using trampoline to deal with recursive stack overflow situation(while tail-call is not applicable):
#include <iostream>
#include <vector>
#include <numeric>
#include <variant>
#include <functional>
#include <utility>

// --- (Paste the Bounce, Step, and trampoline definitions from above here) ---

template<typename T>
struct Bounce;

template<typename T>
using Step = std::variant<T, Bounce<T>>;

template<typename T>
struct Bounce {
    std::function<Step<T>()> thunk;
};

template<typename T>
T trampoline(Step<T> first_step) {
    Step<T> current_step = std::move(first_step);
    while (std::holds_alternative<Bounce<T>>(current_step)) {
        current_step = std::get<Bounce<T>>(current_step).thunk();
    }
    return std::get<T>(current_step);
}

// --- (Paste the sum_trampolined function from above here) ---

Step<long> sum_trampolined(const std::vector<long>& data, size_t index, long current_sum) {
    if (index == data.size()) {
        return current_sum;
    }
    return Bounce<long>{
        [=]() {
            return sum_trampolined(data, index + 1, current_sum + data[index]);
        }
    };
}


int main() {
    // This will now work without crashing!
    std::vector<long> large_vec(200000, 1);

    // To start the process, we create the very first step.
    Step<long> first_step = sum_trampolined(large_vec, 0, 0);

    // The trampoline function runs the computation to completion.
    long total = trampoline(first_step);

    std::cout << "Trampolined sum of large vector: " << total << std::endl;
    std::cout << "The program finished successfully." << std::endl;

    return 0;
}