May 15, 2025

[Book] Programming massively parallel processors(PMPP) - Wen-mei Hwu, reading minute - ch1 - ch6

Reference:

CUDA C++ Programming Guide
NVIDIA Nsight Compute CUDA code optimization

[virtual memory] recap

Intel TBB
Intel TBB Task Scheduler
[oneTBB] accessor note

Error handling recap; TLPI
EINTR and What It Is Good For

kernal namespace recap
https://vsdmars.blogspot.com/2018/12/linuxnamespace-wrap-up.html
https://vsdmars.blogspot.com/2018/12/linuxnamespace-mount-mnt.html
https://vsdmars.blogspot.com/2018/06/linuxkernel-namespace.html

cache
[Pacific++ 2018][re-read] "Designing for Efficient Cache Usage" - Scott McMillan
[Go][design] high through-put low contention in memory cache

Locking
[futex] futex skim through
[Concurrency] [C++][Go] Wrap up - 2018
[C++] memory model in depth; memory spilling.

Algorithm/Implementation
Under my github/leetcode project



Software abstraction (CUDA runtime)

Similar idea e.g.  Golang  ref: [Go][note] Analysis of the Go runtime scheduler paper note, the difference coming from Go/C++ as multi-purpose language running on CPU (sequential), CUDA/OpenCL as C extension running on GPU. While the underneath hardware architecture difference, the abstraction diffs.


SPMD single-program multiple-data 
host CPU based code
kernel GPU Device code / function, more details later. Basically same code / IR run in parallel.
grid threads group
_h host variable in CPU code
_d device variable in CPU code
__host__  callable from host, executed on host, executed by host thread(e.g. linux thread).
__global__ callable from host or device, executed on device, executed by grid of device threads
__device__ callable from device, executed on device, executed by caller device thread.
If function declared with __host__ and __device__ macro, NVCC generates two version, one for host and one for device.
block 32-based size(hardware efficiency reason), all blocks are in same size. Size of how many GPU threads. Threads in a block can execute in any order with respect to each other.
SM Streaming multiprocessor; each SM has several processing units called CUDA cores. It is designed to execute all threads in a warp following the single-instruction multiple-data(SIMD) model.
HBM high-bandwidth memory
Warp a warp groups 32-threads together. Thus a block of threads will be group into warps, which each warp has 32-threads. Scheduling is based on Warp. Also think as single-instruction, multiple-threads.

FLOP floating-point operations
FLOP/B FLOP to byte ratio.
GPU global memory bandwidth: 1555GB/second; 1555 * 0.25(FLOP/B) = 389 GFLOPS

const readonly variables

 blockIdx; area code
 blockDim; row idx
 threadIdx; phoneline
Those three variable gives the kernel realize which data it is running on.


OUR_GLOBAL_FUNC<<<number of block, threads per block>>>(args...);


Thread Scheduling

Block scheduling

When a kernel is called, the CUDA runtime launches a grid of threads execute the kernel code. These threads are assigned to SMs on a block-by-block basis. All threads in a block are simultaneously assigned to the same SM. There are reserved blocks for system to executed, thus a SMs' blocks are not all scheduled to the user kernel.
Multiple blocks are likely to be simultaneously assigned to the same SM. The concept of Warp scheduling is that, those threads inside the same Warp runs the same instruction set(same kernel), thus the fetch of instruction is one time efforts. Also, the data those threads in the same Warp access are linear thus are prefetchable/cache friendly.
Moreover, threads in the same block can interact with each other in ways that threads across different blocks cannot, such as barrier synchronization,

synchronization / transparent scalability

  block until every thread in the same block reaches the code location. if a __syncthreads() statement is present, it must be executed by all threads in a block. i.e. 
void incorrect_barries_example(int n) {
	if (threadIdx.x % 2) {
		__syncthreads(); // sync point-1
	} else {
		__syncthreads(); // sync point-2
	}
}
Wrong due to not all threads runs into the same barrier synchronization points.
Not only do all threads in a block have to be assigned to the same SM, but also they need to to be assigned to that SM simultaneously. i.e. a block can begin execution only when the runtime system has secured all the resources needed by all threads in the block to complete execution.

The ability to execute the same application code on different hardware with different amounts of execution resources is referred to as transparent scalability.


Control Divergence

The execution works well when either all threads in a warp execute the if-path or all execute the else-path. Otherwise, it has to go with the code twice. One run with the core running if path code and the other core with else path is doing noop. (In the same Warp). Another run with the core doing noop on the if path code and the other core wile else path is running. In old architecture, those 2 runs run in sequence. In new architecture, those 2 runs can run in parallel. This is called independent thread scheduling.
Thus, due to this fact, do not use threadIdx for if branching. But use data for divergence control, this also related to data locality in cache. One important fact, the performance impact of control divergence decreases as the size of the vectors being processed increases.
One cannot assume that all threads in a warp have the same execution timing.(even they are running the same fetched instruction). Thus, use __syncwarp() barrier synchronization instead.



Latency tolerance

simple, i.e. CPU, context switch on single code due to limited of resource(registers, cache etc.) and makes sure code runs preemptive-scheduling fashion.
Thus, SM only has enough execution units to execute a subset of all the threads assigned to it at any point in time.
In recent SM, each SM can execute instructions for a small number of warps at any given point in time.
GPU SMs achieves zero-overhead scheduling by holding all the execution states for the assigned warps in the hardware registers so there is no need to save and restore states when switching from one warp to another.
Thus, allows GPU oversubscription of threads to SMs.
Automatic/Local variables declared in the kernel are placed into registers.
Each SM in A100 GPU has 65,536 registers.
65536/2048(threads) = 32 registers per thread/kernel.

In cases, the compiler may perform register spilling to reduce the register requirement per thread and thus elevate the level of occupancy. However, this could increase latency due to need to fetch data from the memory instead directly from the register.

cudaDeviceProp struct has bunch of variable represents the hardware SPEC.
e.g.
 multiProcessorCount Number of multiprocessors on device
 clockRate Clock frequency in kilohertz
 regsPerBlock 32-bit registers available per block
 warpSize  Warp size in threads


Variable declaration scope and lifetime

automatic scalar variables    [mem]register    [scope]thread    [lifetime]grid
automatic array variables    [mem]local    [scope]thread    [lifetime]grid
__device__ __shared__    [mem]shared    [scope]block    [lifetime]grid
__device__    [mem]global    [scope]grid    [lifetime]application
__device__ __constant_    [mem]constant    [scope]grid    [lifetime]application

API





[C++] global const variable initialization

Reference:

[C++][C++20] consteval / constexpr
[C++/Rust] use of thread_local in code
[Note] linking notes


Initializing order

  1. Perform constant initialization and zero initialization
  2. Perform dynamic initialization
  3. Start executing main()  How main() is executed on Linux
  4. Perform dynamic initialization of function-local static variables, as needed.
  5. It is impossible to read uninitialized global memory.
  6. Destroy function-local static variables that were initialized in reverse order
  7. Destroy other globals in reverse order.

header.h
inline int global = compute();
// Always OK due to sequence; and `h` is inline.
// (compiler will take care of its init.)
inline int h = global;

tu.cpp
#include "header.h"
int a = global; // not certain due to TU not guarantee `global` being defined.

Templated variables are init. at some point before main()

template<typename T>
int type_id = get_id<T>();

template<typename T>
struct templ {
 static int type_id = get_id<T>();
};

Compiler is allowed to perform dynamic initialization at compile-time.

int f(int i) {
  return 2 * i; // not constexpr
}

int the_answer = f(21); // might happen at compile-time; even not constexpr.

Compiler is allowed to perform dynamic init. after main() has started executing.

i.e. an unused global variable might not be initialized at all.
const auto start_time = std::chrono::system_clock::now();

int main() {
  auto t = start_time; // might be init. at this point.
}

The static initialization order fiasco

Dynamic initialization order of globals in different translation units is not specified.

Guideline
Unless otherwise allowed by its documentation, do not access global state in the constructor of another global.

* Solution 1

Globals initialized by some constant expression are initialized during static initialization without dynamic initialization.
Constant expressions cannot access globals that are dynamically initialized.

Guideline
Whenever possible, make global variables constexpr.
constexpr float pi = 3.14;
constexpr std::size_t threshold = 3;

Guideline
Whenever possible, use constant initialization.
// https://en.cppreference.com/w/cpp/thread/mutex/mutex
std::mutex mutex; // Has constexpr constructor.


Always look into the code path that has function which is not constexpr.

e.g.
constexpr int compute(bool default) {
  if(default)
    return 42; // constexpr
  else
    return std::getchar(); // runtime.
}

int default = compute(true); // constexpr
int non_default = compute(false); // runtime

C++20

constinit = constexpr - const

  • variable is initialized using constant initialization (or error)
  • variable is not const
thus:
constinit std::mutex mutex; // OK.

constinit int default = compute(true); // constexpr; ok.
constinit int non_default = compute(false); // runtime, thus error.

Guideline
Declare global variables constinit to check for initialization problems.

Guideline
Try to add a constexpr constructor to every type.
e.g.
Default constructors of containers(with default allocator)
Default constructors of resource handles(files, sockets, ...)

Guideline
Do not use constinit if need to do lazy initialization.

* Solution 2:

Lazy initialization.

Global& global() {
  static Global g;
  return g;
}

// The global init. is not determinated.
Global& global = global(); // Bad because can't be constinit

Guideline
Never cache a (reference to) a function-local static in a global variable.
template<typename Tag, typename T>
class lazy_init {
  public:
    constexpr lazy_init() = default;

    T& get() {
      static T global;
      return global;
    }

    T& operator*() { return get();}
    T* operator->() { return &get();}
};

constinit lazy_init<TAG, Logger> global; // OK
global->log(); // The first call takes time, beware.
logger.h
class Logger{...};

extern constinit lazy_init<TAG, Logger> global;
logger.cpp
constinit lazy_init<TAG, Logger> global;

Global variable destruction

Global variables are destroyed in reverse dynamic initialization order.
Beware, constinit always initialized first, and that is its purpose.
A a;
constinit B b;

void func(){
  static C c;
}

int main() {
  func();
}
init. b
init. a
init. c
destroy c
destroy b
destroy a


Destruction order of globals in different TUs is not specified.

Guideline
Unless otherwise allowed by its documentation, do not access global state in the destructor of another global.
This applies to constinit globals.

Rule
  • The order of dynamic initialization is not specified.
  • Exception: within one translation unit, variables are initialized from top to bottom.
    1) Must include the header that declares the global before using it
    2) every global defined in that header is initialized before your global.
  • Do not use function-local static variables if there's a chance they might be used after main()
  • Do not use function-local static variables.
template<typename Tag, typename T>
class lazy_init {
  public:
    constexpr lazy_init() = default;

    T& get() {
    // technique; 1) thread safe, 2) initialize data member in one-shot.
      static bool dummy = (storage_.initialize(), true);
      return storage_.get();
    }
  private:
    storage<T> storage_;
};

constinit lazy_init<struct global_tag, std::string> global;

Nifty counters

  • Ensure the nifty counter object is included in the definition file.
  • The nifty counter approach doesn't work for templated objects.

header.h
extern constinit nifty_init<Global> global_init; // declaration; definition is in some .cpp file.
// static int dummy = (global_init.initialize(), 0);
static nifty_counter_for<global_init> dummy; // definition
inline constinit Global& global = global_init.reference();

constinit Global copy = global; // compiler error
a.cpp
#include "header.h"
int a = global->foo();
b.cpp
#include "header.h"
int b = global->foo();
template<typename T>
class nifty_init {
public:
 constexpr nifty_init() = default;

 void initialize() {
 	if (counter_++ == 0)
		storage_.initialize();
 }

 void destroy() {
	if (--counter_ == 0)
		storage_.get().~T();
 }

 constexpr T& reference() { return storage_.get(); }

private:
 int counter_ = 0;
 storage<T> storage_;
};


template<auto& NiftyInitT>
struct nifty_counter_for {

  nifty_counter_for() {
  	NiftyInitT.initialize();
  }

  ~nifty_counter_for() {
    NiftyInitT.destroy();
  }
};

Conclusion

  • constinit is not always applicable
  • lazy initialization has to leak
  • nifty counters are black magic
  • Or, simply don't do anything before or after main()


Manual initialization (Google's way)

Guideline
Use manual initialization either for all globals in your project, or none.
And remember to initialize them all!


header.h
template<typename T>
class manual_init {
public:
  constexpr manual_init() = default;
  void initialize() { storage_.initialize(); }
  void destroy() { storage_.destroy(); }

  T& get() { return storage_.get(); }

private:
 storage<T> storage_;
};


template<auto& ... Init>
struct scoped_initializer {
  scoped_initializer() {
    Init.initialize();
	...
  }

  ~scoped_initializer() {
    Init.destroy();
	...
  }
};

main.cpp
#include "header.h"

constinit manual_init<Global> global;

int main() {
  scoped_initializer<global> initializer;
  global->foo();
}

Apr 29, 2025

[clang] --system-header-prefix flag

Controlling Diagnostics in System Headers

Assign --system-header-prefix flag to indicate what are system headers, thus the warnings

emit from those headers are ignored.

Its main purpose is to instruct the compiler to treat header files located in directories matching a specified prefix as "system headers."

Key Functionality

When a header file is designated as a system header, compilers like Clang typically suppress warnings that originate from within that header. This behavior is desirable because:

  • Third-Party Libraries: Developers often use external libraries with their own header files. These headers might generate warnings with the project's specific compiler settings, but the project developer cannot or should not modify these library headers.
  • Standard Library and OS Headers: Headers from the C++ Standard Library, C Standard Library, or the operating system itself are inherently system-level. Warnings from these are generally not actionable by the application developer.
  • Reduced Noise: Suppressing these warnings allows developers to focus on issues within their own codebase.




#pragma clang enable-system-header-diagnostics

Apr 27, 2025

[C++] Object Lifetimes reading minute

Reference:
A Deep Dive Into C++ Object Lifetimes - Jonathan Müller - C++Now 2024
[C++] null pointer and memory laundering.
[C++] transparently replaceable
[Book]Inside the C++ Object Model
nifty counter


Category

Storage (i.e. either have in memory or in the instruction)

  • unit, in byte. Every byte has unique address.
  • What's on the storage can be anything.
  • When storage for an object with automatic or dynamic storage duration is obtained,
    the object has an indeterminate value, and if no initialization if performed for the object,
    that object retains an indeterminate value until that value is replaced. If an indeterminate
    value is produced by an evaluation, the behavior is undefined.
  • In C++26, read of indeterminate value is erroneous, not undefined. Ref: P2795 

Duration

  • minimum potential lifetime of the storage containing the object.
  • Static, thread, and automatic storage durations are associated with objects introduced by declarations.

automatic storage durations

  • Lasts until the block in which they are created exits.

static storage duration

  • namespace scope, first declared with the static or extern keywords. Last the duration of the program.
  • function-local static vs. global scope
  • constinit vs. dynamic initialization
  • nifty counters, module dependency graph, inline variables.

thread storage duration

  • thread_local keyword. Last for the duration of the thread they are created.

Value (i.e. being initialized)

Type (determin the storage alloting size.)

  • Mapping the bits to the interpretation.

Object 

  • a particular type and occupies a region of storage at a particular
    address where its value is stored.
  • Function is not an object(function address can be changed.)
  • Reference is not an object. However, pointer type is an object.

Lifetime

  • Lifetime of an object is a runtime property of the object.
  • Before the lifetime of an object starts and after its lifetime ends
    there are significant restrictions on the use of the object.

Object lifetime spans

  1. storage is allocated
  2. object is initialized, the lifetime starts
  3. object is used, its value changed or read.
  4. object is destroyed, the lifetime ends.
  5. storage is deallocated.
Object can be created: This does not necessarily start the lifetime yet.
Object can be destroyed: This ends the lifetime.

The lifetime of an object of type T begins when

  1. storage with the proper alignment and size for type T is obtained, and
  2. its initialization (if any) is complete.

int main() {
  int* i = new int; // however, `new int()` has default value.
  std::print("{}\n", *i); // UB
}


Whenever a prvalue is used in a context where an xvalue is expected, a temporary object is created

  1. binding a reference to a prvalue
  2. member-access on a prvalue
  3. using an array prvalue
  4. discarding the result of a function call that returns a prvalue.
Temporary objects are destroyed as the last step in evaluating the full-expression that contains the point where they were created.
stc::vector<std::string> get_strings();

int main() {
    for (auto&& str: get_strings()) {
        std::print("{}\n", str);
    } // temporary destroyed here.

    // lifetime expanded.
    auto&& str_vec = get_strings();

    // C++23, only for 'range for'
    // https://en.cppreference.com/w/cpp/language/lifetime
    for (auto&& c : get_strings()[0]) {
        std::print("{}\n", c);
    } // temporary destroyed here.

    // this is dangling
    // auto&& str = get_strings()[0];
}

void* memory = ::operator new(sizeof(int));

int* ptr = ::new(memory) int(11);
std::destroy_at(ptr);
::operator delete(memory);

alignas
alignas(int) unsigned char buffer[sizeof(int)];
int* ptr = ::new(static_cast<void*>(buffer)) int(11);
std::destroy_at(ptr);
int x = 11;
std::destroy_at(&x); // end lifetime
int* ptr = ::new(static_cast<void*>(&x)) int(42);

UB:
You cannot legally reuse the memory of an object originally declared const to construct a new object if that construction modifies the memory. The const promise extends to the storage in this scenario.
  • The C++ standard states ([dcl.type.cv] p4 in C++20, similar rules in earlier versions): "Except that any class member declared mutable can be modified, any attempt to modify an object declared with const-qualified type through a glvalue of other than const-qualified type results in undefined behavior."
  • While you technically ended the lifetime of the original const int object, you are attempting to write (int(42)) into the storage that was originally allocated for an object declared const.
  • The standard effectively forbids reusing the storage of a const object to create a new object if that creation involves modifying the storage. The "const-ness" is associated not just with the object's lifetime but also with the storage it occupied in this specific context.
  • Attempting to write 42 into memory that the compiler might have placed in a read-only segment (because x was const) could lead to a hardware exception (like a segmentation fault).
  • Even if not in read-only memory, the compiler's optimizations might rely on that memory location never changing from 11. Overwriting it violates the assumption.
const int x = 11;
std::destroy_at(&x); // end lifetime
// UB
::new(static_cast<void*>(&x)) int(42);

OK:
const int* ptr = new const int(11);
std::destroy_at(&ptr); // end lifetime
::new(static_cast<void*>(ptr)) int(42);


transparently replaceable object

T is transparently replaceable by U if

  • T and U use the same storage, and
  • T and U have the same type (ignoring top-level cv-qualifiers)

T is not transparently replaceable if

  • const objects, const heap objects can be fixed through std::launder
  • base classes
  • [[no_unique_address]] members
When replacing sub-objects, (member variables or array elements), the rules apply
recursively to the parent object.
// x can't be in the register.
int x = 11;
std::destroy_at(&x);
::new(static_cast<void*>(&x)) int(42); // transparent replacement.
std::print("{}\n", x); // ok
foo& foo::operator=(const foo& other) {
  std::destroy_at(this);
  ::new(static_cast<void*>(this)) foo(other); // transparent replacement.
  return *this; // ok
}

non-transparent

const int* ptr = new const int(11);
std::destroy_at(ptr);
int* new_ptr = ::new(static_cast<void*>(ptr)) const int(42); // non-transparent
std::print("{}\n", *new_ptr); // ok
std::print("{}\n", *ptr); // UB


std::launder  

  1. launder is for /previous/ object, not the new one. Compiler always give out right value for new one. 
  2. launder update the provenance of an object. (see below about provenance, a compiler optimization term.)
const int* ptr = new const int(11);
std::destroy_at(ptr);
int* new_ptr = ::new(static_cast<void*>(ptr)) const int(42); // non-transparent
std::print("{}\n", *new_ptr); // ok
std::print("{}\n", *std::launder(ptr)); // ok
Ref: P3006/Launder less 


Implicit create object(and initialize it.)

1) std::malloc and variants, ::operator new, std::allocator::allocate and other allocation functions.
int* ptr = static_cast<int*>(std::malloc(sizeof(int))); // create an int, not init.
*ptr = 11;

2) Anything that starts the lifetime of an unsigned char/std::byte array.
alignas(int) unsigned char buffer[sizeof(int)]; // create an int, not init.
int* ptr = std::launder(reinterpret_cast<int*>(buffer)); // P3006, launder can be avoided.
*ptr = 11;

3) std::memcpy(does not handle memory overlap), std::memmove
// create nothing due to it's char array, not unsigned char array.
alignas(int) char buffer[sizeof(int)];
std::memcpy(buffer, &some_int, sizeof(int)); // create an int
int* ptr = std::launder(reinterpret_cast<int*>(buffer));
std::print("{}\n", *ptr);

4) Implementation-defined set of operations like mmap or VirtualAlloc(a M$ thing).
int* ptr = static_cast<int*>(mmap(...));
std::print("{}\n", *ptr);
// create int or float, later compiler time-traval backs here.
alignas(int) unsigned char buffer[sizeof(int)];
if(...)
 *std::launder(reinterpret_cast<int*>(buffer)) = 11;
else
 *std::launder(reinterpret_cast<float*>(buffer)) = 11.1;
// Still UB
int i = 11;
float f = *std::launder(reinterpret_cast<float*>(&i)); // UB, we don't have float type.
struct data {
  std::uint8_t op;
  std::uint32_t a, b, c;
};

void process(unsigned char* buffer, std::size_t size) {
    data* ptr = std::launder(reinterpret_cast<data*>(buffer));
    std::print("{}\n", *ptr); // might be UB depends on how the buffer is created.
}
struct data {
  std::uint8_t op;
  std::uint32_t a, b, c;
};

void process(unsigned char* buffer, std::size_t size) {
    data* ptr = ::new(static_cast<void*>(buffer));
    // ok, but could be wrong due to new start a lifetime of new object.
    // *ptr might not hold the previous buffer value.
    std::print("{}\n", *ptr);
}
// Fix, C++23,
// std::start_lifetime_as https://en.cppreference.com/w/cpp/memory/start_lifetime_as
// std::start_lifetime_as_array<data>(ptr, count);

struct data {
  std::uint8_t op;
  std::uint32_t a, b, c;
};

void process(unsigned char* buffer, std::size_t size) {
    data* ptr = std::start_lifetime_as<data>(buffer);
    std::print("{}\n", *ptr); // ok.
}

So how is start_lifetime_as implemented?
template<typename T>
T* start_lifetime_as(void* ptr) {
    std::memmove(ptr, ptr, sizeof(T));
    return std::launder(static_cast<T*>(ptr));
}


Implicit destruction of objects

The lifetime of an object o of type T ends when

  1. if T is a non-class type, the object is destroyed, or
  2. if T is a class type, the destructor call starts, or
  3. the storage which the object occupies is released, or is reused
    by an object that is not nested within o.
int x = 11;
::new(static_cast<void*>(&x)) int(42); // end + start new lifetime.
std::print("{}\n", x);

alignas(int) unsigned char buffer[sizeof(int)]; // start lifetime
int* ptr = ::new(static_cast<void*>(buffer)) int(11); // end + start new lifetime.
std::print("{}\n", x);

memory leaks are not UB, but just memory leak.

std::string str = "leaking"; // leaked after next line.
::new(static_cast<void*>(&str)) std::string("new str");


Provenance

  1. Each object has a unique provenance.
  2. All objects in an array have the same provenance.
  3. Re-using the memory of an object changes the provenance unless
    the object is transparently replaced. (std::launder)

A pointer T* is logically a pair(address, provenance)

  1. The address is the only thing that is physically observable.
  2. The provenance identifies to the object of allocation the pointer was derived from.

A pointer dereference is only valid if

  1. The address is in the range of allowed addresses for the provenance.
  2. The current provenance of that address is the same as the provenance of the pointer.

The pointer provenance cannot be changed using pointer arithmetic.

Thus e.g.
int foo() {
  int x, y;
  y = 11;

  if(&x + 1 == &y) {
    do_sth(&x);
  }

  return y;
}

void do_sth(int* ptr) {
  *(ptr + 1) = 42; // UB, address not in range.
}
const int* ptr = new const int(11); // provenance A
std::destroy_at(ptr);
int* new_ptr = ::new(static_cast<void*>(ptr)) const int(42); // non-transparent, provenance B
std::print("{}\n", *new_ptr); // ok
std::print("{}\n", *ptr); // UB due to provenance does not match, launder comes into the play.
// fix
std::print("{}\n", *std::launder(ptr)); // launder updates the provenance and make it updated.

Reference has provenance as well.

const int* ptr = new const int(11); // provenance A
const int& ref = *ptr; // provenance B
std::destroy_at(ptr);
::new(static_cast<void*>(ptr)) const int(42); // non-transparent, provenance C

std::print("{}\n", ref); // UB,  provenance B != provenance C
// fix
std::print("{}\n", *std::launder(&ref)); // launder updates the provenance and make it updated.



Type punning

reinterpret_cast between unrelated types can be done but
dereferencing the cast pointer is UB.
int i = 11;
float* f_ptr = ::new(static_cast<void*>(&i)) float(3.14);
std::print("{}\n", *f_ptr); // ok
std::print("{}\n", i); // UB
int i = 11;
float* f_ptr = std::start_lifetime_as<float>(&i);
std::print("{}\n", *f_ptr); // ok
std::print("{}\n", i); // UB

Be careful about getting the pointer

int i = 11;
float* f_ptr = reinterpret_cast<float*>(&i);
::new(static_cast<void*>(&i)) float(3.14);
std::print("{}\n", *f_ptr); // UB
int i = 11;
::new(static_cast<void*>(&i)) float(3.14);
float* f_ptr = reinterpret_cast<float*>(&i);
std::print("{}\n", *f_ptr); // UB
int i = 11;
float* f_ptr = ::new(static_cast<void*>(&i)) float(3.14);
std::print("{}\n", *f_ptr); // ok
int i = 11;
float* f_ptr = reinterpret_cast<float*>(&i);
::new(static_cast<void*>(&i)) float(3.14);
std::print("{}\n", *std::launder(f_ptr)); // ok
alignas(int) unsigned char buffer[sizeof(int)];
int* ptr = reinterpret_cast<int*>(buffer);
*ptr = 11; // currently needs to call std::launder but fixed in P3006


When to use std::launder?

When want to re-use the storage of

  1. const heap objects; const object cannot be fixed. Once it's const, it's const for life.
  2. base classes
  3. [[no_unique_address]] members
  4. Or when re-using memory as storage for a different type.

There are exceptions for dereferencing from reinterpret_cast with different types.

i.e.
If a program attempts to address the stored value of an object through a glvalue whose type is not similar to one of the following types the behavior is undefined:
  1. the dynamic type of the object,
  2. a type that is the signed or unsigned type corresponding to the dynamic type of the object, or
  3. a char, unsigned char, or std::byte type.
int i = 11;
std::print("{}\n", *reinterpret_cast<unsigned*>(&i)); // ok
std::print("{}\n", *reinterpret_cast<std::byte*>(&i)); // ok



Object representation

Allow access to the object representation, the sequence of bytes the object represents in memory.

Code below currently doesn't work but fixed in p1839.
int object = 11;
std::byte* ptr = reinterpret_cast<std::byte*>(&object);
for (auto i = 0z; i != sizeof(object); ++i) {
    std::print("{:02x} ", static_cast<int>(*ptr++));
}


Type punning via std::memcpy

int i = 11;
float f;
std::memcpy(&f, &i, sizeof(f));
std::print("{}\n", f); // ok
std::print("{}\n", i); // ok
// C++20, std::bit_cast, doing same as std::memcpy, but constexpr
int i = 11;
float f = std::bit_cast<float>(i);
std::print("{}\n", f); // ok
std::print("{}\n", i); // ok


Another exceptions

If two objects are pointer-interconvertible, then they have the same address,
and it is possible to obtain a pointer to one from a pointer to the other via a
reinterpret_cast.

Two objects a and b are pointer-interconvertible if
  •  they are the same object, or
  •  one is a union object and the other is a non-static data member of that object ([class.union]), or
  •  one is a standard-layout class object and the other is the first non-static data member of that object or any base class sub-object of that object ([class.mem]), or
  •  there exists an object c such that a and c are pointer-interconvertible, and c and b are pointer-interconvertible.
If two objects are pointer-interconvertible, then they have the same address, and it is possible to obtain a pointer to one from a pointer to the other via a reinterpret_cast
struct A {
    int member;
};

A a{.member = 11};
int* i_ptr = reinterpret_cast<int*>(&a);
std::print("{}\n", *i_ptr); // ok
std::print("{}\n", reinterpret_cast<A*>(i_ptr)->member); // ok


Union

union U {
  int i;
  float f;
};

U u{.i = 11};
u.f = 3.14f; // now f is the active member of the union.
std::print("{}\n", u.f); // ok
std::print("{}\n", u.i); // UB
union U {
  struct A {
    int prefix;
    int i;
  } a;
  struct B {
    int prefix2;
    float f;
  } b;
};

U u{.a = {.prefix = 0, .i = 11}};
std::print("{}\n", u.a.prefix); // ok
std::print("{}\n", u.b.prefix2); // ok, due to same address with same /type/.


Take away

Don't rely on implicit object creation

  • Use placement new to explicitly create a new object, thus new provenance.
  • Use std::start_lifetime_as to re-interpret raw bytes as an object, thus new provenance.
  • Whenever possible, use the pointer from placement new and std::start_lifetime_as directly, thus new provenance.
  • Use union { char empty, T t;} instead of alignas(T) unsigned char buffer[sizeof(T)];

Mar 25, 2025

[Bits manipulation]

align:

return (value + align_size - 1) & (~(align_size - 1);

has_single_bit:

return x && !(x & (x - 1));

midpoint:

return (a & b) + (a ^ b) / 2;

負i保 減一去
Set union A | B 
Set intersection A & B 
Set subtraction A & ~B 
Set negation ALL_BITS ^ A or ~A 
Set bit A |= 1 << bit 
Clear bit A &= ~(1 << bit) 
Test bit (A & 1 << bit) != 0 
Extract last bit A & -A or A & ~(A - 1) or x ^ (x & (x - 1)) 
Remove last bit A & (A - 1) 
Get all 1-bits ~0