Aug 3, 2025

[C++] coroutine cheat sheet

Key components:

  • Return type
  • Promise type
  • Awaitable type
  • std::coroutine_handle<>

Concept:

`co_yield value` is a syntax sugar for `co_await awaitable`.
thus value is stored in promise_type instance directly instead of having to go through
`Awaitable{value}` and store the value into promise_type instance through `Awaitable::await_suspend(handler)`

Caller could retrieve the value through
`ReturnType::handler_::promise.value_;`

std::suspend_always {
  // always suspend.
  bool await_ready() { return false;}
  // noop
  void await_suspend()
  // noop, the value is stored in promise_type.value_.
  void await_resume()
} // https://en.cppreference.com/w/cpp/coroutine/suspend_always.html

std::suspend_never {
  // never suspend.
  bool await_ready() { return true;}
  // noop
  void await_suspend()
  // noop, the value is stored in promise_type.value_.
  void await_resume()
}; // https://en.cppreference.com/w/cpp/coroutine/suspend_never.html


strict Awaitable {
  T value_;
  bool await_ready() { return false;}
  void await_suspend(std::coroutine_handle<promise_type> handler) {
    // could store the result into handler.promise().RESULT.
    //    or store the value_ into handler.promise().RESULT.
    // Handler can be passed into new thread.
    // When handler calls .resume(), it starts execute from where previous
    // suspended.
  }

  // handler.resume() from Awaitable suspend comes here.
  // can return value to the caller inside the coroutine.
  // the value can be obtained from handler.promise().RESULT.
  void await_resume();
  T await_resume() {return T{}; };

  Awaitable(T value) : value_(value) {}
};



struct ReturnType {
  struct promise_type {
    T value_;
    promise_type(T...); // optional

    ReturnType get_return_object() {
      return {};
      // or
      return {coroutine::from_promise(*this)};
    };

    std::suspend_always initial_suspend() {}

    template<std::convertible_to<T> From> // C++20 concept
        std::suspend_always yield_value(From&& from) {
            value_ = std::forward<From>(from); // caching the result in promise
            return {};
    }

    // above start; below shutdown

    void return_value(T/T&&/const T&); // store value T inside the ReturnType, caller retrieve the value from ReturnType. 
    void return_void();

    void unhandled_exception();
    std::suspend_always final_suspend() noexcept;
  };

  ReturnType(std::coroutine_handle<promise_type> handler) : handler_(handler) {}

  ReturnType() = default;
};

Jul 29, 2025

[C++] Coroutine examples

  1. C++ coroutine has two scopes: caller scope, coroutine scope (if you familiar with Python yield/send, same concept but more subtle with object lifetime and more flexible. Or Golang goroutine/channel [stackful], or C++ fiber[stackful])
  2. Suspend suspends the coroutine and back to the caller, caller uses handler to resume the coroutine
  3. Underneath using JMP instead of CALL since it's stackless coroutine.
  4. Beware that even `void coroutine(const int& arg)` that has co_await/co_yield/co_return, when it suspends, the binding arg has been stack rewinded thus destroyed.
    i.e.
    If a coroutine has a parameter passed by reference, resuming the coroutine after the lifetime of the entity referred to by that parameter has ended is likely to result in undefined behavior.

    The C++ core guidelines say not to use references at all.
  5. Just avoid references" isn't comprehensive, e.g. we have span, string_view, even though they are not reference, can still facing the same dangling issue after co_await/co_yield/co_return.
    Thus, a better contract is:
    Pass by Value with Owning Types.
  6. The way we ensure caller doesn't compile is by making Co non-moveable, and accepting it by value in the co_await implementation in the promise type.

    This means the only way to await it is if you do so immediately in the same full expression as the function call, so that guaranteed copy elision can kick in. 

    Because temporaries aren't destroyed until that full expression has been evaluated, the lifetimes work out perfectly using the usual language rules around lifetimes.

    Design TIP:
    Eliminated the problem with references by just declaring an entire pattern of code illegal.
      Co<void> UseInt_Async(const int& x) {
         printf("%d\n", x);
         co_await DoSomething();
         printf("%d\n", x);
      }
      
      k3::Co<void> UseInt_Async(const int&);
      // Works fine
      co_await UseInt_Async(17);
        
      // Compiler error due to k3::Co<void> is not moveable.
      k3::Co<void> co = UseInt_Async(17);
      co_await std::move(c);
  7. What if we want to use above pattern?
    Indirection, just like std::bind
    k3::Co<void> UseInt_Async(const int&);
    // Totally safe
    k3::Future<void> future{UseInt_Async, 17};
    printf("I created the future!\n");
    co_await std::move(future).Run();
  8. Co<void> has the interface concept of awaitable, as below.
  9.  Fan out or run in sequence; the implementation is imaginable.
      
    // Run all concurrently, finishing once all finish.
    k3::Co<void> FanOut(std::vector<k3::Future<void>> futures);
    // Run all concurrently, finishing when the first finishes.
    template <typename T>
    k3::Co<T> Race(std::vector<k3::Future<T>> futures);
    

  10. co_await awaitable;
    struct awaitable {
    	bool await_ready() { return false; // false: suspended, true: not suspended}
    	void await_suspend(std::coroutine_handle<> h) { // what to do when suspended}
    	void await_resume() { // in await_suspend's coroutine_handle calling .resume() comes here. and after this resume to coroutine.  }
        int await_resume() {return 42; // co_await returns value from here.}
    }; 
    co_yield
    co_return

Stack:

Go goroutines has dynamic stack size, default to 2kb. linux has thread size default to 8mb, 64k in production.
So in coroutine, how do we avoid stack overflow?

tail call comes to the rescue;
c++ coroutine guarantees a tail call into the other coroutine, and then you can do it again on the way back out. There's no need to have a stack frame to return to; everything is a JUMP instruction.
This is the only guaranteed[as of c++20] tail call in the standard; it's actually kind of a unique mixing of abstraction levels.
Co<void> Foo();
Co<void> Bar() {
  // Tail call into the body of Foo.
  co_await Foo();
  // Tail call back from the body of Foo.
  [...]
}
This indicates that with coroutine embedded inside a coroutine all use the same stack; thus
if there are multiple suspension, the stack can be overflown.

c9 solution:
we are never allowed to have one coroutine directly resume another.

Possible solution to the Notify/Wait pattern:
class Event {
Co<void> Wait();
// If there is a waiter, it will start running concurrently
// on another thread. The calling thread continues on.
// Look at the co_wait example below.
void Notify();
};
We could resolve this by breaking the assumption that there is only one thread available.
Instead we could resume the coroutine on a different thread, letting it run concurrently at the same time as the notifier.


Avoid executors in the coroutine library design:

Don't offer unnecessary configurability.
library should be agnostic to executors. All it has is its one thread-local queue of things that have been resumed by the running coroutine.
Of course individual things you wait for might have an executor internally. Like if you wait for an RPC to finish, you'll probably be resumed on one of a team of threads reading RPC replies from the network.
e.g.


// Hop to a specific executor. We need to run there because…
// look the co_await example below~
co_await Reschedule(my_executor);

Cancellation in library design

  • RPCs to stuck machines
  • Request hedging (idempotent operations)
  • Avoiding wasted work
  • Timeouts

The callee exists only to serve the caller. It must stop promptly if the caller loses interest.

  • the caller of a coroutine is always in control.
  • If the caller no longer wants the callee to run, the callee should stop running.
  • And it should do so promptly. Not after getting a response to its RPC. Not after a timeout expires. Immediately.
    All it should get to do is the kind of thing you want to do before you release a lock or unwind after an exception. Ensure internal invariants are restored; that kind of thing.
  • coroutine's local automatic variable won't be destructed until it is resumed till the end of the coroutine. (while is is on the heap)
  • If cancellation happens (which is to resume after coroutine suspended, and pass the cancellation bit to the coroutine_handler which be consumed by the awaitable, and the awaitable resume checks the bit to cancel the original code path.), the coroutine continues till the end.

Notes on structured concurrency; idea:
Children should not out-live parent.
Prefer structured concurrency wherever possible.
Good API design also helps with usability and safety.

Handler:

// task, aka handler, will be created through promise_type instance's get_return_object()
// and destructed once the coroutine function is returned to the caller.
// the destruct of task does not mean the promise_type instance is destructed, it still resides on the heap where
// coroutine is located.
struct task
{
    struct promise_type
    {
        task get_return_object() { return {}; // Always called first when coroutine being called. }
        std::suspend_never initial_suspend() { return {}; // Always called second when coroutine being called.   }
        std::suspend_never final_suspend() noexcept { return {}; / Always called last when coroutine is finished. (i.e. done)  }
        void return_void() { // called if return from coroutine with void.}
        std::suspend_always yield_value(T value) noexcept
        {
          // used for co_yeild, once co_yeild is called, the yield value is installed here.
          // and suspend afterwards due to return `std::suspend_always`
          return {};
        }
        
        std::suspend_always return_value(T value) {
          // used for co_return, once co_return is called with expr, the expr value is installed here.
          // and suspend afterwards due to return `std::suspend_always`
          return {};
        }
        void unhandled_exception() {}
    };
    
    task(std::coroutin_handler<promise_type> h) {
      // store the coroutin_handler inside the task.
    }
};

#include <iostream>
#include <coroutine>

// 1. The Promise Type
struct MyTaskPromise {
    // The compiler calls this to get the return object.
    // We return a 'MyTask' object, and in its constructor, we pass it the handle.
    struct MyTask get_return_object();

    std::suspend_never initial_suspend() noexcept { return {}; }
    std::suspend_always final_suspend() noexcept { return {}; }
    void return_void() noexcept {}
    void unhandled_exception() noexcept {}
};

// 2. The Task Type
// This is the wrapper around the coroutine_handle.
struct MyTask {
    using promise_type = MyTaskPromise;
    std::coroutine_handle<MyTaskPromise> handle;

    // The constructor takes the handle from the promise's get_return_object() call.
    MyTask(std::coroutine_handle<MyTaskPromise> h) : handle(h) {}
};

// Now we can define get_return_object() because MyTask is defined.
MyTask MyTaskPromise::get_return_object() {
    // This is the key line: we construct the MyTask object with the handle
    // to the coroutine that owns this promise.
    return MyTask{std::coroutine_handle<MyTaskPromise>::from_promise(*this)};
}

// 3. A simple awaitable to demonstrate suspension.
struct Awaitable {
    bool await_ready() { return false; }
    void await_resume() {}

    // This await_suspend returns `true`, which suspends the coroutine
    // and returns control to the caller (main).
    bool await_suspend(std::coroutine_handle<>) noexcept {
        std::cout << "-> Awaitable: Coroutine is suspending." << std::endl;
        return true;
    }
};

// 4. The Coroutine Function
MyTask MyCoroutine() {
    std::cout << "Coroutine: Starting." << std::endl;
    co_await Awaitable{}; // Coroutine suspends here.
    std::cout << "Coroutine: Resumed and finishing." << std::endl;
    co_return;
}

// 5. The Caller (main function)
int main() {
    // 1. The call to MyCoroutine() returns a 'MyTask' object.
    // This object contains the handle to the suspended coroutine.
    MyTask task = MyCoroutine();

    std::cout << "\nMain: Coroutine is suspended. I am the caller.\n" << std::endl;

    // 2. We can now use the handle stored inside the 'task' object
    // to resume the coroutine.
    if (task.handle) {
        std::cout << "Main: Resuming the suspended coroutine." << std::endl;
        task.handle.resume();
    }

    std::cout << "\nMain: Coroutine has finished its execution." << std::endl;

    // 3. Clean up the coroutine's memory.
    if (task.handle) {
        task.handle.destroy();
    }

    return 0;
}



co_yield Example:

#include <coroutine>
#include <iostream>
#include <optional>

template<std::movable T>
class Generator
{
public:
    struct promise_type
    {
        Generator<T> get_return_object()
        {
          std::cout << "get_return_object()\n"; // -2; Generator created with pointer to the heap
          return Generator{Handle::from_promise(*this)};
        }
        static std::suspend_always initial_suspend() noexcept
        {
          std::cout << "suspend_always initial_suspend()\n"; // -4 init. suspend. Go to the caller.
          return {};
        }
        static std::suspend_always final_suspend() noexcept
        {
          std::cout << "suspend_always final_suspend()\n"; // -15 coroutine ends, call this and suspend.
          // Back to the caller. Go to (13)
          return {};
        }
        std::suspend_always yield_value(T value) noexcept
        {
          std::cout << "yield_value() : " << value << "\n"; // -9, suspend. Go to the caller.
          current_value = std::move(value);
          return {};
        }
        // Disallow co_await in generator coroutines.
        void await_transform() = delete;
        [[noreturn]]
        static void unhandled_exception() { throw; }

        std::optional<T> current_value;
    };

    using Handle = std::coroutine_handle<promise_type>;

    explicit Generator(const Handle coroutine) :
        m_coroutine{coroutine}
    {
      std::cout << "Generator constructor\n"; // -3
    }

    Generator() = default;
    ~Generator()
    {
        // make sure no double free through handler.destroy() while
        // caller could have a copy of the handler.
        if (m_coroutine && !m_coroutine.done())
            m_coroutine.destroy();
      std::cout << "Generator destructor\n";
    }

    Generator(const Generator&) = delete;
    Generator& operator=(const Generator&) = delete;

    Generator(Generator&& other) noexcept :
        m_coroutine{other.m_coroutine}
    {
      std::cout << "Generator move constructor\n";
      other.m_coroutine = {};
    }
    Generator& operator=(Generator&& other) noexcept
    {
      std::cout << "Generator assign operator=\n";
        if (this != &other)
        {
            if (m_coroutine)
                m_coroutine.destroy();
            m_coroutine = other.m_coroutine;
            other.m_coroutine = {};
        }
        return *this;
    }

    // Range-based for loop support.
    class Iter
    {
    public:
        void operator++()
        {
          std::cout << "Iter ++ currnet value: " << *m_coroutine.promise().current_value << "\n";
          
          m_coroutine.resume(); // -11, resume from (9) yeild's suspend; JUMP to COROUTIN RIGHT AWAY!
          // Following cout is not run until coroutin suspend again.
          
          // -13, followed by (12) and after (9) yeild's suspend.
          std::cout << "Iter ++ resumed currnet value: " << *m_coroutine.promise().current_value << "\n";
        }
        const T& operator*() const
        {
          // -10, caller print out the value.
          std::cout << " Iter* return value: " << *m_coroutine.promise().current_value << "\n";
          return *m_coroutine.promise().current_value;
        }
        bool operator==(std::default_sentinel_t) const
        {
          // -16, caller is calling this from coroutin's suspend.
          std::cout << "Iter == called: !m_coroutine: " << (!m_coroutine) << " m_coroutine.done(): " << m_coroutine.done() << "\n";
            return !m_coroutine || m_coroutine.done();
        }

        explicit Iter(const Handle coroutine) :
            m_coroutine{coroutine}
        {}

    private:
        Handle m_coroutine;
    };

    Iter begin()
    {
      std::cout << "Iter begin\n"; // -5, caller range for calls begin()
        if (m_coroutine)
            m_coroutine.resume(); // -6 (4) suspended resumed. Go to coroutine.
        return Iter{m_coroutine};
    }

    std::default_sentinel_t end() { return {}; }

private:
    Handle m_coroutine;
};

template<std::integral T>
Generator<T> range(T first, const T last)
{
  // suspended right away since initial_suspend() returns std::suspend_always
  // returns to the caller with Generator instance
  std::cout << "Range\n"; // -7 from (6)
  while (first < last){
    std::cout << "Range first: " << first << "\n";
    co_yield first++; // -8, go to yield_value(), install `first` value, first is + 1, and suspend.
    // // -12, resume from (11) operator++()
    std::cout << "Range after first++: " << first << "\n";
  }
  // -14, coroutine end, suspend_always final_suspend() called.
}

int main()
{
  std::cout << "Start for loop\n"; // -1
  // Generator is destructed only once due to range loop extends lifetime.  
  for (const char i : range(65, 67))
      std::cout << i << "\n";
  // -17, out of range for loop scope, 
  // range(65, 67) returned `Generator` destructs.

  std::cout << "End for loop\n";
    std::cout << '\n';
}
stdout:
Start for loop
get_return_object()
Generator constructor
suspend_always initial_suspend()
Iter begin
Range
Range first: 65
yield_value() : 65
Iter == called: !m_coroutine: 0 m_coroutine.done(): 0
 Iter* return value: 65
A
Iter ++ currnet value: 65
Range after first++: 66
Range first: 66
yield_value() : 66
Iter ++ resumed currnet value: 66
Iter == called: !m_coroutine: 0 m_coroutine.done(): 0
 Iter* return value: 66
B
Iter ++ currnet value: 66
Range after first++: 67
suspend_always final_suspend()
Iter ++ resumed currnet value: 66
Iter == called: !m_coroutine: 0 m_coroutine.done(): 1
Generator destructor
End for loop


co_await Example:

#include <coroutine>
#include <iostream>
#include <stdexcept>
#include <thread>
 
auto switch_to_new_thread(std::jthread& out) {
    std::cout << "switch_to_new_thread start\n";
    struct awaitable {
        std::jthread* p_out;
        bool await_ready() {
		  std::cout << "await ready\n";
		  return false; 
	    };
    
    	void await_suspend(std::coroutine_handle<> h) {
	    	std::cout << "await_suspend\n";
	        std::jthread& out = *p_out;
    	    if (out.joinable())
        	  throw std::runtime_error("Output jthread parameter not empty");
	        out = std::jthread([h] { 
        		std::cout << "calling handler.resume()\n";
				h.resume();
    	    });
            
        // Potential undefined behavior: accessing potentially destroyed *this
        // std::cout << "New thread ID: " << p_out->get_id() << '\n';
	        std::cout << "New thread ID: " << out.get_id() << '\n'; // this is OK
        }
    
    	void await_resume() {
	    	std::cout << "await_resume\n";
	    }
    };

    std::cout << "switch_to_new_thread about to return\n";
    
    return awaitable{&out};
}
 
struct task {
    struct promise_type {
        task get_return_object() {
			std::cout << "get_return_object()\n";
			return {}; 
		}
    
    	std::suspend_never initial_suspend() {
			std::cout << "inital_suspend()\n";
			return {}; 
		}
    
	    std::suspend_never final_suspend() noexcept {
			std::cout << "final_suspend()\n";
			return {}; 
		}
    
    	void return_void() {
			std::cout << "return_void\n";
		}
     
    	 void unhandled_exception() {}
    };

    ~task() {
	    std::cout << "task destruct\n";
    }
};
 
task resuming_on_new_thread(std::jthread& out) {
    std::cout << "Coroutine started on thread: " << std::this_thread::get_id() << '\n';
    co_await switch_to_new_thread(out);
    // awaiter destroyed here
    std::cout << "Coroutine resumed on thread: " << std::this_thread::get_id() << '\n';
}
 
int main() {
    std::jthread out;
    std::cout << "start\n";
    resuming_on_new_thread(out);
    std::cout << "ending main()\n";
}
stdout:
start
get_return_object()
inital_suspend()
Coroutine started on thread: 140248882116480
switch_to_new_thread start
switch_to_new_thread about to return
await ready
await_suspend
New thread ID: 140248877143744
task destruct
ending main()
calling handler.resume()
await_resume
Coroutine resumed on thread: 140248877143744
return_void
final_suspend()

Jul 21, 2025

[C++] P1787R6: Declarations and where to find them

P1787R6: Declarations and where to find them


https://timsong-cpp.github.io/cppwp/n4868/temp.local#7

https://timsong-cpp.github.io/cppwp/n4950/temp.local#7


https://godbolt.org/z/hrE3YEzPd

<source>:12:53: error: 'Write' is not a type

#include <cstdint>

class Foo {
  template <typename Write>
  void WriteNestedMessage(uint32_t field_number, Write write_message);

 protected:
  void Write();
};

template <typename Write>
void Foo::WriteNestedMessage(uint32_t field_number, Write write_message) {}

Jul 13, 2025

[C++][Rust] default Lifetime annotation from Rust

  1. The first rule is that the compiler assigns a lifetime parameter to each parameter that’s a reference. In other words, a function with one parameter gets one lifetime parameter: fn foo<'a>(x: &'a i32); a function with two parameters gets two separate lifetime parameters: fn foo<'a, 'b>(x: &'a i32, y: &'b i32); and so on.
  2. The second rule is that, if there is exactly one input lifetime parameter, that lifetime is assigned to all output lifetime parameters: fn foo<'a>(x: &'a i32) -> &'a i32.
  3. The third rule is that, if there are multiple input lifetime parameters, but one of them is &self or &mut self because this is a method, the lifetime of self is assigned to all output lifetime parameters. This third rule makes methods much nicer to read and write because fewer symbols are necessary.

Consider the idea and apply to C++ with 
gnu::lifetimebound
#include <map>
#include <string>

using namespace std::literals;

// Returns m[key] if key is present, or default_value if not.
template<typename T, typename U>
const U &get_or_default(const std::map<T, U> &m [[clang::lifetimebound]],
                        const T &key, /* note, not lifetimebound */
                        const U &default_value [[clang::lifetimebound]]) {
  if (auto iter = m.find(key); iter != m.end()) return iter->second;
  else return default_value;
}

int main() {
  std::map<std::string, std::string> m;
  // warning: temporary bound to local reference 'val1' will be destroyed
  // at the end of the full-expression
  const std::string &val1 = get_or_default(m, "foo"s, "bar"s);

  // No warning in this case.
  std::string def_val = "bar"s;
  const std::string &val2 = get_or_default(m, "foo"s, def_val);

  return 0;
} 
Output:
<source>:19:55: warning: temporary bound to local reference 'val1' will be destroyed at the end of the full-expression [-Wdangling]
   19 |   const std::string &val1 = get_or_default(m, "foo"s, "bar"s);

Jul 11, 2025

[c++] why consteval since we have constexpr?

Reference:
https://vsdmars.blogspot.com/2022/06/cc20-consteval-constexpr.html

Use case: https://godbolt.org/z/a7v15TrWK

#include <type_traits>

constexpr bool ConstExpr() {
    return std::is_constant_evaluated();
}

consteval bool ConstEval() {
    return std::is_constant_evaluated();
}

void Fun(bool b) {}

int main() {
  const bool cbool = ConstExpr(); // True
  ConstExpr(); // False; here we can't force ConstExpr() to run as const expression.
  bool cbool_2 = ConstEval(); // True
  ConstEval(); // True
  Fun(ConstEval());
}

May 15, 2025

[Book] Programming massively parallel processors(PMPP) - Wen-mei Hwu, reading minute - ch1 - ch6

Reference:

CUDA C++ Programming Guide
NVIDIA Nsight Compute CUDA code optimization

[virtual memory] recap

Intel TBB
Intel TBB Task Scheduler
[oneTBB] accessor note

Error handling recap; TLPI
EINTR and What It Is Good For

kernal namespace recap
https://vsdmars.blogspot.com/2018/12/linuxnamespace-wrap-up.html
https://vsdmars.blogspot.com/2018/12/linuxnamespace-mount-mnt.html
https://vsdmars.blogspot.com/2018/06/linuxkernel-namespace.html

cache
[Pacific++ 2018][re-read] "Designing for Efficient Cache Usage" - Scott McMillan
[Go][design] high through-put low contention in memory cache

Locking
[futex] futex skim through
[Concurrency] [C++][Go] Wrap up - 2018
[C++] memory model in depth; memory spilling.

Algorithm/Implementation
Under my github/leetcode project



Software abstraction (CUDA runtime)

Similar idea e.g.  Golang  ref: [Go][note] Analysis of the Go runtime scheduler paper note, the difference coming from Go/C++ as multi-purpose language running on CPU (sequential), CUDA/OpenCL as C extension running on GPU. While the underneath hardware architecture difference, the abstraction diffs.


SPMD single-program multiple-data 
host CPU based code
kernel GPU Device code / function, more details later. Basically same code / IR run in parallel.
grid threads group
_h host variable in CPU code
_d device variable in CPU code
__host__  callable from host, executed on host, executed by host thread(e.g. linux thread).
__global__ callable from host or device, executed on device, executed by grid of device threads
__device__ callable from device, executed on device, executed by caller device thread.
If function declared with __host__ and __device__ macro, NVCC generates two version, one for host and one for device.
block 32-based size(hardware efficiency reason), all blocks are in same size. Size of how many GPU threads. Threads in a block can execute in any order with respect to each other.
SM Streaming multiprocessor; each SM has several processing units called CUDA cores. It is designed to execute all threads in a warp following the single-instruction multiple-data(SIMD) model.
HBM high-bandwidth memory
Warp a warp groups 32-threads together. Thus a block of threads will be group into warps, which each warp has 32-threads. Scheduling is based on Warp. Also think as single-instruction, multiple-threads.

FLOP floating-point operations
FLOP/B FLOP to byte ratio.
GPU global memory bandwidth: 1555GB/second; 1555 * 0.25(FLOP/B) = 389 GFLOPS

const readonly variables

 blockIdx; area code
 blockDim; row idx
 threadIdx; phoneline
Those three variable gives the kernel realize which data it is running on.


OUR_GLOBAL_FUNC<<<number of block, threads per block>>>(args...);


Thread Scheduling

Block scheduling

When a kernel is called, the CUDA runtime launches a grid of threads execute the kernel code. These threads are assigned to SMs on a block-by-block basis. All threads in a block are simultaneously assigned to the same SM. There are reserved blocks for system to executed, thus a SMs' blocks are not all scheduled to the user kernel.
Multiple blocks are likely to be simultaneously assigned to the same SM. The concept of Warp scheduling is that, those threads inside the same Warp runs the same instruction set(same kernel), thus the fetch of instruction is one time efforts. Also, the data those threads in the same Warp access are linear thus are prefetchable/cache friendly.
Moreover, threads in the same block can interact with each other in ways that threads across different blocks cannot, such as barrier synchronization,

synchronization / transparent scalability

  block until every thread in the same block reaches the code location. if a __syncthreads() statement is present, it must be executed by all threads in a block. i.e. 
void incorrect_barries_example(int n) {
	if (threadIdx.x % 2) {
		__syncthreads(); // sync point-1
	} else {
		__syncthreads(); // sync point-2
	}
} 
Wrong due to not all threads runs into the same barrier synchronization points.
Not only do all threads in a block have to be assigned to the same SM, but also they need to to be assigned to that SM simultaneously. i.e. a block can begin execution only when the runtime system has secured all the resources needed by all threads in the block to complete execution.

The ability to execute the same application code on different hardware with different amounts of execution resources is referred to as transparent scalability.


Control Divergence

The execution works well when either all threads in a warp execute the if-path or all execute the else-path. Otherwise, it has to go with the code twice. One run with the core running if path code and the other core with else path is doing noop. (In the same Warp). Another run with the core doing noop on the if path code and the other core wile else path is running. In old architecture, those 2 runs run in sequence. In new architecture, those 2 runs can run in parallel. This is called independent thread scheduling.
Thus, due to this fact, do not use threadIdx for if branching. But use data for divergence control, this also related to data locality in cache. One important fact, the performance impact of control divergence decreases as the size of the vectors being processed increases.
One cannot assume that all threads in a warp have the same execution timing.(even they are running the same fetched instruction). Thus, use __syncwarp() barrier synchronization instead.



Latency tolerance

simple, i.e. CPU, context switch on single code due to limited of resource(registers, cache etc.) and makes sure code runs preemptive-scheduling fashion.
Thus, SM only has enough execution units to execute a subset of all the threads assigned to it at any point in time.
In recent SM, each SM can execute instructions for a small number of warps at any given point in time.
GPU SMs achieves zero-overhead scheduling by holding all the execution states for the assigned warps in the hardware registers so there is no need to save and restore states when switching from one warp to another.
Thus, allows GPU oversubscription of threads to SMs.
Automatic/Local variables declared in the kernel are placed into registers.
Each SM in A100 GPU has 65,536 registers.
65536/2048(threads) = 32 registers per thread/kernel.

In cases, the compiler may perform register spilling to reduce the register requirement per thread and thus elevate the level of occupancy. However, this could increase latency due to need to fetch data from the memory instead directly from the register.

cudaDeviceProp struct has bunch of variable represents the hardware SPEC.
e.g.
 multiProcessorCount Number of multiprocessors on device
 clockRate Clock frequency in kilohertz
 regsPerBlock 32-bit registers available per block
 warpSize  Warp size in threads


Variable declaration scope and lifetime

automatic scalar variables    [mem]register    [scope]thread    [lifetime]grid
automatic array variables    [mem]local    [scope]thread    [lifetime]grid
__device__ __shared__    [mem]shared    [scope]block    [lifetime]grid
__device__    [mem]global    [scope]grid    [lifetime]application
__device__ __constant_    [mem]constant    [scope]grid    [lifetime]application

API





[C++] global const variable initialization

Reference:

[C++][C++20] consteval / constexpr
[C++/Rust] use of thread_local in code
[Note] linking notes


Initializing order

  1. Perform constant initialization and zero initialization
  2. Perform dynamic initialization
  3. Start executing main()  How main() is executed on Linux
  4. Perform dynamic initialization of function-local static variables, as needed.
  5. It is impossible to read uninitialized global memory.
  6. Destroy function-local static variables that were initialized in reverse order
  7. Destroy other globals in reverse order.

header.h
inline int global = compute();
// Always OK due to sequence; and `h` is inline.
// (compiler will take care of its init.)
inline int h = global;

tu.cpp
#include "header.h"
int a = global; // not certain due to TU not guarantee `global` being defined.

Templated variables are init. at some point before main()

template<typename T>
int type_id = get_id<T>();

template<typename T>
struct templ {
 static int type_id = get_id<T>();
};

Compiler is allowed to perform dynamic initialization at compile-time.

int f(int i) {
  return 2 * i; // not constexpr
}

int the_answer = f(21); // might happen at compile-time; even not constexpr.

Compiler is allowed to perform dynamic init. after main() has started executing.

i.e. an unused global variable might not be initialized at all.

const auto start_time = std::chrono::system_clock::now();

int main() {
  auto t = start_time; // might be init. at this point.
}

The static initialization order fiasco

Dynamic initialization order of globals in different translation units is not specified.

Guideline
Unless otherwise allowed by its documentation, do not access global state in the constructor of another global.

* Solution 1

Globals initialized by some constant expression are initialized during static initialization without dynamic initialization.
Constant expressions cannot access globals that are dynamically initialized.

Guideline
Whenever possible, make global variables constexpr.

constexpr float pi = 3.14;
constexpr std::size_t threshold = 3;

Guideline
Whenever possible, use constant initialization.

// https://en.cppreference.com/w/cpp/thread/mutex/mutex
std::mutex mutex; // Has constexpr constructor. 


Always look into the code path that has function which is not constexpr.

e.g.
constexpr int compute(bool default) {
  if(default)
    return 42; // constexpr
  else
    return std::getchar(); // runtime.
}

int default = compute(true); // constexpr
int non_default = compute(false); // runtime

C++20

constinit = constexpr - const

  • variable is initialized using constant initialization (or error)
  • variable is not const
thus:
constinit std::mutex mutex; // OK.

constinit int default = compute(true); // constexpr; ok.
constinit int non_default = compute(false); // runtime, thus error.

Guideline
Declare global variables constinit to check for initialization problems.

Guideline
Try to add a constexpr constructor to every type.
e.g.
Default constructors of containers(with default allocator)
Default constructors of resource handles(files, sockets, ...)

Guideline
Do not use constinit if need to do lazy initialization.

* Solution 2:

Lazy initialization.

Global& global() {
  static Global g;
  return g;
}

// The global init. is not determinated.
Global& global = global(); // Bad because can't be constinit

Guideline
Never cache a (reference to) a function-local static in a global variable.
template<typename Tag, typename T>
class lazy_init {
  public:
    constexpr lazy_init() = default;

    T& get() {
      static T global;
      return global;
    }

    T& operator*() { return get();}
    T* operator->() { return &get();}
};

constinit lazy_init<TAG, Logger> global; // OK
global->log(); // The first call takes time, beware.
logger.h
class Logger{...};

extern constinit lazy_init<TAG, Logger> global;
logger.cpp
constinit lazy_init<TAG, Logger> global;

Global variable destruction

Global variables are destroyed in reverse dynamic initialization order.
Beware, constinit always initialized first, and that is its purpose.
A a;
constinit B b;

void func(){
  static C c;
}

int main() {
  func();
}
init. b
init. a
init. c
destroy c
destroy b
destroy a


Destruction order of globals in different TUs is not specified.

Guideline
Unless otherwise allowed by its documentation, do not access global state in the destructor of another global.
This applies to constinit globals.

Rule
  • The order of dynamic initialization is not specified.
  • Exception: within one translation unit, variables are initialized from top to bottom.
    1) Must include the header that declares the global before using it
    2) every global defined in that header is initialized before your global.
  • Do not use function-local static variables if there's a chance they might be used after main()
  • Do not use function-local static variables.

template<typename Tag, typename T>
class lazy_init {
  public:
    constexpr lazy_init() = default;

    T& get() {
    // technique; 1) thread safe, 2) initialize data member in one-shot.
      static bool dummy = (storage_.initialize(), true);
      return storage_.get();
    }
  private:
    storage<T> storage_;
};

constinit lazy_init<struct global_tag, std::string> global;

Nifty counters

  • Ensure the nifty counter object is included in the definition file.
  • The nifty counter approach doesn't work for templated objects.

header.h
extern constinit nifty_init<Global> global_init; // declaration; definition is in some .cpp file.
// static int dummy = (global_init.initialize(), 0);
static nifty_counter_for<global_init> dummy; // definition
inline constinit Global& global = global_init.reference();

constinit Global copy = global; // compiler error
a.cpp
#include "header.h"
int a = global->foo();
b.cpp
#include "header.h"
int b = global->foo();

template<typename T>
class nifty_init {
public:
 constexpr nifty_init() = default;
 
 void initialize() {
 	if (counter_++ == 0)
		storage_.initialize();
 }
 
 void destroy() {
	if (--counter_ == 0)
		storage_.get().~T();
 }
 
 constexpr T& reference() { return storage_.get(); }

private:
 int counter_ = 0;
 storage<T> storage_;
};


template<auto& NiftyInitT>
struct nifty_counter_for {
 
  nifty_counter_for() {
  	NiftyInitT.initialize();
  }
  
  ~nifty_counter_for() {
    NiftyInitT.destroy();
  }
};

Conclusion

  • constinit is not always applicable
  • lazy initialization has to leak
  • nifty counters are black magic
  • Or, simply don't do anything before or after main()


Manual initialization (Google's way)

Guideline
Use manual initialization either for all globals in your project, or none.
And remember to initialize them all!


header.h
template<typename T>
class manual_init {
public:
  constexpr manual_init() = default;
  void initialize() { storage_.initialize(); }
  void destroy() { storage_.destroy(); }
  
  T& get() { return storage_.get(); }
  
private:
 storage<T> storage_;
};


template<auto& ... Init>
struct scoped_initializer {
  scoped_initializer() {
    Init.initialize();
	...
  }
  
  ~scoped_initializer() {
    Init.destroy();
	...
  }
};

main.cpp
#include "header.h"

constinit manual_init<Global> global;

int main() {
  scoped_initializer<global> initializer;
  global->foo();
}