Ataraxia through Epoché: [Book] Programming massively parallel processors(PMPP) - Wen-mei Hwu, reading minute - ch1

Reference:

CUDA C++ Programming Guide
NVIDIA Nsight Compute CUDA code optimization

Intel TBB
Intel TBB Task Scheduler
[oneTBB] accessor note

Error handling recap; TLPI
EINTR and What It Is Good For

kernal namespace recap
https://vsdmars.blogspot.com/2018/12/linuxnamespace-wrap-up.html
https://vsdmars.blogspot.com/2018/12/linuxnamespace-mount-mnt.html
https://vsdmars.blogspot.com/2018/06/linuxkernel-namespace.html

cache
[Pacific++ 2018][re-read] "Designing for Efficient Cache Usage" - Scott McMillan
[Go][design] high through-put low contention in memory cache

Locking
[futex] futex skim through
[Concurrency] [C++][Go] Wrap up - 2018
[C++] memory model in depth; memory spilling.

Algorithm/Implementation
Under my github/leetcode project

Software abstraction (CUDA runtime)

Similar idea e.g. Golang ref: [Go][note] Analysis of the Go runtime scheduler paper note, the difference coming from Go/C++ as multi-purpose language running on CPU (sequential), CUDA/OpenCL as C extension running on GPU. While the underneath hardware architecture difference, the abstraction diffs.

SPMD single-program multiple-data

host CPU based code

kernel GPU Device code / function, more details later. Basically same code / IR run in parallel.

grid threads group

_h host variable in CPU code

_d device variable in CPU code

__host__ callable from host, executed on host, executed by host thread(e.g. linux thread).

__global__ callable from host or device, executed on device, executed by grid of device threads

__device__ callable from device, executed on device, executed by caller device thread.

If function declared with __host__ and __device__ macro, NVCC generates two version, one for host and one for device.

block 32-based size(hardware efficiency reason), all blocks are in same size. Size of how many GPU threads. Threads in a block can execute in any order with respect to each other.

SM Streaming multiprocessor; each SM has several processing units called CUDA cores. It is designed to execute all threads in a warp following the single-instruction multiple-data(SIMD) model.

HBM high-bandwidth memory

Warp a warp groups 32-threads together. Thus a block of threads will be group into warps, which each warp has 32-threads. Scheduling is based on Warp. Also think as single-instruction, multiple-threads.

FLOP floating-point operations

FLOP/B FLOP to byte ratio.

GPU global memory bandwidth: 1555GB/second; 1555 * 0.25(FLOP/B) = 389 GFLOPS

const readonly variables

blockIdx; area code
blockDim; row idx
threadIdx; phoneline
Those three variable gives the kernel realize which data it is running on.

OUR_GLOBAL_FUNC<<<number of block, threads per block>>>(args...);

Thread Scheduling

Block scheduling

When a kernel is called, the CUDA runtime launches a grid of threads execute the kernel code. These threads are assigned to SMs on a block-by-block basis. All threads in a block are simultaneously assigned to the same SM. There are reserved blocks for system to executed, thus a SMs' blocks are not all scheduled to the user kernel.
Multiple blocks are likely to be simultaneously assigned to the same SM. The concept of Warp scheduling is that, those threads inside the same Warp runs the same instruction set(same kernel), thus the fetch of instruction is one time efforts. Also, the data those threads in the same Warp access are linear thus are prefetchable/cache friendly.

Moreover, threads in the same block can interact with each other in ways that threads across different blocks cannot, such as barrier synchronization,

synchronization / transparent scalability

__syncthreads()

block until every thread in the same block reaches the code location. if a __syncthreads() statement is present, it must be executed by all threads in a block. i.e.

void incorrect_barries_example(int n) {
	if (threadIdx.x % 2) {
		__syncthreads(); // sync point-1
	} else {
		__syncthreads(); // sync point-2
	}
}

Wrong due to not all threads runs into the same barrier synchronization points.

Not only do all threads in a block have to be assigned to the same SM, but also they need to to be assigned to that SM simultaneously. i.e. a block can begin execution only when the runtime system has secured all the resources needed by all threads in the block to complete execution.

The ability to execute the same application code on different hardware with different amounts of execution resources is referred to as transparent scalability.

Control Divergence

The execution works well when either all threads in a warp execute the if-path or all execute the else-path. Otherwise, it has to go with the code twice. One run with the core running if path code and the other core with else path is doing noop. (In the same Warp). Another run with the core doing noop on the if path code and the other core wile else path is running. In old architecture, those 2 runs run in sequence. In new architecture, those 2 runs can run in parallel. This is called independent thread scheduling.

Thus, due to this fact, do not use threadIdx for if branching. But use data for divergence control, this also related to data locality in cache. One important fact, the performance impact of control divergence decreases as the size of the vectors being processed increases.

One cannot assume that all threads in a warp have the same execution timing.(even they are running the same fetched instruction). Thus, use __syncwarp() barrier synchronization instead.

Latency tolerance

simple, i.e. CPU, context switch on single code due to limited of resource(registers, cache etc.) and makes sure code runs preemptive-scheduling fashion.

Thus, SM only has enough execution units to execute a subset of all the threads assigned to it at any point in time.

In recent SM, each SM can execute instructions for a small number of warps at any given point in time.

GPU SMs achieves zero-overhead scheduling by holding all the execution states for the assigned warps in the hardware registers so there is no need to save and restore states when switching from one warp to another.

Thus, allows GPU oversubscription of threads to SMs.

Automatic/Local variables declared in the kernel are placed into registers.

Each SM in A100 GPU has 65,536 registers.

65536/2048(threads) = 32 registers per thread/kernel.

In cases, the compiler may perform register spilling to reduce the register requirement per thread and thus elevate the level of occupancy. However, this could increase latency due to need to fetch data from the memory instead directly from the register.

Nsight Compute Occupancy Calculator

cudaDeviceProp struct has bunch of variable represents the hardware SPEC.

e.g.

multiProcessorCount Number of multiprocessors on device

clockRate Clock frequency in kilohertz

regsPerBlock 32-bit registers available per block

warpSize Warp size in threads

Variable declaration scope and lifetime

automatic scalar variables [mem]register [scope]thread [lifetime]grid

automatic array variables [mem]local [scope]thread [lifetime]grid

__device__ __shared__ [mem]shared [scope]block [lifetime]grid

__device__ [mem]global [scope]grid [lifetime]application

__device__ __constant_ [mem]constant [scope]grid [lifetime]application

API

cudaMalloc

cudaMemcpy

Ataraxia through Epoché

May 15, 2025

[Book] Programming massively parallel processors(PMPP) - Wen-mei Hwu, reading minute - ch1 - ch6