Reference:
CUDA C++ Programming Guide
NVIDIA Nsight Compute CUDA code optimization
[virtual memory] recap
Intel TBB
Intel TBB Task Scheduler
[oneTBB] accessor note
Error handling recap; TLPI
EINTR and What It Is Good For
kernal namespace recap
https://vsdmars.blogspot.com/2018/12/linuxnamespace-wrap-up.html
https://vsdmars.blogspot.com/2018/12/linuxnamespace-mount-mnt.html
https://vsdmars.blogspot.com/2018/06/linuxkernel-namespace.html
cache
[Pacific++ 2018][re-read] "Designing for Efficient Cache Usage" - Scott McMillan
[Go][design] high through-put low contention in memory cache
Locking
[futex] futex skim through
[Concurrency] [C++][Go] Wrap up - 2018
[C++] memory model in depth; memory spilling.
Algorithm/Implementation
Under my github/leetcode project
Software abstraction (CUDA runtime)
Similar idea e.g. Golang ref:
[Go][note] Analysis of the Go runtime scheduler paper note, the difference coming from Go/C++ as multi-purpose language running on CPU (sequential), CUDA/OpenCL as C extension running on GPU. While the underneath hardware architecture difference, the abstraction diffs.
SPMD single-program multiple-data
host CPU based code
kernel GPU Device code / function, more details later. Basically same code / IR run in parallel.
grid threads group
_h host variable in CPU code
_d device variable in CPU code
__host__ callable from host, executed on host, executed by host thread(e.g. linux thread).
__global__ callable from host or device, executed on device, executed by grid of device threads
__device__ callable from device, executed on device, executed by caller device thread.
If function declared with __host__ and __device__ macro, NVCC generates two version, one for host and one for device.
block 32-based size(hardware efficiency reason), all blocks are in same size. Size of how many GPU threads. Threads in a block can execute in any order with respect to each other.
SM Streaming multiprocessor; each SM has several processing units called CUDA cores. It is designed to execute all threads in a warp following the single-instruction multiple-data(SIMD) model.
HBM high-bandwidth memory
Warp a warp groups 32-threads together. Thus a block of threads will be group into warps, which each warp has 32-threads. Scheduling is based on Warp. Also think as single-instruction, multiple-threads.
FLOP floating-point operations
FLOP/B FLOP to byte ratio.
GPU global memory bandwidth: 1555GB/second; 1555 * 0.25(FLOP/B) = 389 GFLOPS
const readonly variables
blockIdx; area code
blockDim; row idx
threadIdx; phoneline
Those three variable gives the kernel realize which
data it is running on.
OUR_GLOBAL_FUNC<<<number of block, threads per block>>>(args...);
Thread Scheduling
Block scheduling
When a kernel is called, the CUDA runtime launches a grid of threads execute the kernel code. These threads are assigned to SMs on a block-by-block basis. All threads in a block are simultaneously assigned to the same SM. There are reserved blocks for system to executed, thus a SMs' blocks are not all scheduled to the user kernel.
Multiple blocks are likely to be simultaneously assigned to the same SM. The concept of Warp scheduling is that, those threads inside the same Warp runs the same instruction set(same kernel), thus the fetch of instruction is one time efforts. Also, the data those threads in the same Warp access are linear thus are prefetchable/cache friendly.
Moreover, threads in the same block can interact with each other in ways that threads across different blocks cannot, such as barrier synchronization,
synchronization / transparent scalability
block until every thread in the same block reaches the code location. if a __syncthreads() statement is present, it must be executed by all threads in a block. i.e.
Wrong due to not all threads runs into the same barrier synchronization points.
Not only do all threads in a block have to be assigned to the same SM, but also they need to to be assigned to that SM simultaneously. i.e. a block can begin execution only when the runtime system has secured all the resources needed by all threads in the block to complete execution.
The ability to execute the same application code on different hardware with different amounts of execution resources is referred to as transparent scalability.
Control Divergence
The execution works well when either all threads in a warp execute the if-path or all execute the else-path. Otherwise, it has to go with the code twice. One run with the core running if path code and the other core with else path is doing noop. (In the same Warp). Another run with the core doing noop on the if path code and the other core wile else path is running. In old architecture, those 2 runs run in sequence. In new architecture, those 2 runs can run in parallel. This is called independent thread scheduling.
Thus, due to this fact, do not use threadIdx for if branching. But use data for divergence control, this also related to data locality in cache. One important fact, the performance impact of control divergence decreases as the size of the vectors being processed increases.
One cannot assume that all threads in a warp have the same execution timing.(even they are running the same fetched instruction). Thus, use
__syncwarp() barrier synchronization instead.
Latency tolerance
simple, i.e. CPU, context switch on single code due to limited of resource(registers, cache etc.) and makes sure code runs preemptive-scheduling fashion.
Thus, SM only has enough execution units to execute a subset of all the threads assigned to it at any point in time.
In recent SM, each SM can execute instructions for a small number of warps at any given point in time.
GPU SMs achieves zero-overhead scheduling by holding all the execution states for the assigned warps in the hardware registers so there is no need to save and restore states when switching from one warp to another.
Thus, allows GPU oversubscription of threads to SMs.
Automatic/Local variables declared in the kernel are placed into registers.
Each SM in A100 GPU has 65,536 registers.
65536/2048(threads) = 32 registers per thread/kernel.
In cases, the compiler may perform register spilling to reduce the register requirement per thread and thus elevate the level of occupancy. However, this could increase latency due to need to fetch data from the memory instead directly from the register.
cudaDeviceProp struct has bunch of variable represents the hardware SPEC.
e.g.
multiProcessorCount Number of multiprocessors on device
regsPerBlock 32-bit registers available per block
warpSize Warp size in threads
Variable declaration scope and lifetime
automatic scalar variables [mem]register [scope]thread [lifetime]grid
automatic array variables [mem]local [scope]thread [lifetime]grid
__device__ __shared__ [mem]shared [scope]block [lifetime]grid
__device__ [mem]global [scope]grid [lifetime]application
__device__ __constant_ [mem]constant [scope]grid [lifetime]application
API