GPU / CUDA Programming
See also GPUs
Programming Massively Parallel Processors (2022)
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
Programming Heterogeneous Computing Systems with GPUs and other Accelerators (Fall 2022)
- Khushi Agrawal on X: “Excited to share a blog series I’ve been working on, diving deep into CUDA programming! Inspired by thePMPP book &CUDA_MODE!! Check out the links below… https://t.co/rPtlFr30wg” / X
Kernel
2D Blocks
Row major
gridDim.x and gridDim.y
GPU Architecture
HetSys Course: Lecture 4: GPU Memory Hierarchy (Fall 2022)
H100 (Nvidia Hopper Architecture)
NVIDIA Hopper Architecture In-Depth
H100 Thread Block Clusters
Thread blocks in the same cluster can sync and exchange data. Makes it possible to avoid having to write intermediate results to global memory
thread < thread block < thread block cluster < grid
GH100: 144 cores, 60MB L2 cache
TMA (Tensor Memory Accelerator) - reduces addressing overhead
Distributed Shared Memory