1.3 KiB
1.3 KiB
CUDA Execution Model
How To Read A CUDA Kernel
Use this short checklist every time:
- Find the logical work unit. Ask what one thread, warp, or block is responsible for.
- Decode the index math.
Look for
blockIdx,threadIdx,blockDim, and any derived offsets. - Inspect the memory accesses. Separate global loads, shared memory loads, stores, and reductions.
- Find synchronization points.
Every
__syncthreads()should protect a clear shared-memory phase boundary. - Check boundary conditions. Out-of-range reads and stores are a common first bug.
- Compare against the reference implementation. Make sure the math, masking, and shape conventions still match.
Execution Hierarchy
- Grid: all blocks launched for one kernel
- Block: a cooperating team of threads
- Thread: one scalar execution context
CUDA makes several things explicit that Triton abstracts:
- manual thread/block decomposition
- pointer arithmetic
- shared-memory allocation and reuse
- synchronization
- launch configuration choices
Reading Order For This Lab
vector_add.cu: pure indexingrow_softmax.cu: reduction structuretiled_matmul.cu: shared-memory tilingonline_softmax.cu: stateful reduction recurrenceflash_attention_fwd.cu: composition of multiple ideas