# CUDA Execution Model ## How To Read A CUDA Kernel Use this short checklist every time: 1. Find the logical work unit. Ask what one thread, warp, or block is responsible for. 2. Decode the index math. Look for `blockIdx`, `threadIdx`, `blockDim`, and any derived offsets. 3. Inspect the memory accesses. Separate global loads, shared memory loads, stores, and reductions. 4. Find synchronization points. Every `__syncthreads()` should protect a clear shared-memory phase boundary. 5. Check boundary conditions. Out-of-range reads and stores are a common first bug. 6. Compare against the reference implementation. Make sure the math, masking, and shape conventions still match. ## Execution Hierarchy - Grid: all blocks launched for one kernel - Block: a cooperating team of threads - Thread: one scalar execution context CUDA makes several things explicit that Triton abstracts: - manual thread/block decomposition - pointer arithmetic - shared-memory allocation and reuse - synchronization - launch configuration choices ## Reading Order For This Lab - `vector_add.cu`: pure indexing - `row_softmax.cu`: reduction structure - `tiled_matmul.cu`: shared-memory tiling - `online_softmax.cu`: stateful reduction recurrence - `flash_attention_fwd.cu`: composition of multiple ideas