kernel-lab/docs/cuda_execution_model.md

# CUDA Execution Model

## How To Read A CUDA Kernel

Use this short checklist every time:

1. Find the logical work unit.
   Ask what one thread, warp, or block is responsible for.
2. Decode the index math.
   Look for `blockIdx`, `threadIdx`, `blockDim`, and any derived offsets.
3. Inspect the memory accesses.
   Separate global loads, shared memory loads, stores, and reductions.
4. Find synchronization points.
   Every `__syncthreads()` should protect a clear shared-memory phase boundary.
5. Check boundary conditions.
   Out-of-range reads and stores are a common first bug.
6. Compare against the reference implementation.
   Make sure the math, masking, and shape conventions still match.

## Execution Hierarchy

- Grid: all blocks launched for one kernel
- Block: a cooperating team of threads
- Thread: one scalar execution context

CUDA makes several things explicit that Triton abstracts:

- manual thread/block decomposition
- pointer arithmetic
- shared-memory allocation and reuse
- synchronization
- launch configuration choices

## Reading Order For This Lab

- `vector_add.cu`: pure indexing
- `row_softmax.cu`: reduction structure
- `tiled_matmul.cu`: shared-memory tiling
- `online_softmax.cu`: stateful reduction recurrence
- `flash_attention_fwd.cu`: composition of multiple ideas