Files
kernel-lab/docs/cuda_execution_model.md
2026-04-10 13:15:06 +00:00

41 lines
1.3 KiB
Markdown

# CUDA Execution Model
## How To Read A CUDA Kernel
Use this short checklist every time:
1. Find the logical work unit.
Ask what one thread, warp, or block is responsible for.
2. Decode the index math.
Look for `blockIdx`, `threadIdx`, `blockDim`, and any derived offsets.
3. Inspect the memory accesses.
Separate global loads, shared memory loads, stores, and reductions.
4. Find synchronization points.
Every `__syncthreads()` should protect a clear shared-memory phase boundary.
5. Check boundary conditions.
Out-of-range reads and stores are a common first bug.
6. Compare against the reference implementation.
Make sure the math, masking, and shape conventions still match.
## Execution Hierarchy
- Grid: all blocks launched for one kernel
- Block: a cooperating team of threads
- Thread: one scalar execution context
CUDA makes several things explicit that Triton abstracts:
- manual thread/block decomposition
- pointer arithmetic
- shared-memory allocation and reuse
- synchronization
- launch configuration choices
## Reading Order For This Lab
- `vector_add.cu`: pure indexing
- `row_softmax.cu`: reduction structure
- `tiled_matmul.cu`: shared-memory tiling
- `online_softmax.cu`: stateful reduction recurrence
- `flash_attention_fwd.cu`: composition of multiple ideas