41 lines
1.3 KiB
Markdown
41 lines
1.3 KiB
Markdown
# CUDA Execution Model
|
|
|
|
## How To Read A CUDA Kernel
|
|
|
|
Use this short checklist every time:
|
|
|
|
1. Find the logical work unit.
|
|
Ask what one thread, warp, or block is responsible for.
|
|
2. Decode the index math.
|
|
Look for `blockIdx`, `threadIdx`, `blockDim`, and any derived offsets.
|
|
3. Inspect the memory accesses.
|
|
Separate global loads, shared memory loads, stores, and reductions.
|
|
4. Find synchronization points.
|
|
Every `__syncthreads()` should protect a clear shared-memory phase boundary.
|
|
5. Check boundary conditions.
|
|
Out-of-range reads and stores are a common first bug.
|
|
6. Compare against the reference implementation.
|
|
Make sure the math, masking, and shape conventions still match.
|
|
|
|
## Execution Hierarchy
|
|
|
|
- Grid: all blocks launched for one kernel
|
|
- Block: a cooperating team of threads
|
|
- Thread: one scalar execution context
|
|
|
|
CUDA makes several things explicit that Triton abstracts:
|
|
|
|
- manual thread/block decomposition
|
|
- pointer arithmetic
|
|
- shared-memory allocation and reuse
|
|
- synchronization
|
|
- launch configuration choices
|
|
|
|
## Reading Order For This Lab
|
|
|
|
- `vector_add.cu`: pure indexing
|
|
- `row_softmax.cu`: reduction structure
|
|
- `tiled_matmul.cu`: shared-memory tiling
|
|
- `online_softmax.cu`: stateful reduction recurrence
|
|
- `flash_attention_fwd.cu`: composition of multiple ideas
|