Files
kernel-lab/docs/cuda_execution_model.md
2026-04-10 13:22:19 +00:00

1.3 KiB

CUDA Execution Model

How To Read A CUDA Kernel

Use this short checklist every time:

  1. Find the logical work unit. Ask what one thread, warp, or block is responsible for.
  2. Decode the index math. Look for blockIdx, threadIdx, blockDim, and any derived offsets.
  3. Inspect the memory accesses. Separate global loads, shared memory loads, stores, and reductions.
  4. Find synchronization points. Every __syncthreads() should protect a clear shared-memory phase boundary.
  5. Check boundary conditions. Out-of-range reads and stores are a common first bug.
  6. Compare against the reference implementation. Make sure the math, masking, and shape conventions still match.

Execution Hierarchy

  • Grid: all blocks launched for one kernel
  • Block: a cooperating team of threads
  • Thread: one scalar execution context

CUDA makes several things explicit that Triton abstracts:

  • manual thread/block decomposition
  • pointer arithmetic
  • shared-memory allocation and reuse
  • synchronization
  • launch configuration choices

Reading Order For This Lab

  • vector_add.cu: pure indexing
  • row_softmax.cu: reduction structure
  • tiled_matmul.cu: shared-memory tiling
  • online_softmax.cu: stateful reduction recurrence
  • flash_attention_fwd.cu: composition of multiple ideas