gahow/kernel-lab

Files

wjh a4a6b1f1c8 Initial project scaffold

2026-04-10 13:15:06 +00:00

1.3 KiB

Raw Blame History

CUDA Execution Model

How To Read A CUDA Kernel

Use this short checklist every time:

Find the logical work unit. Ask what one thread, warp, or block is responsible for.
Decode the index math. Look for blockIdx, threadIdx, blockDim, and any derived offsets.
Inspect the memory accesses. Separate global loads, shared memory loads, stores, and reductions.
Find synchronization points. Every __syncthreads() should protect a clear shared-memory phase boundary.
Check boundary conditions. Out-of-range reads and stores are a common first bug.
Compare against the reference implementation. Make sure the math, masking, and shape conventions still match.

Execution Hierarchy

Grid: all blocks launched for one kernel
Block: a cooperating team of threads
Thread: one scalar execution context

CUDA makes several things explicit that Triton abstracts:

manual thread/block decomposition
pointer arithmetic
shared-memory allocation and reuse
synchronization
launch configuration choices

Reading Order For This Lab

vector_add.cu: pure indexing
row_softmax.cu: reduction structure
tiled_matmul.cu: shared-memory tiling
online_softmax.cu: stateful reduction recurrence
flash_attention_fwd.cu: composition of multiple ideas