gahow/kernel-lab

Files

wjh a4a6b1f1c8 Initial project scaffold

2026-04-10 13:15:06 +00:00

1.9 KiB

Raw Blame History

Triton Vs CUDA

Concept Mapping Table

Triton concept	CUDA concept	What to notice
`tl.program_id(axis=0)`	`blockIdx.x` and block ownership	Both assign a chunk of logical work to a block-scale unit
`tl.arange(0, BLOCK)`	`threadIdx.x` or manual lane-local offsets	Triton expresses vectors of indices directly
masked `tl.load` / `tl.store`	explicit `if (idx < n)` checks	Same boundary problem, different syntax
blocked tensor operations	thread/block decomposition plus loops	Triton lifts index sets into tensor expressions
pointer arithmetic in element units	byte-addressed pointer math and indexing	CUDA makes layout mechanics more visible
implicit vectorized math	manual scalar or vector intrinsics	Triton often reads like array algebra
autotuned launch parameters	manual block-size tuning	Both still depend on the memory hierarchy
block pointers and tile views	shared memory tiles and cooperative loads	The same reuse idea shows up with different APIs
reduction combinators	warp/block reductions	Same algorithmic structure, different implementation burden
masks and predicates	control flow and bounds checks	Divergence and predication still matter

How To Compare Side By Side

Start from the reference PyTorch function and identify the mathematical operator.
In the Triton version, ask what one program instance owns.
In the CUDA version, ask what one block and one thread own.
Match the memory reads and writes, not just the variable names.
Write down where reduction state lives in each version.
For tiled code, identify when data moves from global memory to on-chip storage.
Only then compare performance.

Rule Of Thumb

Triton usually compresses the "how" so you can focus on the blocked tensor math. CUDA exposes the "how" directly, which is why it is valuable to study both.