Files
kernel-lab/docs/triton_vs_cuda.md
2026-04-10 13:15:06 +00:00

1.9 KiB

Triton Vs CUDA

Concept Mapping Table

Triton concept CUDA concept What to notice
tl.program_id(axis=0) blockIdx.x and block ownership Both assign a chunk of logical work to a block-scale unit
tl.arange(0, BLOCK) threadIdx.x or manual lane-local offsets Triton expresses vectors of indices directly
masked tl.load / tl.store explicit if (idx < n) checks Same boundary problem, different syntax
blocked tensor operations thread/block decomposition plus loops Triton lifts index sets into tensor expressions
pointer arithmetic in element units byte-addressed pointer math and indexing CUDA makes layout mechanics more visible
implicit vectorized math manual scalar or vector intrinsics Triton often reads like array algebra
autotuned launch parameters manual block-size tuning Both still depend on the memory hierarchy
block pointers and tile views shared memory tiles and cooperative loads The same reuse idea shows up with different APIs
reduction combinators warp/block reductions Same algorithmic structure, different implementation burden
masks and predicates control flow and bounds checks Divergence and predication still matter

How To Compare Side By Side

  1. Start from the reference PyTorch function and identify the mathematical operator.
  2. In the Triton version, ask what one program instance owns.
  3. In the CUDA version, ask what one block and one thread own.
  4. Match the memory reads and writes, not just the variable names.
  5. Write down where reduction state lives in each version.
  6. For tiled code, identify when data moves from global memory to on-chip storage.
  7. Only then compare performance.

Rule Of Thumb

Triton usually compresses the "how" so you can focus on the blocked tensor math. CUDA exposes the "how" directly, which is why it is valuable to study both.