kernel-lab/docs/triton_vs_cuda.md

# Triton Vs CUDA

## Concept Mapping Table

| Triton concept | CUDA concept | What to notice |
| --- | --- | --- |
| `tl.program_id(axis=0)` | `blockIdx.x` and block ownership | Both assign a chunk of logical work to a block-scale unit |
| `tl.arange(0, BLOCK)` | `threadIdx.x` or manual lane-local offsets | Triton expresses vectors of indices directly |
| masked `tl.load` / `tl.store` | explicit `if (idx < n)` checks | Same boundary problem, different syntax |
| blocked tensor operations | thread/block decomposition plus loops | Triton lifts index sets into tensor expressions |
| pointer arithmetic in element units | byte-addressed pointer math and indexing | CUDA makes layout mechanics more visible |
| implicit vectorized math | manual scalar or vector intrinsics | Triton often reads like array algebra |
| autotuned launch parameters | manual block-size tuning | Both still depend on the memory hierarchy |
| block pointers and tile views | shared memory tiles and cooperative loads | The same reuse idea shows up with different APIs |
| reduction combinators | warp/block reductions | Same algorithmic structure, different implementation burden |
| masks and predicates | control flow and bounds checks | Divergence and predication still matter |

## How To Compare Side By Side

1. Start from the reference PyTorch function and identify the mathematical operator.
2. In the Triton version, ask what one program instance owns.
3. In the CUDA version, ask what one block and one thread own.
4. Match the memory reads and writes, not just the variable names.
5. Write down where reduction state lives in each version.
6. For tiled code, identify when data moves from global memory to on-chip storage.
7. Only then compare performance.

## Rule Of Thumb

Triton usually compresses the "how" so you can focus on the blocked tensor math. CUDA exposes the "how" directly, which is why it is valuable to study both.