Initial project scaffold

2026-04-10 13:22:19 +00:00
commit 7fa69b1354
94 changed files with 3964 additions and 0 deletions
--- a/docs/triton_vs_cuda.md
+++ b/docs/triton_vs_cuda.md
@@ -0,0 +1,30 @@
+# Triton Vs CUDA
+
+## Concept Mapping Table
+
+| Triton concept | CUDA concept | What to notice |
+| --- | --- | --- |
+| `tl.program_id(axis=0)` | `blockIdx.x` and block ownership | Both assign a chunk of logical work to a block-scale unit |
+| `tl.arange(0, BLOCK)` | `threadIdx.x` or manual lane-local offsets | Triton expresses vectors of indices directly |
+| masked `tl.load` / `tl.store` | explicit `if (idx < n)` checks | Same boundary problem, different syntax |
+| blocked tensor operations | thread/block decomposition plus loops | Triton lifts index sets into tensor expressions |
+| pointer arithmetic in element units | byte-addressed pointer math and indexing | CUDA makes layout mechanics more visible |
+| implicit vectorized math | manual scalar or vector intrinsics | Triton often reads like array algebra |
+| autotuned launch parameters | manual block-size tuning | Both still depend on the memory hierarchy |
+| block pointers and tile views | shared memory tiles and cooperative loads | The same reuse idea shows up with different APIs |
+| reduction combinators | warp/block reductions | Same algorithmic structure, different implementation burden |
+| masks and predicates | control flow and bounds checks | Divergence and predication still matter |
+
+## How To Compare Side By Side
+
+1. Start from the reference PyTorch function and identify the mathematical operator.
+2. In the Triton version, ask what one program instance owns.
+3. In the CUDA version, ask what one block and one thread own.
+4. Match the memory reads and writes, not just the variable names.
+5. Write down where reduction state lives in each version.
+6. For tiled code, identify when data moves from global memory to on-chip storage.
+7. Only then compare performance.
+
+## Rule Of Thumb
+
+Triton usually compresses the "how" so you can focus on the blocked tensor math. CUDA exposes the "how" directly, which is why it is valuable to study both.