Initial project scaffold

2026-04-10 13:15:06 +00:00
commit a4a6b1f1c8
94 changed files with 3964 additions and 0 deletions
--- a/tasks/03_tiled_matmul/spec.md
+++ b/tasks/03_tiled_matmul/spec.md
@@ -0,0 +1,51 @@
+# Task 03: Tiled Matmul
+
+## 1. Problem Statement
+
+Implement a tiled matrix multiplication and compare the tile abstraction in Triton with the explicit shared-memory strategy in CUDA.
+
+## 2. Expected Input/Output Shapes
+
+- Input `A`: `[M, K]`
+- Input `B`: `[K, N]`
+- Output `C`: `[M, N]`
+
+## 3. Performance Intuition
+
+Matmul becomes interesting once data reuse matters. Re-reading the same `A` and `B` values from global memory is expensive; tiling exists to reuse those values across many multiply-accumulate operations.
+
+## 4. Memory Access Discussion
+
+Think about which `A` tile and `B` tile each work unit needs. The performance win comes from moving those tiles into on-chip storage and reusing them before fetching the next tile.
+
+## 5. What Triton Is Abstracting
+
+Triton lets you think in output tiles and blocked pointer arithmetic. The tile loads and accumulations read like tensor operations.
+
+## 6. What CUDA Makes Explicit
+
+CUDA makes you choose block dimensions, allocate shared memory, manage cooperative loads, and synchronize between load and compute phases.
+
+## 7. Reflection Questions
+
+- Which values in `A` and `B` are reused across multiple output elements?
+- Why does tiling reduce global-memory traffic?
+- How does a Triton tile map to CUDA shared-memory tiles and threads?
+
+## 8. Implementation Checklist
+
+- Confirm the reference matmul
+- Draw a block/tile diagram before coding
+- Implement the Triton tile loop over `K`
+- Implement the CUDA shared-memory tile loop
+- Benchmark against `torch.matmul` on small and medium sizes
+
+## Tile Diagram Prompt
+
+Sketch:
+
+- one output tile `C[m0:m1, n0:n1]`
+- the matching `A[m0:m1, k0:k1]`
+- the matching `B[k0:k1, n0:n1]`
+
+That sketch should tell you what belongs in shared memory.