# Task 03: Tiled Matmul ## 1. Problem Statement Implement a tiled matrix multiplication and compare the tile abstraction in Triton with the explicit shared-memory strategy in CUDA. ## 2. Expected Input/Output Shapes - Input `A`: `[M, K]` - Input `B`: `[K, N]` - Output `C`: `[M, N]` ## 3. Performance Intuition Matmul becomes interesting once data reuse matters. Re-reading the same `A` and `B` values from global memory is expensive; tiling exists to reuse those values across many multiply-accumulate operations. ## 4. Memory Access Discussion Think about which `A` tile and `B` tile each work unit needs. The performance win comes from moving those tiles into on-chip storage and reusing them before fetching the next tile. ## 5. What Triton Is Abstracting Triton lets you think in output tiles and blocked pointer arithmetic. The tile loads and accumulations read like tensor operations. ## 6. What CUDA Makes Explicit CUDA makes you choose block dimensions, allocate shared memory, manage cooperative loads, and synchronize between load and compute phases. ## 7. Reflection Questions - Which values in `A` and `B` are reused across multiple output elements? - Why does tiling reduce global-memory traffic? - How does a Triton tile map to CUDA shared-memory tiles and threads? ## 8. Implementation Checklist - Confirm the reference matmul - Draw a block/tile diagram before coding - Implement the Triton tile loop over `K` - Implement the CUDA shared-memory tile loop - Benchmark against `torch.matmul` on small and medium sizes ## Tile Diagram Prompt Sketch: - one output tile `C[m0:m1, n0:n1]` - the matching `A[m0:m1, k0:k1]` - the matching `B[k0:k1, n0:n1]` That sketch should tell you what belongs in shared memory.