Initial project scaffold
This commit is contained in:
51
tasks/03_tiled_matmul/spec.md
Normal file
51
tasks/03_tiled_matmul/spec.md
Normal file
@@ -0,0 +1,51 @@
|
||||
# Task 03: Tiled Matmul
|
||||
|
||||
## 1. Problem Statement
|
||||
|
||||
Implement a tiled matrix multiplication and compare the tile abstraction in Triton with the explicit shared-memory strategy in CUDA.
|
||||
|
||||
## 2. Expected Input/Output Shapes
|
||||
|
||||
- Input `A`: `[M, K]`
|
||||
- Input `B`: `[K, N]`
|
||||
- Output `C`: `[M, N]`
|
||||
|
||||
## 3. Performance Intuition
|
||||
|
||||
Matmul becomes interesting once data reuse matters. Re-reading the same `A` and `B` values from global memory is expensive; tiling exists to reuse those values across many multiply-accumulate operations.
|
||||
|
||||
## 4. Memory Access Discussion
|
||||
|
||||
Think about which `A` tile and `B` tile each work unit needs. The performance win comes from moving those tiles into on-chip storage and reusing them before fetching the next tile.
|
||||
|
||||
## 5. What Triton Is Abstracting
|
||||
|
||||
Triton lets you think in output tiles and blocked pointer arithmetic. The tile loads and accumulations read like tensor operations.
|
||||
|
||||
## 6. What CUDA Makes Explicit
|
||||
|
||||
CUDA makes you choose block dimensions, allocate shared memory, manage cooperative loads, and synchronize between load and compute phases.
|
||||
|
||||
## 7. Reflection Questions
|
||||
|
||||
- Which values in `A` and `B` are reused across multiple output elements?
|
||||
- Why does tiling reduce global-memory traffic?
|
||||
- How does a Triton tile map to CUDA shared-memory tiles and threads?
|
||||
|
||||
## 8. Implementation Checklist
|
||||
|
||||
- Confirm the reference matmul
|
||||
- Draw a block/tile diagram before coding
|
||||
- Implement the Triton tile loop over `K`
|
||||
- Implement the CUDA shared-memory tile loop
|
||||
- Benchmark against `torch.matmul` on small and medium sizes
|
||||
|
||||
## Tile Diagram Prompt
|
||||
|
||||
Sketch:
|
||||
|
||||
- one output tile `C[m0:m1, n0:n1]`
|
||||
- the matching `A[m0:m1, k0:k1]`
|
||||
- the matching `B[k0:k1, n0:n1]`
|
||||
|
||||
That sketch should tell you what belongs in shared memory.
|
||||
Reference in New Issue
Block a user