1.7 KiB
Task 03: Tiled Matmul
1. Problem Statement
Implement a tiled matrix multiplication and compare the tile abstraction in Triton with the explicit shared-memory strategy in CUDA.
2. Expected Input/Output Shapes
- Input
A:[M, K] - Input
B:[K, N] - Output
C:[M, N]
3. Performance Intuition
Matmul becomes interesting once data reuse matters. Re-reading the same A and B values from global memory is expensive; tiling exists to reuse those values across many multiply-accumulate operations.
4. Memory Access Discussion
Think about which A tile and B tile each work unit needs. The performance win comes from moving those tiles into on-chip storage and reusing them before fetching the next tile.
5. What Triton Is Abstracting
Triton lets you think in output tiles and blocked pointer arithmetic. The tile loads and accumulations read like tensor operations.
6. What CUDA Makes Explicit
CUDA makes you choose block dimensions, allocate shared memory, manage cooperative loads, and synchronize between load and compute phases.
7. Reflection Questions
- Which values in
AandBare reused across multiple output elements? - Why does tiling reduce global-memory traffic?
- How does a Triton tile map to CUDA shared-memory tiles and threads?
8. Implementation Checklist
- Confirm the reference matmul
- Draw a block/tile diagram before coding
- Implement the Triton tile loop over
K - Implement the CUDA shared-memory tile loop
- Benchmark against
torch.matmulon small and medium sizes
Tile Diagram Prompt
Sketch:
- one output tile
C[m0:m1, n0:n1] - the matching
A[m0:m1, k0:k1] - the matching
B[k0:k1, n0:n1]
That sketch should tell you what belongs in shared memory.