Task 03: Tiled Matmul

1. Problem Statement

Implement a tiled matrix multiplication and compare the tile abstraction in Triton with the explicit shared-memory strategy in CUDA.

2. Expected Input/Output Shapes

Input A: [M, K]
Input B: [K, N]
Output C: [M, N]

3. Performance Intuition

Matmul becomes interesting once data reuse matters. Re-reading the same A and B values from global memory is expensive; tiling exists to reuse those values across many multiply-accumulate operations.

4. Memory Access Discussion

Think about which A tile and B tile each work unit needs. The performance win comes from moving those tiles into on-chip storage and reusing them before fetching the next tile.

5. What Triton Is Abstracting

Triton lets you think in output tiles and blocked pointer arithmetic. The tile loads and accumulations read like tensor operations.

6. What CUDA Makes Explicit

CUDA makes you choose block dimensions, allocate shared memory, manage cooperative loads, and synchronize between load and compute phases.

7. Reflection Questions

Which values in A and B are reused across multiple output elements?
Why does tiling reduce global-memory traffic?
How does a Triton tile map to CUDA shared-memory tiles and threads?

8. Implementation Checklist

Confirm the reference matmul
Draw a block/tile diagram before coding
Implement the Triton tile loop over K
Implement the CUDA shared-memory tile loop
Benchmark against torch.matmul on small and medium sizes

Tile Diagram Prompt

Sketch:

one output tile C[m0:m1, n0:n1]
the matching A[m0:m1, k0:k1]
the matching B[k0:k1, n0:n1]

That sketch should tell you what belongs in shared memory.

1.7 KiB Raw Permalink Blame History