1.5 KiB
1.5 KiB
Task 01: Vector Add
1. Problem Statement
Implement out[i] = x[i] + y[i] in both Triton and CUDA, then compare both against the PyTorch reference.
2. Expected Input/Output Shapes
- Input: two tensors with identical 1D or flattened shapes
- Output: one tensor with the same shape
3. Performance Intuition
Vector add is simple enough that launch overhead and memory bandwidth dominate quickly. It is a good place to learn indexing before the math becomes interesting.
4. Memory Access Discussion
This kernel should read x[i] and y[i] once and write out[i] once. The main thing to inspect is whether neighboring threads or lanes access neighboring elements.
5. What Triton Is Abstracting
Triton lets you express one block of contiguous offsets with program_id and tl.arange, then apply a mask on the tail.
6. What CUDA Makes Explicit
CUDA makes you compute global_idx from block and thread indices yourself and write the boundary check explicitly.
7. Reflection Questions
- What is the exact correspondence between
program_idandblockIdx.xhere? - Why is a mask or bounds check required on the final block?
- How would the ownership change if one thread handled multiple elements?
8. Implementation Checklist
- Confirm the reference implementation
- Fill in the Triton masked loads, add, and store
- Fill in the CUDA thread ownership and store
- Test small and non-multiple-of-block-size shapes
- Benchmark bandwidth on larger vectors