Initial project scaffold
This commit is contained in:
40
tasks/01_vector_add/spec.md
Normal file
40
tasks/01_vector_add/spec.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# Task 01: Vector Add
|
||||
|
||||
## 1. Problem Statement
|
||||
|
||||
Implement `out[i] = x[i] + y[i]` in both Triton and CUDA, then compare both against the PyTorch reference.
|
||||
|
||||
## 2. Expected Input/Output Shapes
|
||||
|
||||
- Input: two tensors with identical 1D or flattened shapes
|
||||
- Output: one tensor with the same shape
|
||||
|
||||
## 3. Performance Intuition
|
||||
|
||||
Vector add is simple enough that launch overhead and memory bandwidth dominate quickly. It is a good place to learn indexing before the math becomes interesting.
|
||||
|
||||
## 4. Memory Access Discussion
|
||||
|
||||
This kernel should read `x[i]` and `y[i]` once and write `out[i]` once. The main thing to inspect is whether neighboring threads or lanes access neighboring elements.
|
||||
|
||||
## 5. What Triton Is Abstracting
|
||||
|
||||
Triton lets you express one block of contiguous offsets with `program_id` and `tl.arange`, then apply a mask on the tail.
|
||||
|
||||
## 6. What CUDA Makes Explicit
|
||||
|
||||
CUDA makes you compute `global_idx` from block and thread indices yourself and write the boundary check explicitly.
|
||||
|
||||
## 7. Reflection Questions
|
||||
|
||||
- What is the exact correspondence between `program_id` and `blockIdx.x` here?
|
||||
- Why is a mask or bounds check required on the final block?
|
||||
- How would the ownership change if one thread handled multiple elements?
|
||||
|
||||
## 8. Implementation Checklist
|
||||
|
||||
- Confirm the reference implementation
|
||||
- Fill in the Triton masked loads, add, and store
|
||||
- Fill in the CUDA thread ownership and store
|
||||
- Test small and non-multiple-of-block-size shapes
|
||||
- Benchmark bandwidth on larger vectors
|
||||
Reference in New Issue
Block a user