Initial project scaffold
This commit is contained in:
75
docs/roadmap.md
Normal file
75
docs/roadmap.md
Normal file
@@ -0,0 +1,75 @@
|
||||
# Roadmap
|
||||
|
||||
## Week 1 Study Plan
|
||||
|
||||
Day 1:
|
||||
|
||||
- Run `tools/check_env.py`
|
||||
- Read `docs/gpu_basics.md`
|
||||
- Read `docs/cuda_execution_model.md`
|
||||
- Inspect `reference/torch_vector_add.py`
|
||||
- Implement or partially implement `tasks/01_vector_add/triton_skeleton.py`
|
||||
|
||||
Day 2:
|
||||
|
||||
- Read `docs/triton_vs_cuda.md`
|
||||
- Inspect `kernels/cuda/src/vector_add.cu`
|
||||
- Fill in vector add indexing TODOs in Triton and CUDA
|
||||
- Run `pytest -q tasks/01_vector_add/test_task.py`
|
||||
|
||||
Day 3:
|
||||
|
||||
- Read `reference/torch_row_softmax.py`
|
||||
- Read `tasks/02_row_softmax/spec.md`
|
||||
- Implement numerically stable row softmax in Triton first
|
||||
- Compare against the CUDA skeleton and map the reduction strategy
|
||||
|
||||
Day 4:
|
||||
|
||||
- Study `tasks/03_tiled_matmul/spec.md`
|
||||
- Draw the tile decomposition on paper
|
||||
- Implement one matmul tile path with correctness-only priorities
|
||||
|
||||
Day 5:
|
||||
|
||||
- Read `docs/flashattention_notes.md`
|
||||
- Read `tasks/04_online_softmax/spec.md`
|
||||
- Derive the running max / running sum recurrence informally
|
||||
|
||||
Day 6:
|
||||
|
||||
- Inspect `tasks/05_flash_attention_fwd/spec.md`
|
||||
- Trace the PyTorch reference line by line
|
||||
- Annotate where Q/K/V loads, score computation, normalization, and output accumulation happen
|
||||
|
||||
Day 7:
|
||||
|
||||
- Read `docs/profiling_guide.md`
|
||||
- Run one benchmark and one profiler command
|
||||
- Write down which numbers changed after warmup and synchronization
|
||||
|
||||
## Recommended TODO Order
|
||||
|
||||
1. Environment checks
|
||||
2. Vector add Triton
|
||||
3. Vector add CUDA
|
||||
4. Row softmax Triton
|
||||
5. Row softmax CUDA
|
||||
6. Tiled matmul Triton
|
||||
7. Tiled matmul CUDA
|
||||
8. Online softmax Triton
|
||||
9. Online softmax CUDA
|
||||
10. Flash attention forward Triton
|
||||
11. Flash attention forward CUDA
|
||||
12. PyTorch custom op binding
|
||||
13. Profiling passes and benchmark validation
|
||||
|
||||
## What To Focus On First
|
||||
|
||||
- Correctness on tiny shapes
|
||||
- Clear index math
|
||||
- Explicit shape assumptions
|
||||
- Numerically stable reductions
|
||||
- Repeatable measurement
|
||||
|
||||
Do not chase peak performance before you can explain the memory traffic and launch geometry of your kernel.
|
||||
Reference in New Issue
Block a user