1.9 KiB
1.9 KiB
Roadmap
Week 1 Study Plan
Day 1:
- Run
tools/check_env.py - Read
docs/gpu_basics.md - Read
docs/cuda_execution_model.md - Inspect
reference/torch_vector_add.py - Implement or partially implement
tasks/01_vector_add/triton_skeleton.py
Day 2:
- Read
docs/triton_vs_cuda.md - Inspect
kernels/cuda/src/vector_add.cu - Fill in vector add indexing TODOs in Triton and CUDA
- Run
pytest -q tasks/01_vector_add/test_task.py
Day 3:
- Read
reference/torch_row_softmax.py - Read
tasks/02_row_softmax/spec.md - Implement numerically stable row softmax in Triton first
- Compare against the CUDA skeleton and map the reduction strategy
Day 4:
- Study
tasks/03_tiled_matmul/spec.md - Draw the tile decomposition on paper
- Implement one matmul tile path with correctness-only priorities
Day 5:
- Read
docs/flashattention_notes.md - Read
tasks/04_online_softmax/spec.md - Derive the running max / running sum recurrence informally
Day 6:
- Inspect
tasks/05_flash_attention_fwd/spec.md - Trace the PyTorch reference line by line
- Annotate where Q/K/V loads, score computation, normalization, and output accumulation happen
Day 7:
- Read
docs/profiling_guide.md - Run one benchmark and one profiler command
- Write down which numbers changed after warmup and synchronization
Recommended TODO Order
- Environment checks
- Vector add Triton
- Vector add CUDA
- Row softmax Triton
- Row softmax CUDA
- Tiled matmul Triton
- Tiled matmul CUDA
- Online softmax Triton
- Online softmax CUDA
- Flash attention forward Triton
- Flash attention forward CUDA
- PyTorch custom op binding
- Profiling passes and benchmark validation
What To Focus On First
- Correctness on tiny shapes
- Clear index math
- Explicit shape assumptions
- Numerically stable reductions
- Repeatable measurement
Do not chase peak performance before you can explain the memory traffic and launch geometry of your kernel.