Initial project scaffold

This commit is contained in:
wjh
2026-04-10 13:15:06 +00:00
commit a4a6b1f1c8
94 changed files with 3964 additions and 0 deletions

75
docs/roadmap.md Normal file
View File

@@ -0,0 +1,75 @@
# Roadmap
## Week 1 Study Plan
Day 1:
- Run `tools/check_env.py`
- Read `docs/gpu_basics.md`
- Read `docs/cuda_execution_model.md`
- Inspect `reference/torch_vector_add.py`
- Implement or partially implement `tasks/01_vector_add/triton_skeleton.py`
Day 2:
- Read `docs/triton_vs_cuda.md`
- Inspect `kernels/cuda/src/vector_add.cu`
- Fill in vector add indexing TODOs in Triton and CUDA
- Run `pytest -q tasks/01_vector_add/test_task.py`
Day 3:
- Read `reference/torch_row_softmax.py`
- Read `tasks/02_row_softmax/spec.md`
- Implement numerically stable row softmax in Triton first
- Compare against the CUDA skeleton and map the reduction strategy
Day 4:
- Study `tasks/03_tiled_matmul/spec.md`
- Draw the tile decomposition on paper
- Implement one matmul tile path with correctness-only priorities
Day 5:
- Read `docs/flashattention_notes.md`
- Read `tasks/04_online_softmax/spec.md`
- Derive the running max / running sum recurrence informally
Day 6:
- Inspect `tasks/05_flash_attention_fwd/spec.md`
- Trace the PyTorch reference line by line
- Annotate where Q/K/V loads, score computation, normalization, and output accumulation happen
Day 7:
- Read `docs/profiling_guide.md`
- Run one benchmark and one profiler command
- Write down which numbers changed after warmup and synchronization
## Recommended TODO Order
1. Environment checks
2. Vector add Triton
3. Vector add CUDA
4. Row softmax Triton
5. Row softmax CUDA
6. Tiled matmul Triton
7. Tiled matmul CUDA
8. Online softmax Triton
9. Online softmax CUDA
10. Flash attention forward Triton
11. Flash attention forward CUDA
12. PyTorch custom op binding
13. Profiling passes and benchmark validation
## What To Focus On First
- Correctness on tiny shapes
- Clear index math
- Explicit shape assumptions
- Numerically stable reductions
- Repeatable measurement
Do not chase peak performance before you can explain the memory traffic and launch geometry of your kernel.