Initial project scaffold

2026-04-10 13:15:06 +00:00
commit a4a6b1f1c8
94 changed files with 3964 additions and 0 deletions
--- a/docs/roadmap.md
+++ b/docs/roadmap.md
@@ -0,0 +1,75 @@
+# Roadmap
+
+## Week 1 Study Plan
+
+Day 1:
+
+- Run `tools/check_env.py`
+- Read `docs/gpu_basics.md`
+- Read `docs/cuda_execution_model.md`
+- Inspect `reference/torch_vector_add.py`
+- Implement or partially implement `tasks/01_vector_add/triton_skeleton.py`
+
+Day 2:
+
+- Read `docs/triton_vs_cuda.md`
+- Inspect `kernels/cuda/src/vector_add.cu`
+- Fill in vector add indexing TODOs in Triton and CUDA
+- Run `pytest -q tasks/01_vector_add/test_task.py`
+
+Day 3:
+
+- Read `reference/torch_row_softmax.py`
+- Read `tasks/02_row_softmax/spec.md`
+- Implement numerically stable row softmax in Triton first
+- Compare against the CUDA skeleton and map the reduction strategy
+
+Day 4:
+
+- Study `tasks/03_tiled_matmul/spec.md`
+- Draw the tile decomposition on paper
+- Implement one matmul tile path with correctness-only priorities
+
+Day 5:
+
+- Read `docs/flashattention_notes.md`
+- Read `tasks/04_online_softmax/spec.md`
+- Derive the running max / running sum recurrence informally
+
+Day 6:
+
+- Inspect `tasks/05_flash_attention_fwd/spec.md`
+- Trace the PyTorch reference line by line
+- Annotate where Q/K/V loads, score computation, normalization, and output accumulation happen
+
+Day 7:
+
+- Read `docs/profiling_guide.md`
+- Run one benchmark and one profiler command
+- Write down which numbers changed after warmup and synchronization
+
+## Recommended TODO Order
+
+1. Environment checks
+2. Vector add Triton
+3. Vector add CUDA
+4. Row softmax Triton
+5. Row softmax CUDA
+6. Tiled matmul Triton
+7. Tiled matmul CUDA
+8. Online softmax Triton
+9. Online softmax CUDA
+10. Flash attention forward Triton
+11. Flash attention forward CUDA
+12. PyTorch custom op binding
+13. Profiling passes and benchmark validation
+
+## What To Focus On First
+
+- Correctness on tiny shapes
+- Clear index math
+- Explicit shape assumptions
+- Numerically stable reductions
+- Repeatable measurement
+
+Do not chase peak performance before you can explain the memory traffic and launch geometry of your kernel.