kernel-lab/docs/roadmap.md

# Roadmap

## Week 1 Study Plan

Day 1:

- Run `tools/check_env.py`
- Read `docs/gpu_basics.md`
- Read `docs/cuda_execution_model.md`
- Inspect `reference/torch_vector_add.py`
- Implement or partially implement `tasks/01_vector_add/triton_skeleton.py`

Day 2:

- Read `docs/triton_vs_cuda.md`
- Inspect `kernels/cuda/src/vector_add.cu`
- Fill in vector add indexing TODOs in Triton and CUDA
- Run `pytest -q tasks/01_vector_add/test_task.py`

Day 3:

- Read `reference/torch_row_softmax.py`
- Read `tasks/02_row_softmax/spec.md`
- Implement numerically stable row softmax in Triton first
- Compare against the CUDA skeleton and map the reduction strategy

Day 4:

- Study `tasks/03_tiled_matmul/spec.md`
- Draw the tile decomposition on paper
- Implement one matmul tile path with correctness-only priorities

Day 5:

- Read `docs/flashattention_notes.md`
- Read `tasks/04_online_softmax/spec.md`
- Derive the running max / running sum recurrence informally

Day 6:

- Inspect `tasks/05_flash_attention_fwd/spec.md`
- Trace the PyTorch reference line by line
- Annotate where Q/K/V loads, score computation, normalization, and output accumulation happen

Day 7:

- Read `docs/profiling_guide.md`
- Run one benchmark and one profiler command
- Write down which numbers changed after warmup and synchronization

## Recommended TODO Order

1. Environment checks
2. Vector add Triton
3. Vector add CUDA
4. Row softmax Triton
5. Row softmax CUDA
6. Tiled matmul Triton
7. Tiled matmul CUDA
8. Online softmax Triton
9. Online softmax CUDA
10. Flash attention forward Triton
11. Flash attention forward CUDA
12. PyTorch custom op binding
13. Profiling passes and benchmark validation

## What To Focus On First

- Correctness on tiny shapes
- Clear index math
- Explicit shape assumptions
- Numerically stable reductions
- Repeatable measurement

Do not chase peak performance before you can explain the memory traffic and launch geometry of your kernel.