gahow/kernel-lab

Files

wjh a4a6b1f1c8 Initial project scaffold

2026-04-10 13:15:06 +00:00

1.9 KiB

Raw Blame History

Roadmap

Week 1 Study Plan

Day 1:

Run tools/check_env.py
Read docs/gpu_basics.md
Read docs/cuda_execution_model.md
Inspect reference/torch_vector_add.py
Implement or partially implement tasks/01_vector_add/triton_skeleton.py

Day 2:

Read docs/triton_vs_cuda.md
Inspect kernels/cuda/src/vector_add.cu
Fill in vector add indexing TODOs in Triton and CUDA
Run pytest -q tasks/01_vector_add/test_task.py

Day 3:

Read reference/torch_row_softmax.py
Read tasks/02_row_softmax/spec.md
Implement numerically stable row softmax in Triton first
Compare against the CUDA skeleton and map the reduction strategy

Day 4:

Study tasks/03_tiled_matmul/spec.md
Draw the tile decomposition on paper
Implement one matmul tile path with correctness-only priorities

Day 5:

Read docs/flashattention_notes.md
Read tasks/04_online_softmax/spec.md
Derive the running max / running sum recurrence informally

Day 6:

Inspect tasks/05_flash_attention_fwd/spec.md
Trace the PyTorch reference line by line
Annotate where Q/K/V loads, score computation, normalization, and output accumulation happen

Day 7:

Read docs/profiling_guide.md
Run one benchmark and one profiler command
Write down which numbers changed after warmup and synchronization

Recommended TODO Order

Environment checks
Vector add Triton
Vector add CUDA
Row softmax Triton
Row softmax CUDA
Tiled matmul Triton
Tiled matmul CUDA
Online softmax Triton
Online softmax CUDA
Flash attention forward Triton
Flash attention forward CUDA
PyTorch custom op binding
Profiling passes and benchmark validation

What To Focus On First

Correctness on tiny shapes
Clear index math
Explicit shape assumptions
Numerically stable reductions
Repeatable measurement

Do not chase peak performance before you can explain the memory traffic and launch geometry of your kernel.