Initial project scaffold

2026-04-10 13:22:19 +00:00
commit 7fa69b1354
94 changed files with 3964 additions and 0 deletions
--- a/docs/profiling_guide.md
+++ b/docs/profiling_guide.md
@@ -0,0 +1,87 @@
+# Profiling Guide
+
+## Profile One Kernel At A Time
+
+Good profiling starts narrow:
+
+- one implementation
+- one shape
+- one dtype
+- one device
+- one command you can rerun
+
+If you profile a full training script too early, you will not know which kernel you are looking at.
+
+## Why Warmup Matters
+
+The first iterations may include:
+
+- lazy module loading
+- JIT compilation
+- cache effects
+- allocator setup
+
+Warm up first, then measure.
+
+## Why Synchronization Matters
+
+GPU work is asynchronous with respect to Python. If you do not synchronize before stopping a timer, you usually measure launch overhead instead of kernel runtime.
+
+Use `torch.cuda.synchronize()` around timed regions.
+
+## How To Avoid Misleading Timings
+
+- keep shapes fixed
+- use multiple repetitions
+- report median, not only minimum
+- separate correctness from performance testing
+- compare implementations under the same dtype and device conditions
+- check that all inputs are already on the GPU
+
+## First Metrics To Inspect
+
+- kernel duration
+- achieved memory throughput
+- occupancy
+- DRAM transactions or bandwidth
+- shared-memory throughput when tiling is relevant
+- eligible warps per cycle when investigating latency hiding
+
+## Practical `ncu` Examples
+
+```bash
+ncu --set full --target-processes all \
+  python bench/bench_vector_add.py --device cuda --mode cuda
+```
+
+```bash
+ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
+dram__throughput.avg.pct_of_peak_sustained_elapsed \
+  python bench/bench_softmax.py --device cuda --mode triton
+```
+
+## Practical `nsys` Examples
+
+```bash
+nsys profile --trace=cuda,nvtx,osrt --sample=none \
+  -o profile-output/attention_triton \
+  python bench/bench_attention.py --device cuda --mode triton
+```
+
+```bash
+nsys profile --trace=cuda,nvtx,osrt --sample=none \
+  -o profile-output/matmul_cuda \
+  python bench/bench_matmul.py --device cuda --mode cuda
+```
+
+## Checklist Before Trusting A Benchmark Result
+
+- Was there a warmup phase?
+- Was the device synchronized before and after timing?
+- Did all implementations run the same math?
+- Were outputs checked against a reference?
+- Were shapes and dtypes identical?
+- Was one implementation silently skipped or falling back to CPU?
+- Did you report median time over several repetitions?
+- Is the measured quantity bandwidth-bound or compute-bound?
+- Did you accidentally include setup or compilation time?