# Profiling Guide ## Profile One Kernel At A Time Good profiling starts narrow: - one implementation - one shape - one dtype - one device - one command you can rerun If you profile a full training script too early, you will not know which kernel you are looking at. ## Why Warmup Matters The first iterations may include: - lazy module loading - JIT compilation - cache effects - allocator setup Warm up first, then measure. ## Why Synchronization Matters GPU work is asynchronous with respect to Python. If you do not synchronize before stopping a timer, you usually measure launch overhead instead of kernel runtime. Use `torch.cuda.synchronize()` around timed regions. ## How To Avoid Misleading Timings - keep shapes fixed - use multiple repetitions - report median, not only minimum - separate correctness from performance testing - compare implementations under the same dtype and device conditions - check that all inputs are already on the GPU ## First Metrics To Inspect - kernel duration - achieved memory throughput - occupancy - DRAM transactions or bandwidth - shared-memory throughput when tiling is relevant - eligible warps per cycle when investigating latency hiding ## Practical `ncu` Examples ```bash ncu --set full --target-processes all \ python bench/bench_vector_add.py --device cuda --mode cuda ``` ```bash ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\ dram__throughput.avg.pct_of_peak_sustained_elapsed \ python bench/bench_softmax.py --device cuda --mode triton ``` ## Practical `nsys` Examples ```bash nsys profile --trace=cuda,nvtx,osrt --sample=none \ -o profile-output/attention_triton \ python bench/bench_attention.py --device cuda --mode triton ``` ```bash nsys profile --trace=cuda,nvtx,osrt --sample=none \ -o profile-output/matmul_cuda \ python bench/bench_matmul.py --device cuda --mode cuda ``` ## Checklist Before Trusting A Benchmark Result - Was there a warmup phase? - Was the device synchronized before and after timing? - Did all implementations run the same math? - Were outputs checked against a reference? - Were shapes and dtypes identical? - Was one implementation silently skipped or falling back to CPU? - Did you report median time over several repetitions? - Is the measured quantity bandwidth-bound or compute-bound? - Did you accidentally include setup or compilation time?