Files
kernel-lab/docs/profiling_guide.md
2026-04-10 13:15:06 +00:00

2.3 KiB

Profiling Guide

Profile One Kernel At A Time

Good profiling starts narrow:

  • one implementation
  • one shape
  • one dtype
  • one device
  • one command you can rerun

If you profile a full training script too early, you will not know which kernel you are looking at.

Why Warmup Matters

The first iterations may include:

  • lazy module loading
  • JIT compilation
  • cache effects
  • allocator setup

Warm up first, then measure.

Why Synchronization Matters

GPU work is asynchronous with respect to Python. If you do not synchronize before stopping a timer, you usually measure launch overhead instead of kernel runtime.

Use torch.cuda.synchronize() around timed regions.

How To Avoid Misleading Timings

  • keep shapes fixed
  • use multiple repetitions
  • report median, not only minimum
  • separate correctness from performance testing
  • compare implementations under the same dtype and device conditions
  • check that all inputs are already on the GPU

First Metrics To Inspect

  • kernel duration
  • achieved memory throughput
  • occupancy
  • DRAM transactions or bandwidth
  • shared-memory throughput when tiling is relevant
  • eligible warps per cycle when investigating latency hiding

Practical ncu Examples

ncu --set full --target-processes all \
  python bench/bench_vector_add.py --device cuda --mode cuda
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
dram__throughput.avg.pct_of_peak_sustained_elapsed \
  python bench/bench_softmax.py --device cuda --mode triton

Practical nsys Examples

nsys profile --trace=cuda,nvtx,osrt --sample=none \
  -o profile-output/attention_triton \
  python bench/bench_attention.py --device cuda --mode triton
nsys profile --trace=cuda,nvtx,osrt --sample=none \
  -o profile-output/matmul_cuda \
  python bench/bench_matmul.py --device cuda --mode cuda

Checklist Before Trusting A Benchmark Result

  • Was there a warmup phase?
  • Was the device synchronized before and after timing?
  • Did all implementations run the same math?
  • Were outputs checked against a reference?
  • Were shapes and dtypes identical?
  • Was one implementation silently skipped or falling back to CPU?
  • Did you report median time over several repetitions?
  • Is the measured quantity bandwidth-bound or compute-bound?
  • Did you accidentally include setup or compilation time?