Profiling Guide

Profile One Kernel At A Time

Good profiling starts narrow:

one implementation
one shape
one dtype
one device
one command you can rerun

If you profile a full training script too early, you will not know which kernel you are looking at.

Why Warmup Matters

The first iterations may include:

lazy module loading
JIT compilation
cache effects
allocator setup

Warm up first, then measure.

Why Synchronization Matters

GPU work is asynchronous with respect to Python. If you do not synchronize before stopping a timer, you usually measure launch overhead instead of kernel runtime.

Use torch.cuda.synchronize() around timed regions.

How To Avoid Misleading Timings

keep shapes fixed
use multiple repetitions
report median, not only minimum
separate correctness from performance testing
compare implementations under the same dtype and device conditions
check that all inputs are already on the GPU

First Metrics To Inspect

kernel duration
achieved memory throughput
occupancy
DRAM transactions or bandwidth
shared-memory throughput when tiling is relevant
eligible warps per cycle when investigating latency hiding

Practical `ncu` Examples

ncu --set full --target-processes all \
  python bench/bench_vector_add.py --device cuda --mode cuda

ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
dram__throughput.avg.pct_of_peak_sustained_elapsed \
  python bench/bench_softmax.py --device cuda --mode triton

Practical `nsys` Examples

nsys profile --trace=cuda,nvtx,osrt --sample=none \
  -o profile-output/attention_triton \
  python bench/bench_attention.py --device cuda --mode triton

nsys profile --trace=cuda,nvtx,osrt --sample=none \
  -o profile-output/matmul_cuda \
  python bench/bench_matmul.py --device cuda --mode cuda

Checklist Before Trusting A Benchmark Result

Was there a warmup phase?
Was the device synchronized before and after timing?
Did all implementations run the same math?
Were outputs checked against a reference?
Were shapes and dtypes identical?
Was one implementation silently skipped or falling back to CPU?
Did you report median time over several repetitions?
Is the measured quantity bandwidth-bound or compute-bound?
Did you accidentally include setup or compilation time?

2.3 KiB Raw Blame History