# Profiling Examples ## Nsight Compute ```bash ./tools/profile_ncu.sh python bench/bench_vector_add.py --device cuda --mode triton ./tools/profile_ncu.sh python bench/bench_softmax.py --device cuda --mode torch ``` ## Nsight Systems ```bash ./tools/profile_nsys.sh python bench/bench_matmul.py --device cuda --mode triton ./tools/profile_nsys.sh python bench/bench_attention.py --device cuda --mode torch ``` ## First Things To Inspect - median runtime from the benchmark harness - whether warmup was excluded - whether kernels overlap or serialize - whether memory throughput is near a practical ceiling - whether a kernel launch is tiny enough that launch overhead matters