Files
kernel-lab/docs/profiling_guide.md
2026-04-10 13:22:19 +00:00

88 lines
2.3 KiB
Markdown

# Profiling Guide
## Profile One Kernel At A Time
Good profiling starts narrow:
- one implementation
- one shape
- one dtype
- one device
- one command you can rerun
If you profile a full training script too early, you will not know which kernel you are looking at.
## Why Warmup Matters
The first iterations may include:
- lazy module loading
- JIT compilation
- cache effects
- allocator setup
Warm up first, then measure.
## Why Synchronization Matters
GPU work is asynchronous with respect to Python. If you do not synchronize before stopping a timer, you usually measure launch overhead instead of kernel runtime.
Use `torch.cuda.synchronize()` around timed regions.
## How To Avoid Misleading Timings
- keep shapes fixed
- use multiple repetitions
- report median, not only minimum
- separate correctness from performance testing
- compare implementations under the same dtype and device conditions
- check that all inputs are already on the GPU
## First Metrics To Inspect
- kernel duration
- achieved memory throughput
- occupancy
- DRAM transactions or bandwidth
- shared-memory throughput when tiling is relevant
- eligible warps per cycle when investigating latency hiding
## Practical `ncu` Examples
```bash
ncu --set full --target-processes all \
python bench/bench_vector_add.py --device cuda --mode cuda
```
```bash
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
dram__throughput.avg.pct_of_peak_sustained_elapsed \
python bench/bench_softmax.py --device cuda --mode triton
```
## Practical `nsys` Examples
```bash
nsys profile --trace=cuda,nvtx,osrt --sample=none \
-o profile-output/attention_triton \
python bench/bench_attention.py --device cuda --mode triton
```
```bash
nsys profile --trace=cuda,nvtx,osrt --sample=none \
-o profile-output/matmul_cuda \
python bench/bench_matmul.py --device cuda --mode cuda
```
## Checklist Before Trusting A Benchmark Result
- Was there a warmup phase?
- Was the device synchronized before and after timing?
- Did all implementations run the same math?
- Were outputs checked against a reference?
- Were shapes and dtypes identical?
- Was one implementation silently skipped or falling back to CPU?
- Did you report median time over several repetitions?
- Is the measured quantity bandwidth-bound or compute-bound?
- Did you accidentally include setup or compilation time?