kernel-lab/docs/profiling_guide.md

# Profiling Guide

## Profile One Kernel At A Time

Good profiling starts narrow:

- one implementation
- one shape
- one dtype
- one device
- one command you can rerun

If you profile a full training script too early, you will not know which kernel you are looking at.

## Why Warmup Matters

The first iterations may include:

- lazy module loading
- JIT compilation
- cache effects
- allocator setup

Warm up first, then measure.

## Why Synchronization Matters

GPU work is asynchronous with respect to Python. If you do not synchronize before stopping a timer, you usually measure launch overhead instead of kernel runtime.

Use `torch.cuda.synchronize()` around timed regions.

## How To Avoid Misleading Timings

- keep shapes fixed
- use multiple repetitions
- report median, not only minimum
- separate correctness from performance testing
- compare implementations under the same dtype and device conditions
- check that all inputs are already on the GPU

## First Metrics To Inspect

- kernel duration
- achieved memory throughput
- occupancy
- DRAM transactions or bandwidth
- shared-memory throughput when tiling is relevant
- eligible warps per cycle when investigating latency hiding

## Practical `ncu` Examples

```bash
ncu --set full --target-processes all \
  python bench/bench_vector_add.py --device cuda --mode cuda
```

```bash
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
dram__throughput.avg.pct_of_peak_sustained_elapsed \
  python bench/bench_softmax.py --device cuda --mode triton
```

## Practical `nsys` Examples

```bash
nsys profile --trace=cuda,nvtx,osrt --sample=none \
  -o profile-output/attention_triton \
  python bench/bench_attention.py --device cuda --mode triton
```

```bash
nsys profile --trace=cuda,nvtx,osrt --sample=none \
  -o profile-output/matmul_cuda \
  python bench/bench_matmul.py --device cuda --mode cuda
```

## Checklist Before Trusting A Benchmark Result

- Was there a warmup phase?
- Was the device synchronized before and after timing?
- Did all implementations run the same math?
- Were outputs checked against a reference?
- Were shapes and dtypes identical?
- Was one implementation silently skipped or falling back to CPU?
- Did you report median time over several repetitions?
- Is the measured quantity bandwidth-bound or compute-bound?
- Did you accidentally include setup or compilation time?