88 lines
2.3 KiB
Markdown
88 lines
2.3 KiB
Markdown
# Profiling Guide
|
|
|
|
## Profile One Kernel At A Time
|
|
|
|
Good profiling starts narrow:
|
|
|
|
- one implementation
|
|
- one shape
|
|
- one dtype
|
|
- one device
|
|
- one command you can rerun
|
|
|
|
If you profile a full training script too early, you will not know which kernel you are looking at.
|
|
|
|
## Why Warmup Matters
|
|
|
|
The first iterations may include:
|
|
|
|
- lazy module loading
|
|
- JIT compilation
|
|
- cache effects
|
|
- allocator setup
|
|
|
|
Warm up first, then measure.
|
|
|
|
## Why Synchronization Matters
|
|
|
|
GPU work is asynchronous with respect to Python. If you do not synchronize before stopping a timer, you usually measure launch overhead instead of kernel runtime.
|
|
|
|
Use `torch.cuda.synchronize()` around timed regions.
|
|
|
|
## How To Avoid Misleading Timings
|
|
|
|
- keep shapes fixed
|
|
- use multiple repetitions
|
|
- report median, not only minimum
|
|
- separate correctness from performance testing
|
|
- compare implementations under the same dtype and device conditions
|
|
- check that all inputs are already on the GPU
|
|
|
|
## First Metrics To Inspect
|
|
|
|
- kernel duration
|
|
- achieved memory throughput
|
|
- occupancy
|
|
- DRAM transactions or bandwidth
|
|
- shared-memory throughput when tiling is relevant
|
|
- eligible warps per cycle when investigating latency hiding
|
|
|
|
## Practical `ncu` Examples
|
|
|
|
```bash
|
|
ncu --set full --target-processes all \
|
|
python bench/bench_vector_add.py --device cuda --mode cuda
|
|
```
|
|
|
|
```bash
|
|
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
|
|
dram__throughput.avg.pct_of_peak_sustained_elapsed \
|
|
python bench/bench_softmax.py --device cuda --mode triton
|
|
```
|
|
|
|
## Practical `nsys` Examples
|
|
|
|
```bash
|
|
nsys profile --trace=cuda,nvtx,osrt --sample=none \
|
|
-o profile-output/attention_triton \
|
|
python bench/bench_attention.py --device cuda --mode triton
|
|
```
|
|
|
|
```bash
|
|
nsys profile --trace=cuda,nvtx,osrt --sample=none \
|
|
-o profile-output/matmul_cuda \
|
|
python bench/bench_matmul.py --device cuda --mode cuda
|
|
```
|
|
|
|
## Checklist Before Trusting A Benchmark Result
|
|
|
|
- Was there a warmup phase?
|
|
- Was the device synchronized before and after timing?
|
|
- Did all implementations run the same math?
|
|
- Were outputs checked against a reference?
|
|
- Were shapes and dtypes identical?
|
|
- Was one implementation silently skipped or falling back to CPU?
|
|
- Did you report median time over several repetitions?
|
|
- Is the measured quantity bandwidth-bound or compute-bound?
|
|
- Did you accidentally include setup or compilation time?
|