Initial project scaffold
This commit is contained in:
87
docs/profiling_guide.md
Normal file
87
docs/profiling_guide.md
Normal file
@@ -0,0 +1,87 @@
|
||||
# Profiling Guide
|
||||
|
||||
## Profile One Kernel At A Time
|
||||
|
||||
Good profiling starts narrow:
|
||||
|
||||
- one implementation
|
||||
- one shape
|
||||
- one dtype
|
||||
- one device
|
||||
- one command you can rerun
|
||||
|
||||
If you profile a full training script too early, you will not know which kernel you are looking at.
|
||||
|
||||
## Why Warmup Matters
|
||||
|
||||
The first iterations may include:
|
||||
|
||||
- lazy module loading
|
||||
- JIT compilation
|
||||
- cache effects
|
||||
- allocator setup
|
||||
|
||||
Warm up first, then measure.
|
||||
|
||||
## Why Synchronization Matters
|
||||
|
||||
GPU work is asynchronous with respect to Python. If you do not synchronize before stopping a timer, you usually measure launch overhead instead of kernel runtime.
|
||||
|
||||
Use `torch.cuda.synchronize()` around timed regions.
|
||||
|
||||
## How To Avoid Misleading Timings
|
||||
|
||||
- keep shapes fixed
|
||||
- use multiple repetitions
|
||||
- report median, not only minimum
|
||||
- separate correctness from performance testing
|
||||
- compare implementations under the same dtype and device conditions
|
||||
- check that all inputs are already on the GPU
|
||||
|
||||
## First Metrics To Inspect
|
||||
|
||||
- kernel duration
|
||||
- achieved memory throughput
|
||||
- occupancy
|
||||
- DRAM transactions or bandwidth
|
||||
- shared-memory throughput when tiling is relevant
|
||||
- eligible warps per cycle when investigating latency hiding
|
||||
|
||||
## Practical `ncu` Examples
|
||||
|
||||
```bash
|
||||
ncu --set full --target-processes all \
|
||||
python bench/bench_vector_add.py --device cuda --mode cuda
|
||||
```
|
||||
|
||||
```bash
|
||||
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
|
||||
dram__throughput.avg.pct_of_peak_sustained_elapsed \
|
||||
python bench/bench_softmax.py --device cuda --mode triton
|
||||
```
|
||||
|
||||
## Practical `nsys` Examples
|
||||
|
||||
```bash
|
||||
nsys profile --trace=cuda,nvtx,osrt --sample=none \
|
||||
-o profile-output/attention_triton \
|
||||
python bench/bench_attention.py --device cuda --mode triton
|
||||
```
|
||||
|
||||
```bash
|
||||
nsys profile --trace=cuda,nvtx,osrt --sample=none \
|
||||
-o profile-output/matmul_cuda \
|
||||
python bench/bench_matmul.py --device cuda --mode cuda
|
||||
```
|
||||
|
||||
## Checklist Before Trusting A Benchmark Result
|
||||
|
||||
- Was there a warmup phase?
|
||||
- Was the device synchronized before and after timing?
|
||||
- Did all implementations run the same math?
|
||||
- Were outputs checked against a reference?
|
||||
- Were shapes and dtypes identical?
|
||||
- Was one implementation silently skipped or falling back to CPU?
|
||||
- Did you report median time over several repetitions?
|
||||
- Is the measured quantity bandwidth-bound or compute-bound?
|
||||
- Did you accidentally include setup or compilation time?
|
||||
Reference in New Issue
Block a user