Files
agentic-ctx/references/benchmarking-crimes.md
2026-05-13 21:36:34 +08:00

74 lines
2.8 KiB
Markdown

# Benchmarking Crimes Reference
Source: Gernot Heiser, “Systems Benchmarking Crimes”
https://gernot-heiser.org/benchmarking-crimes.html
This file is a compact local index. Use `skills/benchmark-crime-auditor.md` for the operational checklist.
## Crime Index
### 1. Selective Benchmarking
- Not evaluating potential performance degradation.
- Benchmark subsetting without strong justification.
- Selective data set hiding deficiencies.
Core idea: a fair performance evaluation must show both improvement in the target regime and acceptable behavior elsewhere.
### 2. Improper Handling of Benchmark Results
- Pretending microbenchmarks represent overall performance.
- Treating throughput degradation as equal to overhead.
- Downplaying overheads through wrong arithmetic or wrong denominators.
- No indication of statistical significance.
- Arithmetic mean over normalized benchmark scores.
Core idea: performance numbers must preserve the meaning of the measured quantity. Ratios, means, averages, and overheads are not interchangeable.
### 3. Using the Wrong Benchmarks
- Benchmarking simplified simulated systems without validating assumptions.
- Inappropriate or misleading benchmarks.
- Using the same data for calibration and validation.
Core idea: the workload must stress the phenomenon behind the claim.
### 4. Improper Comparison of Benchmark Results
- No proper baseline.
- Only evaluate against yourself.
- Unfair benchmarking of competitors.
- Inflating gains by not comparing against the state of the art.
Core idea: a comparison is only meaningful if the reference point is the one readers need to judge the claim.
### 5. Missing Information
- Missing platform specification.
- Missing sub-benchmark results.
- Relative numbers only.
Core idea: readers need enough absolute and contextual information to reproduce, sanity-check, and interpret results.
## Best Practice Index
- Ensure quiescence before timing.
- Make benchmarking part of regression testing.
- Document commands, platform, versions, and configuration.
- Verify transferred or stored data.
- Use different data per run when stale data or caching could matter.
- Repeat points consecutively and separated in time.
- Invert measurement order to detect interference.
- Include non-regular and pathological data points.
- Compare configurations at exactly the same data points.
- Run several trials and report variance.
- Use principled outlier handling.
- Warm up before timed runs.
- Use enough iterations to reduce clock granularity.
- Eliminate loop overhead.
- Inspect generated machine code for low-level timing loops when needed.
## Local Policy
In this agent context, benchmark crimes are handled as research-validity issues. A headline claim affected by selective benchmarking, unfair baseline, wrong overhead accounting, or SOTA omission is `Blocking` until fixed.