74 lines
2.8 KiB
Markdown
74 lines
2.8 KiB
Markdown
# Benchmarking Crimes Reference
|
|
|
|
Source: Gernot Heiser, “Systems Benchmarking Crimes”
|
|
https://gernot-heiser.org/benchmarking-crimes.html
|
|
|
|
This file is a compact local index. Use `skills/benchmark-crime-auditor.md` for the operational checklist.
|
|
|
|
## Crime Index
|
|
|
|
### 1. Selective Benchmarking
|
|
|
|
- Not evaluating potential performance degradation.
|
|
- Benchmark subsetting without strong justification.
|
|
- Selective data set hiding deficiencies.
|
|
|
|
Core idea: a fair performance evaluation must show both improvement in the target regime and acceptable behavior elsewhere.
|
|
|
|
### 2. Improper Handling of Benchmark Results
|
|
|
|
- Pretending microbenchmarks represent overall performance.
|
|
- Treating throughput degradation as equal to overhead.
|
|
- Downplaying overheads through wrong arithmetic or wrong denominators.
|
|
- No indication of statistical significance.
|
|
- Arithmetic mean over normalized benchmark scores.
|
|
|
|
Core idea: performance numbers must preserve the meaning of the measured quantity. Ratios, means, averages, and overheads are not interchangeable.
|
|
|
|
### 3. Using the Wrong Benchmarks
|
|
|
|
- Benchmarking simplified simulated systems without validating assumptions.
|
|
- Inappropriate or misleading benchmarks.
|
|
- Using the same data for calibration and validation.
|
|
|
|
Core idea: the workload must stress the phenomenon behind the claim.
|
|
|
|
### 4. Improper Comparison of Benchmark Results
|
|
|
|
- No proper baseline.
|
|
- Only evaluate against yourself.
|
|
- Unfair benchmarking of competitors.
|
|
- Inflating gains by not comparing against the state of the art.
|
|
|
|
Core idea: a comparison is only meaningful if the reference point is the one readers need to judge the claim.
|
|
|
|
### 5. Missing Information
|
|
|
|
- Missing platform specification.
|
|
- Missing sub-benchmark results.
|
|
- Relative numbers only.
|
|
|
|
Core idea: readers need enough absolute and contextual information to reproduce, sanity-check, and interpret results.
|
|
|
|
## Best Practice Index
|
|
|
|
- Ensure quiescence before timing.
|
|
- Make benchmarking part of regression testing.
|
|
- Document commands, platform, versions, and configuration.
|
|
- Verify transferred or stored data.
|
|
- Use different data per run when stale data or caching could matter.
|
|
- Repeat points consecutively and separated in time.
|
|
- Invert measurement order to detect interference.
|
|
- Include non-regular and pathological data points.
|
|
- Compare configurations at exactly the same data points.
|
|
- Run several trials and report variance.
|
|
- Use principled outlier handling.
|
|
- Warm up before timed runs.
|
|
- Use enough iterations to reduce clock granularity.
|
|
- Eliminate loop overhead.
|
|
- Inspect generated machine code for low-level timing loops when needed.
|
|
|
|
## Local Policy
|
|
|
|
In this agent context, benchmark crimes are handled as research-validity issues. A headline claim affected by selective benchmarking, unfair baseline, wrong overhead accounting, or SOTA omission is `Blocking` until fixed.
|