initial commit
This commit is contained in:
73
references/benchmarking-crimes.md
Normal file
73
references/benchmarking-crimes.md
Normal file
@@ -0,0 +1,73 @@
|
||||
# Benchmarking Crimes Reference
|
||||
|
||||
Source: Gernot Heiser, “Systems Benchmarking Crimes”
|
||||
https://gernot-heiser.org/benchmarking-crimes.html
|
||||
|
||||
This file is a compact local index. Use `skills/benchmark-crime-auditor.md` for the operational checklist.
|
||||
|
||||
## Crime Index
|
||||
|
||||
### 1. Selective Benchmarking
|
||||
|
||||
- Not evaluating potential performance degradation.
|
||||
- Benchmark subsetting without strong justification.
|
||||
- Selective data set hiding deficiencies.
|
||||
|
||||
Core idea: a fair performance evaluation must show both improvement in the target regime and acceptable behavior elsewhere.
|
||||
|
||||
### 2. Improper Handling of Benchmark Results
|
||||
|
||||
- Pretending microbenchmarks represent overall performance.
|
||||
- Treating throughput degradation as equal to overhead.
|
||||
- Downplaying overheads through wrong arithmetic or wrong denominators.
|
||||
- No indication of statistical significance.
|
||||
- Arithmetic mean over normalized benchmark scores.
|
||||
|
||||
Core idea: performance numbers must preserve the meaning of the measured quantity. Ratios, means, averages, and overheads are not interchangeable.
|
||||
|
||||
### 3. Using the Wrong Benchmarks
|
||||
|
||||
- Benchmarking simplified simulated systems without validating assumptions.
|
||||
- Inappropriate or misleading benchmarks.
|
||||
- Using the same data for calibration and validation.
|
||||
|
||||
Core idea: the workload must stress the phenomenon behind the claim.
|
||||
|
||||
### 4. Improper Comparison of Benchmark Results
|
||||
|
||||
- No proper baseline.
|
||||
- Only evaluate against yourself.
|
||||
- Unfair benchmarking of competitors.
|
||||
- Inflating gains by not comparing against the state of the art.
|
||||
|
||||
Core idea: a comparison is only meaningful if the reference point is the one readers need to judge the claim.
|
||||
|
||||
### 5. Missing Information
|
||||
|
||||
- Missing platform specification.
|
||||
- Missing sub-benchmark results.
|
||||
- Relative numbers only.
|
||||
|
||||
Core idea: readers need enough absolute and contextual information to reproduce, sanity-check, and interpret results.
|
||||
|
||||
## Best Practice Index
|
||||
|
||||
- Ensure quiescence before timing.
|
||||
- Make benchmarking part of regression testing.
|
||||
- Document commands, platform, versions, and configuration.
|
||||
- Verify transferred or stored data.
|
||||
- Use different data per run when stale data or caching could matter.
|
||||
- Repeat points consecutively and separated in time.
|
||||
- Invert measurement order to detect interference.
|
||||
- Include non-regular and pathological data points.
|
||||
- Compare configurations at exactly the same data points.
|
||||
- Run several trials and report variance.
|
||||
- Use principled outlier handling.
|
||||
- Warm up before timed runs.
|
||||
- Use enough iterations to reduce clock granularity.
|
||||
- Eliminate loop overhead.
|
||||
- Inspect generated machine code for low-level timing loops when needed.
|
||||
|
||||
## Local Policy
|
||||
|
||||
In this agent context, benchmark crimes are handled as research-validity issues. A headline claim affected by selective benchmarking, unfair baseline, wrong overhead accounting, or SOTA omission is `Blocking` until fixed.
|
||||
Reference in New Issue
Block a user